Abstract:
Lemmatization is an important task which is concerned with making computers understand the
relationship that exists amongst words written in natural language. It is a prior condition needed
for the development of natural language processing (NLP) systems such as machine translation
and information retrieval.
In particular, Lemmatization is intended to reduce the variability in word forms by collapsing
related words to a standard lemma. There is a limited research on lemmatization of Setswana
language. A large part of the available research on Setswana lemmatization relies on rule driven
strategy, which takes time to construct, lacks context of how words are used, and needs extremely
qualified language skills. Moreover, it has been discovered that the treatment of language with
hand coded regulations lacks generalization component as it requires a continual redesign every
time new data appears and this complicates the scalability of systems. With such rich vocabulary
and complex morphology, lemmatization of Setswana cannot be easily unraveled using explicit
rules developed by programmers.
In this thesis we describe how a supervised machine learning approach that employs the use of
Naive Bayes algorithm can solve Setswana lemmatization with regard to how words are used in
sentences. The contribution of this study includes; first, context aware lemmatization model,
that handles most of the morphologically productive classes. Second, we experiment with the
strongest multi-class algorithm Naive Bayes, which to our best knowledge has never been used
to address lemmatization in Setswana. The accuracy of the lemmatization model obtained from
the experiments reached 70.32%. The model shifts from entirely hand programmed rules and is
able to lemmatize words based on the context how they are used. In Setswana lemmatization
should be done according to sentence intension, the model again ensures that as long as the
data is a good example of the goal concept the generalization is simultaneously created, which
allows the model'’s future performance to continue improving.
Furthermore, given that this is a young area of research with no standard datasets for training
and testing, we also contribute with a considerable medium sized dataset which remains a coveted
resource for research community. The experimental results obtained from this study shows
that machine learning approaches are more reliable than rule based approaches in lemmatizing
Setswana inflectional words with regard to the context of how they are used.