A context-aware lemmatization model for setswana language using machine learning

Bafitlhile, Kgosiyame Ditiro

dc.contributor.supervisor	Dimane, Mpoeleng
dc.contributor.supervisor	Nedev, Zhivko
dc.contributor.supervisor	Letsholo, Keletso
dc.contributor.author	Bafitlhile, Kgosiyame Ditiro
dc.date.accessioned	2023-02-07T09:11:04Z
dc.date.available	2023-02-07T09:11:04Z
dc.date.issued	2022-08-25
dc.identifier.citation	Bafitlhile,K.D (2022) A context-aware lemmatization model for setswana language using machine learning, Master's Thesis, Botswana International University of Science and Technology: Palapye.	en_US
dc.identifier.uri	http://repository.biust.ac.bw/handle/123456789/536
dc.description	Thesis (MSc of Science in Computer Science and Information Systems)---Botswana International University of Science and Technology, 2022	en_US
dc.description.abstract	Lemmatization is an important task which is concerned with making computers understand the relationship that exists amongst words written in natural language. It is a prior condition needed for the development of natural language processing (NLP) systems such as machine translation and information retrieval. In particular, Lemmatization is intended to reduce the variability in word forms by collapsing related words to a standard lemma. There is a limited research on lemmatization of Setswana language. A large part of the available research on Setswana lemmatization relies on rule driven strategy, which takes time to construct, lacks context of how words are used, and needs extremely qualified language skills. Moreover, it has been discovered that the treatment of language with hand coded regulations lacks generalization component as it requires a continual redesign every time new data appears and this complicates the scalability of systems. With such rich vocabulary and complex morphology, lemmatization of Setswana cannot be easily unraveled using explicit rules developed by programmers. In this thesis we describe how a supervised machine learning approach that employs the use of Naive Bayes algorithm can solve Setswana lemmatization with regard to how words are used in sentences. The contribution of this study includes; first, context aware lemmatization model, that handles most of the morphologically productive classes. Second, we experiment with the strongest multi-class algorithm Naive Bayes, which to our best knowledge has never been used to address lemmatization in Setswana. The accuracy of the lemmatization model obtained from the experiments reached 70.32%. The model shifts from entirely hand programmed rules and is able to lemmatize words based on the context how they are used. In Setswana lemmatization should be done according to sentence intension, the model again ensures that as long as the data is a good example of the goal concept the generalization is simultaneously created, which allows the model'’s future performance to continue improving. Furthermore, given that this is a young area of research with no standard datasets for training and testing, we also contribute with a considerable medium sized dataset which remains a coveted resource for research community. The experimental results obtained from this study shows that machine learning approaches are more reliable than rule based approaches in lemmatizing Setswana inflectional words with regard to the context of how they are used.	en_US
dc.description.sponsorship	Botswana International University of Science and Technology (BIUST)	en_US
dc.language.iso	en	en_US
dc.publisher	Botswana International University of Science and Technology (BIUST)	en_US
dc.subject	Setswana Language	en_US
dc.subject	Lemmatization	en_US
dc.subject	Natural Language Processing Naive Bayes	en_US
dc.subject	Machine Learning	en_US
dc.subject	Data Structures	en_US
dc.subject	Algorithms	en_US
dc.title	A context-aware lemmatization model for setswana language using machine learning	en_US
dc.description.level	msc	en_US
dc.description.accessibility	unrestricted	en_US
dc.description.department	cis	en_US

Files in this item

Name: Bafitlhile_BIUST_ ...

Size: 1.010Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Sciences
This collection is made up of electronic theses and dissertations produced by post graduate students from Faculty of Sciences

A context-aware lemmatization model for setswana language using machine learning

Files in this item

This item appears in the following Collection(s)

Search BIUSTRE

Browse

All of BIUSTRE

This Collection

My Account