| dc.contributor.supervisor | Dimane, Mpoeleng | |
| dc.contributor.supervisor | Nedev, Zhivko | |
| dc.contributor.supervisor | Letsholo, Keletso | |
| dc.contributor.author | Bafitlhile, Kgosiyame Ditiro | |
| dc.date.accessioned | 2023-02-07T09:11:04Z | |
| dc.date.available | 2023-02-07T09:11:04Z | |
| dc.date.issued | 2022-08-25 | |
| dc.identifier.citation | Bafitlhile,K.D (2022) A context-aware lemmatization model for setswana language using machine learning, Master's Thesis, Botswana International University of Science and Technology: Palapye. | en_US |
| dc.identifier.uri | http://repository.biust.ac.bw/handle/123456789/536 | |
| dc.description | Thesis (MSc of Science in Computer Science and Information Systems)---Botswana International University of Science and Technology, 2022 | en_US |
| dc.description.abstract | Lemmatization is an important task which is concerned with making computers understand the relationship that exists amongst words written in natural language. It is a prior condition needed for the development of natural language processing (NLP) systems such as machine translation and information retrieval. In particular, Lemmatization is intended to reduce the variability in word forms by collapsing related words to a standard lemma. There is a limited research on lemmatization of Setswana language. A large part of the available research on Setswana lemmatization relies on rule driven strategy, which takes time to construct, lacks context of how words are used, and needs extremely qualified language skills. Moreover, it has been discovered that the treatment of language with hand coded regulations lacks generalization component as it requires a continual redesign every time new data appears and this complicates the scalability of systems. With such rich vocabulary and complex morphology, lemmatization of Setswana cannot be easily unraveled using explicit rules developed by programmers. In this thesis we describe how a supervised machine learning approach that employs the use of Naive Bayes algorithm can solve Setswana lemmatization with regard to how words are used in sentences. The contribution of this study includes; first, context aware lemmatization model, that handles most of the morphologically productive classes. Second, we experiment with the strongest multi-class algorithm Naive Bayes, which to our best knowledge has never been used to address lemmatization in Setswana. The accuracy of the lemmatization model obtained from the experiments reached 70.32%. The model shifts from entirely hand programmed rules and is able to lemmatize words based on the context how they are used. In Setswana lemmatization should be done according to sentence intension, the model again ensures that as long as the data is a good example of the goal concept the generalization is simultaneously created, which allows the model'’s future performance to continue improving. Furthermore, given that this is a young area of research with no standard datasets for training and testing, we also contribute with a considerable medium sized dataset which remains a coveted resource for research community. The experimental results obtained from this study shows that machine learning approaches are more reliable than rule based approaches in lemmatizing Setswana inflectional words with regard to the context of how they are used. | en_US |
| dc.description.sponsorship | Botswana International University of Science and Technology (BIUST) | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | Botswana International University of Science and Technology (BIUST) | en_US |
| dc.subject | Setswana Language | en_US |
| dc.subject | Lemmatization | en_US |
| dc.subject | Natural Language Processing Naive Bayes | en_US |
| dc.subject | Machine Learning | en_US |
| dc.subject | Data Structures | en_US |
| dc.subject | Algorithms | en_US |
| dc.title | A context-aware lemmatization model for setswana language using machine learning | en_US |
| dc.description.level | msc | en_US |
| dc.description.accessibility | unrestricted | en_US |
| dc.description.department | cis | en_US |