| dc.contributor.supervisor | Kuthadi, Venu Madhav | |
| dc.contributor.supervisor | Dinakenyane, Otlhapile | |
| dc.contributor.supervisor | Schroeder, Heiko | |
| dc.contributor.author | Kaniwa, Freeson | |
| dc.date.accessioned | 2019-03-26T06:14:45Z | |
| dc.date.available | 2019-03-26T06:14:45Z | |
| dc.date.issued | 2018-08 | |
| dc.identifier.citation | Kaniwa,Freeson (2018) A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model,Masters Thesis, Botswana International University of Science and Technology: Palapye | en_US |
| dc.identifier.uri | https://repository.biust.ac.bw/handle/123456789/85 | |
| dc.description | Theses (PhD Computer Science)-----Botswana International University Science and Technology, 2018 | en_US |
| dc.description.abstract | The explosive growth of biological sequences over the last decade has consequently led to a lot of research on effective techniques for storing and analysing this big data. Generating Deoxyribonucleic Acid (DNA) sequences has become more reliable, faster, easier and cheaper due to latest sequencing technology leading to big data sequences. This poses a challenge to efficient analysis to such massive sequences. Advanced indexing algorithmic techniques are required to facilitate efficient analysis. A particular interesting indexing data structure for such analysis is the Suffix Tree. Search times for patterns have linear times, however longer construction times and high memory requirements make it prohibitive to use this data structure for very long sequences such as DNA sequences. To index these long sequences, several algorithms have been explored in recent years. These algorithms all target to improve running time and memory consumption for the Suffix Tree. That is, being able to index a long sequence by making it fit in the main memory and being able to look for patterns in reasonable times. DNA sequences carry basic information for each human being. Searching for patterns in DNA sequences has been shown to have biological relevance since patterns signify some underlying biological function such as disease detection, genetic variation, and regulation, etc. This thesis explores the use of truncating the generalised Suffix Tree basing on the details of the input, that is, the size of the repeat to be searched. This is aimed at improving the memory requirements. In this thesis, parallelisation of the Suffix Tree data structure is further explored to reduce the construction times. Parallelisation has become of more interest since parallel architectures are now ubiquitous in standard desktop computers. Based on the theoretical and experimental findings, we can conclude that specification of pattern size, prior to Suffix Tree construction by truncating, reduces the memory requirements of the generalised Suffix Tree. In conclusion, the novel alphabet dependant parallelisation algorithm achieves a speedup of 15x over the existing state of art sequential algorithm. Despite the success of our Parallel algorithm for Suffix Tree (PaST) algorithm, its main limitation is that of specification of pattern size prior to construction or searching. Other than that, comparatively our algorithm shows improved performance in terms of memory and time requirements. | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | Botswana International University of Science and Technology | en_US |
| dc.subject | Big data | en_US |
| dc.subject | Generating Deoxyribonucleic Acid | en_US |
| dc.subject | DNA | en_US |
| dc.subject | Indexing algorithmic techniques | en_US |
| dc.subject | Suffix Tree | en_US |
| dc.subject | Algorithm | en_US |
| dc.title | A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model | en_US |
| dc.description.level | phd | en_US |
| dc.description.accessibility | unrestricted | en_US |
| dc.description.department | cis | en_US |