Abstract:
Outliers describe data points that deviate from the rest of a dataset. These deviations may be caused by the following but not limited to; errors during data collection, instrumentation errors and faults in data entry. These faults and errors make it difficult to capture and accurately model the patterns that are inherent in the data. On the other hand, outliers may be an indication of new discoveries in a field, calling for more research on why or how these outlying data points came about. Considering the potential benefits of outliers as well as the problems they may pose, this work focuses on the management of outliers in multivariate datasets with the primary focus in classification problems. In this work, an algorithm to relabel outliers is developed. Support Vector Machine (SVM) is used to detect outliers given its effectiveness in dealing with high dimensional data. After detection, the outliers are relabelled using the developed technique. This technique, measures how far off the outliers are in relation to respective class centroids. Then, based on the comparison between the measured distances, the outliers are relabelled. To evaluate the effectiveness of the developed technique, we use discrimination techniques such as Fisher’s Linear Discriminant Analysis (FLDA), Gaussian Mixture Model (GMM) and the unsupervised clustering technique to highlight the extent of discrimination. The results show a
significant improvement and in some cases obtain a 100% classification on evaluation using the Adjusted Rand Index (ARI) metric.