Abstract:
Birth weight is one of the major factors that determine the overall future health outcomes. Predicting birth weight can enable medical practitioners to make early obstetric interventions, thus minimising complications associated with low birth weight. Data-mining models are receiving a great deal of attention for making predictions based on a vast amount of low birth data. Low birth weight research has especially focused on identifying the risk factors of low birth weight However, the prediction of actual bi1th weight values based on the identified low birth weight risk factors, which can play a significant role in the identification of mothers at the risk of delivering low birth weight infants, remains unsolved. Since the performance of data-mining techniques is dependent on the: underlying problem, it is vital to analyse their relative performances for any given task. Therefore, the goal of this thesis was to develop a data-mining model that predicts the actual birth weight with a relatively higher Area Under the receiver operating Characteristic (AUC).
The prediction was based on low birth weight risk factors and birth data from the North Carolina State Centre for Health Statistics of 2006. ln order to extract interesting patterns from data the knowledge discovery in databases process model was utilized. The steps followed were data selection, data pre-processing, model building, and model evaluation/interpretation. Decision trees were used for classifying birth weight and tested on the actual imbalanced datasets, the balanced dataset using Synthetic Minority Oversampling Technique (SMOTE), as well as with all the features and reduced feature using correlation-based feature selection algorithm.
The results highlighted that models built with balanced datasets using the SMOTE algorithm produce a relatively higher AUC comparative to models built with imbalanced datasets. It was also discovered that building models with reduced features through correlation-based feature selection algorithm, give a comparatively higher AUC as opposed to models built with all features. The J48 decision tree built with reduced features outperformed REPTree and Random tree with an AUC of90.3%, and thus it was selected as the best model. When applying the selected J48 model to new unseen data, and comparing .it to the testing set, we reached a conclusion that the feasibility of using J48 in birth weight prediction would offer the possibility to reduce obstetric related complications and thus improving the overall obstetric healthcare.