
Learning from Data: Cleft Lip and Palate Patients in the West Coast of Sabah
Author(s) -
Zaturrawiah Ali Omar,
Su Na Chin,
Norhafiza Hamzah,
Fouziah Md Yassin
Publication year - 2019
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1358/1/012063
Subject(s) - cluster analysis , resampling , computer science , random forest , artificial intelligence , euclidean distance , sample (material) , class (philosophy) , feature selection , pattern recognition (psychology) , machine learning , data mining , chromatography , chemistry
Analysing data can be quite a challenge sometimes due to the nature of the data and the vast options of methods and techniques that can be used on the data. In this study, for example, a six years Cleft Lip and Palate dataset were gathered on these patients’ conditions in the quest to identify the contributing factors for a successful pre-graft orthodontic treatment. The challenges faced was in the small number of datasets and imbalance sample class. Therefore, this study had taken a step back and tried to approach the dataset with a combination of unsupervised and supervised learning methods to tackle the challenges by incorporating clustering - for testing records creation and; resampling - for balancing sample class. We also observed if the auto-created testing records are replaceable with the manually selected testing records by looking at the performances of the classification models. Based on the feature that was selected, k-Means and PAM were implemented as the clustering algorithm using the Euclidean formula as the distance measure. Resampling was done using SMOTE and Random Forest as the classification model. When the comparison was done on the models, the ones that were fed by resampled training records showed an increase in the AUC values and decrease in the OOB error. Comparable results were also achieved between the training records produced by PAM and by manual selection as both models, based on the AUC values, was classified as excellent classification models.