
Comparison of Machine Learning With Logistic Regression for Prediction of Chronic Kidney Disease in the Thai Adult Population
Author(s) -
Ratchainant Thammasudjarit,
Punnathorn Ingsathit,
Sigit Ari Saputro,
Atiporn Ingsathit,
Ammarin Thakkinstian
Publication year - 2021
Publication title -
ramathibodi medical journal
Language(s) - English
Resource type - Journals
ISSN - 2651-0561
DOI - 10.33165/rmj.2021.44.4.250334
Subject(s) - logistic regression , overfitting , statistics , random forest , decision tree , kidney disease , naive bayes classifier , population , predictive modelling , machine learning , artificial intelligence , artificial neural network , computer science , medicine , mathematics , support vector machine , environmental health
Background: Chronic kidney disease (CKD) takes huge amounts of resources for treatments. Early detection of patients by risk prediction model should be useful in identifying risk patients and providing early treatments.
Objective: To compare the performance of traditional logistic regression with machine learning (ML) in predicting the risk of CKD in Thai population.
Methods: This study used Thai Screening and Early Evaluation of Kidney Disease (SEEK) data. Seventeen features were firstly considered in constructing prediction models using logistic regression and 4 MLs (Random Forest, Naïve Bayes, Decision Tree, and Neural Network). Data were split into train and test data with a ratio of 70:30. Performances of the model were assessed by estimating recall, C statistics, accuracy, F1, and precision.
Results: Seven out of 17 features were included in the prediction models. A logistic regression model could well discriminate CKD from non-CKD patients with the C statistics of 0.79 and 0.78 in the train and test data. The Neural Network performed best among ML followed by a Random Forest, Naïve Bayes, and a Decision Tree with the corresponding C statistics of 0.82, 0.80, 0.78, and 0.77 in training data set. Performance of these corresponding models in testing data decreased about 5%, 3%, 1%, and 2% relative to the logistic model by 2%.
Conclusions: Risk prediction model of CKD constructed by the logit equation may yield better discrimination and lower tendency to get overfitting relative to ML models including the Neural Network and Random Forest.