Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study | Zendy

Hansle Gwon | Zendy; Imjin Ahn | Zendy; Yunha Kim | Zendy; Hee-Jun Kang | Zendy; Hyeram Seo | Zendy; Ha Na Cho | Zendy; Heejung Choi | Zendy; Tae Joon Jun | Zendy; YoungHak Kim | Zendy

Open Access

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

Author(s) -

Hansle Gwon,

Imjin Ahn,

Yunha Kim,

Hee-Jun Kang,

Hyeram Seo,

Ha Na Cho,

Heejung Choi,

Tae Joon Jun,

YoungHak Kim

Publication year - 2021

Publication title -

jmir public health and surveillance

Language(s) - English

Resource type - Journals

ISSN - 2369-2960

DOI - 10.2196/30824

Subject(s) - missing data , imputation (statistics) , wilcoxon signed rank test , decision tree , computer science , statistics , random forest , test data , mean squared error , multivariate statistics , data mining , artificial intelligence , machine learning , mathematics , programming language , mann–whitney u test

Background When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. Objective The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. Methods In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. Results In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. Conclusions Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore