Premium
Variable selection in the presence of missing data: imputation‐based methods
Author(s) -
Zhao Yize,
Long Qi
Publication year - 2017
Publication title -
wiley interdisciplinary reviews: computational statistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.693
H-Index - 38
eISSN - 1939-0068
pISSN - 1939-5108
DOI - 10.1002/wics.1402
Subject(s) - missing data , imputation (statistics) , feature selection , computer science , resampling , data mining , variable (mathematics) , selection (genetic algorithm) , statistics , machine learning , artificial intelligence , mathematics , mathematical analysis
Variable selection plays an essential role in regression analysis as it identifies important variables that are associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid and used under the assumptions of missing at random and missing completely at random, largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combines variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under‐developed and offers fertile ground for further research. WIREs Comput Stat 2017, 9:e1402. doi: 10.1002/wics.1402 This article is categorized under: Statistical and Graphical Methods of Data Analysis > Bootstrap and Resampling