Premium
A variance reduction framework for stable feature selection
Author(s) -
Han Yue,
Yu Lei
Publication year - 2012
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.11152
Subject(s) - feature selection , weighting , stability (learning theory) , computer science , variance (accounting) , data mining , feature (linguistics) , minimum redundancy feature selection , artificial intelligence , pattern recognition (psychology) , selection (genetic algorithm) , support vector machine , dimensionality reduction , machine learning , medicine , linguistics , philosophy , accounting , business , radiology
Stability of feature selection is an important but under‐addressed issue in knowledge discovery from high‐dimensional data. In this study, we present a theoretical framework about the relationship between the stability and the accuracy of feature selection based on a formal bias‐variance decomposition of feature selection error. The framework also reveals the connection between stability and sample size and suggests a variance reduction approach for improving the stability of feature selection algorithms under small sample size. Following the theoretical framework, we propose an empirical variance reduction framework, margin‐based instance weighting, which weights training instances according to their importance to feature evaluation. Our extensive experimental study first verifies the theoretical and empirical frameworks based on synthetic data sets and a popular feature selection algorithm SVM‐RFE. Experiments based on real‐world microarray data sets further verify that the empirical framework is effective at reducing the variance and improving the subset stability of two representative feature selection algorithms, SVM‐RFE and ReliefF, while maintaining comparable predictive accuracy based on the selected features. The proposed instance weighting framework is also shown to be more effective and efficient than the ensemble framework at improving the subset stability of the feature selection algorithms under study. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012