z-logo
Premium
Subsampling from features in large regression to find “winning features”
Author(s) -
Fan Yiying,
Sun Jiayang
Publication year - 2021
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.11499
Subject(s) - feature selection , lasso (programming language) , false discovery rate , computer science , elastic net regularization , feature (linguistics) , regression , random forest , dimension (graph theory) , artificial intelligence , selection (genetic algorithm) , data mining , machine learning , pattern recognition (psychology) , statistics , mathematics , biology , biochemistry , linguistics , philosophy , world wide web , pure mathematics , gene
Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever‐enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false‐discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large‐ p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best‐controlled FDR compared to the penalized and Random Forest procedures while having a competitive true‐feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here