Premium
Facilitating high‐dimensional transparent classification via empirical Bayes variable selection
Author(s) -
Bar Haim,
Booth James,
Wells Martin T.,
Liu Kangyan
Publication year - 2018
Publication title -
applied stochastic models in business and industry
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.413
H-Index - 40
eISSN - 1526-4025
pISSN - 1524-1904
DOI - 10.1002/asmb.2393
Subject(s) - random forest , naive bayes classifier , support vector machine , feature selection , mathematics , artificial intelligence , computer science , logistic regression , machine learning , statistics , nonlinear system , binomial (polynomial) , classifier (uml) , bayes' theorem , variable (mathematics) , class (philosophy) , pattern recognition (psychology) , bayesian probability , mathematical analysis , physics , quantum mechanics
Abstract We present a two‐step approach to classification problems in the “large P , small N ” setting, where the number of predictors may be larger than the sample size. We assume that the association between the predictors and the class variable has an approximate linear‐logistic form, but we allow the class boundaries to be nonlinear. We further assume that the number of true predictors is relatively small. In the first step, we use a binomial generalized linear model to identify which predictors are associated with each class and then restrict the data set to these predictors and run a nonlinear classifier, such as a random forest or a support vector machine. We show that, without the variable screening step, the classification performance of both the random forest and support vector machine is degraded when many among the P predictors are not related to the class.