Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data
Author(s) -
Tahir Mehmood,
Zahid Rasheed
Publication year - 2015
Publication title -
communications for statistical applications and methods
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.326
H-Index - 6
eISSN - 2383-4757
pISSN - 2287-7843
DOI - 10.5351/csam.2015.22.6.575
Subject(s) - multivariate statistics , feature selection , selection (genetic algorithm) , statistics , mathematics , high dimensional , variable (mathematics) , multivariate analysis , computer science , data mining , pattern recognition (psychology) , artificial intelligence , mathematical analysis
The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom