z-logo
Premium
Benchmarking Variable Selection in QSAR
Author(s) -
Eklund Martin,
Norinder Ulf,
Boyer Scott,
Carlsson Lars
Publication year - 2012
Publication title -
molecular informatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.481
H-Index - 68
eISSN - 1868-1751
pISSN - 1868-1743
DOI - 10.1002/minf.201100142
Subject(s) - quantitative structure–activity relationship , univariate , feature selection , benchmarking , interpretability , selection (genetic algorithm) , computer science , variable (mathematics) , multivariate adaptive regression splines , machine learning , latent variable , multivariate statistics , artificial intelligence , data mining , regression analysis , mathematics , bayesian multivariate linear regression , mathematical analysis , marketing , business
Variable selection is important in QSAR modeling since it can improve model performance and transparency, as well as reduce the computational cost of model fitting and predictions. Which variable selection methods that perform well in QSAR settings is largely unknown. To address this question we, in a total of 1728 benchmarking experiments, rigorously investigated how eight variable selection methods affect the predictive performance and transparency of random forest models fitted to seven QSAR datasets covering different endpoints, descriptors sets, types of response variables, and number of chemical compounds. The results show that univariate variable selection methods are suboptimal and that the number of variables in the benchmarked datasets can be reduced with about 60 % without significant loss in model performance when using multivariate adaptive regression splines MARS and forward selection.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here