Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets | Zendy

CarbonMangels Miriam | Zendy; Hutter Michael C. | Zendy

Premium

Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets

Author(s) -

CarbonMangels Miriam,

Hutter Michael C.

Publication year - 2011

Publication title -

molecular informatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.481

H-Index - 68

eISSN - 1868-1751

pISSN - 1868-1743

DOI - 10.1002/minf.201100069

Subject(s) - overfitting , support vector machine , discriminative model , artificial intelligence , bayesian probability , computer science , feature selection , pattern recognition (psychology) , decision tree , machine learning , curse of dimensionality , divergence (linguistics) , data mining , posterior probability , mathematics , artificial neural network , linguistics , philosophy

Classification algorithms suffer from the curse of dimensionality, which leads to overfitting, particularly if the problem is over‐determined. Therefore it is of particular interest to identify the most relevant descriptors to reduce the complexity. We applied Bayesian estimates to model the probability distribution of descriptors values used for binary classification using n ‐fold cross‐validation. As a measure for the discriminative power of the classifiers, the symmetric form of the Kullback–Leibler divergence of their probability distributions was computed. We found that the most relevant descriptors possess a Gaussian‐like distribution of their values, show the largest divergences, and therefore appear most often in the cross‐validation scenario. The results were compared to those of the LASSO feature selection method applied to multiple decision trees and support vector machine approaches for data sets of substrates and nonsubstrates of three Cytochrome P450 isoenzymes, which comprise strongly unbalanced compound distributions. In contrast to decision trees and support vector machines, the performance of Bayesian estimates is less affected by unbalanced data sets. This strategy reveals those descriptors that allow a simple linear separation of the classes, whereas the superior accuracy of decision trees and support vector machines can be attributed to nonlinear separation, which are in turn more prone to overfitting.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research