Premium
Activity prediction and identification of mis‐annotated chemical compounds using extreme descriptors
Author(s) -
Borysov Petro,
Hannig Jan,
Marron J. S.,
Muratov Eugene,
Fourches Denis,
Tropsha Alexander
Publication year - 2016
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.2776
Subject(s) - identification (biology) , quantitative structure–activity relationship , molecular descriptor , artificial intelligence , variance (accounting) , pattern recognition (psychology) , computer science , extreme value theory , mathematics , machine learning , statistics , ecology , biology , accounting , business
Data pre‐processing that includes removal of descriptors with low variance is a standard first step in quantitative structure–activity relationship modeling. In this paper, we study low‐variance descriptors and show that some of them contain significant amounts of useful information. In particular, we define the notion of extreme descriptors (those variables that have the same value for almost all compounds and only a few values that are different from the common median). We show that extreme descriptors can be helpful for activity prediction in a standard binary classification setting. Moreover, we demonstrate using two case studies ( M 2 muscarinic receptors and skin sensitization) that extreme descriptors can be used for the identification of possibly mislabeled compounds. Because of these previously unknown, but important, properties, extreme descriptors should be considered in quantitative structure–activity relationship modeling studies. Copyright © 2016 John Wiley & Sons, Ltd.