z-logo
Premium
The effect of structural redundancy in validation sets on virtual screening performance
Author(s) -
Clark Robert D.,
Shepphird Jennifer K.,
Holliday John
Publication year - 2009
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.1240
Subject(s) - weighting , virtual screening , redundancy (engineering) , computer science , false positive paradox , cluster analysis , receiver operating characteristic , similarity (geometry) , data mining , set (abstract data type) , relevance (law) , artificial intelligence , pattern recognition (psychology) , mathematics , machine learning , pharmacophore , medicine , bioinformatics , radiology , political science , law , image (mathematics) , biology , programming language , operating system
The performance of a classification model is often assessed in terms of how well it separates a set of known observations into appropriate classes. If the validation sets used for such analyses are redundant due to bias in sampling, the relevance of the conclusions drawn to prospective work in which new kinds of positives are sought may be compromised. In the case of the various virtual screening techniques used in modern drug discovery, such bias generally appears as over‐representation of particular structural subclasses in the test set. We show how clustering by substructural similarity, followed by applying arithmetic and harmonic weighting schemes to receiver operating characteristic (ROC) curves, can be used to identify validation sets that are biased due to such redundancies. This can be accomplished qualitatively by direct examination or quantitatively by comparing the areas under the respective linear or semilog curves (AUCs or pAUCs). Copyright © 2009 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here