z-logo
open-access-imgOpen Access
Impact of Benchmark Data Set Topology on the Validation of Virtual Screening Methods: Exploration and Quantification by Spatial Statistics
Author(s) -
Sebastian Rohrer,
Knut Baumann
Publication year - 2008
Publication title -
journal of chemical information and modeling
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.24
H-Index - 160
eISSN - 1549-960X
pISSN - 1549-9596
DOI - 10.1021/ci700099u
Subject(s) - benchmark (surveying) , data mining , computer science , data set , set (abstract data type) , nonparametric statistics , network topology , redundancy (engineering) , representation (politics) , similarity (geometry) , virtual screening , topology (electrical circuits) , algorithm , theoretical computer science , artificial intelligence , mathematics , statistics , geodesy , combinatorics , politics , political science , law , image (mathematics) , programming language , geography , operating system , chemistry , computational chemistry , molecular dynamics
A common finding of many reports evaluating ligand-based virtual screening methods is that validation results vary considerably with changing benchmark data sets. It is widely assumed that these data set specific effects are caused by the redundancy, self-similarity, and cluster structure inherent to those data sets. These phenomena manifest themselves in the data sets' representation in descriptor space, which is termed the data set topology. A methodology for the characterization of data set topology based on spatial statistics is introduced. The method is nonparametric and can deal with arbitrary distributions of descriptor values. With this methodology it is possible to associate differences in virtual screening performance on different data sets with differences in data set topology. Moreover, the better virtual screening performance of certain descriptors can be explained by their ability of representing the benchmark data sets by a more favorable topology. Finally it is shown, that the composition of some benchmark data sets causes topologies that lead to overoptimistic validation results even in very "simple" descriptor spaces. Spatial statistics analysis as proposed here facilitates the detection of such biased data sets and may provide a tool for the future design of unbiased benchmark data sets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom