Premium
Introduction of a Generally Applicable Method to Estimate Retrieval of Active Molecules for Similarity Searching using Fingerprints
Author(s) -
Vogt Martin,
Bajorath Jürgen
Publication year - 2007
Publication title -
chemmedchem
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.817
H-Index - 100
eISSN - 1860-7187
pISSN - 1860-7179
DOI - 10.1002/cmdc.200700090
Subject(s) - fingerprint (computing) , similarity (geometry) , computer science , virtual screening , data mining , set (abstract data type) , nearest neighbor search , bit array , selection (genetic algorithm) , precision and recall , chemical database , function (biology) , information retrieval , machine learning , artificial intelligence , drug discovery , bioinformatics , type (biology) , ecology , evolutionary biology , image (mathematics) , biology , programming language
Fingerprints are bit string representations of molecular structure and properties and are among the most widely used computational tools for similarity searching and database screening. Various fingerprint designs are available and their search performance is in general strongly dependent on the compound classes under study and the chemical characteristics of screening databases. Currently, it is not possible to predict the probability of identifying novel hits through fingerprint searching. However, for practical applications, such estimations would be very useful because one might be able, for example, to prioritize fingerprints and compound selection strategies or decide whether or not a similarity search campaign with subsequent experimental evaluation of candidate compounds would be promising at all. We have developed a method that makes it possible to predict the outcome of similarity search calculations using any type of keyed fingerprint. The methodology incorporates bit frequency distributions of reference molecules and the screening database into an information‐theoretic function and determines the principally possible recall of active compounds within selection sets of varying size. We calibrate the function on diverse compound classes and accurately predict compound recovery in retrospective virtual screening trials. Furthermore, we correctly predict fingerprint search performance on two experimental high‐throughput screening data sets (HTS). Our findings indicate that given a set of reference molecules, a fingerprint, and a screening database, we can readily estimate how likely it will be to retrieve active compounds, without knowledge about the distribution of potential hits in the database.