Premium
Using the gini coefficient to measure the chemical diversity of small‐molecule libraries
Author(s) -
Weidlich Iwona E.,
Filippov Igor V.
Publication year - 2016
Publication title -
journal of computational chemistry
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.907
H-Index - 188
eISSN - 1096-987X
pISSN - 0192-8651
DOI - 10.1002/jcc.24423
Subject(s) - measure (data warehouse) , diversity (politics) , cheminformatics , chemical database , gini coefficient , molecule , computer science , moment (physics) , chemistry , data mining , mathematics , computational chemistry , physics , sociology , organic chemistry , mathematical analysis , classical mechanics , anthropology , economic inequality , inequality
Modern databases of small organic molecules contain tens of millions of structures. The size of theoretically available chemistry is even larger. However, despite the large amount of chemical information, the “big data” moment for chemistry has not yet provided the corresponding payoff of cheaper computer‐predicted medicine or robust machine‐learning models for the determination of efficacy and toxicity. Here, we present a study of the diversity of chemical datasets using a measure that is commonly used in socioeconomic studies. We demonstrate the use of this diversity measure on several datasets that were constructed to contain various congeneric subsets of molecules as well as randomly selected molecules. We also apply our method to a number of well‐known databases that are frequently used for structure‐activity relationship modeling. Our results show the poor diversity of the common sources of potential lead compounds compared to actual known drugs. © 2016 Wiley Periodicals, Inc.