Premium
Unsupervised Classification of Chemical Compounds
Author(s) -
Guttiérrez Toscano P.,
Marriott F. H. C.
Publication year - 1999
Publication title -
journal of the royal statistical society: series c (applied statistics)
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.205
H-Index - 72
eISSN - 1467-9876
pISSN - 0035-9254
DOI - 10.1111/1467-9876.00146
Subject(s) - multidimensional scaling , cluster analysis , computer science , fingerprint (computing) , data mining , pattern recognition (psychology) , scaling , binary data , binary number , data set , metric (unit) , cluster (spacecraft) , metric space , set (abstract data type) , coding (social sciences) , artificial intelligence , mathematics , machine learning , statistics , discrete mathematics , engineering , operations management , geometry , arithmetic , programming language
Clustering chemical compounds of similar structure is important in the pharmaceutical industry. One way of describing the structure is the chemical `fingerprint'. The fingerprint is a string of binary digits, and typical data sets consist of very large numbers of fingerprints; a suitable clustering procedure must take account of the properties of this method of coding, and must be able to handle large data sets. This paper describes the analysis of a set of fingerprint data. The analysis was based on an appropriate distance measure derived from the fingerprints, followed by metric scaling into a low‐dimensional space. An approximation to metric scaling, suitable for very large data sets, was investigated. Cluster analysis using two programs, mclust and AutoClass‐C, was carried out on the scaled data.