z-logo
Premium
Application of the mol2vec Technology to Large‐size Data Visualization and Analysis
Author(s) -
Shibayama Shojiro,
Marcou Gilles,
Horvath Dragos,
Baskin Igor I.,
Funatsu Kimito,
Varnek Alexandre
Publication year - 2020
Publication title -
molecular informatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.481
H-Index - 68
eISSN - 1868-1751
pISSN - 1868-1743
DOI - 10.1002/minf.201900170
Subject(s) - dimensionality reduction , principal component analysis , visualization , preprocessor , substructure , pattern recognition (psychology) , computer science , curse of dimensionality , artificial intelligence , embedding , reduction (mathematics) , data mining , mathematics , engineering , geometry , structural engineering
Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure‐activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descriptors used for encoding molecular structures. The Principal Component Analysis (PCA), a standard preprocessing tool, suffers from the information loss upon the dimensionality reduction. As an alternative, we propose to use substructure vector embedding provided by the mol2vec technique. In addition to the data dimensionality reduction, this technology also accounts for proximity of substructures in molecular graphs. In this study, dimensionality of large descriptor spaces of ISIDA fragment descriptors or Morgan fingerprints were reduced using either the PCA or the mol2vec method. The latter significantly speeds up GTM training without compromising its predictive power in bioactivity classification tasks.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here