z-logo
Premium
PySpark and RDKit: Moving towards Big Data in Cheminformatics
Author(s) -
Lovrić Mario,
Molero José Manuel,
Kern Roman
Publication year - 2019
Publication title -
molecular informatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.481
H-Index - 68
eISSN - 1868-1751
pISSN - 1868-1743
DOI - 10.1002/minf.201800082
Subject(s) - cheminformatics , python (programming language) , scalability , computer science , spark (programming language) , workstation , big data , scripting language , computer cluster , analytics , data mining , computational science , parallel computing , operating system , distributed computing , database , programming language , bioinformatics , biology
The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here