
Analysis of protein features and machine learning algorithms for prediction of druggable proteins
Author(s) -
Sun Tanlin,
Lai Luhua,
Pei Jianfeng
Publication year - 2018
Publication title -
quantitative biology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.707
H-Index - 15
eISSN - 2095-4697
pISSN - 2095-4689
DOI - 10.1007/s40484-018-0157-2
Subject(s) - druggability , computer science , machine learning , word2vec , support vector machine , artificial intelligence , pipeline (software) , protein function , drug discovery , training set , data mining , bioinformatics , biology , biochemistry , embedding , gene , programming language
Background Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models. Methods In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA‐approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation. Results We found that although the 146‐d vector used by Li et al . with neuron network achieved the best training accuracy of 91.10%, overlapped 3‐gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved‐targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study. Conclusions Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction .
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom