z-logo
open-access-imgOpen Access
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences
Author(s) -
Siquan Hu,
Ruixiong Ma,
Haiou Wang
Publication year - 2019
Publication title -
plos one
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.99
H-Index - 332
ISSN - 1932-6203
DOI - 10.1371/journal.pone.0225317
Subject(s) - convolutional neural network , artificial intelligence , support vector machine , context (archaeology) , computer science , uniprot , generalization , pattern recognition (psychology) , test set , machine learning , deep learning , set (abstract data type) , recurrent neural network , computational biology , training set , artificial neural network , biology , mathematics , biochemistry , gene , paleontology , mathematical analysis , programming language
As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%—7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%—12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom