z-logo
Premium
Computational Technique for an Efficient Classification of Protein Sequences With Distance‐Based Sequence Encoding Algorithm
Author(s) -
Iqbal Muhammad Javed,
Faye Ibrahima,
Said Abas MD,
Samir Brahim Belhaouari
Publication year - 2017
Publication title -
computational intelligence
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.353
H-Index - 52
eISSN - 1467-8640
pISSN - 0824-7935
DOI - 10.1111/coin.12069
Subject(s) - protein sequencing , computer science , protein function prediction , sequence (biology) , structural classification of proteins database , pattern recognition (psychology) , feature selection , uniprot , algorithm , artificial intelligence , sequence alignment , benchmark (surveying) , encoding (memory) , set (abstract data type) , peptide sequence , protein structure , biology , protein function , genetics , biochemistry , geodesy , gene , programming language , geography
Machine learning is being implemented in bioinformatics and computational biology to solve challenging problems emerged in the analysis and modeling of biological data such as DNA, RNA, and protein. The major problems in classifying protein sequences into existing families/superfamilies are the following: the selection of a suitable sequence encoding method, the extraction of an optimized subset of features that possesses significant discriminatory information, and the adaptation of an appropriate learning algorithm that classifies protein sequences with higher classification accuracy. The accurate classification of protein sequence would be helpful in determining the structure and function of novel protein sequences. In this article, we have proposed a distance‐based sequence encoding algorithm that captures the sequence's statistical characteristics along with amino acids sequence order information. A statistical metric‐based feature selection algorithm is then adopted to identify the reduced set of features to represent the original feature space. The performance of the proposed technique is validated using some of the best performing classifiers implemented previously for protein sequence classification. An average classification accuracy of 92% was achieved on the yeast protein sequence data set downloaded from the benchmark UniProtKB database.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here