Identifying  DNA ‐binding proteins based on multi‐features and  LASSO  feature selection | Zendy

Zhang Shengli | Zendy; Zhu Fu | Zendy; Yu Qianhao | Zendy; Zhu Xiaoyue | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Identifying DNA ‐binding proteins based on multi‐features and LASSO feature selection

Author(s) -

Zhang Shengli,

Zhu Fu,

Yu Qianhao,

Zhu Xiaoyue

Publication year - 2021

Publication title -

biopolymers

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.556

H-Index - 125

eISSN - 1097-0282

pISSN - 0006-3525

DOI - 10.1002/bip.23419

Subject(s) - cross validation , pattern recognition (psychology) , artificial intelligence , feature selection , computer science , subspace topology , lasso (programming language) , linear discriminant analysis , discriminant , machine learning , world wide web

DNA‐binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index‐vectors (RS), Pseudo‐amino acid components (PseAACS), Position‐specific scoring matrix‐Auto Cross Covariance Transform (PSSM‐ACCT), and Position‐specific scoring matrix‐Discrete Wavelet Transform (PSSM‐DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA‐binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as‐proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five‐fold cross‐validation, and the PDB186 is used for the independent experiment. In the five‐fold cross‐validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi‐classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA‐binding proteins effectively.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research