z-logo
Premium
Identifying DNA ‐binding proteins based on multi‐features and LASSO feature selection
Author(s) -
Zhang Shengli,
Zhu Fu,
Yu Qianhao,
Zhu Xiaoyue
Publication year - 2021
Publication title -
biopolymers
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.556
H-Index - 125
eISSN - 1097-0282
pISSN - 0006-3525
DOI - 10.1002/bip.23419
Subject(s) - cross validation , pattern recognition (psychology) , artificial intelligence , feature selection , computer science , subspace topology , lasso (programming language) , linear discriminant analysis , discriminant , machine learning , world wide web
DNA‐binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index‐vectors (RS), Pseudo‐amino acid components (PseAACS), Position‐specific scoring matrix‐Auto Cross Covariance Transform (PSSM‐ACCT), and Position‐specific scoring matrix‐Discrete Wavelet Transform (PSSM‐DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA‐binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as‐proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five‐fold cross‐validation, and the PDB186 is used for the independent experiment. In the five‐fold cross‐validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi‐classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA‐binding proteins effectively.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here