Using genetic algorithms to select most predictive protein features | Zendy

Kernytsky Andrew | Zendy; Rost Burkhard | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Using genetic algorithms to select most predictive protein features

Author(s) -

Kernytsky Andrew,

Rost Burkhard

Publication year - 2009

Publication title -

proteins: structure, function, and bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.699

H-Index - 191

eISSN - 1097-0134

pISSN - 0887-3585

DOI - 10.1002/prot.22211

Subject(s) - feature vector , encode , computer science , artificial intelligence , sequence (biology) , feature (linguistics) , set (abstract data type) , algorithm , pattern recognition (psychology) , machine learning , biology , genetics , gene , linguistics , philosophy , programming language

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine‐learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine‐learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence‐derived features and to correlate those features with the desired output. We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the coexistence of residue‐based features. It is this combination of individual features, that is the step from the fractions of serine and buried (input space 20 + 2) to the fraction of buried serine (input space 20 ☆ 2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm (GA) paired with neural networks and support vector machines. We find that the GA is critical for selecting combinations of features that are neither too general resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity. Proteins 2009. © 2008 Wiley‐Liss, Inc.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research