Premium
Will my protein crystallize? A sequence‐based predictor
Author(s) -
Smialowski Pawel,
Schmidt Thorsten,
Cox Jürgen,
Kirschner Andreas,
Frishman Dmitrij
Publication year - 2005
Publication title -
proteins: structure, function, and bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.699
H-Index - 191
eISSN - 1097-0134
pISSN - 0887-3585
DOI - 10.1002/prot.20789
Subject(s) - structural genomics , naive bayes classifier , support vector machine , artificial intelligence , classifier (uml) , computer science , feature selection , protein sequencing , machine learning , cross validation , nuclear magnetic resonance spectroscopy , algorithm , protein methods , protein structure , peptide sequence , pattern recognition (psychology) , physics , biology , nuclear magnetic resonance , genetics , gene
We propose a machine-learning approach to sequence-based prediction of protein crystallizability in which we exploit subtle differences between proteins whose structures were solved by X-ray analysis [or by both X-ray and nuclear magnetic resonance (NMR) spectroscopy] and those proteins whose structures were solved by NMR spectroscopy alone. Because the NMR technique is usually applied on relatively small proteins, sequence length distributions of the X-ray and NMR datasets were adjusted to avoid predictions biased by protein size. As feature space for classification, we used frequencies of mono-, di-, and tripeptides represented by the original 20-letter amino acid alphabet as well as by several reduced alphabets in which amino acids were grouped by their physicochemical and structural properties. The classification algorithm was constructed as a two-layered structure in which the output of primary support vector machine classifiers operating on peptide frequencies was combined by a second-level Naive Bayes classifier. Due to the application of metamethods for cost sensitivity, our method is able to handle real datasets with unbalanced class representation. An overall prediction accuracy of 67% [65% on the positive (crystallizable) and 69% on the negative (noncrystallizable) class] was achieved in a 10-fold cross-validation experiment, indicating that the proposed algorithm may be a valuable tool for more efficient target selection in structural genomics. A Web server for protein crystallizability prediction called SECRET is available at http://webclu.bio.wzw.tum.de:8080/secret.