Premium
Dictionary of recurrent domains in protein structures
Author(s) -
Holm Liisa,
Sander Chris
Publication year - 1998
Publication title -
proteins: structure, function, and bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.699
H-Index - 191
eISSN - 1097-0134
pISSN - 0887-3585
DOI - 10.1002/(sici)1097-0134(19981001)33:1<88::aid-prot8>3.0.co;2-h
Subject(s) - computer science , domain (mathematical analysis) , set (abstract data type) , compact space , cluster analysis , structural classification of proteins database , protein structure , protein domain , sequence (biology) , identification (biology) , basis (linear algebra) , pattern recognition (psychology) , data mining , artificial intelligence , mathematics , biology , pure mathematics , geometry , genetics , mathematical analysis , biochemistry , botany , gene , programming language
The rapid growth in the number of experimentally determined three-dimensional protein structures has sharpened the need for comprehensive and up-to-date surveys of known structures. Classic work on protein structure classification has made it clear that a structural survey is best carried out at the level of domains, i.e., substructures that recur in evolution as functional units in different protein contexts. We present a method for automated domain identification from protein structure atomic coordinates based on quantitative measures of compactness and, as the new element, recurrence. Compactness criteria are used to recursively divide a protein into a series of successively smaller and smaller substructures. Recurrence criteria are used to select an optimal size level of these substructures, so that many of the chosen substructures are common to different proteins at a high level of statistical significance. The joint application of these criteria automatically yields consistent domain definitions between remote homologs, a result difficult to achieve using compactness criteria alone. The method is applied to a representative set of 1,137 sequence-unique protein families covering 6,500 known structures. Clustering of the resulting set of domains (substructures) yields 594 distinct fold classes (types of substructures). The Dali Domain Dictionary (http://www.embl-ebi.ac.uk/dali/) not only provides a global structural classification, but also a comprehensive description of families of protein sequences grouped around representative proteins of known structure. The classification will be continuously updated and can serve as a basis for improving our understanding of protein evolution and function and for evolving optimal strategies to complete the map of all natural protein structures.