Premium
Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins
Author(s) -
Rigoutsos Isidore,
Floratos Aris,
Ouzounis Christos,
Gao Yuan,
Parida Laxmi
Publication year - 1999
Publication title -
proteins: structure, function, and bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.699
H-Index - 191
eISSN - 1097-0134
pISSN - 0887-3585
DOI - 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c
Subject(s) - sequence motif , motif (music) , computational biology , sequence (biology) , computer science , artificial intelligence , biology , genetics , physics , dna , acoustics
Using T EIRESIAS , a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets , cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction. Proteins 1999;37:264–277. ©1999 Wiley‐Liss, Inc.