z-logo
Premium
Assessing strategies for improved superfamily recognition
Author(s) -
Sillitoe Ian,
Dibley Mark,
Bray James,
Addou Sarah,
Orengo Christine
Publication year - 2005
Publication title -
protein science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.353
H-Index - 175
eISSN - 1469-896X
pISSN - 0961-8368
DOI - 10.1110/ps.041056105
Subject(s) - superfamily , computational biology , computer science , biology , bioinformatics , genetics , gene
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (∼13,000 nonredundant structures solved to date), several powerful sequence‐based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence‐based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [ s equence a ugmented m odels o f s tructure a lignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single‐seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D‐HMM library, CATH‐ISL increased the coverage to 86%. The single‐seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss‐Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here