Assessing strategies for improved superfamily recognition | Zendy

Sillitoe Ian | Zendy; Dibley Mark | Zendy; Bray James | Zendy; Addou Sarah | Zendy; Orengo Christine | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Assessing strategies for improved superfamily recognition

Author(s) -

Sillitoe Ian,

Dibley Mark,

Bray James,

Addou Sarah,

Orengo Christine

Publication year - 2005

Publication title -

protein science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 3.353

H-Index - 175

eISSN - 1469-896X

pISSN - 0961-8368

DOI - 10.1110/ps.041056105

Subject(s) - superfamily , computational biology , computer science , biology , bioinformatics , genetics , gene

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (∼13,000 nonredundant structures solved to date), several powerful sequence‐based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence‐based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [ s equence a ugmented m odels o f s tructure a lignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single‐seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D‐HMM library, CATH‐ISL increased the coverage to 86%. The single‐seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss‐Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research