Functional Classification Using Phylogenomic Inference | Zendy

D. Brown | Zendy; Kimmen Sjölander | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Functional Classification Using Phylogenomic Inference

Author(s) -

D. Brown,

Kimmen Sjölander

Publication year - 2006

Publication title -

plos computational biology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 2.628

H-Index - 182

eISSN - 1553-7358

pISSN - 1553-734X

DOI - 10.1371/journal.pcbi.0020077

Subject(s) - uniprot , phylogenomics , inference , computational biology , annotation , phylogenetic tree , protein function prediction , subfamily , hidden markov model , genome , biology , function (biology) , genomics , context (archaeology) , genbank , tree (set theory) , supermatrix , computer science , artificial intelligence , evolutionary biology , gene , genetics , protein function , clade , paleontology , mathematical analysis , current algebra , mathematics , affine lie algebra , pure mathematics , algebra over a field

hylogenomic inference of protein (or gene) function attempts to address the question, ''What function does this protein perform?'' in an evolutionary context. As originally outlined by Jonathan Eisen (1-3), phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to (1). One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome (4), based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences (5,6). Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale (7)—and the challenges we have faced in this effort—motivate this paper. In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as ''hypothetical'' or ''unknown'') had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years (8-10). The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research