z-logo
open-access-imgOpen Access
Functional Classification Using Phylogenomic Inference
Author(s) -
D. Brown,
Kimmen Sjölander
Publication year - 2006
Publication title -
plos computational biology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 2.628
H-Index - 182
eISSN - 1553-7358
pISSN - 1553-734X
DOI - 10.1371/journal.pcbi.0020077
Subject(s) - uniprot , phylogenomics , inference , computational biology , annotation , phylogenetic tree , protein function prediction , subfamily , hidden markov model , genome , biology , function (biology) , genomics , context (archaeology) , genbank , tree (set theory) , supermatrix , computer science , artificial intelligence , evolutionary biology , gene , genetics , protein function , clade , paleontology , mathematical analysis , current algebra , mathematics , affine lie algebra , pure mathematics , algebra over a field
hylogenomic inference of protein (or gene) function attempts to address the question, ''What function does this protein perform?'' in an evolutionary context. As originally outlined by Jonathan Eisen (1-3), phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to (1). One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome (4), based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences (5,6). Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale (7)—and the challenges we have faced in this effort—motivate this paper. In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as ''hypothetical'' or ''unknown'') had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years (8-10). The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom