z-logo
open-access-imgOpen Access
From Bioinformatics to Computational Biology
Author(s) -
JeanMichel Claverie
Publication year - 2000
Publication title -
genome research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 9.556
H-Index - 297
eISSN - 1549-5469
pISSN - 1088-9051
DOI - 10.1101/gr.155500
Subject(s) - suspect , biology , sequence (biology) , computational biology , artificial intelligence , bioinformatics , computer science , genetics , sociology , criminology
It is quite ironic that the uncertainty about the number of human genes (28,000–120,000) (Ewing and Green 2000; Liang et al. 2000; Roest Crollius et al. 2000) appears to increase as the determination of the human genome sequence is nearing completion. I shall contend here that this paradox reveals deep epistemological problems, and that “bioinformatics”—a term coined in 1990 to define the use of computers in sequence analysis—is no longer developing in directions relevant to biology. After the pioneers who established the basic concepts of molecular sequence analysis (Fitch and Margoliash 1967; Needleman and Wunsch 1970; Chou and Fasman 1974), most computational biologists of my generation (the second one) embarked on their journey into the emerging discipline with the ambition to turn it into the bona fide theoretical branch of molecular biology. Having a physicist’s background, I suspect that many of us had the vision of establishing bioinformatics in a leadership role over experimental biology, similar to the supremacy that theoretical physics enjoys over experimental physics. Somewhere along the line, it seems that bioinformatics lost this ambition and became sidetracked onto what physicists would call a “phenomenological” pathway. Let us follow the example of particle physics for a little longer. There, theoretical research has two phases (which, in fact, run in parallel). In the first phase (socalled phenomenological), a large number of physical events are recorded in huge raw databases, classified into separate groups based on statistical regularities, and then utilized to identify the most recurrent objects. Optimal database design, fast classification/ clustering algorithms, and data mining software are the main area of development here. The level of knowledge gained from this phase is, for instance, that objects A and B often appear together except when C is around, or when parameter X is lower than a certain threshold; it is mostly statistical in nature. The parallel with the current state of bioinformatics is clear. However, theoretical physics also has a subsequent, totally different phase, aiming at discovering the basic (few) rules (e.g., E = mc) underlying the relationships between the objects, their individual properties, and thus finally explaining the statistical distributions of the events recorded in the databases. Once known, these rules considerably simplify the description of the database content and, more important, have a predictive power: the realm of the theory may encompass objects or events that have not been observed previously. This part of theoretical endeavor is entirely missing in current bioinformatics. As a consequence, we are still not able to agree on the number of human genes despite having the complete sequence of the human genome at hand. Identifying precisely the 5 and 3 boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA (“exon parsing”) has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997; Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. Encoding a protein imposes a variety of constraints on nucleotide sequences, which do not apply to noncoding regions of the genome. These constraints induce statistical biases of various kinds, the most discriminant of which was soon recognized to be the distribution of six nucleotide-long “words” or hexamers (Claverie and Bougueleret 1986; Fickett and Tung 1992). Initial gene parsing methods were then simply based on word frequency computation, eventually combined with the detection of splicing consensus motifs. The next generation of software implemented the same basic principles into a simulated neural network architecture (Uberbacher and Mural 1991). Finally, the last generation of software, based on hidden Markov models, added an additional refinement by computing the likelihood of the predicted gene architectures (e.g., favoring human genes with an average of seven coding exons, each 150 nucleotides long) is added (Kulp et al. 1996; Burge and Karlin, 1997)). These ab initio methods are used in conjunction with a search for sequence similarity with previously characterized genes or expressed sequence tags (EST). Sadly, it is often claimed that matching back cDNA to genomic sequences is the best gene identification protocol; hence, admitting that the best way to find genes is to look them up in a previously established catalog! Thus, the two main principles behind state-of-theart gene prediction software are (1) common statistical regularities and (2) plain sequence similarity. From an E-MAIL Jean.Michel.Claverie@igs.cnrs-mrs.fr; FAX +33-4-91-1645-49. Article and publication are at www.genesdev.org/cgi/doi/10.1101/ gad.155500. Commentary

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom