Open Access
The characteristics of human genes: analysis of human chromosome 22
Author(s) -
Dunham Ian,
Beare David M.,
Collins John E.
Publication year - 2003
Publication title -
comparative and functional genomics
Language(s) - English
Resource type - Journals
eISSN - 1532-6268
pISSN - 1531-6912
DOI - 10.1002/cfg.335
Subject(s) - human genome , gene , computational biology , chromosome , genetics , computer science , chromosome analysis , biology , genome , karyotype
On 14 April 2003 Homo sapiens became the first species on earth to finish reading its own set of instructions. However, although we have read our code, the problem that now faces us is to understand what we have read. The first task towards this understanding is to establish the full catalogue of human genes based on the genome sequence. Although evidence is accumulating that questions our assumptions about what constitutes a gene and it seems possible that we may have to broaden our horizons to include various classes of non-protein coding RNAs, a first generation gene catalogue must inevitably be focused on proteincoding genes. This layering of information onto sequence is widely called ‘annotation’ and we term the specific process of describing gene structures on the genomic sequence ‘gene annotation’. Although ultimately the genome sequence should be accompanied by an information-rich annotation describing many aspects of structure and function, for the moment accurately describing gene structures remains a considerable task. Despite progress in methods for gene prediction, and the availability of genome sequence from other mammals for comparative analysis, deriving the gene catalogue by computation is still an imperfect art (Guigo et al., 2000). On the other hand, curation of gene structures supported by experimental data, either from the cDNA and EST databases, or more unusually from de novo cloning and sequencing, is labourintensive and has not yet been applied to the full genome. A particular problem is that most cDNA libraries are derived from total cellular RNA and contain a high proportion of unprocessed and partially processed RNA species, confounding the identification of intron–exon junctions from ESTs or cDNA sequences (Bashiardes and Lovett, 2000). Thus, the protein coding gene catalogue is far from complete. What we have instead are a series of attempts to approximate or estimate what the gene catalogue looks like, based on applying the current favourite gene finding paradigms to the available genome sequence (International Human Genome Sequencing Consortium, 2001; Das et al., 2001; Davuluri et al., 2001; Ewing and Green, 2000; Flicek et al., 2003; Guigo et al., 2003; Liang et al., 2000; Roest Crollius et al., 2000; Shoemaker et al., 2001; Wright et al., 2001). We have taken an alternative approach by concentrating on a single contiguous segment, representing 1% of the human genome, and attempting to produce a highly curated gene annotation, supported by expressed sequences from the databases and experimental confirmation of gene structures