z-logo
Premium
Transposable elements encoding functional proteins: pitfalls in unprocessed genomic data?
Author(s) -
Pavlı́ček Adam,
Clay Oliver,
Bernardi Giorgio
Publication year - 2002
Publication title -
febs letters
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.593
H-Index - 257
eISSN - 1873-3468
pISSN - 0014-5793
DOI - 10.1016/s0014-5793(02)02992-7
Subject(s) - transposable element , computational biology , encoding (memory) , biology , genetics , genome , gene , neuroscience
The contribution of transposable elements (TEs), including Alus, to human coding sequences has recently been reported to be high, 4% (1.3% Alus) out of 13 799 sequences [1,2]. This is surprising, because previous examinations had revealed only very few repeats, and almost no Alus, in coding sequences [3,4,25]. Since extreme caution about input data has been suggested [5^7], we examined the database of [1] and found that many (V30%) of its TE-containing sequences or their protein products are de¢ned as ‘hypothetical’, and 63% (421/669 sequences) are annotated as ‘predicted, without experimental evidence or records without ¢nal NCBI revision’. Such a dataset is likely to contain several sequences that remain untranscribed, and more that remain untranslated. Not even experimental validation [8], let alone computer prediction of functional genes is foolproof: the errors in coding sequence databases such as those used in [1] may well amount to 1^2% or more. Essentially all reported coding regions derived from Alus, or containing alternatively spliced Alus, have been detected at the RNA (cDNA) level, instead of at the protein level [3,9]. In eukaryotic cells, there is a signi¢cant turnover of RNA, and several steps of quality control exist for the synthesized RNA in both nucleus and cytoplasm [10^14]. mRNAs with an aberrant 3P end are generally retained and/or degraded at their site of transcription [15] and the majority of stable RNA polymerase II transcripts remain in the nucleus as ‘junk’ RNA, so they never reach the cytoplasm [10]. The minority of transcripts that are successfully exported from the nucleus undergo additional check(s) during their translation. For example, there are specialized degradation mechanisms for transcripts having premature stop codons or lacking terminal codons, which prevent the creation of aberrant, potentially pathogenic proteins [11,13,16]. Thus, even detection of a transcript at the mRNA (cDNA) level cannot guarantee that these mRNAs are ever translated into stable proteins. As has been summarized in the light of growing evidence [17], ‘mRNA abundance is a poor indicator of the levels of the corresponding protein’, yet ‘it is the proteome that determines cell phenotype’: the transcriptome does not faithfully represent the proteome. Furthermore, to become a viable protein, a transcript must (after its accurate translation and possible post-translational modi¢cation) resist degradation until it can serve its functional role at the site of its required action. These facts underline the importance of detection at the protein level, for elucidating whether SINEs or other repeats contribute to true coding sequences in humans or mice. The most accurate sources of proteins are 3D structure databases and direct amino acid sequencing. Out of 781 non-redundant human proteins from a 3D database or determined at the amino acid level that we extracted from [18] (mean length 404 aa; including some fragments, but neglecting all peptides shorter than 50 aa or having s 70% identity) and compared to human repeats in RepBase [19] using TFASTX [20], we found no Alu-related protein domain (the best hit has an E-value of 0.5). Twenty-eight apparently signi¢cant hits with E-values under 0.01 were detected, but mainly from protein-coding elements (DNA transposons and LINE1). When cDNAs encoding these 28 proteins were extracted and searched by RepeatMasker [21], no interspersed repeats were detected. In addition, the similarity regions that had been reported by TFASTX were also found in other vertebrate orthologs. In summary, we did not detect any repeat sequence in our dataset of 781 protein sequences. In 1994, it was pointed out [5] that a discovery of a translated Alu element(s) in a functional part of a functional human protein ‘would represent the ¢rst report of its kind and would have important evolutionary implications’. Despite the 7 years since this challenge, con¢rmed cases of Alu-containing sequences that encode a functional protein still remain extremely elusive. The paucity of documented examples is a good indication that proteins are unlikely to utilize domains encoded by Alus for functional ends. The reluctance to accept this view is understandable, given the huge proportion of interspersed repeats in the human genome (around 45% [4]) : in principle, at least some of them might have been recruited for functional purposes at the protein level. The great majority of previously detected repeat-derived coding sequences comes, however, from protein-coding repeats, and particularly from DNA transposons [4,25]. LINEs are less common in coding sequences and only a few Alus had been identi¢ed prior to the analysis of [1,2]. Since SINEs are derived from RNA genes without protein-coding capacity, the lack of Alu-encoded proteins is consistent with the notion that new domains arise from existing sequences encoding functional proteins (for example, by exon shu¥ing) and that the de novo creation of coding sequences from non-coding DNA is rare. Indeed, in the words of Graur and Li [22], ‘True novelty is almost unheard of during evolution; rather, preexisting genes and parts of genes [presumably encoding functional proteins or their domains] are transformed to produce new functions, and molecular systems are combined to give rise to new, often more complex systems. T We may T deduce that [such] molecular tinkering is most probably the paradigm of molecular evolution.’ Such a notion appears to contrast with the recent view of coding Alus presented by one of these authors [1]. The relative frequencies for the TE classes found by Nekrutenko and Li [1] are similar to genome-wide repeat proportions, i.e. to expectations under random sampling of sequences or random errors in predicting exons. In contrast, our ¢ndings are in good agreement with previous reports [4,25] and the above arguments that repeat-derived protein-coding sequences, especially those corresponding to Alus and other SINEs, should be rare. Indeed, Alus are derived from 7SL RNA, part of the signal recognition particle on ribosomes [23], and the strong selection for such 7SL-like secondary

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here