Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity | Zendy

Nicolas Philippe | Zendy; Anthony Boureux | Zendy; Laurent Bréhélin | Zendy; Jorma Tarhio | Zendy; Thérèse Commes | Zendy; Éric Rivals | Zendy

Open Access

Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

Author(s) -

Nicolas Philippe,

Anthony Boureux,

Laurent Bréhélin,

Jorma Tarhio,

Thérèse Commes,

Éric Rivals

Publication year - 2009

Publication title -

nucleic acids research

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 9.008

H-Index - 537

eISSN - 1362-4954

pISSN - 0305-1048

DOI - 10.1093/nar/gkp492

Subject(s) - biology , genome , computational biology , false positive paradox , sequence (biology) , reference genome , genetics , epigenomics , genomics , international hapmap project , k mer , human genome , gene , computer science , artificial intelligence , gene expression , dna methylation

International audienceUltra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research