High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes | Zendy

Penka Markova-Raina | Zendy; Dmitri A. Petrov | Zendy

Open Access

High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes

Author(s) -

Penka Markova-Raina,

Dmitri A. Petrov

Publication year - 2011

Publication title -

genome research

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 9.556

H-Index - 297

eISSN - 1549-5469

pISSN - 1088-9051

DOI - 10.1101/gr.115949.110

Subject(s) - false positive paradox , biology , drosophila melanogaster , genome , selection (genetic algorithm) , false positive rate , positive selection , inference , evolutionary biology , negative selection , computational biology , genetics , gene , machine learning , artificial intelligence , computer science

We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (∼6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%–82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%–55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research