z-logo
Premium
How to infer reliable diploid genotypes from NGS or traditional sequence data: from basic probability to experimental optimization
Author(s) -
CHENUIL A.
Publication year - 2012
Publication title -
journal of evolutionary biology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.289
H-Index - 128
eISSN - 1420-9101
pISSN - 1010-061X
DOI - 10.1111/j.1420-9101.2012.02488.x
Subject(s) - biology , bayes' theorem , genetics , locus (genetics) , ploidy , dna sequencing , population , amplicon , sequence (biology) , computational biology , evolutionary biology , bayesian probability , statistics , mathematics , polymerase chain reaction , gene , demography , sociology
The use of diploid sequence markers is still challenging despite the good quality of the information they provide. There is a common problem to all sequencing approaches [traditional cloning and sequencing of PCR amplicons as well as next‐generation sequencing (NGS)]: when no variation is found within the sequences from a given individual, homozygozity can never be asserted with certainty. As a consequence, sequence data from diploid markers are mostly analysed at the population (not the individual level) particularly in animal studies. This study aims at contributing to solve this. Using the Bayes theorem and the binomial law, useful results are derived, among which: (i) the number of sequence reads per individual (or sequencing depth) which is required to ensure, at a given probability threshold, that some heterozygotes are not considered erroneously as homozygotes, as a function of the observed heterozygozity ( H o ) of the locus in the population; (ii) a way of estimating H o from low coverage NGS data; (iii) a way of testing the null hypothesis that a genetic marker corresponds to a single and diploid locus, in the absence of data from controlled crosses; (iv) strategies for characterizing sequence genotypes in populations minimizing the average number of sequence reads per individual; (v) a rationale to decide which are the variations that one needs to consider along the sequence, as a function of the sequencing depth affordable, the level of polymorphism desired and the risk of sequencing error. For traditional sequencing technology, optimal strategies appear surprisingly different from the usual empirical ones. The average number of sequence reads required to obtain 99% of fully determined genotypes never exceeds six, this value corresponding to the worst situation when H o equals 0.6. This threshold value of H o is strikingly stable when the tolerated proportion of nonfully resolved genotypes varies in a reasonable range. These results do not rely on the Hardy–Weinberg equilibrium assumption or on diallelism of nucleotidic sites.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here