Premium
Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding
Author(s) -
Somervuo Panu,
Yu Douglas W.,
Xu Charles C.Y.,
Ji Yinqiu,
Hultman Jenni,
Wirta Helena,
Ovaskainen Otso
Publication year - 2017
Publication title -
methods in ecology and evolution
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.425
H-Index - 105
ISSN - 2041-210X
DOI - 10.1111/2041-210x.12721
Subject(s) - taxonomic rank , dna barcoding , biology , biodiversity , identification (biology) , taxonomy (biology) , biological classification , evolutionary biology , taxon , ecology
Summary A crucial step in the use of DNA markers for biodiversity surveys is the assignment of Linnaean taxonomies (species, genus, etc.) to sequence reads. This allows the use of all the information known based on the taxonomic names. Taxonomic placement of DNA barcoding sequences is inherently probabilistic because DNA sequences contain errors, because there is natural variation among sequences within a species, and because reference data bases are incomplete and can have false annotations. However, most existing bioinformatics methods for taxonomic placement either exclude uncertainty, or quantify it using metrics other than probability. In this paper we evaluate the performance of the recently proposed probabilistic taxonomic placement method PROTAX by applying it to both annotated reference sequence data as well as to unknown environmental data. Our four case studies include contrasting taxonomic groups (fungi, bacteria, mammals and insects), variation in the length and quality of the barcoding sequences (from individually Sanger‐sequenced sequences to short Illumina reads), variation in the structures and sizes of the taxonomies (800–130 000 species) and variation in the completeness of the reference data bases (representing 15–100% of known species). Our results demonstrate that PROTAX yields essentially unbiased probabilities of taxonomic placement, which means its quantification of species identification uncertainty is reliable. As expected, the accuracy of taxonomic placement increases with increasing coverage of taxonomic and reference sequence data bases, and with increasing ratio of genetic variation among taxonomic levels over within taxonomic levels. We conclude that reliable species‐level identification from environmental samples is still challenging and that neglecting identification uncertainty can lead to spurious inference. A key aim for future research is the completion of taxonomic and reference sequence data bases and making these two types of data compatible.