z-logo
Premium
The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples
Author(s) -
Evans Steven N.,
Matsen Frederick A.
Publication year - 2012
Publication title -
journal of the royal statistical society: series b (statistical methodology)
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 6.523
H-Index - 137
eISSN - 1467-9868
pISSN - 1369-7412
DOI - 10.1111/j.1467-9868.2011.01018.x
Subject(s) - mathematics , metric (unit) , unifrac , phylogenetic tree , statistics , tree (set theory) , sequence (biology) , sample (material) , combinatorics , biology , paleontology , chromatography , biochemistry , operations management , genetics , chemistry , 16s ribosomal rna , bacteria , gene , economics
Summary.  It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac , which gives a somewhat ad hoc phylogenetics‐based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich–Rubinstein, or earth mover's, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich–Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop L p Zolotarev‐type generalizations of the metric, and we show how the p ‐value of the resulting natural permutation test of the null hypothesis ‘no difference between two communities’ can be approximated by using a Gaussian process functional. We relate the L 2 ‐case to an analysis‐of‐variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent random variables.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here