z-logo
open-access-imgOpen Access
A Bayesian nonparametric model for inferring subclonal populations from structured DNA sequencing data
Author(s) -
Shai He,
Aaron Schein,
Vishal Kumar Sarsani,
Patrick Flaherty
Publication year - 2021
Publication title -
annals of applied statistics/the annals of applied statistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.674
H-Index - 75
eISSN - 1941-7330
pISSN - 1932-6157
DOI - 10.1214/20-aoas1434
Subject(s) - gibbs sampling , inference , hierarchical dirichlet process , dirichlet distribution , computer science , dirichlet process , nonparametric statistics , sampling (signal processing) , bayesian probability , computational biology , biology , artificial intelligence , data mining , latent dirichlet allocation , mathematics , statistics , topic model , mathematical analysis , filter (signal processing) , computer vision , boundary value problem
There are distinguishing features or "hallmarks" of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here