z-logo
Premium
Bayesian nonparametric variable selection as an exploratory tool for discovering differentially expressed genes
Author(s) -
Shahbaba Babak,
Johnson Wesley O.
Publication year - 2012
Publication title -
statistics in medicine
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.996
H-Index - 183
eISSN - 1097-0258
pISSN - 0277-6715
DOI - 10.1002/sim.5680
Subject(s) - regression , bayesian probability , feature selection , dirichlet process , regression analysis , statistics , scale (ratio) , linear regression , computer science , biology , mathematics , computational biology , artificial intelligence , physics , quantum mechanics
High‐throughput scientific studies involving no clear a priori hypothesis are common. For example, a large‐scale genomic study of a disease may examine thousands of genes without hypothesizing that any specific gene is responsible for the disease. In these studies, the objective is to explore a large number of possible factors (e.g., genes) in order to identify a small number that will be considered in follow‐up studies that tend to be more thorough and on smaller scales. A simple, hierarchical, linear regression model with random coefficients is assumed for case–control data that correspond to each gene. The specific model used will be seen to be related to a standard Bayesian variable selection model. Relatively large regression coefficients correspond to potential differences in responses for cases versus controls and thus to genes that might ‘matter’. For large‐scale studies, and using a Dirichlet process mixture model for the regression coefficients, we are able to find clusters of regression effects of genes with increasing potential effect or ‘relevance’, in relation to the outcome of interest. One cluster will always correspond to genes whose coefficients are in a neighborhood that is relatively close to zero and will be deemed least relevant. Other clusters will correspond to increasing magnitudes of the random/latent regression coefficients. Using simulated data, we demonstrate that our approach could be quite effective in finding relevant genes compared with several alternative methods. We apply our model to two large‐scale studies. The first study involves transcriptome analysis of infection by human cytomegalovirus. The second study's objective is to identify differentially expressed genes between two types of leukemia. Copyright © 2012 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here