The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection
Author(s) -
Zaixiang Tang,
Yueping Shen,
Xinyan Zhang,
Nengjun Yi
Publication year - 2016
Publication title -
genetics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 2.792
H-Index - 246
eISSN - 1943-2631
pISSN - 0016-6731
DOI - 10.1534/genetics.116.192195
Subject(s) - lasso (programming language) , bayesian probability , computer science , data set , feature selection , algorithm , scale (ratio) , set (abstract data type) , data mining , biology , computational biology , pattern recognition (psychology) , artificial intelligence , physics , quantum mechanics , world wide web , programming language
Large-scale "omics" data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom