Premium
RNASeqDesign: a framework for ribonucleic acid sequencing genomewide power calculation and study design issues
Author(s) -
Lin ChienWei,
Liao Serena G.,
Liu Peng,
Lee MeiLing Ting,
Park Yong Seok,
Tseng George C.
Publication year - 2019
Publication title -
journal of the royal statistical society: series c (applied statistics)
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.205
H-Index - 72
eISSN - 1467-9876
pISSN - 0035-9254
DOI - 10.1111/rssc.12330
Subject(s) - computer science , massive parallel sequencing , sample size determination , parametric statistics , count data , data mining , sample (material) , statistical power , dna sequencing , computational biology , biology , poisson distribution , statistics , mathematics , genetics , chemistry , dna , chromatography
Summary Massively parallel sequencing (also known as next generation sequencing (NGS)) technology has emerged as a powerful tool in characterizing genomic profiles. Among many NGS applications, ribonucleic acid sequencing (‘RNA‐Seq’) has gradually become a standard tool for global transcriptomic monitoring. Although the cost of NGS experiments has dropped constantly, the high sequencing cost and bioinformatic complexity are still obstacles for many biomedical projects. Unlike earlier fluorescence‐based technologies such as microarrays, modelling of NGS data should consider discrete count data. In addition to sample size, sequencing depth also directly relates to the experimental cost. Consequently, given a total budget and prespecified unit experimental cost, the study design issue in RNA‐Seq is conceptually a more complex multi‐dimensional constrained optimization problem rather than a one‐dimensional sample size calculation in a traditional hypothesis setting. We propose a statistical framework, namely ‘RNASeqDesign’, to utilize pilot data for power calculation and study design of RNA‐Seq experiments. The approach is based on mixture model fitting of the p ‐value distribution from pilot data and a parametric bootstrap procedure based on approximated Wald test statistics to infer the genomewide power for optimal sample size and sequencing depth. We further illustrate five practical study design tasks for practitioners. We perform simulations and three real applications to evaluate the performance and to compare with existing methods.