Premium
Extracting Actionable Information From Genome Scans
Author(s) -
Bacanu SilviuAlin,
Kendler Kenneth S.
Publication year - 2013
Publication title -
genetic epidemiology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.301
H-Index - 98
eISSN - 1098-2272
pISSN - 0741-0395
DOI - 10.1002/gepi.21682
Subject(s) - univariate , estimator , genome , statistics , genome wide association study , computer science , sample size determination , computational biology , data mining , biology , mathematics , genetics , multivariate statistics , gene , single nucleotide polymorphism , genotype
Genome‐wide association studies discovered numerous genetic variants significantly associated with various phenotypes. However, significant signals explain only a small portion of the variation in many traits. One explanation is that missing variation is found in “suggestive signals,” i.e., variants with reasonably small P ‐values. However, it is not clear how to capture this information and use it optimally to design and analyze future studies. We propose to extract the available information from a genome scan by accurately estimating the means of univariate statistics. The means are estimated by: (i) computing the sum of squares (SS) of a genome scan's univariate statistics, (ii) using SS to estimate the expected SS for the means (SSM) of univariate statistics, and (iii) constructing accurate soft threshold (ST) estimators for means of univariate statistics by requiring that the SS of these estimators equals the SSM. When compared to competitors, ST estimators explain a substantially higher fraction of the variability in true means. The accuracy of proposed estimators can be used to design two‐tier follow‐up studies in which regions close to variants having ST‐estimated means above a certain threshold are sequenced at high coverage and the rest of the genome is sequenced at low coverage. This follow‐up approach reduces the sequencing burden by at least an order of magnitude when compared to a high coverage sequencing of the whole genome. Finally, we suggest ways in which ST methodology can be used to improve signal detection in future sequencing studies and to perform general statistical model selection.