Effect Size Estimation in Neuroimaging
Author(s) -
Marianne C. Reddan,
Martin A. Lindquist,
Tor D. Wager
Publication year - 2017
Publication title -
jama psychiatry
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 7.531
H-Index - 365
eISSN - 2168-6238
pISSN - 2168-622X
DOI - 10.1001/jamapsychiatry.2016.3356
Subject(s) - neuroimaging , estimation , medicine , psychology , neuroscience , management , economics
A central goal of translational neuroimaging is to establish robust links between brain measures and clinical outcomes. Success hinges on the development of brain biomarkers with large effect sizes. With large enough effects, a measure may be diagnostic of outcomes at the individual patient level. Surprisingly, however, standard brain-mapping analyses are not designed to estimate or optimize the effect sizes of brain-outcome relationships, and estimates are often biased. Here, we review these issues and how to estimate effect sizes in neuroimaging research. Effect size is a unit-free description of the strength of an effect, independent of sample size. Examples include Cohen d, Pearson r, and number needed to treat.1,2 For a given sample size (N), these can be converted to a t or z score (eg, Cohen d is t/[N]1/2). But t, z, F, and P values are sample size dependent and relate to the presence of an effect (statistical significance), not its magnitude. By contrast, effect size describes a finding’s practical significance, which determines its clinical importance. This is an important distinction because small effects can reach statistical significance given a large enough sample, even if they are unlikely to be of practical importance or replicable across diverse samples.3 Traditional neuroimaging studies are not designed to estimate effect sizes. A typical analysis tests for effects at each of 50 000 to 350 000 brain voxels. Post hoc effect sizes are selectively reported for a small subset of significant voxels. This practice creates bias, making effect size estimates larger than their true values.4 It is like a mediocre golfer who plays 5000 holes over the course of his career but only reports his 10 best holes. Bias is introduced because the best performance, selected post hoc, is not representative of expected performance. The Figure shows a simulation in which the true effect size in a set of voxels is d = 0.5. Once noise is added and a statistical test (t test) is conducted across 30 individuals, all significant voxels have an estimated effect size greater than the true effect. Why does this occur? Voxels tend to be significant if they show a true effect and have noise that favors the hypothesis. Correcting for multiple comparisons reduces false positives but actually increases this optimistic bias.6 As statistical thresholds become more stringent, an increasingly small subset of tests with favorable noise will reach significance, making the estimated post hoc effect size grow. In sum, conducting a large number of tests inherently induces selection bias, which invalidates effect size estimates. To overcome selection bias, we must reduce the number of statistical tests performed. One solution is to test a single, predefined region of interest. However, it is rare to consider only 1 region and discard valuable data. In addition, many symptoms and outcomes of interest are increasingly thought to be distributed across brain networks.5 It can also be tempting to redefine the boundaries of regions of interest post hoc after looking at the results—a form of P hacking that invalidates both hypothesis tests and effect size estimates. An alternative approach is to integrate effects across multiple voxels into 1 model of the outcome, which is then tested on new observations (ie, new patients). Instead of testing each voxel separately, associations with clinical outcomes are combined into a single model, and a single prediction is made for each patient. This approach is common in clinical research; for example, multiple factors, like diet, exercise, and hormone levels, are combined into models of disease risk. Neuroimaging models are based on voxels or network measures rather than risk factors, but the principle is the same. As long as (1) the model makes a single prediction for each patient and (2) predictions are tested on patient samples independent of those used to derive the model, then effect size estimates are unbiased. A growing number of studies use machine learning and multivoxel pattern analysis to integrate brain information into predictive models. Effect sizes are assessed via prospective application of the model to new, “out-of-training-sample” patients, often using an iterative strategy of training and testing on different subsets of patients, known as cross-validation7 (see Chang et al,5 for example). There are ways that crossvalidation can fail, and it is possible to overfit a crossvalidated data set by training many models and picking the best. However, if a model is tested prospectively on new, independent data sets without changing its parameters, then unbiased estimates of effect sizes can be obtained. Bias, or lack thereof, can also be assessed with permutation tests. Because integrated models combine information distributed across the brain in an optimized way, these models can substantially outperform single regions in predicting outcomes (Figure, C [adapted from data in Chang et al5]). Thus, such models provide a promising way to establish meaningful associations between brain measures and clinically relevant outcomes.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom