Premium
Regression and Correlation
Author(s) -
Harmatz Jerold S.,
Greenblatt David J.
Publication year - 2015
Publication title -
clinical pharmacology in drug development
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.711
H-Index - 22
eISSN - 2160-7648
pISSN - 2160-763X
DOI - 10.1002/cpdd.200
Subject(s) - medicine , citation , library science , computer science
We took on this topic more than 2 decades ago, having encountered an increasing number of research presentations with poorly reasoned statistics. Our objective was to examine the use of regression and correlation procedures in biomedical sciences, with a particular focus on applications in clinical pharmacology and psychopharmacology. Others have since considered these issues, and the topic continues to be timely up to the present. If anything, the problem is becoming more prevalent, thereby affecting journal contributors as well as reviewers. Many training programs do not incorporate any kind of biomedical statistics requirement. Research programs often have replaced statistical consultation in favor of readily available and largely automatic personal computer-based statistical calculation packages. Regression and correlation approaches are typically applied to a set of data pairs [(X1,Y1), (X2,Y2), . . . (Xn,Yn)] in the context of evaluating the relation between X and Y. Unless otherwise specified, “regression” is assumed to imply “linear regression”—that is, the underlying relationship betweenXandY is assumed tobe a linear function.The best-fitting linear function is one that minimizes the sum of squared perpendicular distances of the individual (X, Y) points from the corresponding straight line. The “correlation coefficient,” usually represented as r, is a quantitative measure ofthe extent to which the data points match the function. The “coefficient of determination” refers to the square of the correlation coefficient (r), representing the fraction of variance inY that is accounted for byX.As such, r indicates the importance of the X–Y relationship. Inferential statistical tests are typically applied to r and r values determined through linear regression analysis. The resulting probability indicates the likelihood that X explains some nonzero—and possibly trivial—fraction of the variance in Y—that is, the probability that X and Y have no relationship to each other. The issue is that, as the number of X,Y pairs increases, the size of the r or r values needed to achieve “statistical significance” becomes smaller and smaller (Table 1). With 40 X,Y pairs, an r value of less than 0.1 is “significant” at the P < .05 level. This indicates that, although there is less than 5% probability that Y is unrelated to X, less than 10% of the variance in Y is explained by X. Although statistically significant in this case, the r and r values provide little insight into the biomedical relation between X and Y, and minimal evidence that X “predicts” Y.With larger sample sizes, in the range of 200 or more X,Y pairs, “statistical significance” is achieved with r values of 0.02 (Table 1)! Even with careful and circumspect interpretation of r, r, and statistical significance, regression and correlation still can mislead. Common sources of trouble include: (1) A linear model is not an appropriate fit; (2) Data points are clustered rather than varying over a range; (3) Relationships are “driven” by outlying data points. Visual inspection of X,Y plots is needed to detect these problems. Figure 1 is an example of bias introduced by outlying data. Linear regression analysis of the 8 data points indicates a highly significant (P< .001) relationship betweenX andY,withX explaining 95%of the variability in Y (r1⁄4 0.95). However visual inspection indicates that the relationship is solely dependent on a single outlying Clinical Pharmacology in Drug Development 2015, 4(3) 161–162 © 2015, The American College of Clinical Pharmacology DOI: 10.1002/cpdd.200