Premium
The essential role of balance tests in propensity‐matched observational studies: Comments on ‘A critical appraisal of propensity‐score matching in the medical literature between 1996 and 2003’ by Peter Austin, Statistics in Medicine
Author(s) -
Hansen B. B.
Publication year - 2008
Publication title -
statistics in medicine
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.996
H-Index - 183
eISSN - 1097-0258
pISSN - 0277-6715
DOI - 10.1002/sim.3208
Subject(s) - observational study , propensity score matching , matching (statistics) , critical appraisal , history , library science , sociology , psychology , medicine , statistics , computer science , mathematics , alternative medicine , pathology
Peter Austin has made an exacting, timely and eye-opening review of uses of propensity-score matching in medical research. Its Section 2.1 argues that the reports of propensity-matched analyses should include descriptive assessments of matched treatment-control differences on baseline variables. When propensity matching on covariates including BMI, for example, one should report the difference between matched cohorts’ mean BMIs, perhaps after inverse scaling by the pooled s.d. of BMIs prior to matching. The recommendation is a good one: matched differences on prognostic variables, and on variables that track selection into treatment, speak to the credibility of subsequent matched outcome analyses; and although the basic promise of propensity matching is that it should lessen such differences, the extent of the reduction varies greatly from case to case. Furthermore, since successful propensity matches or subclassifications enable comparisons similar to those which randomization would have given—in terms of observed covariates, at least [1], and should those covariates jointly suffice to remove confounding, then also in terms of outcomes [2]—it follows that balance is the basic mark of success of a propensity adjustment. Austin’s review also makes a negative recommendation: When appraising balance, avoid significance tests. Having compared means of BMI and other variables, Austin would not have us go on to calculate either paired/two-sample t-tests or other tests for a treatment-control difference on BMI. His pessimism about the state of reporting in the medical propensity-matching literature stems in large part from this opinion; only 2 of 47 papers reported balance properly, Austin reports, but it turns out that another 33 were disqualified on the basis of having tested balance, rather than reporting it using purely descriptive measures. This dim view of balance testing is driven by two complaints, complaints Austin shares with Imai, King and Stuart [3]: