Premium
Combining evidence across multiple endpoints with a global statistical test: Comparison of z‐scores versus ranks
Author(s) -
Dickson Samuel P.,
Knowlton Newman,
Hendrix Suzanne B.
Publication year - 2020
Publication title -
alzheimer's and dementia
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 6.713
H-Index - 118
eISSN - 1552-5279
pISSN - 1552-5260
DOI - 10.1002/alz.046771
Subject(s) - type i and type ii errors , standardization , statistical power , ceiling (cloud) , ranking (information retrieval) , null hypothesis , ceiling effect , disease , medicine , statistics , psychology , computer science , econometrics , mathematics , machine learning , engineering , alternative medicine , structural engineering , pathology , operating system
Background Alzheimer’s disease (AD) is a multi‐symptom disease which has cognitive, behavioral, functional and global outcomes. These outcomes are all important measures of disease severity and are driven by the underlying disease process. These symptoms provide multiple, potentially conflicting, answers to the question, “Did the treatment work?” Method Outcomes can be combined through standardization of outcomes using ranks or z‐scores. Additionally, the standardization can be performed before calculating change from baseline or after. These four combinations are explored by simulations with no effect, corroborative effects across outcomes, and disparate effects across outcomes (a mix of positive and null effects) to demonstrate the type I error (no effect) and power under various secenarios along with other performance metrics. Result All four methods of calculating a GST adequately control type I error. In the absence of ceiling and floor effects, the GST that calculates change from baseline first then combines evidence using z‐scores has the highest power. In the presence of disparate effects, the GST can still be more powerful than any single outcome, though the effect is appropriately attenuated. The GSTs that use ranking outperform z‐score methods in the presence of strong floor or ceiling effects. Conclusion A clinical trial with a GST analyzed first can show success as a proof of concept even if the individual outcomes fail to achieve significance, signalling that a development can proceed to later phases. Change from baseline should be calculated first. If there are no ceiling or floor effects, standardization should be performed using z‐scores, otherwise percentiles should be used. GSTs offer a good way to combine outcomes to demonstrate efficacy using fewer subjects.