z-logo
Premium
What is a Good Calibration Question?
Author(s) -
Hemming Victoria,
Hanea Anca M.,
Burgman Mark A.
Publication year - 2022
Publication title -
risk analysis
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.972
H-Index - 130
eISSN - 1539-6924
pISSN - 0272-4332
DOI - 10.1111/risa.13725
Subject(s) - calibration , set (abstract data type) , sample (material) , computer science , reliability (semiconductor) , domain (mathematical analysis) , artificial intelligence , machine learning , statistics , data mining , mathematics , mathematical analysis , power (physics) , chemistry , physics , chromatography , quantum mechanics , programming language
Abstract Weighted aggregation of expert judgments based on their performance on calibration questions may improve mathematically aggregated judgments relative to equal weights. However, obtaining validated, relevant calibration questions can be difficult. If so, should analysts settle for equal weights? Or should they use calibration questions that are easier to obtain but less relevant? In this article, we examine what happens to the out‐of‐sample performance of weighted aggregations of the classical model (CM) compared to equal weighted aggregations when the set of calibration questions includes many so‐called “irrelevant” questions, those that might ordinarily be considered to be outside the domain of the questions of interest. We find that performance weighted aggregations outperform equal weights on the combined CM score, but not on statistical accuracy (i.e., calibration). Importantly, there was no appreciable difference in performance when weights were developed on relevant versus irrelevant questions. Experts were unable to adapt their knowledge across vastly different domains, and in‐sample validation did not accurately predict out‐of‐sample performance on irrelevant questions. We suggest that if relevant calibration questions cannot be found, then analysts should use equal weights, and draw on alternative techniques to improve judgments. Our study also indicates limits to the predictive accuracy of performance weighted aggregation, and the degree to which expertise can be adapted across domains. We note limitations in our study and urge further research into the effect of question type on the reliability of performance weighted aggregations.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here