z-logo
open-access-imgOpen Access
Stochastic Simulation of Test Collections
Author(s) -
Julián Urbano,
Thomas Nagler
Publication year - 2018
Publication title -
data archiving and networked services (dans)
Language(s) - English
Resource type - Conference proceedings
ISBN - 978-1-4503-5657-2
DOI - 10.1145/3209978.3210043
Subject(s) - computer science , vine copula , resampling , workaround , parametric statistics , population , sample (material) , data mining , impossibility , algorithm , machine learning , statistics , mathematics , chemistry , demography , multivariate statistics , chromatography , sociology , political science , law , programming language
Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only eliminates the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom