Premium
Special issue on Big Data and the Statistical Sciences: Guest Editor's Introduction
Author(s) -
Lockhart Richard
Publication year - 2018
Publication title -
canadian journal of statistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.804
H-Index - 51
eISSN - 1708-945X
pISSN - 0319-5724
DOI - 10.1002/cjs.11350
Subject(s) - citation , computer science , big data , library science , data science , data mining
The era of Big Data is well underway. It presents, on the one hand, clear challenges to the discipline of statistics and to statisticians and, on the other hand, many opportunities for statistical scientists. We statisticians are challenged to show our leadership in what is clearly our traditional domain: data. At the same time Big Data offers many opportunities for statistical scientists to push science, technology, and engineering forward and to show that the basic ideas of our field remain relevant, nay critically important, in this new era. Current chatter around the meaning of Big Data shows that the terms “statistician” and “data scientist” are widely used, outside the community to which I belong, to apply to a much wider group than that community. Essentially every discipline has data; with those data come disciplinespecific data scientists. Techniques and jargon develop independently of work in other fields. More importantly for the readership of The Canadian Journal of Statistics those techniques often ignore the more encompassing, general work of statisticians such as we. We understand well that the basic ideas underlying our study of data analytic techniques apply in all sorts of contexts and that lessons learned in one context have important value in other contexts in entirely different disciplines. This issue of The Canadian Journal of Statistics is therefore dedicated to Big Data and the Statistical Sciences and to highlighting both the value of classical statistical thought in approaching novel large-scale data problems and the challenges facing the professional statistical community. In this issue you will see that these classical statistical ideas continue to have a crucial role to play in keeping data analysis honest, efficient, and effective. You will see opportunities for new statistical methodology built on old statistical ideas across a wide spectrum of applications. You will see that huge new computing resources do not put an end to the need for careful modelling, for honest assessment of uncertainty, or for good experimental design. We have here both review articles and methodological proposals. Some are Bayesian in view, some are frequentist, and some are clearly both. We cover experimental design, Official Statistics, modern genetics, on-line methods, Markov Chain Monte Carlo, functional data, graphical models, dimension reduction, local methods, model selection, post-selection inference, high-dimensional limit theory, and many more ideas. I want in the remainder of this introduction to highlight a few of those ideas, draw some connections to the challenges I have mentioned, and perhaps point out places where our community has particular obligations. Mary Thompson looks at the impact of Big Data on Official Statistics in a wide ranging review. She articulates many ways in which our ability to collect more data with much greater complexity and to fit much larger models will change the way agencies like Statistics Canada do their work. For instance, some concepts are traditionally defined in terms which suit the way they are measured more than the underlying idea; access to larger and timelier data sources may change this balance. As another example, most statistical agencies are moving rapidly to augment, or replace, traditional survey data with administrative data and to use the paradata gathered automatically as part of electronic data collection methods; statisticians will need to cope with data quality issues in the administrative data, at least, because those data were not gathered for the statistical agency’s purpose. Thompson considers carefully the impact of continuous or rolling data collection and discusses the future use of visualization in Official Statistics before concluding with an important list of research topics needing attention from statisticians. The topics show clearly that one impact of Big Data is positive for statisticians: there are many new and serious problems squarely situated within our field.