Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure | Zendy

Roberts David R. | Zendy; Bahn Volker | Zendy; Ciuti Simone | Zendy; Boyce Mark S. | Zendy; Elith Jane | Zendy; GuilleraArroita Gurutzeta | Zendy; Hauenstein Severin | Zendy; LahozMonfort José J. | Zendy; Schröder Boris | Zendy; Thuiller Wilfried | Zendy; Warton David I. | Zendy; Wintle Brendan A. | Zendy; Hartig Florian | Zendy; Dormann Carsten F. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure

Author(s) -

Roberts David R.,

Bahn Volker,

Ciuti Simone,

Boyce Mark S.,

Elith Jane,

GuilleraArroita Gurutzeta,

Hauenstein Severin,

LahozMonfort José J.,

Schröder Boris,

Thuiller Wilfried,

Warton David I.,

Wintle Brendan A.,

Hartig Florian,

Dormann Carsten F.

Publication year - 2017

Publication title -

ecography

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 2.973

H-Index - 128

eISSN - 1600-0587

pISSN - 0906-7590

DOI - 10.1111/ecog.02881

Subject(s) - overfitting , random forest , computer science , cross validation , econometrics , autoregressive model , contrast (vision) , data mining , ecology , statistics , machine learning , mathematics , artificial intelligence , biology , artificial neural network

Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross‐validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross‐validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non‐causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross‐validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non‐random and blocked cross‐validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross‐validation is nearly universally more appropriate than random cross‐validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross‐validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore