Premium
Using Machine Learning to Predict Geomorphic Disturbance: The Effects of Sample Size, Sample Prevalence, and Sampling Strategy
Author(s) -
Perry George L. W.,
Dickson Mark E.
Publication year - 2018
Publication title -
journal of geophysical research: earth surface
Language(s) - English
Resource type - Journals
eISSN - 2169-9011
pISSN - 2169-9003
DOI - 10.1029/2018jf004640
Subject(s) - sampling (signal processing) , computer science , sample size determination , categorical variable , machine learning , sample (material) , hyperparameter , covariate , artificial intelligence , statistics , sampling bias , data mining , mathematics , chemistry , filter (signal processing) , chromatography , computer vision
Advances in data acquisition and statistical methodology have led to growing use of machine‐learning methods to predict geomorphic disturbance events. However, capturing the data required to parameterize these models is challenging because of expense or, more fundamentally, because the phenomenon of interest occurs infrequently. Thus, it is important to understand how the nature of the data used to train predictive models influences their performance. Using a database of cliff failure prediction and associated covariates from Auckland, New Zealand, we assess the performance of seven machine‐learning algorithms under different sampling strategies. Three sampling components are investigated: (i) the number of data points used in model training (sample size), (ii) the prevalence of occurrences (presences) in the data, and (iii) random versus spatial sampling strategy. Across the seven algorithms, small sample sizes can produce models that perform relatively well, especially if the prime concern is identifying key predictors rather than quantifying risk or predicting categorical outcome. Our analyses show that for the same effort (i.e., number of samples), sampling around multiple locations provides better predictions than sampling at just one or a few locations. Predictive performance may be further improved by considering issues such as the nature of what absences actually represent and paying careful attention to decisions about hyperparameter tuning, training‐testing data splits, and threshold optimization. It is well known that big data can inform complex data‐driven modeling, but here we show that careful sampling can facilitate informative event prediction even from small data.