Premium
Comparison of statistical methods commonly used in predictive modelling
Author(s) -
Muñoz Jesús,
Felicísimo Ángel M.
Publication year - 2004
Publication title -
journal of vegetation science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.1
H-Index - 115
eISSN - 1654-1103
pISSN - 1100-9233
DOI - 10.1111/j.1654-1103.2004.tb02263.x
Subject(s) - multivariate adaptive regression splines , computer science , data mining , factor regression model , regression analysis , range (aeronautics) , principal component analysis , regression , data set , multivariate statistics , statistics , reliability (semiconductor) , logistic regression , cart , machine learning , artificial intelligence , mathematics , nonparametric regression , proper linear model , bayesian multivariate linear regression , geography , engineering , power (physics) , physics , aerospace engineering , archaeology , quantum mechanics
Logistic Multiple Regression, Principal Component Regression and Classification and Regression Tree Analysis (CART), commonly used in ecological modelling using GIS, are compared with a relatively new statistical technique, Multivariate Adaptive Regression Splines (MARS), to test their accuracy, reliability, implementation within GIS and ease of use. All were applied to the same two data sets, covering a wide range of conditions common in predictive modelling, namely geographical range, scale, nature of the predictors and sampling method. We ran two series of analyses to verify if model validation by an independent data set was required or cross‐validation on a learning data set sufficed. Results show that validation by independent data sets is needed. Model accuracy was evaluated using the area under Receiver Operating Characteristics curve (AUC). This measure was used because it summarizes performance across all possible thresholds, and is independent of balance between classes. MARS and Regression Tree Analysis achieved the best prediction success, although the CART model was difficult to use for cartographic purposes due to the high model complexity.