Premium
Regular, median and Huber cross‐validation: A computational comparison
Author(s) -
Yu ChiWai,
Clarke Bertrand
Publication year - 2015
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.11254
Subject(s) - outlier , estimator , residual , mathematics , statistics , least squares function approximation , noise (video) , cross validation , algorithm , computer science , artificial intelligence , image (mathematics)
We present a new technique for comparing models using a median form of cross‐validation and least median of squares estimation (MCV‐LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross‐validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV‐LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV‐LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV‐LMS for ‘good’ data. MCV‐LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross‐validation with least squares estimators (CV‐LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV‐LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV‐LS and MCV‐LMS. We illustrate CV‐LS, HVC, and MCV‐LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers.