Premium
Non‐parametric statistical methods for multivariate calibration model selection and comparison
Author(s) -
Thomas Edward V.
Publication year - 2003
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.833
Subject(s) - latent variable , latent variable model , multivariate statistics , partial least squares regression , calibration , statistics , parametric statistics , principal component regression , latent class model , mathematics , computer science , principal component analysis , regression analysis , model selection
Model selection is an important issue when constructing multivariate calibration models using methods based on latent variables (e.g. partial least squares regression and principal component regression). It is important to select an appropriate number of latent variables to build an accurate and precise calibration model. Inclusion of too few latent variables can result in a model that is inaccurate over the complete space of interest. Inclusion of too many latent variables can result in a model that produces noisy predictions through incorporation of low‐order latent variables that have little or no predictive value. Commonly used metrics for selecting the number of latent variables are based on the predicted error sum of squares (PRESS) obtained via cross‐validation. In this paper a new approach for selecting the number of latent variables is proposed. In this new approach the prediction errors of individual observations (obtained from cross‐validation) are compared across models incorporating varying numbers of latent variables. Based on these comparisons, non‐parametric statistical methods are used to select the simplest model (least number of latent variables) that provides prediction quality that is indistinguishable from that provided by more complex models. Unlike methods based on PRESS, this new approach is robust to the effects of anomalous observations. More generally, the same approach can be used to compare the performance of any models that are applied to the same data set where reference values are available. The proposed methodology is illustrated with an industrial example involving the prediction of gasoline octane numbers from near‐infrared spectra. Published in 2004 by John Wiley & Sons, Ltd.