Premium
The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration
Author(s) -
Nadler Boaz,
Coifman Ronald R.
Publication year - 2005
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.915
Subject(s) - partial least squares regression , mean squared error , chemometrics , feature selection , calibration , multivariate statistics , mathematics , statistics , pattern recognition (psychology) , algorithm , computer science , artificial intelligence , machine learning
Abstract Classical least squares (CLS) and partial least squares (PLS) are two common multivariate regression algorithms in chemometrics. This paper presents an asymptotically exact mathematical analysis of the mean squared error of prediction of CLS and PLS under the linear mixture model commonly assumed in spectroscopy. For CLS regression with a very large calibration set the root mean squared error is approximately equal to the noise per wavelength divided by the length of the net analyte signal vector. It is shown, however, that for a finite training set with n samples in p dimensions there are additional error terms that depend on σ 2 p 2 / n 2 , where σ is the noise level per co‐ordinate. Therefore in the ‘large p —small n ’ regime, common in spectroscopy, these terms can be quite large and even dominate the overall prediction error. It is demonstrated both theoretically and by simulations that dimensional reduction of the input data via their compact representation with a few features, selected for example by adaptive wavelet compression, can substantially decrease these effects and recover the asymptotic error. This analysis provides a theoretical justification for the need to perform feature selection (dimensional reduction) of the input data prior to application of multivariate regression algorithms. Copyright © 2005 John Wiley & Sons, Ltd.