Premium
Significance of the structure of data in partial least squares regression predictions involving both natural and human experimental design
Author(s) -
Rinnan Åsmund,
Munck Lars
Publication year - 2012
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.2438
Subject(s) - partial least squares regression , chemometrics , biological system , data set , regression , set (abstract data type) , linear regression , covariance , regression analysis , residual , chemistry , mathematics , computer science , statistics , algorithm , machine learning , biology , programming language
When predicting the chemical composition of food samples from near‐infrared spectroscopy using partial least squares regression, deep knowledge of the origin of the information is not present. We are aiming at opening a Pandora's box of how the prediction of protein proceeds in a unique set of chemically diverse barley mutant samples. An external validation of the sources of co‐variation in nature that are exploited by chemometric models would give a framework for manipulating the deciding information to make expensive calibration more economical. The barley samples were supplemented by two designed data sets: one mirroring the coarse composition of the barley samples by mixing six main chemical components and one set where the biological covariance between the six chemical components had been reduced. The three original data sets give remarkably comparable prediction models, albeit their regression coefficients are quite different. The origin of the prediction ability of the data is elucidated by splitting the natural barley samples into two parts: one based on simulated biology extracted from a set of chemical mixtures, and the residual after the chemistry has been removed from the raw data. As much as 98.1% of the spectral information in the natural barley data is explained through the simulated biology, leaving as little as 1.9% of the spectral information for the unexplained biological variation and noise. However, unexplained biological variation still gives a fair prediction of protein (RMSECV = 1.23 and r 2 = 0.80, compared with RMSECV = 0.46 and r 2 = 0.97 for the natural data), and it gives a clear principal component analysis separation of the three genotype classes. The results were interpreted by conducting spectral inspection on the origin of the unique covariate patterns appearing in self‐organised biological systems that should motivate researchers and industry to investigate the compressive effect that the model has on the essential deterministic biological data. Copyright © 2012 John Wiley & Sons, Ltd.