Significance of the structure of data in partial least squares regression predictions involving both natural and human experimental design | Zendy

Rinnan Åsmund | Zendy; Munck Lars | Zendy

Premium

Significance of the structure of data in partial least squares regression predictions involving both natural and human experimental design

Author(s) -

Rinnan Åsmund,

Munck Lars

Publication year - 2012

Publication title -

journal of chemometrics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.47

H-Index - 92

eISSN - 1099-128X

pISSN - 0886-9383

DOI - 10.1002/cem.2438

Subject(s) - partial least squares regression , chemometrics , biological system , data set , regression , set (abstract data type) , linear regression , covariance , regression analysis , residual , chemistry , mathematics , computer science , statistics , algorithm , machine learning , biology , programming language

When predicting the chemical composition of food samples from near‐infrared spectroscopy using partial least squares regression, deep knowledge of the origin of the information is not present. We are aiming at opening a Pandora's box of how the prediction of protein proceeds in a unique set of chemically diverse barley mutant samples. An external validation of the sources of co‐variation in nature that are exploited by chemometric models would give a framework for manipulating the deciding information to make expensive calibration more economical. The barley samples were supplemented by two designed data sets: one mirroring the coarse composition of the barley samples by mixing six main chemical components and one set where the biological covariance between the six chemical components had been reduced. The three original data sets give remarkably comparable prediction models, albeit their regression coefficients are quite different. The origin of the prediction ability of the data is elucidated by splitting the natural barley samples into two parts: one based on simulated biology extracted from a set of chemical mixtures, and the residual after the chemistry has been removed from the raw data. As much as 98.1% of the spectral information in the natural barley data is explained through the simulated biology, leaving as little as 1.9% of the spectral information for the unexplained biological variation and noise. However, unexplained biological variation still gives a fair prediction of protein (RMSECV = 1.23 and r 2 = 0.80, compared with RMSECV = 0.46 and r 2 = 0.97 for the natural data), and it gives a clear principal component analysis separation of the three genotype classes. The results were interpreted by conducting spectral inspection on the origin of the unique covariate patterns appearing in self‐organised biological systems that should motivate researchers and industry to investigate the compressive effect that the model has on the essential deterministic biological data. Copyright © 2012 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research