z-logo
Premium
Fast method for GA‐PLS with simultaneous feature selection and identification of optimal preprocessing technique for datasets with many observations
Author(s) -
Stefansson Petter,
Liland Kristian H.,
Thiis Thomas,
Burud Ingunn
Publication year - 2020
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.3195
Subject(s) - preprocessor , partial least squares regression , computer science , covariance matrix , pattern recognition (psychology) , feature selection , set (abstract data type) , data pre processing , feature (linguistics) , genetic algorithm , matrix (chemical analysis) , algorithm , artificial intelligence , data mining , machine learning , chemistry , linguistics , philosophy , chromatography , programming language
A fast and memory‐efficient new method for performing genetic algorithm partial least squares (GA‐PLS) on spectroscopic data preprocessed in multiple different ways is presented. The method, which is primarily intended for datasets containing many observations, involves preprocessing a spectral dataset with several different techniques and concatenating the different versions of the data horizontally into a design matrix X which is both tall and wide. The large matrix is then condensed into a substantially smaller covariance matrix X T X whose resulting size is unrelated to the number of observations in the dataset, i.e. the height of X . It is demonstrated that the smaller covariance matrix can be used to efficiently calibrate partial least squares (PLS) models containing feature selections from any of the involved preprocessing techniques. The method is incorporated into GA‐PLS and used to evolve variable selections for a set of different preprocessing techniques concurrently within a single algorithm. This allows a single instance of GA‐PLS to determine which preprocessing technique, within the set of considered methods, is best suited for the spectroscopic dataset. Additionally, the method allows feature selections to be evolved containing variables from a mixture of different preprocessing techniques. The benefits of the introduced GA‐PLS technique can be summarized as threefold: (1) for datasets with many observations, the proposed method is substantially faster compared to conventional GA‐PLS implementations based on NIPALS, SIMPLS, etc. (2) using a single GA‐PLS automatically reveals which of the considered preprocessing techniques results in the lowest model error. (3) it allows the exploration of highly complex solutions composed of features preprocessed using various techniques.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here