Premium
Variable selection in support vector regression using angular search algorithm and variance inflation factor
Author(s) -
Folli Gabriely S.,
Nascimento Márcia H.C.,
Paulo Ellisson H.,
Cunha Pedro H.P.,
Romão Wanderson,
Filgueiras Paulo R.
Publication year - 2020
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.3282
Subject(s) - mean squared error , calibration , outlier , mathematics , coefficient of determination , linear regression , feature selection , statistics , variance inflation factor , content (measure theory) , analytical chemistry (journal) , linearity , root mean square , chemistry , algorithm , artificial intelligence , physics , computer science , chromatography , mathematical analysis , multicollinearity , quantum mechanics
Here, we combine angular search algorithm and variance inflation factor (ASA‐VIF) with support vector regression (SVR) (ASA‐VIF‐SVR) to estimate total acid number (TAN), basic nitrogen content (BNC), and sulfur content (SC) in Brazilian crude oils. To prevent the interference of outliers, we further developed a strategy for outlier identification and applied it to nonlinear models based on RMSE (root mean square error). ASA‐VIF‐SVR was applied to near‐ and mid‐infrared spectroscopy (NIR and MIR) and hydrogen nuclear magnetic resonance ( 1 H NMR) spectroscopy data available in a range of 93–194 samples. The models were evaluated for accuracy (root mean square error of calibration [RMSEC] and root mean square error of prediction [RMSEP]) and linearity (coefficient of determination, R 2 ). The removal of outliers increased accuracy and linearity of our models. The ASA‐VIF model for TAN, BNC, and SC selected 0.37%, 0.93%, and 0.30% of variables from full NIR spectra; 0.21%, 0.27%, and 0.21% from full MIR; and 0.20%, 0.42%, and 0.15% from full 1 H NMR. In most cases, the best results were obtained with variable selection compared with the full dataset. Also, 1 H NMR generated more accurate and linear models with RMSEP and R 2 p of 0.0071 wt% and 0.86 for BNC and 0.0623 wt% and 0.79 for SC. TAN showed a better MIR result with RMSEP of 0.1426 mg KOH g –1 and R 2 p of 0.47. The most important region for 1 H NMR and MIR was the one with the largest quantity of unpaired electrons (aromatic region).