Premium
An efficient nonlinear programming strategy for PCA models with incomplete data sets
Author(s) -
de la Fuente Rodrigo LópezNegrete,
GarcíaMuñoz Salvador,
Biegler Lorenz T.
Publication year - 2010
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.1306
Subject(s) - missing data , principal component analysis , nonlinear system , curse of dimensionality , computer science , partial least squares regression , minification , data mining , nonlinear programming , algorithm , artificial intelligence , mathematics , pattern recognition (psychology) , mathematical optimization , machine learning , physics , quantum mechanics
Abstract Processing plants can produce large amounts of data that process engineers use for analysis, monitoring, or control. Principal component analysis (PCA) is well suited to analyze large amounts of (possibly) correlated data, and for reducing the dimensionality of the variable space. Failing online sensors, lost historical data, or missing experiments can lead to data sets that have missing values where the current methods for obtaining the PCA model parameters may give questionable results due to the properties of the estimated parameters. This paper proposes a method based on nonlinear programming (NLP) techniques to obtain the parameters of PCA models in the presence of incomplete data sets. We show the relationship that exists between the nonlinear iterative partial least squares (NIPALS) algorithm and the optimality conditions of the squared residuals minimization problem, and how this leads to the modified NIPALS used for the missing value problem. Moreover, we compare the current NIPALS‐based methods with the proposed NLP with a simulation example and an industrial case study, and show how the latter is better suited when there are large amounts of missing values. The solutions obtained with the NLP and the iterative algorithm (IA) are very similar. However when using the NLP‐based method, the loadings and scores are guaranteed to be orthogonal, and the scores will have zero mean. The latter is emphasized in the industrial case study. Also, with the industrial data used here we are able to show that the models obtained with the NLP were easier to interpret. Moreover, when using the NLP many fewer iterations were required to obtain them. Copyright © 2010 John Wiley & Sons, Ltd.