Considering Numerical Error Propagation In Modeling And Regression Of Data
Author(s) -
Neima Brauner,
Mordechai Shacham
Publication year - 2020
Language(s) - English
Resource type - Conference proceedings
DOI - 10.18260/1-2--6982
Subject(s) - computer science , regression analysis , collinearity , regression , statistics , algorithm , data mining , machine learning , mathematics
The use of user-friendly interactive regression software enables undergraduate engineering students to reach a high level of sophistication in regression, correlation and analysis of data. In order to interpret correctly the results, the students must be familiar with potential causes for poor fits in correlations, should be able to recognize a poor correlation and improve it if possible. They should also be aware of the practical consequences of using a correlation which has no statistical validity. In this paper, the harmful effects of numerical error propagation (resulting from collinearity among the independent variables) are explained and demonstrated. Simple methods for minimizing such error propagation in polynomial regression are introduced. This material can be presented, for example, as part of 3rd year undergraduate mathematical modeling and numerical methods course. Introduction Realistic modeling and accurate correlation of experimental data are essential to sound engineering design. Many of the statistical techniques for analyzing the accuracy of the correlations have been known for several decades (see, for example, Draper and Smith, 1981, Himmelblau, 1970, Bates and Watts, 1988 and Noggle, 1993). But, until recently, those techniques have not been utilized in a significant level in undergraduate engineering education. One of the main reasons for not utilizing those techniques was that statistical tests usually yield numbers (variance, standard deviation, correlation coefficient, etc.). The meaning of these numbers can be easily misinterpreted if the statistical theory and the assumptions made in developing the tests are not well understood. The emergence of software packages with interactive regression and statistical analysis capabilities (such as POLYMATH, MATLAB, MATHEMATICA, EXCEL) which provides both numerical and graphical output changes the situation. These software packages enable undergraduate engineering students, with moderate statistical background, to carry out rigorous regression and statistical analysis of data. They are able to select the most appropriate correlation model and test its statistical validity using residual and confidence region plots. They can analyze the quality and precision of the laboratory data by plotting one independent variable versus the others to detect hidden collinearity that may exist among the variables. Shacham et al (1996) had described a set of lectures and exercises that is used to introduce freshman engineering students to the basics of data modeling and analysis using interactive software packages. This material is included in an introductory computing course or as part of an introductory engineering course. The introduction to data modeling and analysis (described by Shacham et al, 1996) includes the following subjects: 1. Basic statistical concepts. 2. Discrimination between real experimental data and smoothed interpolated data. 3. Using residual plots and confidence intervals for selecting the most appropriate model. 4. The dangers of extrapolation, in particular, when a non-theory based model is used. This introductory material is very helpful to the students for modeling and analyzing their own data. However, they may need more advanced material when dealing with, for example, models containing large numbers of parameters. In this paper, more advanced material related to regression is presented. The discussion includes models which are comprised of a sum of functions of the same independent variable (as in polynomial regression). Various effects of the interdependency between these functions are described and demonstrated. The material presented is taught to third year, undergraduate Chemical Engineering students at the Ben Gurion University as part of a mathematical modeling and numerical methods’ course. The calculations involved in solving the examples presented have been carried out using the POLYMATH 4.0 (Shacham and Cutlip, 1996) and MATLAB (MathWorks, 1992) packages, but other similar packages can be used for this purpose. Linear Regression with Models Comprising of Functions of One Independent Variable Let us assume that there is a set of N data points of a dependent variable (measured variable, such as vapor pressure, viscosity, heat capacity, etc.), yi versus an independent variable (controlled variable, such as temperature,. concentration, pressure) xi, i = 1, 2,... N. A regression model comprised of a linear combination of n different functions of the independent variable is considered. Thus, the regressors are x1 = f1 (x), x2 = f2(x)... xn = fn (x). For instance, in polynomials x1 = x , x2 = x,... xn = x n – . A linear model fitted to the data is of the form: yi = β0+β1x1i + β2x2i ... + βn xni +∈i (1) where β0, β1, ...βn are the parameters of the model and ∈i is a measurement error in yi. It is assumed that ∈i is independently and identically (i.i.d.) distributed. The vector of estimated parameters ) ˆ ˆ , ˆ ( ˆ 1 0 n T β β β = Κ β is usually calculated using the least squares error approach, by minimizing the following function: ( ) [ ] ∑ = β + β + β + β − = N i ni n i i i x x x y S 1 2 2 2 1 1 0 2 Λ (2) If the parameters appear in a linear expression (as in eq. (1)), the minimization can be carried out by solving a set of simultaneous linear algebraic equations (the normal equation): y X X X T T = β (3) The columns of X are: x0 = 1, x1, x2 ... x n and X T X = A is the normal matrix. To check the goodness of the fit between the observed yi and estimated i ŷ values of the dependent variable, they can be plotted versus xi (when there is a single independent variable) or versus i, the point number (when there are several independent variables). The distance between the observed and estimated values can serve as an indication for the quality of the fit. These distances can be amplified using a “residual plot”. In the residual plot, the model error (residual) i e is plotted usually versus yi, where: i i i y y ˆ ˆ − = e (4) A random distribution of the residuals around zero indicates that the model correctly represent the particular set of data. A definite trend or pattern in the residual plot may indicate either a lack of fit of the model, or that the assumption of random error distribution for the dependent variable is incorrect. In some cases (for example, in cases where the value of the dependent variable changes by several orders of magnitude over the range of interest) the relative error is distributed normally. The relative error is defined as: i i ir y e = e ˆ ˆ (5) The appropriate transformation, which results in minimization of the relative error in a regression, is taking the logarithm of both sides of the model equation. It should be emphasized, however, that when the variables are transformed, the residual plot must be constructed using the transformed form of the dependent variable, in order to account for the change in the error distribution introduced by the transformation. A numerical indicator for the quality of the fit which is used most frequently is the square of standard error of the estimate, which represents the sample variance, and is given by:
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom