Dr.
Siti Mariam binti Abdul Rahman
Faculty of Mechanical Engineering
Office: T1-A14-01C
e-mail:
[email protected] To fit curves to data using available techniques.
To assess the reliability of the answers obtained.
To choose the preferred method for any particular
problem.
To study different techniques to fit curves or approximating
functions to the set of discrete data and to manipulate these
approximating functions.
Least-squares regression. Get the ‘best’ straight line to fit
through a set of uncertain data points.
Interpolation. Estimate intermediate values between precise
data points by deriving polynomials in equation forms. Three
methods to be investigated:
(a) Newton’s interpolating polynomial,
(b) Lagrange interpolating polynomial, and
(c) Spline interpolation (fitting data in piecewise fashion).
Typical data
◦ is discrete but we are interested to know the intermediate value
◦ need to estimate these intermediate values
Curve fitting:
◦ finding a curve (approximation) which has the best fit to a series of
discrete data
◦ The curve is the estimate of the trend of the dependent variables
◦ the curve can be used to determine the intermediate estimate of the
data.
How to draw the curve?
◦ need to define the function of the curve – can be linear or non-linear
Approaches for curve fitting:
1. Least-square regression
Data with significant error or noise
Curve doesn’t pass all data points – curve represent general trend
of the data
2. Interpolation
Data is known to be precise
Curve passes all data point
Regression?
◦ modeling of relationship between dependent and
independent variables
◦ finding a curve which represent the best approximation of a
series of data points
◦ the curve is the estimate of the trend of dependent
variables
How to find the curve?
◦ by deriving the function of the curve
◦ functions can be linear, polynomial & exponential
Given n data points (x1, y1), (x2,y2),…. , (xn,yn), best fit y = f(x)
to the data set. The best fit is generally based on minimizing
the sum of the square of the residuals, Sr.
Regression model;
y p f (x)
Residual at any point is
i yi f (xi )
Sum of the square of the residuals
n
Sr yi f (xi )
2
i 1
Fit a straight line to a set of n data points (x1,y1), (x2,y2), ....., (xn, yn).
Equation of the line (regression model) is given by
y a0 a1x e
a1- slope
a0 - intercept
e - error, or residual, between the model and the measurement
• Ideally, if all the residuals are zero,
one may have found an equation in
which all the points lie on the
model.
• Thus, minimization of the residual
is an objective of obtaining
regression coefficients.
The most popular method to minimize the residual is the least
squares methods, where the estimates of the constants of the
models are chosen such that the sum of the squared residuals, Sr is
minimized.
The ‘best’ straight line would be the one that minimize the total
error. Several criteria may be used,
n n
min e (y
i i a0 a1xi )
or i 1
n
i 1
n
min e (y
i 1
i
i 1
i a0 a1xi )
n = total number of points
This is an inadequate criterion no unique model
Examples of some criteria for “best fit”
that are inadequate for regression:
a) minimizes the sum of the
residuals,
b) minimizes the sum of the absolute
values of the residuals, and
c) minimizes the maximum error of
any individual point.
However, a more practical criterion
for least-squares approach is to
minimize the sum of the squares of
the residuals, that is
n n
Sr ei (yi a0 a1xi ) 2
2
i 1 i 1
Best strategy! Yields a unique line for a given set of data.
Using the regression model:
y a0 a1x
the slope and intercept producing the best fit can be found
using:
n xi yi xi yi
a1
n xi x
2 2
i
a0
y a1 xi
i
n
y a1x
Fit the best straight line to the following set of x and y values
using the method of least-squares.
x 0 1 2 3 4 5 6
y 2 5 9 15 17 24 25
Solution: xi yi xi2 x i yi
0 2 0 0
1 5 1 5
2 9 4 18
3 15 9 45
4 17 16 68
5 24 25 120
6 25 36 150
Σ 21 97 91 406
Knowing the linear equation and using known value:
a1
n xi yi xi yi
a0
y i a1 xi
n xi x n
2 2
i
97 4.1071(21)
7(406) (21)(97)
7
7(91) 21
2
97 86.2491
2842 2037
7
637 441 10.7509
805
7
196 1.5357
4.1071
Least-square fit is given by: y 1.5357 4.1071x
for a straight line, the sum of the squares of the estimate
residuals:
n n
Sr ei (yi a0 a1xi ) 2
2
i 1 i 1
• Quantify the spread of data
around the regression line
• Used to quantify the ‘goodness’
of a fit
Standard error of the estimate:
St
Sy St (yi y ) 2
n 1
Sr Sr (yi a0 a1xi ) 2
Sy / x
n 2
Standard error of estimate, Sy/x Quantify the spread of data
around regression line
Regression data showing
◦ the spread of data around the mean of the dependent data, Sy
◦ the spread of the data around the best fit line, Sy/x.
◦ The reduction in spread represents the improvement due to linear
regression.
how well regression line represent the real data.
Represent the % of the data which is closest to the line of best fit!
r2 is the difference between the sum of the squares of the data
residuals, St and the sum of the squares of the estimate residuals, Sr,
normalized by the sum of the squares of the data residuals:
St Sr
r2 St (yi y ) 2 Sr (yi a0 a1xi ) 2
St
St – Sr: quantify the improvement (or error reduction) due to
describing data in term of straight line rather than an average value.
r2 represents the percentage of the
original uncertainty explained by
the model. (i.e. % of data that is closest to the regression line)
For a perfect fit, Sr = 0 and r2 = 1.
If r2 = 0 St = Sr, there is no improvement over simply picking the
mean.
If r2 < 0, the model is worse than simply picking the mean!
Determine the coefficient of correlation for the linear
regression line obtained in the Example 1
xi yi Fest 1.5357 4.1071xi
0 2
St (yi y ) 2 480.8571
1 5
2 9 Sr (yi a 0 a1xi ) 2 8.5357
3 15
480.8571
4 17 Sy 8.9523
5 24 7 1
6 25 8.5357
Sy / x 1.3066
72
98.22% of the original
uncertainty has been 480.8571 8.5357
r 2
0.9822
explained by the linear model 480.8571
How good is the model?
Check for adequacy
◦ Standard error of estimate, Sy/x
◦ Coefficient of determination, r2
◦ Plot graph and check visually
Examples of functions that can be linearized are
1. Exponential function: y e1 x where a and β are constant
1
coefficients
Linearized: ln y ln 1 1x
2. Power function:
y 2x 2 where a and β are constant
coefficients.
Linearized:
log y log 2 2 log x
x
3. Saturation-growth-rate:
y 3
3 x
1 1 3 1
Linearized: transform these equations
y 3 3 x into linear form so that
simple linear regression
can be used.
Example of nonlinear transformation
In their
transformed
forms, these model
can use linear
regression to
evaluate the
constant
coefficients.
Use a power model to fit the following set of data.
y 2 x 2 log y log 2 2 log x
xi y log xi log yi
1 0.5 0.000 -0.301
2 1.7 0.301 0.226
3 3.4 0.477 0.534
4 5.7 0.602 0.753
5 8.4 0.699 0.922
Use linearized plot to determine the coefficient of power equation
◦ Linear regression is log y = 1.75*log x – 0.300
Not all equations can be easily transformed!!!!
◦ Alternative method Nonlinear regression
The linear least-squares regression
procedure can be readily extended to
fit data to a higher-order polynomial.
Again, the idea is to minimize the sum
of the squares of the estimate
residuals.
The figure shows the same data fit
with:
a) A first order polynomial
b) A second order polynomial
For second order polynomial
regression:
y a0 a1x a2x2 e
For a second order polynomial, the best fit would mean minimizing:
n n
Sr ei (yi a0 a1xi a2 xi 2 ) 2
2
i 1 i 1
In general, this would mean minimizing:
n n
S r ei ( yi a0 a1 xi a2 xi am xi ) 2
2 2 m
i 1 i 1
The standard error for fitting an mth order polynomial to ndata
points is:
Sr
Sy / x
n (m 1)
because the mth order polynomial has (m+1) coefficients.
The coefficient of determination r2 is still found using:
St Sr
r2
St
To find the constants of the polynomial model, we partially
differentiate it with respect to each of the unknown
coefficients and set them equal to zero.
S r n
2.( yi a0 a1 xi a2 xi am xi )(1) 0
2 m
a0 i 1
S r n
2.( yi a0 a1 xi a2 xi am xi )( xi ) 0
2 m
a1 i 1
S r n
2.( yi a0 a1 xi a2 xi am xi )( xi ) 0
2 m 2
a2 i 1
S r n
2.( yi a0 a1 xi a2 xi am xi )( xi ) 0
2 m m
am i 1
In general, these equations in matrix form are given by
n xi2 xi 3 mi 1 a0 y i
2 m
x
a x y
xi xi 3 xi 4 xim2 1 i2 i
x 2 xi xi xi a2 xi yi
i
xm 2m
am xi m yi
i i i i
m 1 m2
x x x
The above equations are then solved for a0, a1,…,am
Fit a second-order polynomial
m 2; n 6; x 2.5; y 25.433;
xi 15; yi 152.6; i 55;
x 2
i 225;
x 3
i 979;
x 4
xi yi 585.6; i yi 2488.8;
x 2
m 2; n 6; x 2.5; y 25.433;
x i 15; y 152.6; x 55; x 225;
i i
2
i
3
x i
4
979; x y 585.6; x y 2488.8;
i i i
2
i
n xi2 xi 3 mi 1 a0 y i
2 m
x
a x y
xi xi 3 xi 4 xim2 1 i2 i
x 2 xi xi xi a2 xi yi
i
xm
i xi xi xi am xi yi
m 1 m2 2m m
Solving gives; a0 = 2.47847, a1=2.35929 & a2=1.86071
y = 2.47847 + 2.35929x + 1.86071x2
m 2; n 6; x 2.5; y 25.433;
x i 15; y 152.6; x 55; x 225;
i i
2
i
3
x i
4
979; x y 585.6; x y 2488.8;
i i i
2
i
6 15 55 a0 152.6
15 55 225 a 585.6
1
55 225 979
a2
2488.8
Solving gives;
a0 = 2.47847, a1 = 2.35929 & a2 = 1.86071
y = 2.47847 + 2.35929x + 1.86071x2
Sr: sum of the squares of the estimate residuals
n n
Sr ei (yi a0 a1xi a2 xi 2 ) 2
2
i 1 i 1
St: sum of the squares of the data residuals
St (yi y ) 2
Sy/x: standard error of the estimate
Sr
Sy / x
n (m 1)
r2: the coefficient of determination
St Sr
r
2
St