IE 211
Quantitative Methods in Industrial Engineering I
Agenda
Regression Analysis
Regression
Introduction
Many problems in engineering and science involve exploring
the relationships between two or more variables.
Regression analysis is a statistical technique that is very
useful for these types of problems.
For example, in a chemical process, suppose that the yield of
the product is related to the process-operating temperature.
Regression analysis can be used to build a model to predict
yield at a given temperature level.
Introduction
In factorial design (continuous variable), increased number of
levels (3 or more) of the fixed factors or independent factors
produces range of values with certain trend
Dependency of one factor on the other or there is “cause and
effect”. The factor causes effects on certain characteristics i.e.
dependent or response variable
Note: If there is no cause & effects factor but two variables
go together then it will be correlation between them (e.g.
height and weight)
Types of Regression
Simple Linear Regression
Single regressor (x) variable such as x1 and model linear with
respect to coefficients
Example 1: y = a0 + a1x + error
Example 2: y = a0 + a1x + a2 x2 + a3 x3 + error
Note: ‘Linear’ refers to the coefficients a0, a1, a2 , etc. It implies that each
term containing a coefficient is added to the model. In example 2, the
relationship between x and y are cubic polynomial in nature, but the
model is linear with respect to the coefficients.
6
Types of Regression
Multiple Linear Regression
Multiple regressor (x) variables such as x1, x2, x3 and model linear
with respect to coefficients
Example: y = b0 + b1 x1 + b2 x2 + b3 x3 + error
Simple Non-Linear Regression
Single regressor (x) variable such as x and model non-linear with
respect to coefficients
Example: y = b0 + b1 (1-e-b2x) + error
Multiple Non-Linear Regression
Multiple regressor (x) variables such as x1, x2, x3 and model non-
linear with respect to coefficients
Example: y = (b0+ b1 x1) / b2 x2 + b3 x3 + error 7
Simple Linear Regression
Production
(t/ha/yr)
N-0 N-20 N-40 N-60
Nitrogen fertilization (kg/ha)
Simple relationship
Dependent
Variable:
Production
(t/ha/yr)
N-0 N-20 N-40 N-60
Factor: Nitrogen fertilization (kg/ha)
Simple linear regression
Model, Y = a + bX
Y
Dependent
Y
Variable
or
X Slope (b) = Y / X
Output
e.g. crop
Production Y-intercept (a)
(t/ha/yr) X
0 20 40 60 80 100
Independent variable or input/factor
e.g. Nitrogen fertilization (kg/ha)
Type of simple linear relationship
“b” is positive – addition of
input increases the output
“b” is negative – addition of
input decreases the output
No response, b = 0
0 20 40 60 80 100
Empirical Model
Based on the scatter diagram, it is probably reasonable to
assume that the mean of the random variable Y is related to
x by the following straight-line relationship:
𝐸 𝑌 𝑥 = 𝜇𝑌|𝑥 = 𝛽0 + 𝛽1 𝑥
where the slope and intercept of the line are called
regression coefficients.
The simple linear regression model is given by
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where is the random error term. The simple linear regression
model gives the value of each
individual data point.
The true regression model gives
Empirical Model the equation of the line.
The true regression model is a line of mean values:
𝜇𝑌|𝑥 = 𝛽0 + 𝛽1 𝑥
where β1 can be interpreted as the change in the mean of Y for
a unit change in x.
Also, the variability of Y at a particular value of x is
determined by the error variance, σ2.
This implies there is a distribution of Y-values at each x and
that the variance of this distribution is the same at each x.
Empirical Model
Method of Least Squares
Suppose that we have n pairs of observations (x1, y1), (x2,
y2), …, (xn, yn).
The method of least squares is used to estimate the
parameters, 0 and 1 by minimizing the sum of the squares
of the vertical deviations
Even though the
equations are given in
the next few slides,
we can use Minitab
or Excel to solve
come up with the
regression
equation.w
Least Squares Estimate
Intercept → ˆ0 = y − ˆ1 x
n n
yi xi
i =1 i =1
n
y x
i i −
n
Slope → ˆ1 = i =1
2
n
xi
i =1
n
i =1
xi
2
−
n
Where: n n
x i y i
x= i =1
y= i =1
n n
Fitted Regression Line
Plugging the parameters into the equation, we would get the
regression line as follows:
yˆ = ˆ0 + ˆ1 x
For each given value, the following formula would follow:
yi = ˆ0 + ˆ1 xi + i
Where εi = yi – ŷi is called the residual
Other notations:
2
n
xi n n
yi xi
S xx = xi2 − i =1
n
S xy = yi xi − i =1 i =1
n
i =1 n n
i =1
S xy
̂1 =
S xx
Example
The number of workers on an assembly line varies due to the
level of absenteeism on any given day.
In a random sample of production output from several days
of work, the following data were obtained, where
x = number of workers absent from the assembly line, and
y = number of defects coming off the line
X 3 5 0 2 1
Y 16 20 9 12 10
Observation X Y X2 XY
1 3 16 9 48
2 5 20 25 100
3 0 9 0 0
4 2 12 4 24
5 1 10 1 10
Total 11 67 39 182
n
2
n n
xi yi xi
S xy = yi xi − i =1 i =1
n
S xx = xi2 − i =1
n
n i =1 n
i =1
(3 + + 1)
(3 + 5 + + 1)(16 + 20 + + 10 )
( )− S xy = (3 *16 ) + + (1*10 ) −
2
S xx = 3 + + 1
2 2
5
5
S xy = 34 .6
S xx = 14 .8
ˆ S xy 34 .6 ˆ ˆ 67 11
1 = = 0 = y − 1 x = − 2.338 *
S xx 14 .8 5 5
ˆ = 2.338
1
ˆ = 8.257
0
Estimating variance
The variance can be estimated by the error sum of squares or
MSE
n n
SSE = e = ( yi − yˆ i )
2 2 This can be
i computed
i =1 i =1 easily in Excel
SSE
ˆ = MSE = E ( SSE) =
2
n−2
Estimating variance
However, obtaining the SSE would be tedious.
Another more convenient formula is as follows:
SSE = SST − ̂1S xy
Where:
n n
SST = ( yˆ i − yi ) 2 = yi2 − ny 2
i =1 i =1
Observation X Y X2 XY Y2
1 3 16 9 48 256
2 5 20 25 100 400
3 0 9 0 0 81
4 2 12 4 24 144
5 1 10 1 10 100
Total 11 67 39 182 981
2
( 67
)
n
SST = y − ny = 16 + + 10 − 5 *
2
i
2 2 2
i =1 5
SST = 83 .2
SSE = SST − ˆ1S xy = 83 .2 − 2.338 * 34 .6
SSE = 2.305
SSE 2.305
ˆ =
2
= = 0.7684
n−2 5−2
Hypothesis Testing
We want to determine if the parameters, β1 and β0, are equal
to a certain constant.
Two methods
t-tests
ANOVA
t-tests
Slope Intercept
Test the slope is equal to a Test the intercept is equal
constant β1,0 to a constant β0,0
H 0 : 1 = 1, 0 H 0 : 0 = 0, 0
H 1 : 1 1, 0 H1 : 0 0, 0
t-tests
Slope Intercept
Test Statistic Test Statistic
ˆ1 − 1,0 ˆ0 − 0,0
T0 = T0 =
ˆ 2
2 1 x
2
ˆ +
S xx n S xx
Rejection region Rejection Region
t 0 t / 2 ,n − 2 t 0 t / 2 ,n − 2
Example
Slope Intercept
Test Statistic Test Statistic
ˆ1 − 1,0 2.338 − 0 ˆ0 − 0, 0 8.257 − 0
T0 = = T0 = =
ˆ 2
0.7684 1 x2 1 2.2 2
S xx 14 .8 ˆ +
2
0.7684 +
n S xx 5 14 .8
T0 = 10 .261 T0 = 12 .975
Rejection region Rejection Region
t0 t0.025,5− 2 = 3.182
ANOVA
Hypothesis
Testing the significance of the regression model
H 0 : 1 = 0
H1 : 1 0
If the slope, β1, is 0, then the
dependent and independent
variables have no relationship to
one another (they are
independent).
ANOVA
Source of Degrees of Mean
Sum of Squares F0
Variation Freedom Square
MSR/
Regression SSR=β1Sxy 1 MSR
MSE
SSE=
Error n–2 MSE
SST – β1Sxy
Total SST n–1
Reject H0
F0 F ,(1,n−2)
Example
Source of Degrees of Mean
Sum of Squares F0
Variation Freedom Square
Regression 80.895 1 80.895 105.286
Error 2.305 3 0.7683
Total 83.2 4
Reject H0
F0 F0.05,(1,3) = 10 .13
Correlation
Correlation is a measure of the strength of association
between two quantitative variables (Ex: Pressure and
Yield)
Measures the degree of linearity between two variables
assumed to be completely independent of each other
S xy
It is important to
remember that
R=
correlation pertains to a S xx * S yy
linear relationship..
Correlation Coefficient
90 100
80 90
70
80
60
70
Y
Y
50
60
40
50
30
r = +1.0 r = -1.0
20 40
10 20 30
10 20 30
X X
76
r = 0.0
75
74 No correlation
Y
73
72
71
10 20 30
X 33
Strength and Direction of “+”
Correlation
Moderate positive correlation
110
100
90
Output
80
Y=25.7595+0.645418X
70
R Squared=0.369
60
50
40
50 60 70 80 90 100
Weak positive correlation Input Strong positive correlation
110
85
100
90
75 80
Output
Output
Y=56.6537+0.181987X 70
Y=9.77271+0.745022X
65 R Squared=0.115 60
R Squared=0.876
50
40
55
30
40 50 60 70 80 90 100 110 120
40 50 60 70 80 90
Input Input
34
Strength and Direction of “-”
Correlation
Moderate negative correlation
110
100
90
Output
80
Y=90.3013-0.645418X
70
R Squared=0.369
60
50
40
50 60 70 80 90 100
Weak negative correlation Input Strong negative correlation
110
85
100
90
75 80
Output
Output
Y=74.8524-0.181987X 70
Y=99.1754-0.745022X
65 R Squared=0.115 60
R Squared=0.876
50
40
55
30
40 50 60 70 80 90 100 110 120
40 50 60 70 80 90
Input Input
35
Correlation vs. Causation
Data shows that average life expectancy of Americans increased when the divorce rate went up!
Is there a correlation between shark attacks and Popsicle sales?
# of Shark Attack
Average Life
Expectancy
Divorce Rate Popsicle Sale
in America
Correlation does not imply causation! A third variable may be ‘lurking’ that causes both
x and y to vary
36
Example
The correlation of the example.
S xy 34 .6
R= =
S xx * S yy 14 .8 * 83 .2
R = 0.98601
R 2 = (0.98601 ) 2 = 97 .2%
SSR 80 .895
R =
2
=
SST 83 .2
R 2 = 97 .2%
97.2% of the variation in the number of defects can be
explained by absenteeism
fin