0% found this document useful (0 votes)
11 views38 pages

Regression Analysis

The document provides an overview of regression analysis, a statistical technique used to explore relationships between variables in engineering and science. It covers types of regression, including simple and multiple linear regression, as well as methods for estimating parameters and testing hypotheses. Additionally, it discusses the importance of correlation in measuring the strength of association between variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

Regression Analysis

The document provides an overview of regression analysis, a statistical technique used to explore relationships between variables in engineering and science. It covers types of regression, including simple and multiple linear regression, as well as methods for estimating parameters and testing hypotheses. Additionally, it discusses the importance of correlation in measuring the strength of association between variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

IE 211

Quantitative Methods in Industrial Engineering I


Agenda

 Regression Analysis
Regression
Introduction
 Many problems in engineering and science involve exploring
the relationships between two or more variables.

 Regression analysis is a statistical technique that is very


useful for these types of problems.

 For example, in a chemical process, suppose that the yield of


the product is related to the process-operating temperature.

 Regression analysis can be used to build a model to predict


yield at a given temperature level.
Introduction
 In factorial design (continuous variable), increased number of
levels (3 or more) of the fixed factors or independent factors
produces range of values with certain trend

 Dependency of one factor on the other or there is “cause and


effect”. The factor causes effects on certain characteristics i.e.
dependent or response variable

 Note: If there is no cause & effects factor but two variables


go together then it will be correlation between them (e.g.
height and weight)
Types of Regression
 Simple Linear Regression
 Single regressor (x) variable such as x1 and model linear with
respect to coefficients
 Example 1: y = a0 + a1x + error
 Example 2: y = a0 + a1x + a2 x2 + a3 x3 + error
 Note: ‘Linear’ refers to the coefficients a0, a1, a2 , etc. It implies that each
term containing a coefficient is added to the model. In example 2, the
relationship between x and y are cubic polynomial in nature, but the
model is linear with respect to the coefficients.

6
Types of Regression
 Multiple Linear Regression
 Multiple regressor (x) variables such as x1, x2, x3 and model linear
with respect to coefficients
 Example: y = b0 + b1 x1 + b2 x2 + b3 x3 + error
 Simple Non-Linear Regression
 Single regressor (x) variable such as x and model non-linear with
respect to coefficients
 Example: y = b0 + b1 (1-e-b2x) + error

 Multiple Non-Linear Regression


 Multiple regressor (x) variables such as x1, x2, x3 and model non-
linear with respect to coefficients
 Example: y = (b0+ b1 x1) / b2 x2 + b3 x3 + error 7
Simple Linear Regression
Production
(t/ha/yr)

N-0 N-20 N-40 N-60

Nitrogen fertilization (kg/ha)


Simple relationship

Dependent
Variable:
Production
(t/ha/yr)

N-0 N-20 N-40 N-60

Factor: Nitrogen fertilization (kg/ha)


Simple linear regression

Model, Y = a + bX

Y
Dependent
Y
Variable
or
X Slope (b) = Y / X
Output
e.g. crop
Production Y-intercept (a)
(t/ha/yr) X
0 20 40 60 80 100
Independent variable or input/factor
e.g. Nitrogen fertilization (kg/ha)
Type of simple linear relationship

“b” is positive – addition of


input increases the output

“b” is negative – addition of


input decreases the output

No response, b = 0

0 20 40 60 80 100
Empirical Model
 Based on the scatter diagram, it is probably reasonable to
assume that the mean of the random variable Y is related to
x by the following straight-line relationship:

𝐸 𝑌 𝑥 = 𝜇𝑌|𝑥 = 𝛽0 + 𝛽1 𝑥
where the slope and intercept of the line are called
regression coefficients.
The simple linear regression model is given by

𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where  is the random error term. The simple linear regression
model gives the value of each
individual data point.
The true regression model gives
Empirical Model the equation of the line.

 The true regression model is a line of mean values:


𝜇𝑌|𝑥 = 𝛽0 + 𝛽1 𝑥
where β1 can be interpreted as the change in the mean of Y for
a unit change in x.
 Also, the variability of Y at a particular value of x is
determined by the error variance, σ2.
 This implies there is a distribution of Y-values at each x and
that the variance of this distribution is the same at each x.
Empirical Model
Method of Least Squares
 Suppose that we have n pairs of observations (x1, y1), (x2,
y2), …, (xn, yn).

 The method of least squares is used to estimate the


parameters, 0 and 1 by minimizing the sum of the squares
of the vertical deviations
Even though the
equations are given in
the next few slides,
we can use Minitab
or Excel to solve
come up with the
regression
equation.w
Least Squares Estimate
 Intercept → ˆ0 = y − ˆ1 x
 n  n 
  yi   xi 
 i =1  i =1 
n

 y x
i i −
n
 Slope → ˆ1 = i =1
2
 n

  xi 
 i =1 
n


i =1
xi
2

n
 Where: n n

x i y i
x= i =1
y= i =1
n n
Fitted Regression Line
 Plugging the parameters into the equation, we would get the
regression line as follows:

yˆ = ˆ0 + ˆ1 x
 For each given value, the following formula would follow:

yi = ˆ0 + ˆ1 xi +  i
 Where εi = yi – ŷi is called the residual
Other notations:
2
 n

  xi   n  n 
  yi   xi 
S xx =  xi2 −  i =1 
n

S xy =  yi xi −  i =1  i =1 
n

i =1 n n
i =1

S xy
̂1 =
S xx
Example
 The number of workers on an assembly line varies due to the
level of absenteeism on any given day.
 In a random sample of production output from several days
of work, the following data were obtained, where
 x = number of workers absent from the assembly line, and
 y = number of defects coming off the line

X 3 5 0 2 1
Y 16 20 9 12 10
Observation X Y X2 XY
1 3 16 9 48
2 5 20 25 100
3 0 9 0 0
4 2 12 4 24
5 1 10 1 10
Total 11 67 39 182

 n 
2
 n  n 
  xi    yi   xi 
S xy =  yi xi −  i =1  i =1 
n

S xx =  xi2 −  i =1 
n

n i =1 n
i =1

(3 +  + 1)
(3 + 5 +  + 1)(16 + 20 +  + 10 )
( )− S xy = (3 *16 ) +  + (1*10 ) −
2
S xx = 3 +  + 1
2 2
5
5
S xy = 34 .6
S xx = 14 .8

ˆ S xy 34 .6 ˆ ˆ  67   11 
1 = =  0 = y − 1 x =   − 2.338 *  
S xx 14 .8  5  5
ˆ = 2.338
1
ˆ = 8.257
0
Estimating variance
 The variance can be estimated by the error sum of squares or
MSE
n n
SSE =  e =  ( yi − yˆ i )
2 2 This can be
i computed
i =1 i =1 easily in Excel

SSE
ˆ = MSE = E ( SSE) =
2

n−2
Estimating variance
 However, obtaining the SSE would be tedious.
 Another more convenient formula is as follows:

SSE = SST − ̂1S xy

 Where:
n n
SST =  ( yˆ i − yi ) 2 =  yi2 − ny 2
i =1 i =1
Observation X Y X2 XY Y2
1 3 16 9 48 256
2 5 20 25 100 400
3 0 9 0 0 81
4 2 12 4 24 144
5 1 10 1 10 100
Total 11 67 39 182 981
2

(  67 
)
n
SST =  y − ny = 16 +  + 10 − 5 *  
2
i
2 2 2

i =1  5 
SST = 83 .2

SSE = SST − ˆ1S xy = 83 .2 − 2.338 * 34 .6


SSE = 2.305

SSE 2.305
ˆ =
2
= = 0.7684
n−2 5−2
Hypothesis Testing
 We want to determine if the parameters, β1 and β0, are equal
to a certain constant.

 Two methods
 t-tests
 ANOVA
t-tests
Slope Intercept
 Test the slope is equal to a  Test the intercept is equal
constant β1,0 to a constant β0,0
H 0 : 1 = 1, 0 H 0 :  0 =  0, 0
H 1 : 1  1, 0 H1 :  0   0, 0
t-tests
Slope Intercept
 Test Statistic  Test Statistic

ˆ1 − 1,0 ˆ0 −  0,0


T0 = T0 =
ˆ 2

2 1 x 
2
ˆ  + 
S xx  n S xx 

 Rejection region  Rejection Region


t 0  t / 2 ,n − 2 t 0  t / 2 ,n − 2
Example
Slope Intercept
 Test Statistic  Test Statistic

ˆ1 − 1,0 2.338 − 0 ˆ0 −  0, 0 8.257 − 0


T0 = = T0 = =
ˆ 2
0.7684  1 x2   1 2.2 2 
S xx 14 .8 ˆ  +
2
 0.7684  + 
 n S xx   5 14 .8 
T0 = 10 .261 T0 = 12 .975

 Rejection region  Rejection Region

t0  t0.025,5− 2 = 3.182
ANOVA
 Hypothesis
 Testing the significance of the regression model

H 0 : 1 = 0
H1 : 1  0

If the slope, β1, is 0, then the


dependent and independent
variables have no relationship to
one another (they are
independent).
ANOVA
Source of Degrees of Mean
Sum of Squares F0
Variation Freedom Square
MSR/
Regression SSR=β1Sxy 1 MSR
MSE
SSE=
Error n–2 MSE
SST – β1Sxy
Total SST n–1

Reject H0
F0  F ,(1,n−2)
Example
Source of Degrees of Mean
Sum of Squares F0
Variation Freedom Square
Regression 80.895 1 80.895 105.286

Error 2.305 3 0.7683

Total 83.2 4

Reject H0
F0  F0.05,(1,3) = 10 .13
Correlation
 Correlation is a measure of the strength of association
between two quantitative variables (Ex: Pressure and
Yield)
 Measures the degree of linearity between two variables
assumed to be completely independent of each other

S xy
It is important to
remember that
R=
correlation pertains to a S xx * S yy
linear relationship..
Correlation Coefficient
90 100

80 90
70
80
60
70
Y

Y
50
60
40
50
30
r = +1.0 r = -1.0
20 40

10 20 30
10 20 30
X X
76
r = 0.0
75

74 No correlation
Y

73

72

71

10 20 30
X 33
Strength and Direction of “+”
Correlation
Moderate positive correlation
110

100

90

Output
80
Y=25.7595+0.645418X
70
R Squared=0.369
60

50

40

50 60 70 80 90 100
Weak positive correlation Input Strong positive correlation
110
85
100

90

75 80

Output
Output

Y=56.6537+0.181987X 70
Y=9.77271+0.745022X
65 R Squared=0.115 60
R Squared=0.876
50

40
55
30

40 50 60 70 80 90 100 110 120


40 50 60 70 80 90

Input Input
34
Strength and Direction of “-”
Correlation
Moderate negative correlation
110

100

90

Output
80
Y=90.3013-0.645418X
70
R Squared=0.369
60

50

40

50 60 70 80 90 100
Weak negative correlation Input Strong negative correlation
110
85
100

90

75 80

Output
Output

Y=74.8524-0.181987X 70
Y=99.1754-0.745022X
65 R Squared=0.115 60
R Squared=0.876
50

40
55
30

40 50 60 70 80 90 100 110 120


40 50 60 70 80 90

Input Input
35
Correlation vs. Causation
 Data shows that average life expectancy of Americans increased when the divorce rate went up!

 Is there a correlation between shark attacks and Popsicle sales?

# of Shark Attack
Average Life
Expectancy

Divorce Rate Popsicle Sale


in America

Correlation does not imply causation! A third variable may be ‘lurking’ that causes both
x and y to vary
36
Example
 The correlation of the example.
S xy 34 .6
R= =
S xx * S yy 14 .8 * 83 .2
R = 0.98601

R 2 = (0.98601 ) 2 = 97 .2%
SSR 80 .895
R =
2
=
SST 83 .2
R 2 = 97 .2%
 97.2% of the variation in the number of defects can be
explained by absenteeism
fin

You might also like