Regression Analysis
Regression Analysis
Regression Analysis is used to:
1)understand the relation between two variables
2)predict the value of one variable based on another
variable.
A regression model is comprised of a dependent
(response) variable and an independent (predictor)
variable.
Independent Variable(s) Dependent Variable
Prediction Relationship
Regression Analysis
Linear regression estimates the coefficients of the
linear equation, involving one or more independent
variables that best predict the value of the
dependent variable.
If you believe that none of your predictor variables
is correlated with the errors in your dependent
variable, you can use the linear regression
procedure
Simple Linear Regression
The Scatter Diagram – used to graphically
investigate relationship between the dependent
and independent variables
100
Y 50
0
0 20 X 40 60
Plot of all (Xi , Yi) pairs
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
Simple Linear Regression Model
Regression models are used to test if a relationship
exist between variables; that is to use one variable to
predict another. However, there is a random error that
cannot be predicted.
Y intercept Random
Error
Yi 0 1 X i i
Dependent
(Response)
Slope
Variable Independent
(Predictor/Explanatory)
Variable
Population Linear Regression
Model
Y Yi 0 1X i i Observed
Value
i = Random Error
0 1X i
YX
X
Observed Value
Sample Linear Regression Model
yˆ i b0 b1 xi
yi = Predicted Value of Y for observation i
xi = Value of X for observation i
b0 = Sample Y - intercept used as estimate
of the population 0
b1 = Sample Slope used as estimate of the
population 1
Sample Linear Regression Model
Sample data are used to estimate the true
values for the intercept and slope.
yˆ i b0 b1 xi
The difference between the actual value of Y
and the predicted value (using sample data) is
known as the error.
Error = actual value – predicted value
Yi
Sample Linear Regression Model
yˆ i b0 b1 xi
n
n n
n xi yi xi yi
b1 i 1 i 1 i 1
2
n
n
n xi xi
2
i 1 i 1
b0 y b1 x
Table 3.1. Intelligence Test Scores and Freshmen Chemistry Grades
Test Score Chemistry
Student (x) Grade (y)
1 65 85
2 50 74
3 55 76
4 65 90
5 55 85
6 70 87
7 65 94
8 70 98
9 55 81
10 70 91
11 50 76
12 55 74
Figure 3.1. Scatter Diagram with regression line
100
95
yˆ i b0 b1 xi
Chemistry Grade
90
85
Determining point
80 estimate of b0 and b1
Using the Method of
75
Least Squares
70
40 45 50 55 60 65 70 75
Intelligence Test Score
Measures of Variation: The
Sum of Squares
Y
SSE =(Yi - Yi )2
_ b Xi
b0 + 1
SST = (Yi - Y) 2
Yi =
_
SSR = (Yi - Y)2
_
Y
X
Xi
Method of Least Squares
n n
SSE e ( yi b0 b1 xi )
2
i
2
i 1 i 1
n
The process of differentiating i 1 i
e 2
with respect
to b0 and b1 and equating the derivatives to zero
n
n n
n xi yi xi yi
b1 i 1 i 1 i 1 b0 y b1 x
2
n
n
n xi xi
2
i 1 i 1
Method of Least Squares
n
n n
n xi yi xi yi
b1 i 1 i 1 i 1
2
n
n
n xi xi
2
i 1 i 1
(x i x )( yi y )
b1 i 1
n
i
( x
i 1
x ) 2
Table 3.1. Intelligence Test Scores and Freshmen Chemistry Grades
Test Score Chemistry
Student (x) Grade (y)
1 65 85
2 50 74
b1 0.897
3 55 76
b0 30.056
4 65 90
5 55 85
yˆ i b0 b1 xi
6 70 87
yˆ i 30.056 0.897 xi
7 65 94
8 70 98
9 55 81
10 70 91
11 50 76
12 55 74
Figure 3.1. Scatter Diagram with regression line
100
95 yˆ i 30.056 0.897 xi
Chemistry Grade
90
85
80
75 The slope of 0.897 means for
each increase of one unit in
70
intelligence Test Score (X),
40 45 50 55 60 65 70 75
the Chemistry Grade (Y) is
Intelligence Test Score estimated to increase by
0.897 units.
Using SPSS
Graphs To add regression line Use
Scatter
Simple SPSS Chart Editor
100 Chart
Options
Fit Line
90
Regression
Line
80
Chemistry Grade
Regression
70 Rsq = 0.7438 Prediction
40 50 60 70 80
Line
Test Score
Using SPSS
Analyze
Regression
Linear
Coefficientsa a
Coefficients
Unstandardized Standardized
Unstandardized Standardized
Coefficients Coefficients
Coefficients Coefficients
Model B Std. Error Beta t Sig.
Model B Std. Error Beta t Sig.
1 (Constant) 30.043 10.137 2.964 .014
1 (Constant) 30.043 10.137 2.964 .014
Test Score .897 .167 .862 5.389 .000
Test Score .897 .167 .862 5.389 .000
a. Dependent Variable: Chemistry Grade
a. Dependent Variable: Chemistry Grade
yˆ i 30.043 0.897 xi
Using SPSS Standard Deviation
Analyze Coefficient of
Regression Correlation Determination around the
Linear regression line
Model Summaryb b
Model Summary
Adjusted Std. Error of
Adjusted Std. Error of
Model R R Square R Square the Estimate
Model R a R Square R Square the Estimate
1 .862 .744 .718 4.319
1 .862a .744 .718 4.319 Measures of
a. Predictors: (Constant), Test Score
a. Predictors: (Constant), Test Score
b. Dependent Variable: Chemistry Grade
Variation
b. Dependent Variable: Chemistry Grade
ANOVAb b
ANOVA
Sum of
Sum of
Model Squares df Mean Square F Sig.
Model Squares df Mean Square F Sig. a
1 Regression 541.693 1 541.693 29.036 .000 a
1 Regression 541.693 1 541.693 29.036 .000
Residual 186.557 10 18.656
Residual 186.557 10 18.656
Total 728.250 11
Total 728.250 11
a. Predictors: (Constant), Test Score
a. Predictors: (Constant), Test Score
b. Dependent Variable: Chemistry Grade
b. Dependent Variable: Chemistry Grade
Testing the Significance of b
Similar to a test on r in the one-predictor case
t =(0.8972136-0)/0.1665043 = 5.39 H0 is rejected,
i.e. the regression line has a nonzero slope
2
Variance Explained – r
r2 tells us the proportion of variance in Y which is
explained by X
r
2
SS regression
SSYˆ
Yˆ Y 2
Y Y
2
SS total SSY
• a ratio reflecting the proportion of variance captured
by our model relative to the overall variance in our
data
• highly interpretable: r2 =.50 means 50% of the
variance in Y is explained by X
Linear Regression Assumptions
For Linear Models
1. Normality
Y Values Are Normally Distributed For Each
X
Probability Distribution of Error is Normal
2. Homoscedasticity (Constant Variance)
3. Independence of Errors
Variation of Errors Around the
Regression Line
y values are normally distributed
f(e) around the regression line.
For each x value, the “spread” or
variance around the regression line
is the same.
Y
X2
X1
X
Regression Line
Residual Analysis
Purposes
Examine Linearity
Evaluate violations of assumptions
Graphical Analysis of Residuals
Plot residuals Vs. Xi values
Difference between actual Yi & predicted Yi
Studentized residuals:
Allows consideration for the magnitude of the
residuals
Residual Analysis for Linearity
Not Linear
Linear
e e
X X
Residual Analysis for Homoscedasticity
Heteroscedasticity
SR
Homoscedasticity
SR
X X
Using Standardized Residuals
• Predict Chemistry Grade
• Predict residual
• Predict studentized residual
• Predict standardized residual
Residual Analysis for
Normality
kdensity r, normal swilk r Normal
kernel = epanechnikov, bandwidth = 2.25
kernel = epanechnikov, bandwidth = 2.25
Kernel density estimate
Kernel density estimate
Normal density
Normal density
.1
.1
.08
.08
Density
.06
Density
.06
.04
.04
Kernel density estimate
Kernel density estimate .02
.02
-10 -5 0 5 10
-10 -5 0
Residuals 5 10
Residuals
Residual Analysis for Linearity
scatter r X, yline(0)
5
5
Linear
Residuals
Residuals
0
0
-5
-5
50 55 60 65 70
50 55 60
Test Score 65 70
Test Score
Residual Analysis for
Homoscedasticity
Homoscedasticity
scatter r1 X, yline(0) scatter sr X, yline(0)
2 2
2 2
1 1
1 Standardized residuals 1
Studentized residuals
Standardized residuals
Studentized residuals
0 0
0 0
-1 -1
-1 -1
-2 -2
50 55 60 65 70 -2 50 55 60 65 70 -2
50 55 60
Test Score 65 70 50 55 60
Test Score 65 70
Test Score Test Score
Using Standardized Residuals Using Studentized Residuals
Residual Analysis for
Homoscedasticity
hettest
Homoscedasticity
Residual Analysis for
Independence
scatter r obs, yline(0)
Independent
5
5
Residuals
Residuals
0
0
-5
-5
0 5 10 15
0 5 obs 10 15
obs
Residual Analysis for
Independence
Durbin-Watson Statistic.
The D-W statistic is
defined as:
Independent