Selvanathan 6e - 19 - PPT
Selvanathan 6e - 19 - PPT
Multiple regression
Chapter outline
19.1 Model and required conditions
19.2 Estimating the coefficients and assessing the model
19.3 Regression diagnostics – II
19.4 Regression diagnostics – III (time series)
Learning objectives
LO1 Develop a multiple regression model, use a
computer and program to estimate the model and
interpret the estimated coefficients
LO2 Understand the adjusted coefficient of
determination and assess the fitness of the model
LO3 Test the significance of the individual coefficients
and the overall utility of the model
LO4 Use the estimated regression model to make
predictions
LO5 Perform diagnostic checks for the regression model
assumptions.
19.5
Introduction
The simple linear regression model was used to
analyse how one numerical variable (the dependent
variable y) is related to one other numerical variable
(the independent variable x).
Dependent
Independent variables
variable
Error variable
Coefficients
y = b0 + b1x + e
y = b0 + b1x
Note how the straight line
2
1x 1 + b 2x
y= b0 + b X 1
Example 1…
Margin
Profitability
Market
Competition Customers Community Physical
awareness
of the site
19.14
Example 1…
Model Assessment…
Coefficient of Determination…
Again, the coefficient of determination is defined as:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7231
R Square 0.5229
Adjusted R Square 0.4921
Standard Error 5.5248
Observations 100
Adjusted R2 Value…
SUMMARY OUTPUT
What’s this?
Regression Statistics
Multiple R 0.7231
R Square 0.5229
The ‘adjusted’ R2 is: Adjusted R Square 0.4921
Standard Error 5.5248
the coefficient of Observations 100
The F-test
• Construct the F-statistic
MSR = SSR/k
MSR
F =
MSE
F > Fa,k,n-k-1
SST = [Variation in y] = SSR + SSE. Large F results from a large SSR. Then much of the variation in
Required conditions
y is explained by the regression model. The null hypothesis should be rejected; thus the model is
must be satisfied.
useful.
19.22
A large value of F indicates that most of the variation in y is explained by the regression model
and that the model is valid or useful. A small value of F indicates that most of the variation in y is
unexplained.
19.23
Example 1…
MSR/MSE
ANOVA
df SS MS F Significance F
Regression 6 3110.80 518.47 16.99 0.00
Residual 93 2838.66 30.52
Total 99 5949.46
p-value of the
SSR F-test
SSE MSE
MSR
19.24
Example 1…
• Excel provides the following ANOVA results
ANOVA
df SS MS F Significance F
Regression 6 3110.80 518.47 16.99 0.00
Residual 93 2838.66 30.52
Total 99 5949.46
Conclusion: There is sufficient evidence to reject the null hypothesis in favour of the alternative hypothesis. That is, at least one of the bi
is not equal to zero. Thus, at least one independent variable is linearly related to y.
0 0 1 Perfect
0 0 Useless
Once we’re satisfied that the model fits the data as well as possible, and that the required conditions are
satisfied, we can interpret and test the individual coefficients and use the model to predict and estimate…
19.26
H0: bi = 0
ˆ i i
HA: bi 0 t d.f. = n - k -1
sˆ
i
Excel output: Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 38.6643 6.9690 5.5480 0.0000 24.8252 52.5034
Number (x1) -0.0076 0.0013 -6.0586 0.0000 -0.0101 -0.0051
Nearest (x2) 1.6564 0.6345 2.6105 0.0105 0.3964 2.9164
Office Space (x3) 0.1980 0.0342 5.7933 0.0000 0.1302 0.2659
Enrollment (x4) 0.2131 0.1338 1.5921 0.1148 -0.0527 0.4788
Income (x5) 0.3660 0.1271 2.8803 0.0049 0.1137 0.6184
Distance (x6) -0.1424 0.1119 -1.2725 0.2064 -0.3647 0.0798
19.30
Conclusion:
Thus, the number of hotel rooms within 5km distance of the
nearest competitor (Holiday Inn), amount of office space,
and median household income are linearly related to the
operating margin.
There is no evidence to infer that university enrolment and
distance to the town centre are linearly related to
operating margin.
The number of hotel rooms within 5km distance to the
nearest competitor and distance to downtown have a
negative effect on the operating margin, while all other
variables have a positive effect.
19.31
Example 1
Prediction Interval
We predict that the operating margin Margin (y)
will fall between 25.3 and 48.8. Predicted value 37.07624
Prediction Interval
Lower limit 25.35321
If management defines a profitable Upper limit 48.79927
Example 2
Using Excel (Data Analysis): Data > Data Analysis > Regression
that the number of bedrooms, the house size, and the lot size
41
19.42
(i.e. this is reasonable: larger houses have more bedrooms and are situated on larger lots, and smaller
Multicollinearity has affected the t-tests so that they implied that none of the independent
Time 40 45 50 55 60
M 20 24 26 30 32
a 23 26 25 32 31
r
. . . . .
k
s
. . . . .
19.47
Example 3 - Solution
50
40
distributed.
ANOVA
df SS MS F Significance F
Regression 1 1512.5 1512.5 284.7743 9.42E-31
Residual 98 520.5 5.311224
Total 99 2033
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -2.2 1.64582 -1.33672 0.184409 -5.46608 1.066077
Time 0.55 0.032592 16.87526 9.42E-31 0.485322 0.614678
19.48
Example 3 - Solution…
The standard error of estimate seems to increase with the predicted value of y.
1. y’ = logey
2. y’ = 1/y
19.49
Example 3 - Solution…
Let us see what happens when a transformation is applied:
Mark
40
LogMark
The original data, where 4
‘mark’ is a function of ‘time’ The modified data, where
35
LogMark is a function of ‘time’
30
40, 3.135
Loge23 = 3.135 3
25
40, 2.89
40,23
20
Example 3 - Solution…
SUMMARY OUTPUT
ANOVA
df SS MS F Significance F
Regression 1 2.357901 2.35790 330.72 3.58E-33
Residual 98 0.698705 0.00713
Total 99 3.056606
Example 3 - Solution…
40
30
to be normally
10
distributed.
0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
Standard Residuals
4
2
The standard error still changes with
0
the predicted y, but the change is 2.9 3 3.1 3.2 3.3 3.4 3.5
smaller than before.
-2
-4
19.52
Example 3 - Solution…
3.323
Mark = antiloge3.323 = e = 27.743
19.4 Regression Diagnostics – III
19.53
(Time series)
Durbin–Watson test
This test detects first-order autocorrelation between
consecutive residuals in a time series of the form
et = et-1 + vt, t=2,3,…,n,
HA: ρ 0
n
2
(et et 1 )
d t 2 n
, d 2(1 – ρ)
2
et
t 1
Residual at time t
Since -1 1, we have 0 d 4. Therefore,
+
0
Time
+
+
+
+
0
+ + + Time
+
19.56
0 2 4
Small values of d (d < 2) indicate a positive Large values of d (d > 2) imply a negative
dL dU 2 4
0
57
19.58
0 2 4-dU 4-dL 4
58
19.59
0 dL dU 2 4-dU 4-dL 4
Example 4 IDENTIFY
How does the weather affect the sales of lift tickets in a ski
resort? Data of the past 20 years sales of tickets, along with
the total snowfall and the average temperature during July
in each year, were collected.
Themodel
The modelseems
seemsto
tobe
bevery
verypoor:
poor:
•• Modelisisnot
Model notvalid
valid(significance
(significanceFF==0.33)
0.33)
•• Noneof
None ofthe
the33variables
variablesseem
seemto
tobe
belinearly
linearlyrelated
relatedto
tosales.
sales.
Diagnosisof
Diagnosis ofthe
therequired
requiredconditions
conditionsresulted
resultedin
inthe
thefollowing
followingfindings:
findings:
19.64
Example 4 – Solution…
The histogram of residuals…