0% found this document useful (0 votes)
45 views72 pages

Selvanathan 6e - 19 - PPT

The document discusses multiple regression analysis using several examples. Multiple regression allows for analyzing the relationship between a dependent variable and multiple independent variables. Key steps in multiple regression include estimating coefficients, assessing model fit using measures like the standard error of estimate and coefficient of determination, and ensuring assumptions are met.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views72 pages

Selvanathan 6e - 19 - PPT

The document discusses multiple regression analysis using several examples. Multiple regression allows for analyzing the relationship between a dependent variable and multiple independent variables. Key steps in multiple regression include estimating coefficients, assessing model fit using measures like the standard error of estimate and coefficient of determination, and ensuring assumptions are met.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

CHAPTER 19

Multiple regression
Chapter outline
19.1 Model and required conditions
19.2 Estimating the coefficients and assessing the model
19.3 Regression diagnostics – II
19.4 Regression diagnostics – III (time series)
Learning objectives
LO1 Develop a multiple regression model, use a
computer and program to estimate the model and
interpret the estimated coefficients
LO2 Understand the adjusted coefficient of
determination and assess the fitness of the model
LO3 Test the significance of the individual coefficients
and the overall utility of the model
LO4 Use the estimated regression model to make
predictions
LO5 Perform diagnostic checks for the regression model
assumptions.
19.5

Introduction
The simple linear regression model was used to
analyse how one numerical variable (the dependent
variable y) is related to one other numerical variable
(the independent variable x).

Multiple regression allows for any number of


independent variables.

We expect to develop models that fit the data better


than would a simple linear regression model.
19.6

19.1 The model and required conditions


We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:

Dependent
Independent variables
variable

Error variable

Coefficients

In the one variable, two dimensional case we drew a


regression line; here we imagine a response surface.
19.7

The simple linear regression model

y allows for one independent variable x.

y = b0 + b1x + e

y = b0 + b1x
Note how the straight line

becomes a plane and ...

2
1x 1 + b 2x
y= b0 + b X 1

The multiple linear regression model allows

for more than one independent variable.


X2
y = b0 + b1x1 + b2x2 + e
19.8

Required conditions for the error variable 


(1) The mean of e is zero: E() = 0.
(2) The standard deviation of  is a constant ().
(3) The errors are independent.
(4) The errors are independent of the independent variable x.
(5) The error  is normally distributed.

These conditions are required in order to

• estimate the model coefficients with desirable properties

• test hypotheses about the model coefficients

• assess the resulting model.


19.9

19.2 Estimating the Coefficients and


Assessing the Model…
Estimating the model…
The sample regression equation is expressed as:
yˆ  ˆ  ˆ x  ˆ x  ...  ˆ x
0 1 1 2 2 k k

We will use computer output to assess the model.

Assessing the model…


• How well does the model fits the data?
• Is it useful?
• Are any required conditions violated?
19.10

Estimating the Coefficients and Assessing


the Model…
Employ the model…

yˆ  ˆ0  ˆ1 x1  ˆ2 x2  ...  ˆk xk


• Interpreting the coefficients
• Predictions using the prediction equation
• Estimating the expected value of the dependent
variable.
19.11

Regression analysis steps…


1. Use a computer and software to generate the
estimated coefficients and the statistics required to
assess the model.
2. Diagnose violations of required conditions. If there
are problems, attempt to remedy them.
3. Assess the fitness of the model.
• standard error of estimate
• coefficient of determination
• F-test of the analysis of variance.
4. If 1, 2 and 3 are OK, use the model to predict or
estimate the expected value of the dependent
variable.
19.12

Example 1 - Selecting sites for a motel chain (Example 19.1, p770)

The Holiday Inns group is planning an expansion. The


management wishes to predict which sites are likely to
be profitable. Several areas where predictors of
profitability can be identified are:
• competition
• market awareness
• demand generators
• demographics
• physical quality.
19.13

Example 1…
Margin
Profitability

Market
Competition Customers Community Physical
awareness

Rooms Nearest Office University Income Distance


Number of
Distance to space enrolment Median to town
hotel/motel
the nearest household Distance to
rooms
Holiday Inn income downtown
within 5 km

of the site
19.14

Example 1…

Data were collected from 100 randomly-selected Holiday


Inns and run for the following suggested model:
Margin = 0 + 1 Rooms + 2 Nearest + 3 Office + 4 Enrolment
+ 5 Income + 6 Distance + 
19.15

Excel output This is the sample regression equation

(sometimes called the prediction equation)


SUMMARY OUTPUT

Regression Statistics MARGIN = 38.664 – 0.0076ROOMS –1.656NEAREST


Multiple R 0.7231
R Square 0.5229 + 0.198OFFICE + 0.213ENROLMT
Adjusted R Square 0.4921
+ 0.366INCOME - 0.142DISTTWN
Standard Error 5.5248
Observations 100

ANOVA Let us assess this equation


df SS MS F Significance F
Regression 6 3110.80 518.47 16.99 0.00
Residual 93 2838.66 30.52
Total 99 5949.46

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 38.6643 6.9690 5.5480 0.0000 24.8252 52.5034
Number (x1) -0.0076 0.0013 -6.0586 0.0000 -0.0101 -0.0051
Nearest (x2) 1.6564 0.6345 2.6105 0.0105 0.3964 2.9164
Office Space (x3) 0.1980 0.0342 5.7933 0.0000 0.1302 0.2659
Enrollment (x4) 0.2131 0.1338 1.5921 0.1148 -0.0527 0.4788
Income (x5) 0.3660 0.1271 2.8803 0.0049 0.1137 0.6184
Distance (x6) -0.1424 0.1119 -1.2725 0.2064 -0.3647 0.0798
19.16

Model Assessment…

We will assess the model in three ways:


• Standard error of estimate,
• Coefficient of determination, and
• F-test of the analysis of variance.
19.17

Standard Error of Estimate…


In multiple regression, the standard error of estimate is
defined as:

n is the sample size and k is the number of independent


variables in the model. We compare this value with the
mean value of y:
s = 5.5248 compared to = 45.739

It seems the standard error of estimate is not particularly


small. What can we conclude?
19.18

Coefficient of Determination…
Again, the coefficient of determination is defined as:
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.7231
R Square 0.5229
Adjusted R Square 0.4921
Standard Error 5.5248
Observations 100

This means that 52.29% of the variation in the


operating margin is explained by the six independent
variables, but 47.71% remains unexplained.
19.19

Adjusted R2 Value…
SUMMARY OUTPUT
What’s this?
Regression Statistics
Multiple R 0.7231
R Square 0.5229
The ‘adjusted’ R2 is: Adjusted R Square 0.4921
Standard Error 5.5248
the coefficient of Observations 100

determination adjusted for degrees of freedom.

It takes into account the sample size n, and k, the


number of independent variables, and is given by:
19.20

Testing the validity of the model…


In a multiple regression model (i.e. more than one
independent variable), we utilise an analysis of variance
technique to test the overall validity/utility of the model.
Here’s the idea:
H0: 1 = 1 = … = k = 0
HA: At least one i is not equal to zero.

If the null hypothesis is true, none of the independent


variables is linearly related to y, and so the model is invalid.
If at least one i  0, the model does have some validity and
it is useful.
19.21

Testing the Validity of the Model…


To test these hypotheses we perform an analysis of variance
procedure.

The F-test
• Construct the F-statistic
MSR = SSR/k
MSR
F =
MSE

• Rejection region MSE = SSE/(n – k – 1)

F > Fa,k,n-k-1

SST = [Variation in y] = SSR + SSE. Large F results from a large SSR. Then much of the variation in
Required conditions
y is explained by the regression model. The null hypothesis should be rejected; thus the model is
must be satisfied.
useful.
19.22

Testing the Validity of the Model…


ANOVA table for regression analysis…

Source of Degrees of Sums of


Mean squares F-statistic
variation freedom squares
Regression k SSR MSR = SSR/k F=MSR/MSE
MSE = SSE/(n–k-
Error n–k–1 SSE
1)
Total n–1

A large value of F indicates that most of the variation in y is explained by the regression model

and that the model is valid or useful. A small value of F indicates that most of the variation in y is

unexplained.
19.23

Example 1…

Excel provides the following ANOVA results:

MSR/MSE

ANOVA
df SS MS F Significance F
Regression 6 3110.80 518.47 16.99 0.00
Residual 93 2838.66 30.52
Total 99 5949.46
p-value of the
SSR F-test

SSE MSE

MSR
19.24

Example 1…
• Excel provides the following ANOVA results
ANOVA
df SS MS F Significance F
Regression 6 3110.80 518.47 16.99 0.00
Residual 93 2838.66 30.52
Total 99 5949.46

F,k,n-k-1 = F0.05,6,100-6-1 = 2.17

F = 16.99 > 2.17

Also the p-value (significance F) = 0.00.

Clearly, p-value = 0.00 < 0.05 = , the null hypothesis is rejected.

Conclusion: There is sufficient evidence to reject the null hypothesis in favour of the alternative hypothesis. That is, at least one of the bi

is not equal to zero. Thus, at least one independent variable is linearly related to y.

This linear regression model is useful.


19.25

Table 19.1 Summary


Assessmen
SSE R2 F
t of model

0 0 1 Perfect

Small Small Close to 1 Large Good

Large Large Close to 0 Small Poor

0 0 Useless

Once we’re satisfied that the model fits the data as well as possible, and that the required conditions are

satisfied, we can interpret and test the individual coefficients and use the model to predict and estimate…
19.26

Interpreting the Coefficients


• ˆ0  38.66 T is the intercept, the value of y when
all the variables take the value zero. Since the data range of
all the independent variables do not cover the value zero, do
not interpret the intercept.
• ˆ 1  .0076 In this model, for each additional 1 000 rooms
within 5 km of the Holiday Inn, the operating margin
decreases on average by 7.6% (assuming the other variables
are held constant).
• ˆ2  1.656 In this model, for each additional km that the
nearest competitor is to the Holiday Inn, the average
operating margin increases by 1.65%.
19.27

Interpreting the coefficients…

• ˆ3  .198 For each additional 1 000 square metre of office


space, the average increase in operating margin will be 0.198%.
• ˆ4  .213 For each additional thousand students MARGIN
increases by 0.21%.
• ˆ5  .366 For each additional $1 000 increase in median
household income, MARGIN increases by 0.37%.
• For each additional km to downtown, MARGIN
decreases by 0.14% on average.
19.28

Testing the Coefficients


For each independent variable, we can test to determine
whether there is enough evidence of a linear relationship
between it and the dependent variable for the entire
population:
H 0:  i = 0
HA: i ≠ 0

(for i = 1, 2, …, k) and using:


ˆi  i
t
sˆ
i
as our t test statistic (with n–k–1 degrees of freedom).
19.29

Testing the Coefficients…


The hypotheses for each bi
Test statistic

H0: bi = 0
ˆ i  i
HA: bi  0 t d.f. = n - k -1
sˆ
i

Excel output: Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 38.6643 6.9690 5.5480 0.0000 24.8252 52.5034
Number (x1) -0.0076 0.0013 -6.0586 0.0000 -0.0101 -0.0051
Nearest (x2) 1.6564 0.6345 2.6105 0.0105 0.3964 2.9164
Office Space (x3) 0.1980 0.0342 5.7933 0.0000 0.1302 0.2659
Enrollment (x4) 0.2131 0.1338 1.5921 0.1148 -0.0527 0.4788
Income (x5) 0.3660 0.1271 2.8803 0.0049 0.1137 0.6184
Distance (x6) -0.1424 0.1119 -1.2725 0.2064 -0.3647 0.0798
19.30

Testing the Coefficients… INTERPRET

Conclusion:
Thus, the number of hotel rooms within 5km distance of the
nearest competitor (Holiday Inn), amount of office space,
and median household income are linearly related to the
operating margin.
There is no evidence to infer that university enrolment and
distance to the town centre are linearly related to
operating margin.
The number of hotel rooms within 5km distance to the
nearest competitor and distance to downtown have a
negative effect on the operating margin, while all other
variables have a positive effect.
19.31

Using the regression equation


The model can be used by:
• producing a prediction interval for the particular
value of y, for a given set of values of xi.
• producing an interval estimate for the expected value
of y, for a given set of values of xi.

The model can be used to learn about relationships


between the independent variables xi and the dependent
variable y, by interpreting the coefficients i
19.32

Example 1

Predict the MARGIN of an inn at a site with the


following characteristics:
• 3 815 rooms within 5 km
• closest competitor 0.9 km away
• 476 000 square metres of office space
• 24 500 university students
• $38 000 median household income
• 17.9 km distance to downtown centre.

MARGIN = 38.66 – 0.0076(3815) + 1.656(0.9) + 0.198(47.6)

+0.213(24.5) + 0.366(38) – 0.142(17.9) = 37.08


19.33

Example 1 - Solution COMPUTE

Using Excel (Data Analysis Plus):


We add one row (our given values for the independent variables)
to the bottom of our data set:

Then we use: Prediction Interval

Add-Ins > Data Analysis Plus Margin (y)

> Prediction Interval Predicted value 37.07624

to obtain the predicted values and intervals. Prediction


Lower limit
Interval
25.35321
Upper limit 48.79927
Excel output is presented on the right.
Interval Estimate of Expected Value
Lower limit 32.94539
Upper limit 41.20709
19.34

Prediction Interval… INTERPRET

Prediction Interval
We predict that the operating margin Margin (y)
will fall between 25.3 and 48.8. Predicted value 37.07624

Prediction Interval
Lower limit 25.35321
If management defines a profitable Upper limit 48.79927

inn as one with an operating margin Interval Estimate of Expected Value


Lower limit 32.94539
greater than 50% and an unprofitable Upper limit 41.20709

inn as one with an operating margin


below 30%, they will pass on this site,
since the entire prediction interval
is below 50%.
19.35

Confidence Interval INTERPRET

The expected operating margin of all Prediction Interval

sites that fit this category is Margin (y)

estimated to be between 32.9 and Predicted value 37.07624

41.2. Prediction Interval


Lower limit 25.35321
Upper limit 48.79927

Interval Estimate of Expected Value


We interpret this to mean that if we Lower limit 32.94539
Upper limit 41.20709
built inns on an infinite number of
sites that fit the category described,
the mean operating margin would
fall between 32.9 and 41.2. In other
words, the average inn would not be
profitable either…
19.36

19.3 Regression Diagnostics – II

The required conditions for the model must be checked.


Calculate the residuals and check the following:
• Is the error variable non-normal?
Draw the histogram of the residuals.
• Is the error variance constant?
Plot the residuals versus the predicted values of
y.
• Are the errors independent (time-series data)?
Plot the residuals versus the time periods.
• Are there observations that are inaccurate or do not belong
to the target population?
Double-check the accuracy of outliers and
influential observations.
19.37

Regression Diagnostics – II…

• Multiple regression models have a problem that


simple regressions do not, namely multicollinearity.

• It happens when the independent variables are


highly correlated.

• We’ll explore this concept through the following


example…
19.38

Example 2

A real estate agent believes that the selling price of a


house can be predicted using the house size, number of
bedrooms and lot size. A random sample of 100 houses
was drawn and data recorded.

Price Bedrooms H size Lot size


124100 3 129 390
218300 4 208 660
117800 3 125 375
. . . .
. . . .

Analyse the relationship between house prices and the


three variables, house size, number of bedrooms and lot
size.
19.39

Example 2: Solution IDENTIFY

The proposed model is

PRICE = 0 + 1BEDROOMS + 2H-SIZE + 3LOTSIZE + e


19.40

Example 2: Solution IDENTIFY

Using Excel (Data Analysis): Data > Data Analysis > Regression

The F-test indicates the model is valid…

…but these t-stats suggest none of the

variables are related to the selling price.


19.41

Example 2: Solution IDENTIFY

Unlike the t-tests in the multiple regression model, these three

t-tests for the significance of correlation coefficients, tell us

that the number of bedrooms, the house size, and the lot size

are all linearly related to the price…

41
19.42

Example 2: Solution… IDENTIFY

How to account for this apparent contradiction?


The answer is that the three independent variables are
correlated with each other!

(i.e. this is reasonable: larger houses have more bedrooms and are situated on larger lots, and smaller

houses have fewer bedrooms and are located on smaller lots.)

Multicollinearity has affected the t-tests so that they implied that none of the independent

variables is linearly related to price when, in fact, all are.


19.43

Example 2: Solution… IDENTIFY

When regressing the price on each independent variable alone,


it is found that each variable is strongly related to the selling
price.
Multicollinearity is the source of this problem.
Multicollinearity causes two kinds of difficulties:
• The t statistics appear to be too small.
• The  coefficients cannot be interpreted as ‘slopes’.

To overcome the multicollinearity problem,


• one can drop one of the correlated variables or
• the variables can be deviated from their respective mean value and
these mean deviated variables can be used in the regression model
without a constant term estimation.
19.44

Remedying violations of required conditions

Non-normality or heteroscedasticity can be remedied


using transformations on the y variable.
The transformations can improve the linear
relationship between the dependent variable and the
independent variables.
Many computer software systems allow us to make the
transformations easily.
19.45

Remedying violations of required conditions


A brief list of transformations:
y’ = log y (for y > 0)
• Use when the se increases with y, or
• Use when the error distribution is positively skewed.
y’ = y2
• Use when the s2e is proportional to E(y), or
• Use when the error distribution is negatively skewed.
y’ = y1/2 (for y > 0)
• Use when the s2e is proportional to E(y).
y’ = 1/y
• Use when s2e increases significantly when y increases
beyond some value.
19.46

Example 3: The effect of time limits on quiz marks

(Example 19.9, p800)

A statistics lecturer wanted to know whether time limit


affect the marks on a quiz. A random sample of 100
students was split into five groups. Each student did a
quiz, but each group was given a different time limit.
See data below. Analyse these results and include
diagnostics.

Time 40 45 50 55 60
M 20 24 26 30 32
a 23 26 25 32 31
r
. . . . .
k
s
. . . . .
19.47

Example 3 - Solution
50
40

The model tested: 30


20
MARK = b0 + b1TIME + e 10
0
SUMMARY OUTPUT -2.5 -1.5 -0.5 0.5 1.5 2.5 More
This model is useful and
Regression Statistics
Multiple R 0.86254 provides a good fit.
R Square 0.743974
Adjusted R Square 0.741362
Standard Error 2.304609
Observations 100 The errors seem to be normally

distributed.
ANOVA
df SS MS F Significance F
Regression 1 1512.5 1512.5 284.7743 9.42E-31
Residual 98 520.5 5.311224
Total 99 2033

Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -2.2 1.64582 -1.33672 0.184409 -5.46608 1.066077
Time 0.55 0.032592 16.87526 9.42E-31 0.485322 0.614678
19.48

Example 3 - Solution…

Standardised errors vs. predicted mark


4
3
2
1
0
-1 20 22 24 26 28 30 32
-2
-3

The standard error of estimate seems to increase with the predicted value of y.

Two transformations are used to remedy this problem:

1. y’ = logey

2. y’ = 1/y
19.49

Example 3 - Solution…
Let us see what happens when a transformation is applied:

Mark
40
LogMark
The original data, where 4
‘mark’ is a function of ‘time’ The modified data, where
35
LogMark is a function of ‘time’

30

40, 3.135
Loge23 = 3.135 3
25
40, 2.89

40,23

20

40,18 Loge18 = 2.89


2
15
0 20 40 60 80 0 20 40 60 80
19.50

Example 3 - Solution…

The new regression analysis and diagnostics are:

The model tested:

LOGMARK = b’0 + b’1TIME + e’ Predicted LogMark = 2.1295 + 0.021 time

SUMMARY OUTPUT

Regression Statistics This model is useful and


Multiple R 0.878300
R Square 0.771412 provides a good fit.
Adjusted R Square 0.769079
Standard Error 0.084437
Observations 100

ANOVA
df SS MS F Significance F
Regression 1 2.357901 2.35790 330.72 3.58E-33
Residual 98 0.698705 0.00713
Total 99 3.056606

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 2.129582 0.060300 35.31632 1.51409E-57 2.009918 2.249246
Time 0.021716 0.001194 18.18566 3.58062E-33 0.019346 0.024086
19.51

Example 3 - Solution…

40

30

20 The errors seem

to be normally
10
distributed.
0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More

Standard Residuals
4

2
The standard error still changes with
0
the predicted y, but the change is 2.9 3 3.1 3.2 3.3 3.4 3.5
smaller than before.
-2

-4
19.52

Example 3 - Solution…

How do we use the modified model to predict?

Let TIME = 55 minutes

LogMark = 2.1295 + 0.0217 time

= 2.1295 + 0.0217(55) = 3.323

To find the predicted mark, take the antilog:

3.323
Mark = antiloge3.323 = e = 27.743
19.4 Regression Diagnostics – III
19.53

(Time series)
Durbin–Watson test
This test detects first-order autocorrelation between
consecutive residuals in a time series of the form
et =  et-1 + vt, t=2,3,…,n,

and test the hypotheses:


H0: ρ = 0

HA: ρ  0

If the null hypothesis is rejected, we conclude that


autocorrelation exists and the error variables are not
independent.
19.54

Durbin-Watson (DW) Test…

The test statistic for Durbin-Watson (DW) test is

n
2
 (et  et 1 )
d  t 2 n
, d  2(1 – ρ)
2
 et
t 1

Residual at time t
Since -1    1, we have 0  d  4. Therefore,

if d = 2 then there is no evidence of autocorrelation


19.55

Durbin-Watson (DW) Test…


Positive first-order autocorrelation occurs

when consecutive residuals tend to be


Positive first-order autocorrelation
Residuals + Residuals
similar. Then the value of d is small (< 2).
+
+

+
0
Time
+

+
+
+

Negative first-order autocorrelation occurs when consecutive residuals


Negative first-order autocorrelation
tend to differ markedly. Then the value of d is large (> 2).
Residuals
Residuals
+ +

0
+ + + Time
+
19.56

Durbin-Watson (DW) Test…

0 2 4

Small values of d (d < 2) indicate a positive Large values of d (d > 2) imply a negative

first-order autocorrelation. first-order autocorrelation.


19.57

Durbin-Watson (one-tail) test

To test for positive first-order autocorrelation:


Positive first-order autocorrelation
Inconclusive test Positive first-order autocorrelation does not exist
exists

dL dU 2 4
0

• If d < dL , we conclude that there is enough evidence to


support positive first-order autocorrelation.
• If d > dU , we conclude that there is not enough evidence
to support positive first-order autocorrelation.
• If dL ≤ d ≤ dU , the test is inconclusive.

dL and dU can be read from Table 11, Appendix B

57
19.58

Durbin–Watson (one-tail) Test

To test for negative first-order autocorrelation:


Negative first-order autocorrelation
Negative first-order autocorrelation does not exist Inconclusive test
exists

0 2 4-dU 4-dL 4

• If d > 4 – dL, we conclude that there is enough evidence to


support negative first-order autocorrelation.
• If d < 4 – dU, we conclude that there is not enough evidence to
support negative first-order autocorrelation.
• If 4 – dU ≤ d ≤ 4 – dL, the test is inconclusive.

dL and dU can be read from Table 11, Appendix B

58
19.59

Durbin–Watson (two-tail) test


To test for first-order autocorrelation:
First-order autocorrelation exists First-order autocorrelation exists
Inconclusive Doesn’t exist inconclusive

0 dL dU 2 4-dU 4-dL 4

• If d < dL or d > 4 – dL, first-order autocorrelation exists.


• If d falls between dL and dU or between 4 – dU and 4 – dL , the
test is inconclusive.
• If d falls between dU and 4 – dU there is no evidence of first
order autocorrelation.
dL and dU can be read from Table 11, Appendix B
19.60

Durbin–Watson test using Excel COMPUTE

Step 1: Using Data Analysis, run the regression and obtain


the residuals
Data > Data Analysis > Regression (tick
Residuals box then OK)
Step 2: Use Data Analysis Plus and the residual output in
Step 1 to obtain the DW statistic.
Add-Ins > Data Analysis Plus > Durbin–
Watson statistic > Highlight range of residuals
from regression run > OK
19.61

Example 4 IDENTIFY

How does the weather affect the sales of lift tickets in a ski
resort? Data of the past 20 years sales of tickets, along with
the total snowfall and the average temperature during July
in each year, were collected.

The model hypothesised was

TICKETS = b0+ b1SNOWFALL+ b2TEMPERATURE + e

Regression analysis yielded the following results:


19.62

Example 4 – Solution… COMPUTE

Both the coefficient of determination and the p-value of the F-test

indicate the model is poor…

Neither variable is linearly related to ticket sale…


19.63

Example 4 - Solution… INTERPRET

Themodel
The modelseems
seemsto
tobe
bevery
verypoor:
poor:

•• The fit of the model is very low


22
The fit of the model is very low (R ==0.12)
(R 0.12)

•• Modelisisnot
Model notvalid
valid(significance
(significanceFF==0.33)
0.33)

•• Noneof
None ofthe
the33variables
variablesseem
seemto
tobe
belinearly
linearlyrelated
relatedto
tosales.
sales.

Diagnosisof
Diagnosis ofthe
therequired
requiredconditions
conditionsresulted
resultedin
inthe
thefollowing
followingfindings:
findings:
19.64

Example 4 – Solution…
The histogram of residuals…

reveals the errors may be normally distributed…


19.65

Example 4 – Solution… COMPUTE

In the plot of residuals versus predicted values (testing


for heteroscedasticity) – the error variance appears to
be constant…
19.66

Example 4 – Solution… COMPUTE

Durbin-Watson (DW) test


Apply the Durbin-Watson Statistic from Data Analysis Plus
to the entire list of residuals.
19.67

Example 4 - Solution… INTERPRET

To test for positive first-order autocorrelation with  = .05, we


find in Table 11(a) in Appendix B
dL = 1.10 and dU = 1.54

The null and alternative hypotheses are


H0 : There is no first-order autocorrelation.
HA : There is positive first-order autocorrelation.

The rejection region is d < dL = 1.10. Since d = 0.59, we reject


the null hypothesis and conclude that there is enough evidence
to infer that positive first-order autocorrelation exists.
19.68

Example 4 – Solution… INTERPRET

Autocorrelation usually indicates that the model needs to


include an independent variable that has a time-ordered
effect on the dependent variable.

The simplest such independent variable represents the time


periods. We included a third independent variable that
records the number of years since the year the data were
gathered. Thus, YEARS = 1, 2,..., 20. The new model is

TICKETS = b0+ b1SNOWFALL+ b2TEMPERATURE+ b3YEARS+ e


19.69

Example 4 – Solution… COMPUTE

The fit of the model is high.

The model is valid…

Snowfall and time are linearly related to ticket sales; temperature


Our new variable
is not…
19.70

Example 4 – Solution… INTERPRET

If we re-run the Durbin-Watson statistic against the


residuals from our Regression analysis,

we can conclude that there is not enough evidence to


infer the presence of first-order autocorrelation.
(Determining dL and dU is left as an exercise for the
reader…)

Hence, we have improved our model dramatically!


19.71

Example 4 – Solution… INTERPRET

All the required conditions are met for this model.


The fit of this model is high: R2 = 0.74.
The model is useful as p-value (F-test) = 0.0001 is very low.
SNOWFALL and YEARS are linearly related to ticket sales.
TEMPERATURE is not linearly related to ticket sales.
19.72

Example 4 - Solution INTERPRET

Notice that the model has improved dramatically.


The F-test tells us that the model is valid. The t-tests tell
us that both the amount of snowfall and time are
significantly linearly related to the number of lift tickets.
This information could prove useful in advertising for the
resort. For example, if there has been a recent snowfall,
the resort could emphasise that in its advertising.
If no new snow has fallen, it may emphasise their snow-
making facilities.

You might also like