0% found this document useful (0 votes)
18 views44 pages

Multiple Regression

Multiple regression analysis extends simple linear regression by using multiple independent variables to predict a dependent variable, allowing for more accurate forecasts. The document discusses the importance of selecting independent variables that are not highly correlated to avoid multicollinearity and highlights the use of a correlation matrix to analyze relationships among variables. It also explains how to interpret regression coefficients and provides an example of a regression model for forecasting sales volume based on price and advertising expenses.

Uploaded by

Salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views44 pages

Multiple Regression

Multiple regression analysis extends simple linear regression by using multiple independent variables to predict a dependent variable, allowing for more accurate forecasts. The document discusses the importance of selecting independent variables that are not highly correlated to avoid multicollinearity and highlights the use of a correlation matrix to analyze relationships among variables. It also explains how to interpret regression coefficients and provides an example of a regression model for forecasting sales volume based on price and advertising expenses.

Uploaded by

Salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MULTIPLE REGRESSION

ANALYSIS
In simple linear regression, the relationship between a single independent variable and
a dependent variable is investigated. The relationship between two variables frequently
allows one to accurately predict the dependent variable from knowledge of the inde-
pendent variable. Unfortunately, many real-life forecasting situations are not so simple.
More than one independent variable is usually necessary in order to predict a depen-
dent variable accurately. Regression models with more than one independent variable
are called multiple regression models. Most of the concepts introduced in simple linear
regression carry over to multiple regression. However, some new concepts arise because
more than one independent variable is used to predict the dependent variable.

Multiple regression involves the use of more than one independent variable to
predict a dependent variable.

SEVERAL PREDICTOR VARIABLES

As an example, return to the problem in which sales volume of gallons of milk is fore-
cast from knowledge of price per gallon. Mr. Bump is faced with the problem of making
a prediction that is not entirely accurate. He can explain almost 75% of the differences
in gallons of milk sold by using one independent variable. Thus, 25% 11 - r22 of the
total variation is unexplained. In other words, from the sample evidence Mr. Bump
knows 75% of what he must know to forecast sales volume perfectly. To do a more accu-
rate job of forecasting, he needs to find another predictor variable that will enable him
to explain more of the total variation. If Mr. Bump can reduce the unexplained varia-
tion, his forecast will involve less uncertainty and be more accurate.
A search must be conducted for another independent variable that is related to sales
volume of gallons of milk. However, this new independent, or predictor, variable cannot
relate too highly to the independent variable (price per gallon) already in use. If the two
independent variables are highly related to each other, they will explain the same varia-
tion, and the addition of the second variable will not improve the forecast.1 In fields such
as econometrics and applied statistics, there is a great deal of concern with this problem
of intercorrelation among independent variables, often referred to as multicollinearity.

1Interrelated predictor variables essentially contain much of the same information and therefore do not
contribute “new” information about the behavior of the dependent variable. Ideally, the effects of separate
predictor variables on the dependent variable should be unrelated to one another.

From Chapter 7 of Business Forecasting, Ninth Edition. John E. Hanke, Dean W. Wichern.
Copyright © 2009 by Pearson Education, Inc. All rights reserved.

235
Multiple Regression Analysis

The simple solution to the problem of two highly related independent variables is merely
not to use both of them together. The multicollinearity problem will be discussed later in
this chapter.

CORRELATION MATRIX

Mr. Bump decides that advertising expense might help improve his forecast of weekly
sales volume. He investigates the relationships among advertising expense, sales vol-
ume, and price per gallon by examining a correlation matrix. The correlation matrix is
constructed by computing the simple correlation coefficients for each combination of
pairs of variables.
An example of a correlation matrix is illustrated in Table 1. The correlation coeffi-
cient that indicates the relationship between variables 1 and 2 is represented as r12. Note
that the first subscript, 1, also refers to the row and the second subscript, 2, also refers to
the column in the table. This approach allows one to determine, at a glance, the relation-
ship between any two variables. Of course, the correlation between, say, variables 1 and
2 is exactly the same as the correlation between variables 2 and 1; that is, r12 = r21.
Therefore, only half of the correlation matrix is necessary. In addition, the correlation of
a variable with itself is always 1, so that, for example, r11 = r22 = r33 = 1.
Mr. Bump runs his data on the computer, and the correlation matrix shown in
Table 2 results. An investigation of the relationships among advertising expense, sales
volume, and price per gallon indicates that the new independent variable should con-
tribute to improved prediction. The correlation matrix shows that advertising expense
has a high positive relationship 1r13 = .892 with the dependent variable, sales volume,
and a moderate negative relationship 1r23 = -.652 with the independent variable,
price per gallon. This combination of relationships should permit advertising expenses
to explain some of the total variation of sales volume that is not already being
explained by price per gallon. As will be seen, when both price per gallon and advertis-
ing expense are used to estimate sales volume, R 2 increases to 93.2%.
The analysis of the correlation matrix is an important initial step in the solution of
any problem involving multiple independent variables.

TABLE 1 Correlation Matrix

Variables

Variables 1 2 3

1 r11 r12 r13


2 r21 r22 r23
3 r31 r32 r33

TABLE 2 Correlation Matrix for Mr. Bump’s data


Variables
Sales Price Advertising
Variables
1 2 3
Sales, 1 1.00 -.86 .89
Price, 2 1.00 -.65
Advertising, 3 1.00

236
Multiple Regression Analysis

MULTIPLE REGRESSION MODEL

In simple regression, the dependent variable can be represented by Y and the inde-
pendent variable by X. In multiple regression analysis, X’s with subscripts are used to
represent the independent variables. The dependent variable is still represented by Y,
and the independent variables are represented by X1, X2, . . ., Xk. Once the initial set of
independent variables has been determined, the relationship between Y and these X’s
can be expressed as a multiple regression model.
In the multiple regression model, the mean response is taken to be a linear func-
tion of the explanatory variables:
mY = b 0 + b 1X1 + b 2X2 + Á + b kXk (1)
This expression is the population multiple regression function. As was the case with
simple linear regression, we cannot directly observe the population regression function
because the observed values of Y vary about their means. Each combination of values
for all of the X’s defines the mean for a subpopulation of responses Y. We assume that
the Y’s in each of these subpopulations are normally distributed about their means
with the same standard deviation, σ.
The data for simple linear regression consist of observations 1Xi, Yi2 on the two
variables. In multiple regression, the data for each case consist of an observation on the
response and an observation on each of the independent variables. The ith observation
on the jth predictor variable is denoted by Xi j. With this notation, data for multiple
regression have the form given in Table 3. It is convenient to refer to the data for the ith
case as simply the ith observation. With this convention, n is the number of observa-
tions and k is the number of predictor variables.

Statistical Model for Multiple Regression


The response, Y, is a random variable that is related to the independent (predictor)
variables, X1,X2, Á ,Xk, by
Y = b 0 + b 1X1 + b 2X2 + Á + b kXk + 

where
1. For the ith observation, Y = Yi and X1,X2, Á ,Xk are set at values Xi 1,Xi 2, Á ,Xik.

TABLE 3 Data Structure for Multiple


Regression

Predictor Variables Response

Case X1 X2 ............ Xk Y

1 X11 X12 ............ X1k Y1


2 X21 X22 ............ X2k Y2
. . . ............ . .
. . . ............ . .
i Xi1 Xi2 ............ Xik Yi
. . . ............ . .
. . . ............ . .
n Xn1 Xn2 ............ Xnk Yn

237
Multiple Regression Analysis

2. The ’s are error components that represent the deviations of the response from
the true relation. They are unobservable random variables accounting for the
effects of other factors on the response. The errors are assumed to be independent,
and each is normally distributed with mean 0 and unknown standard deviation σ.
3. The regression coefficients, b 0,b 1, Á ,b k, that together locate the regression function
are unknown.
Given the data, the regression coefficients can be estimated using the principle of
least squares. The least squares estimates are denoted by b0, b1, Á , bk and the estimated
regression function by

N = b + bX + Á + bX
Y (2)
0 1 1 k k

The residuals, e = Y - YN , are estimates of the error component and similar to the sim-
ple linear regression situation; the correspondence between population and sample is

Population: Y = b 0 + b 1X1 + b 2X2 + Á + b kXk + 


Sample: Y = b0 + b1X1 + b2X2 + Á + bkXk + e

The calculations in multiple regression analysis are ordinarily performed using


computer software packages such as Minitab and Excel (see the Minitab and Excel
Applications sections at the end of the chapter).

Example 1
For the data shown in Table 4, Mr. Bump considers a multiple regression model relating
sales volume (Y) to price 1X12 and advertising 1X22:

Y = b 0 + b 1X1 + b 2X2 + 

Mr. Bump determines the fitted regression function:

YN = 16.41 - 8.25X1 + .59X2

The least squares values— b0 = 16.41, b1 = -8.25 , and b2 = .59 —minimize the sum of
squared errors:

SSE = a 1Yi - b0 - b1Xi1 - b2Xi222 = a 1Yi - YN i22


i i

for all possible choices of b0 , b1 , and b2 . Here, the best-fitting function is a plane (see
Figure 1). The data points are plotted in three dimensions along the Y, X1, and X2 axes.
The points fall above and below the plane in such a way that ©1Y - YN 22 is a minimum.
The fitted regression function can be used to forecast next week’s sales. If plans call for
a price per gallon of $1.50 and advertising expenditures of $1,000, the forecast is 9.935 thou-
sands of gallons; that is,

YN = 16.41 - 8.25X1 + .59X2


YN = 16.41 - 8.2511.52 + .591102 = 9.935

238
Multiple Regression Analysis

TABLE 4 Mr. Bump’s Data for Example 1

Week Sales (1,000s) Price per Gallon ($) Advertising ($100s)


Y X1 X2

1 10 1.30 9
2 6 2.00 7
3 5 1.70 5
4 12 1.50 14
5 10 1.60 15
6 15 1.20 12
7 5 1.60 6
8 12 1.40 10
9 17 1.00 15
10 20 1.10 21
Totals 112 14.40 114
Means 11.2 1.44 11.4

Y
Sales

17
16.41
^
Y = 16.41 − 8.25 X1 + .59X2
B

X2
A 6 15 20 Advertising
$1.00
$1.60
$2.00
X1
Price
Point Week Sales Price Advertising
A 7 5 $1.60 6
B 9 17 $1.00 15

FIGURE 1 Fitted Regression Plane for Mr. Bump’s Data for Example 1

INTERPRETING REGRESSION COEFFICIENTS

Consider the interpretation of b0, b1, and b2 in Mr. Bump’s fitted regression function.
The value b0 is again the Y-intercept. However, now it is interpreted as the value of YN
when both X1 and X2 are equal to zero. The coefficients b1 and b2 are referred to as the
partial, or net, regression coefficients. Each measures the average change in Y per unit
change in the relevant independent variables. However, because the simultaneous

239
Multiple Regression Analysis

influence of all independent variables on Y is being measured by the regression func-


tion, the partial or net effect of X1 (or any other X) must be measured apart from any
influence of other variables. Therefore, it is said that b1 measures the average change in
Y per unit change in X1, holding the other independent variables constant.

The partial, or net, regression coefficient measures the average change in the
dependent variable per unit change in the relevant independent variable, holding
the other independent variables constant.

In the present example, the b1 value of -8.25 indicates that each increase of 1 cent
in the price of a gallon of milk when advertising expenditures are held constant reduces
the quantity purchased by an average of 82.5 gallons. Similarly, the b2 value of .59
means that, if advertising expenditures are increased by $100 when the price per gallon
is held constant, then sales volume will increase an average of 590 gallons.
Example 2
To illustrate the net effects of individual X’s on the response, consider the situation in which
price is to be $1.00 per gallon and $1,000 is to be spent on advertising. Then

YN = 16.41 - 8.25X1 + .59X2


= 16.41 - 8.2511.002 + .591102
= 16.41 - 8.25 + 5.9 = 14.06

Sales are forecast to be 14,060 gallons of milk.


What is the effect on sales of a 1-cent price increase if $1,000 is still spent on advertising?

YN = 16.41 - 8.2511.012 + .591102


= 16.41 - 8.3325 + 5.9 = 13.9775

Note that sales decrease by 82.5 gallons 114.06 - 13.9775 = .08252.


What is the effect on sales of a $100 increase in advertising if price remains constant
at $1.00?

YN = 16.41 - 8.2511.002 + .591112


= 16.41 - 8.25 + 6.49 = 14.65

Note that sales increase by 590 gallons 114.65 - 14.06 = .592.

INFERENCE FOR MULTIPLE REGRESSION MODELS

Inference for multiple regression models is analogous to that for simple linear regres-
sion. The least squares estimates of the model parameters, their estimated standard
errors, the t statistics used to examine the significance of individual terms in the regres-
sion model, and the F statistic used to check the significance of the regression are all
provided in output from standard statistical software packages. Determining these
quantities by hand for a multiple regression analysis of any size is not practical, and the
computer must be used for calculations.
As you know, any observation Y can be written

Observation = Fit + Residual

240
Multiple Regression Analysis

or
Y = YN + 1Y - YN 2
where
YN = b0 + b1X1 + b2X2 + Á + bkXk
is the fitted regression function. Recall that YN is an estimate of the population regres-
sion function. It represents that part of Y explained by the relation of Y with the X’s.
The residual, Y - YN , is an estimate of the error component of the model. It represents
that part of Y not explained by the predictor variables.
The sum of squares decomposition and the associated degrees of freedom are

©1Y - Y22 = ©1YN - Y22 + ©1Y - YN 22


SST = SSR + SSE
df : n - 1 = k + n - k - 1 (3)

The total variation in the response, SST, consists of two components: SSR, the variation
explained by the predictor variables through the estimated regression function, and
SSE, the unexplained or error variation. The information in Equation 3 can be set out
in an analysis of variance (ANOVA) table, which is discussed in a later section.

Standard Error of the Estimate


The standard error of the estimate is the standard deviation of the residuals. It mea-
sures the typical scatter of the Y values about the fitted regression function.2 The stan-
dard error of the estimate is

©1Y - YN 22 SSE
sy #x¿s = = = 1MSE (4)
Dn - k - 1 Dn - k - 1

where
n = the number of observations
k = the number of independent variables in the regression function
SSE = ©1Y - YN 22 = the residual sum of squares
MSE = SSE>1n - k - 12 = the residual mean square

The standard error of the estimate is the standard deviation of the residuals. It mea-
sures the amount the actual values (Y) differ from the estimated values (YN ). For rel-
atively large samples, we would expect about 67% of the differences Y - YN to be
within sy #x¿s of zero and about 95% of these differences to be within 2 sy #x¿s of zero.

Example 3
The quantities required to calculate the standard error of the estimate for Mr. Bump’s data
are given in Table 5.

2The standard error of the estimate is an estimate of s, the standard deviation of the error term, , in the
multiple regression model.

241
Multiple Regression Analysis

TABLE 5 Residuals from the Model for Mr. Bump’s Data for
Example 3

N Using
Predicted Y (Y) Residual
Y X1 X2 YN ⴝ 16.406 ⴚ 8.248X1 ⴙ .585X2 (Y - YN ) (Y - YN ) 2

10 1.30 9 10.95 -.95 .90


6 2.00 7 4.01 1.99 3.96
5 1.70 5 5.31 -.31 .10
12 1.50 14 12.23 -.23 .05
10 1.60 15 11.99 -1.99 3.96
15 1.20 12 13.53 1.47 2.16
5 1.60 6 6.72 -1.72 2.96
12 1.40 10 10.71 1.29 1.66
17 1.00 15 16.94 .06 .00
20 1.10 21 19.62 .38 .14
Totals .00 15.90

The standard error of the estimate is


15.90
sy #x¿s = = 22.27 = 1.51
A 10 - 2 - 1
With a single predictor, X1 = price, the standard error of the estimate was sy #x = 2.72.
With the additional predictor, X2 = advertising, Mr. Bump has reduced the standard error
of the estimate by almost 50%. The differences between the actual volumes of milk sold and
their forecasts obtained from the fitted regression equation are considerably smaller with
two predictor variables than they were with a single predictor. That is, the two-predictor
equation comes a lot closer to reproducing the actual Y ’s than the single-predictor equa-
tion.

Significance of the Regression


The ANOVA table based on the decomposition of the total variation in Y (SST) into its
explained (SSR) and unexplained (SSE) parts (see Equation 3) is given in Table 6.
Consider the hypothesis H0: b 1 = b 2 = Á = b k = 0. This hypothesis means that
Y is not related to any of the X’s (the coefficient attached to every X is zero). A test of
H0 is referred to as a test of the significance of the regression. If the regression model
assumptions are appropriate and H0 is true, the ratio
MSR
F =
MSE
has an F distribution with df = k, n - k - 1. Thus, the F ratio can be used to test the
significance of the regression.

TABLE 6 ANOVA Table for Multiple Regression


Source Sum of Squares df Mean Square F Ratio

MSR
Regression SSR k MSR = SSR>k F =
MSE
Error SSE n - k - 1 MSE = SSE>(n - k - 12
Total SST n - 1

242
Multiple Regression Analysis

In simple linear regression, there is only one predictor variable. Consequently, test-
ing for the significance of the regression using the F ratio from the ANOVA table is
equivalent to the two-sided t test of the hypothesis that the slope of the regression line
is zero. For multiple regression, the t tests (to be introduced shortly) examine the sig-
nificance of individual X’s in the regression function, and the F test examines the
significance of all the X’s collectively.

F Test for the Significance of the Regression


In the multiple regression model, the hypotheses
H0: b 1 = b 2 = Á = b k = 0
H1: at least one b j Z 0
are tested by the F ratio:
MSR
F =
MSE
with df = k, n - k - 1. At significance level α, the rejection region is
F 7 Fa
where Fa is the upper α percentage point of an F distribution with 1 = k,
2 = n - k - 1 degrees of freedom.

The coefficient of determination, R2, is given by

2 SSR ©1YN - Y22


R = =
SST ©1Y - Y22

SSE ©1Y - YN 22
= 1 - = 1 - (5)
SST ©1Y - Y22

and has the same form and interpretation as r2 does for simple linear regression. It rep-
resents the proportion of variation in the response, Y, explained by the relationship of
Y with the X’s.
A value of R 2 = 1 says that all the observed Y’s fall exactly on the fitted regression
function. All of the variation in the response is explained by the regression. A value of
R2 = 0 says that YN = Y—that is, SSR = 0—and none of the variation in Y is explained
by the regression. In practice, 0 … R2 … 1, and the value of R 2 must be interpreted relative
to the extremes, 0 and 1.
The quantity

R = 2R 2 (6)

is called the multiple correlation coefficient and is the correlation between the
responses, Y, and the fitted values, YN . Since the fitted values predict the responses, R is
always positive, so that 0 … R … 1.

243
Multiple Regression Analysis

For multiple regression,

R2 n - k - 1
F = 2
¢ ≤ (7)
1 - R k

so, everything else equal, significant regressions (large F ratios) are associated with rel-
atively large values for R2.
The coefficient of determination 1R22 can always be increased by adding an addi-
tional independent variable, X, to the regression function, even if this additional vari-
able is not important.3 For this reason, some analysts prefer to interpret R2 adjusted for
the number of terms in the regression function. The adjusted coefficient of determina-
tion, R 2, is given by

R 2 = 1 - 11 - R22 ¢
n - 1
≤ (8)
n - k - 1

Like R2, R2 is a measure of the proportion of variability in the response, Y, explained


by the regression. It can be shown that 0 … R2 … R2. When the number of observa-
tions (n) is large relative to the number of independent variables (k), R2 L R2. If k = 0,
YN = Y and R2 = R2. In many practical situations, there is not much difference between
the magnitudes of R2 and R2.

Example 4
Using the total sum of squares in Table 6 and the residual sum of squares from Example 3, the
sum of squares decomposition for Mr. Bump’s problem is

SST = SST + SSE

a 1Y - Y2 = a 1Y - Y2 + a 1Y - Y2
2 N N 2
2

233.6 = 217.7 + 15.9

Hence, using both forms of Equation 5 to illustrate the calculations,

217.7 15.9
R2 = = 1 - = .932
233.6 233.6

and the multiple correlation coefficient is R = 2R2 = 2.932 = .965.


Here, about 93% of the variation in sales volume is explained by the regression, that is,
the relation of sales to price and advertising expenditures. In addition, the correlation
between sales and fitted sales is about .965, indicating close agreement between the actual
and predicted values. A summary of the analysis of Mr. Bump’s data to this point is given in
Table 7.

Individual Predictor Variables


The coefficient of an individual X in the regression function measures the partial or net
effect of that X on the response, Y, holding the other X’s in the equation constant. If the
regression is judged significant, then it is of interest to examine the significance of
the individual predictor variables. The issue is this: Given the other X’s, is the effect of
this particular X important, or can this X term be dropped from the regression func-
tion? This question can be answered by examining an appropriate t value.

3Here, “not important” means “not significant.” That is, the coefficient of X is not significantly different from
zero (see the Individual Predictor Variables section that follows).

244
Multiple Regression Analysis

TABLE 7 Summary of the Analysis of


Mr. Bump’s Data for Example 4
Variables Used to Explain
Variability in Y R2 π(Y - YN )2

None .00 233.6


Price .75 59.4
Price and Advertising expense .93 15.9

If H0: b j = 0 is true, the test statistic, t, with the value t = bj>sbj has a t distribution
with df = n - k - 1.4

To judge the significance of the jth term, j = 0, 1, Á , k, in the regression function,


the test statistic, t, is compared with a percentage point of a t distribution with
n - k - 1 degrees of freedom. For an α level test of
H0: b j = 0
H1: b j Z 0

reject H0 if ƒ t ƒ 7 ta>2 . Here, ta>2 is the upper α/2 percentage point of a t distribution
with df = n - k - 1.
Some care must be exercised in dropping from the regression function those pre-
dictor variables that are judged to be insignificant by the t test 1H0: b j = 0 cannot be
rejected). If the X’s are related (multicollinear), the least squares coefficients and the
corresponding t values can change, sometimes appreciably, if a single X is deleted
from the regression function. For example, an X that was previously insignificant may
become significant. Consequently, if there are several small (insignificant) t values,
predictor variables should be deleted one at a time (starting with the variable having
the smallest t value) rather than in bunches. The process stops when the regression is
significant and all the predictor variables have large (significant) t statistics.

Forecast of a Future Response


A forecast, YN *, of a future response, Y, for new values of the X’s—say, X1 = X*1,
X2 = X*2, Á ,Xk = X*k—is given by evaluating the fitted regression function at the X*’s:

YN * = b0 + b1X*1 + b2X*2 + Á + bkX*k (9)

With confidence level 1- α, a prediction interval for Y takes the form


YN * ; ta>2 * 1Standard error of the forecast2
The standard error of the forecast is a complicated expression, but the standard error of
the estimate, sy # x’s, is an important component. In fact, if n is large and all the X’s are quite
variable, an approximate 10011 - α)% prediction interval for a new response Y is

1YN * - ta>2 sy #x’s , YN * + ta>2 sy #x’s2 (10)

4Here, bj is the least squares coefficient for the jth predictor variable, Xj, and sbj is its estimated standard
deviation (standard error). These two statistics are ordinarily obtained with computer software such as
Minitab.

245
Multiple Regression Analysis

COMPUTER OUTPUT

The computer output for Mr. Bump’s problem is presented in Table 8. Examination of
this output leads to the following observations (explanations are keyed to Table 8).
1. The regression coefficients are -8.25 for price and .585 for advertising [Link]
fitted regression equation is YN = 16.4 - 8.25X1 + .585X2.
2. The regression equation explains 93.2% of the variation in sales volume.
3. The standard error of the estimate is 1.5072 gallons. This value is a measure of the
amount the actual values differ from the fitted values.
4. The regression slope coefficient was tested to determine whether it was different
from zero. In the current situation, the large t statistic of -3.76 for the price vari-
able, X1, and its small p-value (.007) indicate that the coefficient of price is signifi-
cantly different from zero (reject H0: b 1 = 02. Given the advertising variable, X2,
in the regression function, price cannot be dropped from the regression function.
Similarly, the large t statistic of 4.38 for the advertising variable, X2, and its small p-
value (.003) indicate that the coefficient of advertising is significantly different
from zero (reject H0: b 2 = 02. Given the price variable, X1, in the regression func-
tion, the advertising variable cannot be dropped from the regression function. (As
a reference point for the magnitude of the t values, with seven degrees of freedom,
Table 3 in Appendix: Tables gives t.01 = 2.998.) In summary, the coefficients of
both predictor variables are significantly different from zero.
5. The p-value .007 is the probability of obtaining a t value at least as large as -3.76 if
the hypothesis H0: b 1 = 0 is true. Since this probability is extremely small, H0 is
unlikely to be true, and it is [Link] coefficient of price is significantly different
from zero. The p-value .003 is the probability of obtaining a t value at least as large
as 4.38 if H0: b 2 = 0 is true. Since a t value of this magnitude is extremely unlikely,
H0 is rejected. The coefficient of advertising is significantly different from zero.

TABLE 8 Minitab Output for Mr. Bump’s Data

Correlations: Y, X1, X2
Y X1
X1 -0.863
X2 0.891 -0.654 162
Regression Analysis: Y versus X1, X2
The regression equation is
Y = 16.4 - 8.25 X1 + 0.585 X2 112
Presictor Coef SE Coef T P
Constant 16.406 (1) 4.343 3.78 0.007
X1 -8.248 112 2.196 -3.76 142 0.007 (5)
X2 0.5851 (1) 0.1337 4.38 (4) 0.003 (5)
S = 1.50720 132 R -Sq = 93.2% 122 R -Sq1adj2 = 91.2% 192
Analysis of Variance
Source DF SS MS F P
Regression 2 217.70 (7) 108.85 47.92 (8) 0.000
Residual Error 7 15.90 (7) 2.27
Total 9 233.60 (7)

246
Multiple Regression Analysis

6. The correlation matrix was demonstrated in Table 2.


7. The sum of squares decomposition, SST = SSR + SSE 1sum of squares total =
sum of squares regression + sum of squares error2, was given in Example 4.
8. The computed F value (47.92) is used to test for the significance of the regression.
The large F ratio and its small p-value (.000) show the regression is significant
(reject H0: b 1 = b 2 = 02. The F ratio is computed from
MSR 108.85
F = = = 47.92
MSE 2.27
As a reference for the magnitude of the F ratio, Table 5 in Appendix: Tables
gives the upper 1% point of an F distribution with two and seven degrees of free-
dom as F.01 = 9.55. The regression function explains a significant amount of the
variability in sales, Y.
9. The computation for the corrected or adjusted R2, R 2, is

R2 = 1 - 11 - R22 ¢
n - 1

n - k - 1

= 1 - 11 - .9322 ¢ ≤ = 1 - 1.068211.2862 = .912


10 - 1
10 - 2 - 1

DUMMY VARIABLES

Consider the following example.


Example 5
Suppose an analyst wishes to investigate how well a particular aptitude test predicts job per-
formance. Eight women and seven men have taken the test, which measures manual dexter-
ity in using the hands with tiny objects. Each subject then went through a month of intensive
training as an electronics assembler, followed by a month of actual assembly, during which
his or her productivity was evaluated by an index having values ranging from 0 to 10 (zero
means unproductive).
The data are shown in Table 9. A scatter diagram is presented in Figure 2. Each female
worker is represented by a 0 and each male by a 1.
It is immediately evident from observing Figure 2 that the relationship of this aptitude test
to job performance follows two distinct patterns, one applying to women and the other to men.

It is sometimes necessary to determine how a dependent variable is related to an


independent variable when a qualitative factor is influencing the situation. This relation-
ship is accomplished by creating a dummy variable. There are many ways to identify
quantitatively the classes of a qualitative [Link] values 0 and 1 are used in this text.

Dummy, or indicator, variables are used to determine the relationships between


qualitative independent variables and a dependent variable.

The dummy variable technique is illustrated in Figure 3. The data points for
females are shown as 0’s; the 1’s represent males. Two parallel lines are constructed for
the scatter diagram. The top one fits the data for females; the bottom one fits the male
data points.
Each of these lines was obtained from a fitted regression function of the form

YN = b0 + b1X1 + b2X2

247
Multiple Regression Analysis

TABLE 9 Electronics Assemblers Dummy Variable


Data for Example 5

Job Performance Rating Aptitude Test Score Gender


Subject Y X1 X2

1 5 60 0(F)
2 4 55 0(F)
3 3 35 0(F)
4 10 96 0(F)
5 2 35 0(F)
6 7 81 0(F)
7 6 65 0(F)
8 9 85 0(F)
9 9 99 1(M)
10 2 43 1(M)
11 8 98 1(M)
12 6 91 1(M)
13 7 95 1(M)
14 3 70 1(M)
15 6 85 1(M)
Totals 87 1,093
YF = the mean female job performance rating = 5.75
YM = the mean male job performance rating = 5.86
XF = the mean female aptitude test score = 64
XM = the mean male aptitude test score = 83

Y
10 0
0 = Females
1 = Males
9 0 1

8 1
Job Performance Rating

7 0 1

6 0 1 1

5 0

4 0

3 0 1

2 0 1

X
0 10 20 30 40 50 60 70 80 90 100
Aptitude Test Score

FIGURE 2 Scatter Diagram for Data in Example 5

248
Multiple Regression Analysis

Y
10 0
^
0 = Females 0 Y = −1.96 + .12X1
9
1 = Males

8 1

0 1

Job Performance Rating


7 ^
Y = −4.14 + .12X1
6 0 1 1

5 0

4 0

3 0 1

2 0 1

X1
0 10 20 30 40 50 60 70 80 90 100
Aptitude Test Score

FIGURE 3 Regression Lines Corresponding to Dummy Variables


in Example 5

where
X1 = the test score
0 for females
X2 = b dummy variable
1 for males
The single equation is equivalent to the following two equations:

YN = b0 + b1X1 for females


YN = b0 + b1X1 + b2 = 1b0 + b22 + b1X1 for males

Note that b2 represents the effect of a male on job performance and that b1 represents
the effect of differences in aptitude test scores (the b1 value is assumed to be the same
for both males and females). The important point is that one multiple regression equa-
tion will yield the two estimated lines shown in Figure 3. The top line is the estimated
relation for females, and the lower line is the estimated relation for males. One might
envisage X2 as a “switching” variable that is “on” when an observation is made for a
male and “off” when it is made for a female.

Example 6
The estimated multiple regression equation for the data of Example 5 is shown in the
Minitab computer output in Table 10. It is

YN = -1.96 + .12X1 - 2.18X2

249
Multiple Regression Analysis

TABLE 10 Minitab Output for Example 6

Correlations: Ratings, Test, Gender


Rating Test
Test 0.876
Gender 0.021 0.428
Regression Analysis: Rating versus Test, Gender
The regression equation is
Rating = -1.96 + 0.120 Test -2.18 Gender
Predictor Coef SE Coef T P
Constant -1.9565 0.7068 -2.77 0.017
Test 0.12041 0.01015 11.86 0.000
Gender -2.1807 0.4503 -4.84 0.000

S = 0.7863 R -sq = 92.1% R -sq1adj2 = 90.8%

Analysis of Variance
Source DF SS MS F P
Regression 2 86.981 43.491 70.35 0.000
Residual Error 12 7.419 0.618
Total 14 94.400

For the two values (0 and 1) of X2, the fitted equation becomes

YN = -1.96 + .12X1 - 2.18102 = -1.96 + .12X1 for females


and

YN = -1.96 + .12X1 - 2.18112 = -4.14 + .12X1 for males


These equations may be interpreted in the following way: The regression coefficient
value b1 = .12, which is the slope of each of the lines, is the estimated average increase in
performance rating for each one-unit increase in aptitude test score. This coefficient applies
to both males and females.
The other regression coefficient, b2 = -2.18, applies only to males. For a male test
taker, the estimated job performance rating is reduced, relative to female test takers, by 2.18
units when the aptitude score is held constant.
An examination of the means of the Y and X1 variables, classified by gender, helps one
understand this result. Table 9 shows that the mean job performance ratings were approxi-
mately equal for males, 5.86, and females, 5.75. However, the males scored significantly
higher (83) on the aptitude test than did the females (64). Therefore, if two applicants, one
male and one female, took the aptitude test and both scored 70, the female’s estimated job
performance rating would be 2.18 points higher than the male’s, since

Female: YN = -1.96 + .12X1 = -1.96 + .121702 = 6.44

Male: YN = -4.14 + .12X1 = -4.14 + .121702 = 4.26

A look at the correlation matrix in Table 10 provides some interesting insights.


A strong linear relationship exists between job performance and the aptitude test because
r12 = .876 . If the aptitude test score alone were used to predict performance, it would
explain about 77% 1.8762 = .7672 of the variation in job performance scores.
The correlation coefficient r13 = .02 indicates virtually no relationship between gender
and job performance. This conclusion is also evident from the fact that the mean perform-
ance ratings for males and females are nearly equal (5.86 versus 5.75). At first glance, one
might conclude that knowledge of whether an applicant is male or female is not useful

250
Multiple Regression Analysis

information. However, the moderate relationship, r23 = .428, between gender and aptitude
test score indicates that the test might discriminate between sexes. Males seem to do better
on the test than do females (83 versus 64). Perhaps some element of strength is required on
the test that is not required on the job.
When both test results and gender are used to forecast job performance, 92% of the vari-
ance is explained. This result suggests that both variables make a valuable contribution to pre-
dicting performance. The aptitude test scores explain 77% of the variance, and gender used in
conjunction with the aptitude test scores adds another 15%. The computed t statistics, 11.86
1p-value = .0002 and -4.84 1p-value = .0002, for aptitude test score and gender, respec-
tively, indicate that both predictor variables should be included in the final regression function.

MULTICOLLINEARITY

In many regression problems, data are routinely recorded rather than generated from
preselected settings of the independent variables. In these cases, the independent vari-
ables are frequently linearly dependent or multicollinear. For example, in appraisal
work, the selling price of a home may be related to predictor variables such as age, liv-
ing space in square feet, number of bathrooms, number of rooms other than bath-
rooms, lot size, and an index of construction quality. Living space, number of rooms,
and number of bathrooms should certainly “move together.” If one of these variables
increases, the others will generally increase.
If this linear dependence is less than perfect, the least squares estimates of the
regression model coefficients can still be obtained. However, these estimates tend to be
unstable—their values can change dramatically with slight changes in the data—and
inflated—their values are larger than expected. In particular, individual coefficients
may have the wrong sign, and the t statistics for judging the significance of individual
terms may all be insignificant, yet the F test will indicate the regression is significant.
Finally, the calculation of the least squares estimates is sensitive to rounding errors.

Multicollinearity is the situation in which independent variables in a multiple


regression equation are highly intercorrelated. That is, a linear relation exists
between two or more independent variables.

The strength of the multicollinearity is measured by the variance inflation factor


(VIF):5
1
VIFj = j = 1, 2, Á , k (11)
1 - R2j
Here, R 2j is the coefficient of determination from the regression of the jth independent
variable on the remaining k - 1 independent variables. For k = 2 independent vari-
2
ables, R j is the square of their sample correlation, r.
If the jth predictor variable, Xj, is not related to the remaining X ’s, R 2j = 0 and VIF j = 1.
If there is a relationship, then VIFj>1. For example, when R 2j is equal to .90,
VIFj = 1>11 - .902 = 10.
A VIF near 1 suggests that multicollinearity is not a problem for that independent
variable. Its estimated coefficient and associated t value will not change much as the other
independent variables are added or deleted from the regression equation. A VIF much

5The variance inflation factor (VIF) gets its name from the fact that sbj r VIFj. The estimated standard
deviation (standard error) of the least squares coefficient, bj, increases as VIFj increases.

251
Multiple Regression Analysis

greater than 1 indicates that the estimated coefficient attached to that independent vari-
able is unstable. Its value and associated t statistic may change considerably as the other
independent variables are added or deleted from the regression equation. A large VIF
means essentially that there is redundant information among the predictor variables. The
information being conveyed by a variable with a large VIF is already being explained by
the remaining predictor variables. Thus, multicollinearity makes interpreting the effect of
an individual predictor variable on the response (dependent variable) difficult.
Example 7
A large component of the cost of owning a newspaper is the cost of newsprint. Newspaper
publishers are interested in factors that determine annual newsprint consumption. In one
study (see Johnson and Wichern, 1997), data on annual newsprint consumption (Y), the num-
ber of newspapers in a city 1X12, the logarithm6 of the number of families in a city 1X22, and
the logarithm of total retail sales in a city 1X32 were collected for n = 15 cities. The
correlation array for the three predictor variables and the Minitab output from a regression
analysis relating newsprint consumption to the predictor variables are in Table 11.
The F statistic (18.54) and its p-value (.000) clearly indicate that the regression is signifi-
cant. The t statistic for each of the independent variables is small with a relatively large
p-value. It must be concluded, for example, that the variable LnFamily is not significant, pro-
vided the other predictor variables remain in the regression function. This suggests that the
term b 2X2 can be dropped from the regression function if the remaining terms, b 1X1 and
b 3X3, are retained. Similarly, it appears as if b 3X3 can be dropped if b 1X1 and b 2X2 remain in
the regression function. The t value (1.69) associated with papers is marginally significant, but
the term b 1X1 might also be dropped if the other predictor variables remain in the equation.
Here, the regression is significant, but each of the predictor variables is not significant. Why?
The VIF column in Table 11 provides the answer. Since VIF = 1.7 for Papers, this pre-
dictor variable is very weakly related (VIF near 1) to the remaining predictor variables,
LnFamily and LnRetSales. The VIF = 7.4 for LnFamily is relatively large, indicating this

TABLE 11 Minitab Output for Example 7—Three Predictor Variables


Correlations: Papers, LnFamily, LnRetSales
Papers LnFamily
LnFamily 0.600
LnRetSales 0.643 0.930
Regression Analysis: Newsprint versus Papers, LnFamily, LnRetSales
The regression equation is
Newsprint = -56388 + 2385 Papers + 1859 LnFamily + 3455 LnRetSales

Predictor Coef SE Coef T P VIF


Constant -56388 13206 -4.27 0.001
Papers 2385 1410 1.69 0.119 1.7
LnFamily 1859 2346 0.79 0.445 7.4
LnRetSales 3455 2590 1.33 0.209 8.1
S = 1849 R -Sq = 83.8% R -Sq1adj2 = 79.0%
Analysis of Variance
Source DF SS MS F P
Regression 3 190239371 63413124 18.54 0.000
Residual Error 11 37621478 3420134
Total 14 227860849

6Logarithmsof the number of families and the total retail sales are used to make the numbers less positively
skewed and more manageable.

252
Multiple Regression Analysis

TABLE 12 Minitab Output for Example 7—Two Predictor


Variables
Regression Analysis: Newsprint versus Papers, LnRetSales
The regression equation is
Newsprint = -59766 + 2393 Papers + 5279 LnRetSales
Predictor Coef SE Coef T P VIF
Constant -59766 12304 -4.86 0.000
Papers 2393 1388 1.72 0.110 1.7
LnRetSales 5279 1171 4.51 0.001 1.7
S = 1820 R -sq = 82.5% R -sq1adj2 = 79.6%
Analysis of Variance
Source DF SS MS F P
Regression 2 188090489 94045244 28.38 0.000
Residual Error 12 39770360 3314197
Total 14 227860849

variable is linearly related to the remaining predictor variables. Also, the VIF = 8.1 for
LnRetSales indicates that LnRetSales is related to the remaining predictor variables. Since
Papers is weakly related to LnFamily and LnRetSales, the relationship among the predictor
variables is essentially the relationship between LnFamily and LnRetSales. In fact, the sample
correlation between LnFamily and LnRetSales is r = .93, showing strong linear association.
The variables LnFamily and LnRetSales are very similar in their ability to explain
newsprint consumption. We need only one, but not both, in the regression function. The
Minitab output from a regression analysis with LnFamily (smallest t statistic) deleted from
the regression function is shown in Table 12.
Notice that the coefficient of Papers is about the same for the two regressions. The coef-
ficients of LnRetSales, however, are considerably different (3,455 for k = 3 predictors and
5,279 for k = 2 predictors). Also, for the second regression, the variable LnRetSales is
clearly significant 1t = 4.51 with p-valve = .0012. With Papers in the model, LnRetSales is
an additional important predictor of newsprint consumption. The R2’s for the two regres-
sions are nearly the same, approximately .83, as are the standard errors of the estimates,
sy # x’s = 1,849 and sy #x’s = 1,820, respectively. Finally, the common VIF = 1.7 for the two pre-
dictors in the second model indicates that multicollinearity is no longer a problem. As a
residual analysis confirms, for the variables considered, the regression of Newsprint on
Papers and LnRetSales is entirely adequate.

If estimating the separate effects of the predictor variables is important and multi-
collinearity appears to be a problem, what should be done? There are several ways to
deal with severe multicollinearity, as follows. None of them may be completely satisfac-
tory or feasible.
'
• Create new X variables (call them X ) by scaling all the independent variables
according to the formula

' Xi j - X j
X = j = 1, 2, Á , k; i = 1, 2, Á , n (12)
1Xij - Xj2 2
Aa
i

These new variables will each have a sample mean of 0 and the same sample
standard deviation. The regression calculations with the new X’s are less sensitive
to round-off error in the presence of severe multicollinearity.
• Identify and eliminate one or more of the redundant independent variables from
the regression function. (This approach was used in Example 7.)

253
Multiple Regression Analysis

• Consider estimation procedures other than least squares.7


• Regress the response, Y, on new X’s that are uncorrelated with each other. It is
possible to construct linear combinations of the original X’s that are uncorrelated.8
• Carefully select potential independent variables at the beginning of the study. Try
to avoid variables that “say the same thing.”

SELECTING THE “BEST” REGRESSION EQUATION

How does one develop the best multiple regression equation to forecast a variable of
interest? The first step involves the selection of a complete set of potential predictor
variables. Any variable that might add to the accuracy of the forecast should be
included. In the selection of a final equation, one is usually faced with the dilemma of
providing the most accurate forecast for the smallest cost. In other words, when choos-
ing predictor variables to include in the final equation, the analyst must evaluate them
by using the following two opposed criteria:
1. The analyst wants the equation to include as many useful predictor variables as
possible.9
2. Given that it costs money to obtain and monitor information on a large number of
X’s, the equation should include as few predictors as possible. The simplest equa-
tion is usually the best equation.
The selection of the best regression equation usually involves a compromise between
these extremes, and judgment will be a necessary part of any solution.
After a seemingly complete list of potential predictors has been compiled, the
second step is to screen out the independent variables that do not seem appropriate. An
independent variable (1) may not be fundamental to the problem (there should be
some plausible relation between the dependent variable and an independent variable),
(2) may be subject to large measurement errors, (3) may duplicate other independent
variables (multicollinearity), or (4) may be difficult to measure accurately (accurate
data are unavailable or costly).
The third step is to shorten the list of predictors so as to obtain a “best” selection of
independent variables. Techniques currently in use are discussed in the material that
follows. None of the search procedures can be said to yield the “best” set of independent
variables. Indeed, there is often no unique “best” set. To add to the confusion, the various
techniques do not all necessarily lead to the same final prediction [Link] entire vari-
able selection process is very [Link] primary advantage of automatic-search proce-
dures is that analysts can then focus their judgments on the pivotal areas of the problem.
To demonstrate various search procedures, a simple example is presented that has
five potential independent variables.
Example 8
Pam Weigand, the personnel manager of the Zurenko Pharmaceutical Company, is inter-
ested in forecasting whether a particular applicant will become a good salesperson. She
decides to use the first month’s sales as the dependent variable (Y), and she chooses to
analyze the following independent variables:

7Alternative procedures for estimating the regression parameters are beyond the scope of this text. The
interested reader should consult the work of Draper and Smith (1998).
8Again, the procedures for creating linear combinations of the X’s that are uncorrelated are beyond the
scope of this text. Draper and Smith (1998) discuss these techniques.
9Recall that, whenever a new predictor variable is added to a multiple regression equation, R2 increases.
Therefore, it is important that a new predictor variable make a significant contribution to the regression equation.

254
Multiple Regression Analysis

X1 = the selling aptitude test score


X2 = the age, in years
X3 = the anxiety test score
X4 the experience, in years
the high school GPA 1grade point average2
=
X5 =
The personnel manager collects the data shown in Table 13, and she assigns the task of
obtaining the “best” set of independent variables for forecasting sales ability to her analyst.
The first step is to obtain a correlation matrix for all the variables from a computer pro-
gram. This matrix will provide essential knowledge about the basic relationships among the
variables.
Examination of the correlation matrix in Table 14 reveals that the selling aptitude
test score, age, experience, and GPA are positively related to sales ability and have poten-
tial as good predictor variables. The anxiety test score shows a low negative correlation
with sales, and it is probably not an important predictor. Further analysis indicates that
age is moderately correlated with both GPA and experience. It is the presence of these

TABLE 13 Zurenko Pharmaceutical Data for Example 8

One Month’s Aptitude Age Anxiety Experience High


Sales (units) Test Score (years) Test Score (years) School GPA
44 10 22.1 4.9 0 2.4
47 19 22.5 3.0 1 2.6
60 27 23.1 1.5 0 2.8
71 31 24.0 .6 3 2.7
61 64 22.6 1.8 2 2.0
60 81 21.7 3.3 1 2.5
58 42 23.8 3.2 0 2.5
56 67 22.0 2.1 0 2.3
66 48 22.4 6.0 1 2.8
61 64 22.6 1.8 1 3.4
51 57 21.1 3.8 0 3.0
47 10 22.5 4.5 1 2.7
53 48 22.2 4.5 0 2.8
74 96 24.8 .1 3 3.8
65 75 22.6 .9 0 3.7
33 12 20.5 4.8 0 2.1
54 47 21.9 2.3 1 1.8
39 20 20.5 3.0 2 1.5
52 73 20.8 .3 2 1.9
30 4 20.0 2.7 0 2.2
58 9 23.3 4.4 1 2.8
59 98 21.3 3.9 1 2.9
52 27 22.9 1.4 2 3.2
56 59 22.3 2.7 1 2.7
49 23 22.6 2.7 1 2.4
63 90 22.4 2.2 2 2.6
61 34 23.8 .7 1 3.4
39 16 20.6 3.1 1 2.3
62 32 24.4 .6 3 4.0
78 94 25.0 4.6 5 3.6

255
Multiple Regression Analysis

TABLE 14 Correlations: Sales, Aptitude, Age, Anxiety, Experience, GPA


Correlations: Sales, Aptitude, Age, Anxiety, Experience, GPA
Sales Aptitude Age Anxiety Experience
Aptitude 0.676
Age 0.798 0.228
Anxiety -0.296 -0.222 -0.287
Experience 0.550 0.350 0.540 -0.279
GPA 0.622 0.318 0.695 -0.244 0.312

interrelationships that must be dealt with in attempting to find the best possible set of
explanatory variables.

Two procedures are demonstrated: all possible regressions and stepwise regression.

All Possible Regressions


The procedure calls for the investigation of all possible regression equations that
involve the potential independent variables. The analyst starts with an equation
containing no independent variables and then proceeds to analyze every possible
combination in order to select the best set of predictors.
Different criteria for comparing the various regression equations may be used with
the all possible regressions approach. Only the R2 technique, which involves four steps,
is discussed here.
This procedure first requires the fitting of every possible regression model that
involves the dependent variable and any number of independent variables. Each
independent variable can either be or not be in the equation (two possible out-
comes), and this fact is true for every independent variable. Thus, altogether there
are 2k equations (where k equals the number of independent variables). So, if there
are eight independent variables to consider 1k = 82, then 28 = 256 equations must
be examined.
The second step in the procedure is to divide the equations into sets according to
the number of parameters to be estimated.

Example 9
The results from the all possible regressions runs for the Zurenko Pharmaceutical Company
are presented in Table 15. Notice that Table 15 is divided into six sets of regression equation
outcomes. This breakdown coincides with the number of parameters contained in each
equation.

The third step involves the selection of the best independent variable (or vari-
ables) for each parameter grouping. The equation with the highest R2 is considered
best. Using the results from Example 9, the best equation from each set listed in
Table 15 is presented in Table 16.
The fourth step involves making the subjective decision: “Which equation is the
best?” On the one hand, the analyst desires the highest R2 possible; on the other
hand, he or she wants the simplest equation possible. The all possible regressions
approach assumes that the number of data points, n, exceeds the number of parame-
ters, k + 1.

Example 10
The analyst is attempting to find the point at which adding additional independent variables
for the Zurenko Pharmaceutical problem is not worthwhile because it leads to a very small
increase in R2. The results in Table 16 clearly indicate that adding variables after selling

256
Multiple Regression Analysis

TABLE 15 R2 Values for All Possible Regressions for


Zurenko Pharmaceutical for Example 9

Independent Number of Error Degrees


Variables Used Parameters of Freedom R2
None 1 29 .0000
X1 2 28 .4570
X2 2 28 .6370
X3 2 28 .0880
X4 2 28 .3020
X5 2 28 .3870
X1, X2 3 27 .8948
X1, X3 3 27 .4790
X1, X4 3 27 .5690
X1, X5 3 27 .6410
X2, X3 3 27 .6420
X2, X4 3 27 .6570
X2, X5 3 27 .6460
X3, X4 3 27 .3240
X3, X5 3 27 .4090
X4, X5 3 27 .5270
X1, X2, X3 4 26 .8951
X1, X2, X4 4 26 .8948
X1, X2, X5 4 26 .8953
X1, X3, X4 4 26 .5750
X1, X3, X5 4 26 .6460
X1, X4, X5 4 26 .7010
X2, X3, X4 4 26 .6590
X2, X3, X5 4 26 .6500
X2, X4, X5 4 26 .6690
X3, X4, X5 4 26 .5310
X1, X2, X3, X4 5 25 .8951
X1, X2, X3, X5 5 25 .8955
X1, X2, X4, X5 5 25 .8953
X1, X3, X4, X5 5 25 .7010
X2, X3, X4, X5 5 25 .6710
X1, X2, X3, X4, X5 6 24 .8955

TABLE 16 Best Regression Equations for


Zurenko Pharmaceutical
from Example 9
Number of Independent Error Degrees
Parameters Variables of Freedom R2

1 None 29 .0000
2 X2 28 .6370
3 X1, X2 27 .8948
4 X1, X2, X5 26 .8953
5 X1, X2, X3, X5 25 .8955
6 X1, X2, X3, X4, X5 24 .8955

257
Multiple Regression Analysis

aptitude test (X1) and age (X2) is not necessary. Therefore, the final fitted regression equa-
tion is of the form

YN = b0 + b1X1 + b2X2

and it explains 89.48% of the variation in Y.

The all possible regressions procedure is best summed up by Draper and Smith (1998):

In general the analysis of all regressions is quite unwarranted. While it means


that the investigator has “looked at all possibilities” it also means he has exam-
ined a large number of regression equations that intelligent thought would
often reject out of hand. The amount of computer time used is wasteful and
the sheer physical effort of examining all the computer printouts is enormous
when more than a few variables are being examined. Some sort of selection
procedure that shortens this task is preferable. (p. 333)

Stepwise Regression
The stepwise regression procedure adds one independent variable at a time to the
model, one step at a time. A large number of independent variables can be handled on
the computer in one run when using this procedure.
Stepwise regression can best be described by listing the basic steps (algorithm)
involved in the computations.

1. All possible simple regressions are considered. The predictor variable that explains
the largest significant proportion of the variation in Y (has the largest correlation
with the response) is the first variable to enter the regression equation.
2. The next variable to enter the equation is the one (out of those not included) that
makes the largest significant contribution to the regression sum of squares. The sig-
nificance of the contribution is determined by an F test. The value of the F statistic
that must be exceeded before the contribution of a variable is deemed significant
is often called the F to enter.
3. Once an additional variable has been included in the equation, the individual con-
tributions to the regression sum of squares of the other variables already in the
equation are checked for significance using F tests. If the F statistic is less than a
value called the F to remove, the variable is deleted from the regression equation.
4. Steps 2 and 3 are repeated until all possible additions are nonsignificant and all
possible deletions are significant. At this point, the selection stops.

Stepwise regression permits predictor variables to enter or leave the regression


function at different stages of its development. An independent variable is
removed from the model if it doesn’t continue to make a significant contribution
when a new variable is added.

The user of a stepwise regression program supplies the values that decide when a
variable is allowed to enter and when a variable is removed. Since the F statistics used in
stepwise regression are such that F = t2 where t is the t statistic for checking the signifi-
cance of a predictor variable, F = 4 1corresponding to ƒ t ƒ = 22 is a common choice for
both the F to enter and the F to remove. An F to enter of 4 is essentially equivalent to
testing for the significance of a predictor variable at the 5% level. The Minitab stepwise

258
Multiple Regression Analysis

program allows the user to choose an α level to enter and to remove variables or the F
value to enter and to remove variables. Using an α value of .05 is approximately equiv-
alent to using F = 4. The current default values in Minitab are a = .15 and F = 4.
The result of the stepwise procedure is a model that contains only independent
variables with t values that are significant at the specified level. However, because of
the step-by-step development, there is no guarantee that stepwise regression will
select, for example, the best three variables for prediction. In addition, an automatic
selection method is not capable of indicating when transformations of variables are
useful, nor does it necessarily avoid a multicollinearity problem. Finally, stepwise
regression cannot create important variables that are not supplied by the user. It is nec-
essary to think carefully about the collection of independent variables that is supplied
to a stepwise regression program.
The stepwise procedure is illustrated in Example 11.

Example 11
Let’s “solve” the Zurenko Pharmaceutical problem using stepwise regression.
Pam examines the correlation matrix shown in Table 14 and decides that, when she
runs the stepwise analysis, the age variable will enter the model first because it has the
largest correlation with sales 1r1,3 = .7982 and will explain 63.7% 1.79822 of the variation
in sales.
She notes that the aptitude test score will probably enter the model second because it is
strongly related to sales 1r1,2 = .6762 but not highly related to the age variable 1r2,3 = .2282
already in the model.
Pam also notices that the other variables will probably not qualify as good predictor
variables. The anxiety test score will not be a good predictor because it is not well related to
sales 1r1,4 = -.2962 . The experience and GPA variables might have potential as good
predictor variables 1r1,5 = .550 and r1,6 = .622, respectively2 . However, both of these
predictor variables have a potential multicollinearity problem with the age variable
1r3,5 = .540 and r3,6 = .695, respectively2.
The Minitab commands to run a stepwise regression analysis for this example are
demonstrated in the Minitab Applications section at the end of the chapter. The output
from this stepwise regression run is shown in Table 17. The stepwise analysis proceeds
according to the steps that follow.

TABLE 17 Stepwise Regression for Example 11: Sales


versus Aptitude, Age, Anxiety, Experience, GPA
Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05
Response is Sales on 5 predictors, with N = 30
Step 1 2
Constant -100.85 -86.79
Age 6.97 5.93
T-Value 7.01 10.60
P-Value 0.000 0.000
Aptitude 0.200
T-Value 8.13
P-Value 0.000
S 6.85 3.75
R-Sq 63.70 89.48
R-Sq(adj) 62.41 88.70
Mallows Cp 57.4 0.2

259
Multiple Regression Analysis

Step 1. The model after step 1 is


Sales = -100.85 + 6.97 1Age2

As Pam thought, the age variable entered the model first and explains 63.7%
of the sales variance. Since the p-value of .000 is less than the a value of .05, age is
added to the model. Remember that the p-value is the probability of obtaining a
t statistic as large as 7.01 by chance alone. The Minitab decision rule that Pam
selected is to enter a variable if the p-value is less than a = .05.
Note that t = 7.01 6 2.048 , the upper .025 point of a t distribution with
28 1n - k - 1 = 30 - 1 - 12 degrees of freedom. Thus, at the .05 significance
level, the hypothesis H0: b 1 = 0 is rejected in favor of H1: b 1 Z 0 . Since
t2 = F or 2.0482 = 4.19, an F to enter of 4 is also essentially equivalent to testing
for the significance of a predictor variable at the 5% level. In this case, since the
coefficient of the age variable is clearly significantly different from zero, age
enters the regression equation, and the procedure now moves to step 2.
Step 2. The model after step 2 is
Sales = -86.79 + 5.93 1Age2 + 0.200 1Aptitude2

This model explains 89.48% of the variation in sales.


The null and alternative hypotheses to determine whether the aptitude test
score’s regression coefficient is significantly different from zero are

H0: b 2 = 0
H1: b 2 Z 0

Again, the p-value of .000 is less than the α value of .05, and aptitude test score is
added to the model. The aptitude test score’s regression coefficient is significantly
different from zero, and the probability that this occurred by chance sampling
error is approximately zero. This result means that the aptitude test score is an
important variable when used in conjunction with age.
The critical t statistic based on 27 1n - k - 1 = 30 - 2 - 12 degrees of
freedom is 2.052.10 The computed t ratio found on the Minitab output is 8.13,
which is greater than 2.052. Using a t test, the null hypothesis is also rejected.
Note that the p-value for the age variable’s t statistic, .000, remains very small.
Age is still a significant predictor of sales. The procedure now moves on to step 3.
Step 3. The computer now considers adding a third predictor variable, given that X1 (age)
and X2 (aptitude test score) are in the regression equation. None of the remaining
independent variables is significant (has a p-value less than .05) when run in
combination with X1 and X2, so the stepwise procedure is completed.
Pam’s final model selected by the stepwise procedure is the two-predictor
variable model given in step 2.

Final Notes on Stepwise Regression


The stepwise regression technique is extremely easy to use. Unfortunately, it is also
extremely easy to misuse. Analysts developing a regression model often produce a
large set of potential independent variables and then let the stepwise procedure deter-
mine which ones are significant. The problem is that, when a large set of independent
variables is analyzed, many t tests are performed, and it is likely that a type I error
(adding a nonsignificant variable) will result. That is, the final model might contain a
variable that is not linearly related to the dependent variable and entered the model
just by chance.

10Again, since 2.0522 = 4.21, using an F to enter of 4 is roughly equivalent to testing for the significance of a
predictor variable at the .05 level.

260
Multiple Regression Analysis

As mentioned previously, another problem involves the initial selection of poten-


tial independent variables. When these variables are selected, higher-order terms
(curvilinear, nonlinear, and interaction) are often omitted to keep the number of vari-
ables manageable. Consequently, several important variables may be initially omitted
from the model. It becomes obvious that an analyst’s intuitive choice of the initial
independent variables is critical to the development of a successful regression model.

REGRESSION DIAGNOSTICS AND RESIDUAL ANALYSIS

A regression analysis is not complete until one is convinced the model is an adequate
representation of the data. It is imperative to examine the adequacy of the model
before it becomes part of the decision-making apparatus.
An examination of the residuals is a crucial component of the determination of
model [Link], if regression models are used with time series data, it is important
to compute the residual autocorrelations to check the independence assumption.
Inferences (and decisions) made with models that do not approximately conform to the
regression assumptions can be grossly misleading. For example, it may be concluded
that the manipulation of a predictor variable will produce a specified change in the
response when, in fact, it will not. It may be concluded that a forecast is very likely (95%
confidence) to be within 2% of the future response when, in fact, the actual confidence
is much less, and so forth.
In this section, some additional tools that can be used to evaluate a regression
model will be discussed. These tools are designed to identify observations that are out-
lying or extreme (observations that are well separated from the remainder of the data).
Outlying observations are often hidden by the fitting process and may not be easily
detected from an examination of residual plots. Yet they can have a major role in deter-
mining the fitted regression function. It is important to study outlying observations to
decide whether they should be retained or eliminated and, if retained, whether their
influence should be reduced in the fitting process or the regression function revised.
A measure of the influence of the ith data point on the location of the fitted
regression function is provided by the leverage hii. The leverage depends only on the
predictors; it does not depend on the response, Y. For simple linear regression with one
predictor variable, X,

1 1Xi - X22
a 1Xi - X2
hii = + 2
(13)
n

With k predictors, the expression for the ith leverage is more complicated; however,
one can show that 0 6 hii 6 1 and that the mean leverage is h = 1k + 12>n.
If the ith data point has high leverage 1hii is close to 1), the fitted response, YNi, at
these X’s is almost completely determined by Yi, with the remaining data having very
little influence. The high leverage data point is also an outlier among the X’s (far from
other combinations of X values).11 A rule of thumb suggests that hii is large enough to
merit checking if hii Ú 31k + 12>n.
The detection of outlying or extreme Y values is based on the size of the residuals,
e = Y - YN . Large residuals indicate a Y value that is “far” from its fitted or predicted

11The converse is not necessarily true. That is, an outlier among the X’s may not be a high leverage point.

261
Multiple Regression Analysis

value, YN . A large residual will show up in a histogram of the residuals as a value far (in
either direction) from zero. A large residual will show up in a plot of the residuals ver-
sus the fitted values as a point far above or below the horizontal axis.
Software packages such as Minitab flag data points with extreme Y values by com-
puting “standardized” residuals and identifying points with large standardized residuals.
One standardization is based on the fact that the residuals have estimated stan-
dard deviations:

sei = sy #x¿s 21 - hii

where sy # x¿s = 2MSE is the standard error of the estimate and hii is the leverage
associated with the ith data point. The standardized residual12 is then

ei ei
= (14)
sei sy # x¿s 21 - hii

The standardized residuals all have a variance of 1. A standardized residual is consid-


ered large (the response extreme) if

` ` 7 2
ei
sei

The Y values corresponding to data points with large standardized residuals can heav-
ily influence the location of the fitted regression function.

Example 12
Chief executive officer (CEO) salaries in the United States are of interest because of their
relationship to salaries of CEOs in international firms and to salaries of top professionals
outside corporate America. Also, for an individual firm, the CEO compensation directly, or
indirectly, influences the salaries of managers in positions below that of CEO. CEO salary
varies greatly from firm to firm, but data suggest that salary can be explained in terms of a
firm’s sales and the CEO’s amount of experience, educational level, and ownership stake in
the firm. In one study, 50 firms were used to develop a multiple regression model linking
CEO compensation to several predictor variables such as sales, profits, age, experience,
professional background, educational level, and ownership stake.
After eliminating unimportant predictor variables, the final fitted regression func-
tion was

YN = 5.52 - .467X1 + .263X2

where

Y = the logarithm of CEO compensation


X1 = the indicator variable for educational level
X2 = the logarithm of company sales

Minitab identified three observations from this regression analysis that have either
large standardized residuals or large leverage.

12Some software packages may call the standardized residual given by Equation 14 the Studentized residual.

262
Multiple Regression Analysis

Unusual Observations
Obs Educate LnComp Fit StDev Fit Residual St Resid
14 1.00 6.0568 7.0995 0.0949 1.0427 2.09R
25 0.00 8.1342 7.9937 0.2224 0.1405 0.31X
33 0.00 6.3969 7.3912 0.2032 0.9943 2.13R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.

Observations 14 and 33 have large standardized residuals. The fitted regression function is
predicting (log) compensation that is too large for these two CEOs. An examination of the
full data set shows that these CEOs each own relatively large percentages of their compa-
nies’ stock. Case 14 owns more than 10% of the company’s stock, and case 33 owns more
than 17% of the company’s stock. These individuals are receiving much of their remunera-
tion through long-term compensation, such as stock incentives, rather than through annual
salary and bonuses. Since amount of stock owned (or stock value) is not included as a vari-
able in the regression function, it cannot be used to adjust the prediction of compensation
determined by CEO education and company sales. Although education and (log) sales do
not predict the compensation of these two CEOs as well as the others, there appears to be
no reason to eliminate them from consideration.
Observation 25 is singled out because the leverage for this data point is greater than
31k + 12>n = 3132>50 = .18. This CEO has no college degree 1Educate = 02 but is with a
company with relatively large sales 1LnSales = 9.3942. The combination (0, 9.394) is far
from the point 1X1,X22 ; therefore, it is an outlier among the pairs of X’s. The response asso-
ciated with these X’s will have a large influence on the determination of the fitted regres-
sion function. (Notice that the standardized residual for this data point is small, indicating
that the predicted or fitted (log) compensation is close to the actual value.) This particular
CEO has 30 years of experience as a CEO, more experience than all but one of the CEOs in
the data set. This observation is influential, but there is no reason to delete it.

Leverage tells us if an observation has unusual predictors, and a standardized


residual tells us if an observation has an unusual response. These quantities can be
combined into one overall measure of influence known as Cook’s distance. Cook’s
distances can be printed out in most statistical software packages, but additional
discussion is beyond the scope of this text.13

FORECASTING CAVEATS

We finish this discussion of multiple regression with some general comments. These
comments are oriented toward the practical application of regression analysis.

Overfitting

Overfitting refers to adding independent variables to the regression function that,


to a large extent, account for all the eccentricities of the sample data under analysis.

When such a model is applied to new sets of data selected from the same population, it
does not forecast as well as the initial fit might suggest.
Overfitting is more likely to occur when the sample size is small, especially if a large
number of independent variables are included in the model. Some practitioners have

13A good discussion of Cook’s distance is provided by Draper and Smith (1998).

263
Multiple Regression Analysis

suggested that there should be at least 10 observations for each independent variable. (If
there are four independent variables, a sample size n of at least 40 is suggested.)
One way to guard against overfitting is to develop the regression function from
one part of the data and then apply it to a “holdout” sample. Use the fitted regression
function to forecast the holdout responses and calculate the forecast errors. If the fore-
cast errors are substantially larger than the fitting errors as measured by, say, compara-
ble mean squared errors, then overfitting has occurred.

Useful Regressions, Large F Ratios


A regression that is statistically significant is not necessarily useful. With a relatively large
sample size (i.e., when n is large relative to k, the number of predictors), it is not unusual to
get a significant F ratio and a small R2. That is, the regression is significant, yet it explains
only a small proportion of the variation in the response. One rule of thumb suggests that
with a significance level of .05, the F ratio should be at least four times the corresponding
critical value before the regression is likely to be of much use for prediction purposes.14
The “four times” criterion comes from the argument that the range of the predic-
tions (over all the X’s) should be about four times the (average) prediction error
before the regression is likely to yield a worthwhile interpretation.15
As an example, with k = 3 predictors, n = 25 observations, and a significance level
of .05, the computed F from the ANOVA table would have to exceed the critical value
F = 3.07 (see Table 5 in Appendix: Tables with 1 = k = 3, 2 = n - k - 1 = 21
degrees of freedom) for the regression to be significant. (Using Equation 7, the critical
F = 3.07 corresponds to an R2 of about 30%, not a particularly large number.) However,
the “four times” rule suggests that the computed F should exceed 413.072 = 12.28 in
order for the regression to be worthwhile from a practical point of view.

APPLICATION TO MANAGEMENT

Multiple regression analysis has been used extensively to help forecast the economic
activity of the various segments of the economy. Many of the reports and forecasts
about the future of our economy that appear in the Wall Street Journal, Fortune,
Business Week, and other similar sources are based on econometric (regression) mod-
els. The U.S. government makes wide use of regression analysis in predicting future
revenues, expenditures, income levels, interest rates, birthrates, unemployment, and
Social Security benefits requirements as well as a multitude of other events. In fact,
almost every major department in the U.S. government makes use of the tools
described in this chapter.
Similarly, business entities have adopted and, when necessary, modified regression
analysis to help in the forecasting of future events. Few firms can survive in today’s
environment without a fairly accurate forecast of tomorrow’s sales, expenditures, capi-
tal requirements, and cash flows. Although small or less sophisticated firms may be
able to get by with intuitive forecasts, larger and/or more sophisticated firms have
turned to regression analysis to study the relationships among several variables and to
determine how these variables are likely to affect their future.
Unfortunately, the very notoriety that regression analysis receives for its usefulness
as a tool in predicting the future tends to overshadow an equally important asset: its

14Some authors argue that the “four times” rule is not enough and should be replaced by a “ten times”
criterion.
15This assumes that no other defect is detected in the fit.

264
Multiple Regression Analysis

ability to help evaluate and control the present. Because a fitted regression equation
provides the researcher with both strength and direction information, management can
evaluate and change current strategies.
Suppose, for example, a manufacturer of jams wants to know where to direct its
marketing efforts when introducing a new flavor. Regression analysis can be used to
help determine the profile of heavy users of jams. For instance, a company might try to
predict the number of flavors of jam a household might have at any one time on the
basis of a number of independent variables, such as the following:

Number of children living at home


Age of children
Gender of children
Home ownership versus rental
Time spent shopping
Income

Even a superficial reflection on the jam example quickly leads the researcher to
realize that regression analysis has numerous possibilities for use in market segmenta-
tion studies. In fact, many companies use regression to study market segments to deter-
mine which variables seem to have an impact on market share, purchase frequency,
product ownership, and product and brand loyalty as well as on many other areas.
Agricultural scientists use regression analysis to explore the relationship of product
yield (e.g., number of bushels of corn per acre) to fertilizer type and amount, rainfall,
temperature, days of sun, and insect infestation. Modern farms are equipped with mini-
and microcomputers complete with software packages to help them in this process.
Medical researchers use regression analysis to seek links between blood pressure
and independent variables such as age, social class, weight, smoking habits, and race.
Doctors explore the impact of communications, number of contacts, and age of patient
on patient satisfaction with service.
Personnel directors explore the relationship of employee salary levels to geo-
graphic location, unemployment rates, industry growth, union membership, industry
type, and competitive salaries. Financial analysts look for causes of high stock prices by
analyzing dividend yields, earnings per share, stock splits, consumer expectations of
interest rates, savings levels, and inflation rates.
Advertising managers frequently try to study the impact of advertising budgets,
media selection, message copy, advertising frequency, and spokesperson choice on
consumer attitude change. Similarly, marketers attempt to determine sales from adver-
tising expenditures, price levels, competitive marketing expenditures, and consumer
disposable income as well as a wide variety of other variables.
A final example further illustrates the versatility of regression analysis. Real estate
site location analysts have found that regression analysis can be very helpful in pin-
pointing geographic areas of over- and underpenetration of specific types of retail
stores. For instance, a hardware store chain might look for a potential city in which to
locate a new store by developing a regression model designed to predict hardware sales
in any given city. Researchers could concentrate their efforts on those cities where the
model predicted higher sales than actually achieved (as can be determined from many
sources). The hypothesis is that sales of hardware are not up to potential in these cities.
In summary, regression analysis has provided management with a powerful and
versatile tool for studying the relationships between a dependent variable and multiple
independent variables. The goal is to better understand and perhaps control present
events as well as to better predict future events.

265
Multiple Regression Analysis

Glossary

Dummy variables. Dummy, or indicator, variables are Partial, or net, regression coefficient. The partial,
used to determine the relationships between qualita- or net, regression coefficient measures the aver-
tive independent variables and a dependent variable. age change in the dependent variable per unit
Multicollinearity. Multicollinearity is the situation change in the relevant independent variable, hold-
in which independent variables in a multiple ing the other independent variables constant.
regression equation are highly intercorrelated. Standard error of the estimate. The standard error
That is, a linear relation exists between two or of the estimate is the standard deviation
more independent variables. of the residuals. It measures the amount the actual
Multiple regression. Multiple regression involves values (Y) differ from the estimated values 1YN 2.
the use of more than one independent variable to Stepwise regression. Stepwise regression permits
predict a dependent variable. predictor variables to enter or leave the regression
Overfitting. Overfitting refers to adding indepen- function at different stages of its development.
dent variables to the regression function that, to a An independent variable is removed from the
large extent, account for all the eccentricities of model if it doesn’t continue to make a significant
the sample data under analysis. contribution when a new variable is added.

Key Formulas

Population multiple regression function


mY = b 0 + b 1X1 + b 2X2 + Á + b kXk (1)

Estimated (fitted) regression function

YN = b0 + b1X1 + Á + bkXk (2)

Sum of squares decomposition and associated degrees of freedom


©1Y - Y22 = ©1YN - Y22 + ©1Y - YN 22
SST = SSR + SSE (3)
df : n - 1 = k + n - k - 1

Standard error of the estimate

a 1Y - Y 2
N 2 SSE
sy #x’s = = = 2MSE (4)
D n - k - 1 Dn - k - 1

F statistic for testing the significance of the regression


MSR
F =
MSE

Coefficient of determination
SSR ©1YN - Y22
R2 = =
SST ©1Y - Y22

SSE ©1Y - YN 22
= 1 - = 1 - (5)
SST ©1Y - Y22

266
Multiple Regression Analysis

Multiple correlation coefficient


R = 2R2 (6)

Relation between F statistic and R2


R2 n - k - 1
F = 2
¢ ≤ (7)
1 - R k
Adjusted coefficient of determination

R 2 = 1 - 11 - R 22 ¢
n - 1
≤ (8)
n - k - 1
t statistic for testing H0: Bj = 0
bj
t =
sbj
Forecast of a future value
YN * = b0 + b1X*1 + b2X*2 + Á + bkX*k (9)

Large-sample prediction interval for a future response


1YN * - ta>2 sy #x¿s , YN * + ta>2 sy #x¿s2 (10)

Variance inflation factors


1
VIFj = j = 1, 2, Á , k (11)
1 - R2j
Standardized independent variable values
' Xi j - Xj
X = j = 1, 2, Á , k; i = 1, 2, Á , n (12)
a 1Xi j - Xj2
2
A i
Leverage (one-predictor variable)

1 1Xi - X 22
a 1Xi - X2
hii = + 2
(13)
n
Standardized residual
ei ei
= (14)
sei sy #x¿s 21 - hii
Problems

1. What are the characteristics of a good predictor variable?


2. What are the assumptions associated with the multiple regression model?
3. What does the partial, or net, regression coefficient measure in multiple regression?
4. What does the standard error of the estimate measure in multiple regression?
5. Your estimated multiple regression equation is YN = 7.52 + 3X1 - 12.2X2. Predict
the value of Y if X1 = 20 and X2 = 7.

267
Multiple Regression Analysis

TABLE P–7

Variable Number
Variable Number 1 2 3 4 5 6
1 1.00 .55 .20 -.51 .79 .70
2 1.00 .27 .09 .39 .45
3 1.00 .04 .17 .21
4 1.00 -.44 -.14
5 1.00 .69
6 1.00

6. Explain each of the following concepts:


a. Correlation matrix
b. R2
c. Multicollinearity
d. Residual
e. Dummy variable
f. Stepwise regression
7. Most computer solutions for multiple regression begin with a correlation matrix.
Examining this matrix is often the first step when analyzing a regression problem
that involves more than one independent variable. Answer the following questions
concerning the correlation matrix given in Table P-7.
a. Why are all the entries on the main diagonal equal to 1.00?
b. Why is the bottom half of the matrix below the main diagonal blank?
c. If variable 1 is the dependent variable, which independent variables have the
highest degree of linear association with variable 1?
d. What kind of association exists between variables 1 and 4?
e. Does this correlation matrix show any evidence of multicollinearity?
f. In your opinion, which variable or variables will be included in the best fore-
casting model? Explain.
g. If the data given in this correlation matrix are run on a stepwise program, which
independent variable (2, 3, 4, 5, or 6) will be the first to enter the regression function?
8. Jennifer Dahl, supervisor of the Circle O discount chain, would like to forecast the
time it takes to check out a customer. She decides to use the following independent
variables: number of purchased items and total amount of the purchase. She
collects data for a sample of 18 customers, shown in Table P-8.
a. Determine the best regression equation.
b. When an additional item is purchased, what is the average increase in the
checkout time?
c. Compute the residual for customer 18.
d. Compute the standard error of the estimate.
e. Interpret part d in terms of the variables used in this problem.
f. Compute a forecast of the checkout time if a customer purchases 14 items that
amount to $70.
g. Compute a 95% interval forecast for your prediction in part f.
h. What should Jennifer conclude?
9. Table P-9 contains data on food expenditures, annual income, and family size for a
sample of 10 families.

268
Multiple Regression Analysis

TABLE P-8

Checkout Time Amount ($) Number


Customer (minutes) Y X1 of items X2
1 3.0 36 9
2 1.3 13 5
3 .5 3 2
4 7.4 81 14
5 5.9 78 13
6 8.4 103 16
7 5.0 64 12
8 8.1 67 11
9 1.9 25 7
10 6.2 55 11
11 .7 13 3
12 1.4 21 8
13 9.1 121 21
14 .9 10 6
15 5.4 60 13
16 3.3 32 11
17 4.5 51 15
18 2.4 28 10

TABLE P-9

Annual Food
Expenditures Annual Income Family Size
Family ($100s) Y ($1,000s) X1 X2

A 24 11 6
B 8 3 2
C 16 4 1
D 18 7 3
E 24 9 5
F 23 8 4
G 11 5 2
H 15 7 2
I 21 8 3
J 20 7 2

a. Construct the correlation matrix for the three variables in Table P-9. Interpret
the correlations in the matrix.
b. Fit a multiple regression model relating food expenditures to income and family
size. Interpret the partial regression coefficients of income and family size. Do
they make sense?
c. Compute the variance inflation factors (VIFs) for the independent variables. Is
multicollinearity a problem for these data? If so, how might you modify the
regression model?
10. Beer sales at the Shapiro One-Stop Store are analyzed using temperature
and number of people (age 21 or over) on the street as independent variables.

269
Multiple Regression Analysis

TABLE P-10 Minitab Output


Correlations
Y X1
X1 0.827
X2 0.822 0.680
Regression Analysis
The regression equation is
Y = -26.7 + .782 X1 + .068 X2
Predictor Coef SE Coef T P
Constant -26.706
X1 .78207 .22694
X2 .06795 .02026
S = R-Sq = R-Sq1adj2 =
Analysis of Variance

Source DF SS MS F
Regression 2 11589.035 5794.516 36.11
Residual Error 17 2727.914 160.466
Total 19 14316.949

A random sample of 20 days is selected, and the following variables are


measured:
Y = the number of six-packs of beer sold each day
X1 = the daily high temperature
X2 = the daily traffic count
The data are analyzed using multiple regression. The partial computer output is in
Table P-10.
a. Analyze the correlation matrix.
b. Test the hypothesis H0: b j = 0 , j = 1, 2, at the .01 significance level.
c. Forecast the volume of beer sold if the high temperature is 60 degrees and the
traffic count is 500 people.
d. Calculate R2, and interpret its meaning in terms of this problem.
e. Calculate the standard error of the estimate.
f. Explain how beer sales are affected by an increase of one degree in the high
temperature.
g. State your conclusions for this analysis concerning the accuracy of the forecast-
ing equation and also the contributions of the independent variables.
11. A taxi company is interested in the relationship between mileage, measured in
miles per gallon, and the age of cars in its fleet. The 12 fleet cars are the same make
and size and are in good operating condition as a result of regular maintenance.
The company employs both male and female drivers, and it is believed that some
of the variability in mileage may be due to differences in driving techniques
between the groups of drivers of opposite gender. In fact, other things being equal,
women tend to get better mileage than men. Data are generated by randomly
assigning the 12 cars to five female and seven male drivers and computing miles
per gallon after 300 miles. The data appear in Table P-11.
a. Construct a scatter diagram with Y as the vertical axis and X1 as the horizontal
axis. Identify the points corresponding to male and female drivers, respectively.

270
Multiple Regression Analysis

TABLE P-11
Miles per Age of Car Gender (0  male,
Gallon Y (years) X1 1  female) X2

22.3 3 0
22.0 4 1
23.7 3 1
24.2 2 0
25.5 1 1
21.1 5 0
20.6 4 0
24.0 1 0
26.0 1 1
23.1 2 0
24.8 2 1
20.2 5 0

b. Fit the regression model

Y = b 0 + b 1X1 + b 2X2 + 

and interpret the least squares coefficient, b2.


c. Compute the fitted values for each of the 1X1,X22 pairs, and plot the fitted val-
ues on the scatter diagram. Draw straight lines through the fitted values for
male drivers and female drivers, respectively. Specify the equations for these
two straight lines.
d. Suppose gender is ignored. Fit the simple linear regression model,
Y = b 0 + b 1X1 + , and plot the fitted straight line on the scatter diagram. Is
it important to include the effects of gender in this case? Explain.
12. The sales manager of a large automotive parts distributor, Hartman Auto Supplies,
wants to develop a model to forecast as early as May the total annual sales of a
region. If regional sales can be forecast, then the total sales for the company can be
forecast. The number of retail outlets in the region stocking the company’s parts
and the number of automobiles registered for each region as of May 1 are the two
independent variables investigated. The data appear in Table P-12.
a. Analyze the correlation matrix.
b. How much error is involved in the prediction for region 1?
c. Forecast the annual sales for region 12, given 2,500 retail outlets and 20.2 million
automobiles registered.
d. Discuss the accuracy of the forecast made in part c.
e. Show how the standard error of the estimate was computed.
f. Give an interpretation of the partial regression coefficients. Are these regression
coefficients sensible?
g. How can this regression equation be improved?
13. The sales manager of Hartman Auto Supplies decides to investigate a new inde-
pendent variable, personal income by region (see Problem 12). The data for this
new variable are presented in Table P-13.
a. Does personal income by region make a contribution to the forecasting of sales?

271
Multiple Regression Analysis

TABLE P-12

Annual Sales Number of


($ millions) Retail Outlets Number of Automobiles
Region X2 Y X1 Registered ($ millions)
1 52.3 2,011 24.6
2 26.0 2,850 22.1
3 20.2 650 7.9
4 16.0 480 12.5
5 30.0 1,694 9.0
6 46.2 2,302 11.5
7 35.0 2,214 20.5
8 3.5 125 4.1
9 33.1 1,840 8.9
10 25.2 1,233 6.1
11 38.2 1,699 9.5

TABLE P-13
Region Personal Income Region Personal Income
($ billions) ($ billions)

1 98.5 7 67.6
2 31.1 8 19.7
3 34.8 9 67.9
4 32.7 10 61.4
5 68.8 11 85.6
6 94.7

b. Forecast annual sales for region 12 for personal income of $40 billion and the val-
ues for retail outlets and automobiles registered given in part c of Problem 12.
c. Discuss the accuracy of the forecast made in part b.
d. Which independent variables would you include in your final forecast model?
Why?
14. The Nelson Corporation decides to develop a multiple regression equation to
forecast sales performance. A random sample of 14 salespeople is interviewed and
given an aptitude test. Also, an index of effort expended is calculated for each
salesperson on the basis of a ratio of the mileage on his or her company car to the
total mileage projected for adequate coverage of territory. Regression analysis
yields the following results:
YN = 16.57 + .65 X1 + 20.6 X2
1.052 11.692

The quantities in parentheses are the standard errors of the partial regression
coefficients. The standard error of the estimate is 3.56. The standard deviation of
the sales variable is sy = 16.57. The variables are
Y = the sales performance, in thousands
X1 = the aptitude test score
X2 = the effort index

272
Multiple Regression Analysis

a. Are the partial regression coefficients significantly different from zero at the .01
significance level?
b. Interpret the partial regression coefficient for the effort index.
c. Forecast the sales performance for a salesperson who has an aptitude test score
of 75 and an effort index of .5.
d. Calculate the sum of squared residuals, ©1Y - YN 22.
e. Calculate the total sum of squares, ©1Y - Y22.
f. Calculate R2, and interpret this number in terms of this problem.
g. Calculate the adjusted coefficient of determination, R 2.
15. We might expect credit card purchases to differ from cash purchases at the same
store. Table P-15 contains daily gross sales and items sold for cash purchases and
daily gross sales and items sold for credit card purchases at the same consignment
store for 25 consecutive days.
a. Make a scatter diagram of daily gross sales, Y, versus items sold for cash pur-
chases, X1. Using a separate plot symbol or color, add daily gross sales and
items sold for credit card purchases, X2 . Visually compare the relationship
between sales and number of items sold for cash with that for credit card
purchases.

TABLE P-15

Gross Number of Gross Credit Number of


Day Cash ($) Items Card ($) Items

1 348 55 148 4
2 42 8 111 6
3 61 9 62 7
4 94 16 0 0
5 60 11 39 5
6 165 26 7 1
7 126 27 143 26
8 111 19 27 5
9 26 5 14 2
10 109 18 71 12
11 180 27 116 21
12 212 36 50 9
13 58 10 13 2
14 115 20 105 16
15 15 8 19 3
16 97 15 44 14
17 61 10 0 0
18 85 15 24 3
19 157 24 144 10
20 88 15 63 11
21 96 19 0 0
22 202 33 14 3
23 108 23 0 0
24 158 21 24 4
25 176 43 253 28

273
Multiple Regression Analysis

b. Define the dummy variable


1 if cash purchase
X2 = b
0 if credit card purchase
and fit the regression model
Y = b 0 + b 1X1 + b 2X2 + 
c. Analyze the fit in part b. Be sure to include an analysis of the residuals. Are you
happy with your model?
d. Using the fitted model from part b, generate a forecast of daily sales for an
individual that purchases 25 items and pays cash. Construct a large-sample 95%
prediction interval for daily sales.
e. Describe the nature of the fitted function in part b. Do you think it is better to
fit two separate straight lines, one for the cash sales and another for the credit
card sales, to the data in Table P-15? Discuss.
16. Cindy Lawson just bought a major league baseball team. She has been receiving a lot
of advice about what she should do to create a winning ball club. Cindy asks you to
study this problem and write a report. You decide to use multiple regression analysis
to determine which statistics are important in developing a winning team (measured
by the number of games won during the 1991 season). You gather data for six statis-
tics from the Sporting News 1992 Baseball Yearbook, as shown in Table P-16, and run

TABLE P-16

Team Wins ERA SO BA Runs HR SB


Giants 75 4.03 905 .246 649 141 95
Mets 77 3.56 1,028 .244 640 117 153
Cubs 77 4.03 927 .253 695 159 123
Reds 74 3.83 997 .258 689 164 124
Pirates 98 3.44 919 .263 768 126 124
Cardinals 84 3.69 822 .255 651 68 202
Phillies 78 3.86 988 .241 629 111 92
Astros 65 4.00 1,033 .244 605 79 125
Dodgers 93 3.06 1,028 .253 665 108 126
Expos 71 3.64 909 .246 579 95 221
Braves 94 3.49 969 .258 749 141 165
Padres 84 3.57 921 .244 636 121 101
Red Sox 84 4.01 999 .269 731 126 59
White Sox 87 3.79 923 .262 758 139 134
Yankees 71 4.42 936 .256 674 147 109
Tigers 84 4.51 739 .247 817 209 109
Orioles 67 4.59 868 .254 686 170 50
Brewers 83 4.14 859 .271 799 116 106
Indians 57 4.23 862 .254 576 79 84
Blue Jays 91 3.50 971 .257 684 133 148
Mariners 83 3.79 1,003 .255 702 126 97
Rangers 85 4.47 1,022 .270 829 177 102
Athletics 84 4.57 892 .248 760 159 151
Royals 82 3.92 1,004 .264 727 117 119
Angels 81 3.69 990 .255 653 115 94
Twins 95 3.69 876 .280 776 140 107

274
Multiple Regression Analysis

a stepwise regression program, assuming a multiple regression model with “Wins” as


the dependent variable.
a. Discuss the importance of each independent variable.
b. What equation should Cindy use to forecast wins?
c. Write a report for Cindy.
d. Collect data from the most recent Sporting News Baseball Yearbook or other
source of baseball statistics. Run a stepwise regression, and compare your results.
17. Ms. Haight, a real estate broker, wishes to forecast the importance of four factors
in determining the prices of lots. She accumulates data on price, area, elevation,
and slope and rates the view for 50 lots. She runs the data on a correlation program
and obtains the correlation matrix given in Table P-17. Ms. Haight then runs the
data on a stepwise multiple regression program.
a. Determine which variable will enter the model first, second, third, and last.
b. Which variable or variables will be included in the best prediction equation?
18. The scores for two within-term examinations, X1 and X2; the current grade point
average (GPA), X3; and the final exam score, Y, for 20 students in a business statis-
tics class are listed in Table P-18.
a. Fit a multiple regression model to predict the final exam score from the scores
on the within-term exams and GPA. Is the regression significant? Explain.
b. Predict the final exam score for a student with within-term exam scores of 86
and 77 and a GPA of 3.4.
c. Compute the VIFs and examine the t statistics for checking the significance of
the individual predictor variables. Is multicollinearity a problem? Explain.

TABLE P-17
Variable
Variable Price Area Elevation Slope View
Price 1.00 .59 .66 .68 .88
Area 1.00 .04 .64 .41
Elevation 1.00 .13 .76
Slope 1.00 .63
View 1.00

TABLE P-18
X1 X2 X3 Y X1 X2 X3 Y
87 85 2.7 91 93 60 3.2 54
100 84 3.3 90 92 69 3.1 63
91 82 3.5 83 100 86 3.6 96
85 60 3.7 93 80 87 3.5 89
56 64 2.8 43 100 96 3.8 97
81 48 3.1 75 69 51 2.8 50
77 67 3.1 63 80 75 3.6 74
86 73 3.0 78 74 70 3.1 58
79 90 3.8 98 79 66 2.9 87
96 69 3.7 99 95 83 3.3 57

275
Multiple Regression Analysis

d. Compute the mean leverage. Are any of the observations high leverage points?
e. Compute the standardized residuals. Identify any observation with a large stan-
dardized residual. Does the fitted model under- or overpredict the response for
these observations?
19. Refer to the data in Table P-18. Find the “best” regression model using the step-
wise regression procedure and the all possible regressions procedure. Compare the
results. Are you confident using a regression model to predict the final exam score
with fewer than the original three independent variables?
20. Recall Example 12. The full data set related to CEO compensation is contained in
Appendix: Data Sets and Databases. Use stepwise regression to select the “best”
model with k = 3 predictor variables. Fit the stepwise model, and interpret the
estimated coefficients. Examine the residuals. Identify and explain any influential
observations. If you had to choose between this model and the k = 2 predictor
model discussed in Example 12, which one would you choose? Why?
21. Table P-21 contains the number of accounts (in thousands) and the assets (in bil-
lions of dollars) for 10 online stock brokerages. Plot the assets versus the number
of accounts. Investigate the possibility the relationship is curved by running a mul-
tiple regression to forecast assets using the number of accounts and the square of
the number of accounts as independent variables.
a. Give the fitted regression function. Is the regression significant? Explain.
b. Test for the significance of the coefficient of the squared term. Summarize your
conclusion.
c. Rerun the analysis without the quadratic (squared) term. Explain why the coeffi-
cient of the number of accounts is not the same as the one you found for part a.
22. The quality of cheese is determined by tasters whose scores are summarized in a
dependent variable called Taste. The independent (predictor) variables are three
chemicals that are present in the cheese: acetic acid, hydrogen sulfide (H2S), and
lactic acid. The 15 cases in the data set are given in Table P-22. Analyze these data
using multiple regression methods. Be sure to include only significant independent
variables in your final model and interpret R 2. Include an analysis of the residuals.
23. Refer to Problem 22. Using your final fitted regression function, forecast Taste (qual-
ity) for Acetic = 5.750, H 2S = 7.300, and Lactic = 1.85. (All three independent
variable values may not be required.) Although n in this case is small, construct the

TABLE P-21
Assets Number of accounts
($ billions) X (1,000s) Y
219.0 2,500
21.1 909
38.8 615
5.5 205
160.0 2,300
19.5 428
11.2 590
5.9 134
1.3 130
6.8 125

276
TABLE P-22
Case Taste Y Acetic X1 H2S X2 Lactic X3
1 40.9 6.365 9.588 1.74
2 15.9 4.787 3.912 1.16
3 6.4 5.412 4.700 1.49
4 18.0 5.247 6.174 1.63
5 38.9 5.438 9.064 1.99
6 14.0 4.564 4.949 1.15
7 15.2 5.298 5.220 1.33
8 32.0 5.455 9.242 1.44
9 56.7 5.855 10.199 2.01
10 16.8 5.366 3.664 1.31
11 11.6 6.043 3.219 1.46
12 26.5 6.458 6.962 1.72
13 0.7 5.328 3.912 1.25
14 13.4 5.802 6.685 1.08
15 5.5 6.176 4.787 1.25

TABLE P-24 Accounting Categories for Major League Baseball ($ millions)


GtReceit MediaRev StadRev TotRev PlayerCt OpExpens OpIncome FranValu Franchise

19.4 61.0 7.4 90.0 29.8 59.6 30.4 200 NyYank


26.6 32.5 18.0 79.3 36.0 72.0 7.3 180 LADodge
22.9 50.0 16.0 91.1 35.2 70.4 20.7 170 NYMets
44.5 30.0 12.0 88.7 29.7 62.4 26.3 160 TorBlJay
24.5 40.5 14.3 81.5 35.4 70.8 10.7 160 BosRdSox
19.0 24.4 5.0 50.6 15.8 39.5 11.1 140 BaltOrio
27.5 25.7 19.3 78.0 18.0 60.0 18.0 140 ChWhSox
19.9 25.0 12.0 59.1 23.2 46.4 12.7 132 StLCard
22.8 27.5 12.0 64.5 29.0 58.0 6.5 132 ChCubs
19.0 25.5 14.0 61.5 20.7 47.6 13.9 123 TxRanger
16.9 20.5 14.0 53.6 30.4 60.8 7.2 117 KCRoyals
15.2 23.2 7.6 48.2 21.7 43.4 4.8 115 PhilPhil
25.7 27.0 10.0 64.9 39.2 66.6 1.7 115 OakAthlt
19.0 27.9 5.0 54.1 34.3 61.7 7.6 103 CalAngls
15.5 26.2 5.0 48.9 33.3 53.3 4.4 99 SFGiants
17.1 24.4 5.3 49.0 27.1 48.8 0.2 98 CinnReds
15.6 23.1 7.5 48.4 24.4 48.8 0.4 96 SDPadres
10.6 25.2 8.0 46.0 12.1 31.5 14.5 95 HousAstr
16.2 21.9 5.5 45.8 24.9 49.8 87 PittPirt
15.6 28.8 5.0 51.6 31.1 54.4 2.8 85 DetTiger
15.4 18.9 3.8 40.3 20.4 40.8 0.5 83 AtBraves
18.2 20.5 3.2 44.1 24.1 48.2 4.1 83 MinnTwin
15.5 22.0 5.0 44.7 17.4 41.8 2.9 79 SeatMar
14.2 19.4 3.0 38.8 26.4 50.2 11.4 77 MilBrews
9.5 23.7 6.6 42.0 19.5 46.8 4.8 77 ClevIndn
10.7 23.5 3.0 39.4 21.8 43.6 4.2 75 MonExpos
Source: M. Ozanian and S. Taub, “Big Leagues, Bad Business,” Financial World, July 7, 1992, pp. 34–51.

277
Multiple Regression Analysis

large-sample approximate 95% prediction interval for your forecast. Do you feel your
regression analysis has yielded a useful tool for forecasting cheese quality? Explain.
24. The 1991 accounting numbers for major league baseball are given in Table P-24.
All figures are in millions of dollars. The numerical variables are GtReceit (Gate
Receipts), MediaRev (Media Revenue), StadRev (Stadium Revenue), TotRev
(Total Revenue), PlayerCt (Player Costs), OpExpens (Operating Expenses),
OpIncome 1Operating Income = Total Revenue - Operating Expenses2, and
FranValu (Franchise Value).
a. Construct the correlation matrix for the variables GtReceit, MediaRev, . . . ,
FranValu. From the correlation matrix, can you determine a variable that is
likely to be a good predictor of FranValu? Discuss.
b. Use stepwise regression to build a model for predicting franchise value using
the remaining variables. Are you surprised at the result? Explain.
c. Can we conclude that, as a general rule, franchise value is about twice total rev-
enue? Discuss.
d. Player costs are likely to be a big component of operating expenses. Develop an
equation for forecasting operating expenses from player costs. Comment on the
strength of the relation. Using the residuals as a guide, identify teams that have
unusually low or unusually high player costs as a component of operating
expenses.
e. Consider the variables other than FranValu. Given their definitions, are there
groups of variables that are multicollinear? If so, identify these sets.

CASES

CASE 1 THE BOND MARKET16


Judy Johnson, vice president of finance of a large, Judy called a meeting of her financial staff to
private, investor-owned utility in the Northwest, was discuss the bond market problem. One member of
faced with a financing problem. The company her staff, Ron Peterson, a recent M.B.A. graduate,
needed money both to pay off short-term debts said he thought a multiple regression model could
coming due and to continue construction of a coal- be developed to forecast the bond rates. Since the
fired plant. vice president was not familiar with multiple regres-
Judy’s main concern was estimating the 10- or sion, she steered the discussion in another direction.
30-year bond market; the company needed to decide After an hour of unproductive interaction, Judy
whether to use equity financing or long-term debt. then asked Ron to have a report on her desk the fol-
To make this decision, the utility needed a reliable lowing Monday.
forecast of the interest rate it would pay at the time Ron knew that the key to the development of a
of bond issuance. good forecasting model is the identification of

16Thedata for this case study were provided by Dorothy Mercer, an Eastern Washington University M.B.A.
student. The analysis was done by M.B.A. students Tak Fu, Ron Hand, Dorothy Mercer, Mary Lou
Redmond, and Harold Wilson.

278

You might also like