Multiple Regression
Multiple Regression
ANALYSIS
In simple linear regression, the relationship between a single independent variable and
a dependent variable is investigated. The relationship between two variables frequently
allows one to accurately predict the dependent variable from knowledge of the inde-
pendent variable. Unfortunately, many real-life forecasting situations are not so simple.
More than one independent variable is usually necessary in order to predict a depen-
dent variable accurately. Regression models with more than one independent variable
are called multiple regression models. Most of the concepts introduced in simple linear
regression carry over to multiple regression. However, some new concepts arise because
more than one independent variable is used to predict the dependent variable.
Multiple regression involves the use of more than one independent variable to
predict a dependent variable.
As an example, return to the problem in which sales volume of gallons of milk is fore-
cast from knowledge of price per gallon. Mr. Bump is faced with the problem of making
a prediction that is not entirely accurate. He can explain almost 75% of the differences
in gallons of milk sold by using one independent variable. Thus, 25% 11 - r22 of the
total variation is unexplained. In other words, from the sample evidence Mr. Bump
knows 75% of what he must know to forecast sales volume perfectly. To do a more accu-
rate job of forecasting, he needs to find another predictor variable that will enable him
to explain more of the total variation. If Mr. Bump can reduce the unexplained varia-
tion, his forecast will involve less uncertainty and be more accurate.
A search must be conducted for another independent variable that is related to sales
volume of gallons of milk. However, this new independent, or predictor, variable cannot
relate too highly to the independent variable (price per gallon) already in use. If the two
independent variables are highly related to each other, they will explain the same varia-
tion, and the addition of the second variable will not improve the forecast.1 In fields such
as econometrics and applied statistics, there is a great deal of concern with this problem
of intercorrelation among independent variables, often referred to as multicollinearity.
1Interrelated predictor variables essentially contain much of the same information and therefore do not
contribute “new” information about the behavior of the dependent variable. Ideally, the effects of separate
predictor variables on the dependent variable should be unrelated to one another.
From Chapter 7 of Business Forecasting, Ninth Edition. John E. Hanke, Dean W. Wichern.
Copyright © 2009 by Pearson Education, Inc. All rights reserved.
235
Multiple Regression Analysis
The simple solution to the problem of two highly related independent variables is merely
not to use both of them together. The multicollinearity problem will be discussed later in
this chapter.
CORRELATION MATRIX
Mr. Bump decides that advertising expense might help improve his forecast of weekly
sales volume. He investigates the relationships among advertising expense, sales vol-
ume, and price per gallon by examining a correlation matrix. The correlation matrix is
constructed by computing the simple correlation coefficients for each combination of
pairs of variables.
An example of a correlation matrix is illustrated in Table 1. The correlation coeffi-
cient that indicates the relationship between variables 1 and 2 is represented as r12. Note
that the first subscript, 1, also refers to the row and the second subscript, 2, also refers to
the column in the table. This approach allows one to determine, at a glance, the relation-
ship between any two variables. Of course, the correlation between, say, variables 1 and
2 is exactly the same as the correlation between variables 2 and 1; that is, r12 = r21.
Therefore, only half of the correlation matrix is necessary. In addition, the correlation of
a variable with itself is always 1, so that, for example, r11 = r22 = r33 = 1.
Mr. Bump runs his data on the computer, and the correlation matrix shown in
Table 2 results. An investigation of the relationships among advertising expense, sales
volume, and price per gallon indicates that the new independent variable should con-
tribute to improved prediction. The correlation matrix shows that advertising expense
has a high positive relationship 1r13 = .892 with the dependent variable, sales volume,
and a moderate negative relationship 1r23 = -.652 with the independent variable,
price per gallon. This combination of relationships should permit advertising expenses
to explain some of the total variation of sales volume that is not already being
explained by price per gallon. As will be seen, when both price per gallon and advertis-
ing expense are used to estimate sales volume, R 2 increases to 93.2%.
The analysis of the correlation matrix is an important initial step in the solution of
any problem involving multiple independent variables.
Variables
Variables 1 2 3
236
Multiple Regression Analysis
In simple regression, the dependent variable can be represented by Y and the inde-
pendent variable by X. In multiple regression analysis, X’s with subscripts are used to
represent the independent variables. The dependent variable is still represented by Y,
and the independent variables are represented by X1, X2, . . ., Xk. Once the initial set of
independent variables has been determined, the relationship between Y and these X’s
can be expressed as a multiple regression model.
In the multiple regression model, the mean response is taken to be a linear func-
tion of the explanatory variables:
mY = b 0 + b 1X1 + b 2X2 + Á + b kXk (1)
This expression is the population multiple regression function. As was the case with
simple linear regression, we cannot directly observe the population regression function
because the observed values of Y vary about their means. Each combination of values
for all of the X’s defines the mean for a subpopulation of responses Y. We assume that
the Y’s in each of these subpopulations are normally distributed about their means
with the same standard deviation, σ.
The data for simple linear regression consist of observations 1Xi, Yi2 on the two
variables. In multiple regression, the data for each case consist of an observation on the
response and an observation on each of the independent variables. The ith observation
on the jth predictor variable is denoted by Xi j. With this notation, data for multiple
regression have the form given in Table 3. It is convenient to refer to the data for the ith
case as simply the ith observation. With this convention, n is the number of observa-
tions and k is the number of predictor variables.
where
1. For the ith observation, Y = Yi and X1,X2, Á ,Xk are set at values Xi 1,Xi 2, Á ,Xik.
Case X1 X2 ............ Xk Y
237
Multiple Regression Analysis
2. The ’s are error components that represent the deviations of the response from
the true relation. They are unobservable random variables accounting for the
effects of other factors on the response. The errors are assumed to be independent,
and each is normally distributed with mean 0 and unknown standard deviation σ.
3. The regression coefficients, b 0,b 1, Á ,b k, that together locate the regression function
are unknown.
Given the data, the regression coefficients can be estimated using the principle of
least squares. The least squares estimates are denoted by b0, b1, Á , bk and the estimated
regression function by
N = b + bX + Á + bX
Y (2)
0 1 1 k k
The residuals, e = Y - YN , are estimates of the error component and similar to the sim-
ple linear regression situation; the correspondence between population and sample is
Example 1
For the data shown in Table 4, Mr. Bump considers a multiple regression model relating
sales volume (Y) to price 1X12 and advertising 1X22:
Y = b 0 + b 1X1 + b 2X2 +
The least squares values— b0 = 16.41, b1 = -8.25 , and b2 = .59 —minimize the sum of
squared errors:
for all possible choices of b0 , b1 , and b2 . Here, the best-fitting function is a plane (see
Figure 1). The data points are plotted in three dimensions along the Y, X1, and X2 axes.
The points fall above and below the plane in such a way that ©1Y - YN 22 is a minimum.
The fitted regression function can be used to forecast next week’s sales. If plans call for
a price per gallon of $1.50 and advertising expenditures of $1,000, the forecast is 9.935 thou-
sands of gallons; that is,
238
Multiple Regression Analysis
1 10 1.30 9
2 6 2.00 7
3 5 1.70 5
4 12 1.50 14
5 10 1.60 15
6 15 1.20 12
7 5 1.60 6
8 12 1.40 10
9 17 1.00 15
10 20 1.10 21
Totals 112 14.40 114
Means 11.2 1.44 11.4
Y
Sales
17
16.41
^
Y = 16.41 − 8.25 X1 + .59X2
B
X2
A 6 15 20 Advertising
$1.00
$1.60
$2.00
X1
Price
Point Week Sales Price Advertising
A 7 5 $1.60 6
B 9 17 $1.00 15
FIGURE 1 Fitted Regression Plane for Mr. Bump’s Data for Example 1
Consider the interpretation of b0, b1, and b2 in Mr. Bump’s fitted regression function.
The value b0 is again the Y-intercept. However, now it is interpreted as the value of YN
when both X1 and X2 are equal to zero. The coefficients b1 and b2 are referred to as the
partial, or net, regression coefficients. Each measures the average change in Y per unit
change in the relevant independent variables. However, because the simultaneous
239
Multiple Regression Analysis
The partial, or net, regression coefficient measures the average change in the
dependent variable per unit change in the relevant independent variable, holding
the other independent variables constant.
In the present example, the b1 value of -8.25 indicates that each increase of 1 cent
in the price of a gallon of milk when advertising expenditures are held constant reduces
the quantity purchased by an average of 82.5 gallons. Similarly, the b2 value of .59
means that, if advertising expenditures are increased by $100 when the price per gallon
is held constant, then sales volume will increase an average of 590 gallons.
Example 2
To illustrate the net effects of individual X’s on the response, consider the situation in which
price is to be $1.00 per gallon and $1,000 is to be spent on advertising. Then
Inference for multiple regression models is analogous to that for simple linear regres-
sion. The least squares estimates of the model parameters, their estimated standard
errors, the t statistics used to examine the significance of individual terms in the regres-
sion model, and the F statistic used to check the significance of the regression are all
provided in output from standard statistical software packages. Determining these
quantities by hand for a multiple regression analysis of any size is not practical, and the
computer must be used for calculations.
As you know, any observation Y can be written
240
Multiple Regression Analysis
or
Y = YN + 1Y - YN 2
where
YN = b0 + b1X1 + b2X2 + Á + bkXk
is the fitted regression function. Recall that YN is an estimate of the population regres-
sion function. It represents that part of Y explained by the relation of Y with the X’s.
The residual, Y - YN , is an estimate of the error component of the model. It represents
that part of Y not explained by the predictor variables.
The sum of squares decomposition and the associated degrees of freedom are
The total variation in the response, SST, consists of two components: SSR, the variation
explained by the predictor variables through the estimated regression function, and
SSE, the unexplained or error variation. The information in Equation 3 can be set out
in an analysis of variance (ANOVA) table, which is discussed in a later section.
©1Y - YN 22 SSE
sy #x¿s = = = 1MSE (4)
Dn - k - 1 Dn - k - 1
where
n = the number of observations
k = the number of independent variables in the regression function
SSE = ©1Y - YN 22 = the residual sum of squares
MSE = SSE>1n - k - 12 = the residual mean square
The standard error of the estimate is the standard deviation of the residuals. It mea-
sures the amount the actual values (Y) differ from the estimated values (YN ). For rel-
atively large samples, we would expect about 67% of the differences Y - YN to be
within sy #x¿s of zero and about 95% of these differences to be within 2 sy #x¿s of zero.
Example 3
The quantities required to calculate the standard error of the estimate for Mr. Bump’s data
are given in Table 5.
2The standard error of the estimate is an estimate of s, the standard deviation of the error term, , in the
multiple regression model.
241
Multiple Regression Analysis
TABLE 5 Residuals from the Model for Mr. Bump’s Data for
Example 3
N Using
Predicted Y (Y) Residual
Y X1 X2 YN ⴝ 16.406 ⴚ 8.248X1 ⴙ .585X2 (Y - YN ) (Y - YN ) 2
MSR
Regression SSR k MSR = SSR>k F =
MSE
Error SSE n - k - 1 MSE = SSE>(n - k - 12
Total SST n - 1
242
Multiple Regression Analysis
In simple linear regression, there is only one predictor variable. Consequently, test-
ing for the significance of the regression using the F ratio from the ANOVA table is
equivalent to the two-sided t test of the hypothesis that the slope of the regression line
is zero. For multiple regression, the t tests (to be introduced shortly) examine the sig-
nificance of individual X’s in the regression function, and the F test examines the
significance of all the X’s collectively.
SSE ©1Y - YN 22
= 1 - = 1 - (5)
SST ©1Y - Y22
and has the same form and interpretation as r2 does for simple linear regression. It rep-
resents the proportion of variation in the response, Y, explained by the relationship of
Y with the X’s.
A value of R 2 = 1 says that all the observed Y’s fall exactly on the fitted regression
function. All of the variation in the response is explained by the regression. A value of
R2 = 0 says that YN = Y—that is, SSR = 0—and none of the variation in Y is explained
by the regression. In practice, 0 … R2 … 1, and the value of R 2 must be interpreted relative
to the extremes, 0 and 1.
The quantity
R = 2R 2 (6)
is called the multiple correlation coefficient and is the correlation between the
responses, Y, and the fitted values, YN . Since the fitted values predict the responses, R is
always positive, so that 0 … R … 1.
243
Multiple Regression Analysis
R2 n - k - 1
F = 2
¢ ≤ (7)
1 - R k
so, everything else equal, significant regressions (large F ratios) are associated with rel-
atively large values for R2.
The coefficient of determination 1R22 can always be increased by adding an addi-
tional independent variable, X, to the regression function, even if this additional vari-
able is not important.3 For this reason, some analysts prefer to interpret R2 adjusted for
the number of terms in the regression function. The adjusted coefficient of determina-
tion, R 2, is given by
R 2 = 1 - 11 - R22 ¢
n - 1
≤ (8)
n - k - 1
Example 4
Using the total sum of squares in Table 6 and the residual sum of squares from Example 3, the
sum of squares decomposition for Mr. Bump’s problem is
a 1Y - Y2 = a 1Y - Y2 + a 1Y - Y2
2 N N 2
2
217.7 15.9
R2 = = 1 - = .932
233.6 233.6
3Here, “not important” means “not significant.” That is, the coefficient of X is not significantly different from
zero (see the Individual Predictor Variables section that follows).
244
Multiple Regression Analysis
If H0: b j = 0 is true, the test statistic, t, with the value t = bj>sbj has a t distribution
with df = n - k - 1.4
reject H0 if ƒ t ƒ 7 ta>2 . Here, ta>2 is the upper α/2 percentage point of a t distribution
with df = n - k - 1.
Some care must be exercised in dropping from the regression function those pre-
dictor variables that are judged to be insignificant by the t test 1H0: b j = 0 cannot be
rejected). If the X’s are related (multicollinear), the least squares coefficients and the
corresponding t values can change, sometimes appreciably, if a single X is deleted
from the regression function. For example, an X that was previously insignificant may
become significant. Consequently, if there are several small (insignificant) t values,
predictor variables should be deleted one at a time (starting with the variable having
the smallest t value) rather than in bunches. The process stops when the regression is
significant and all the predictor variables have large (significant) t statistics.
4Here, bj is the least squares coefficient for the jth predictor variable, Xj, and sbj is its estimated standard
deviation (standard error). These two statistics are ordinarily obtained with computer software such as
Minitab.
245
Multiple Regression Analysis
COMPUTER OUTPUT
The computer output for Mr. Bump’s problem is presented in Table 8. Examination of
this output leads to the following observations (explanations are keyed to Table 8).
1. The regression coefficients are -8.25 for price and .585 for advertising [Link]
fitted regression equation is YN = 16.4 - 8.25X1 + .585X2.
2. The regression equation explains 93.2% of the variation in sales volume.
3. The standard error of the estimate is 1.5072 gallons. This value is a measure of the
amount the actual values differ from the fitted values.
4. The regression slope coefficient was tested to determine whether it was different
from zero. In the current situation, the large t statistic of -3.76 for the price vari-
able, X1, and its small p-value (.007) indicate that the coefficient of price is signifi-
cantly different from zero (reject H0: b 1 = 02. Given the advertising variable, X2,
in the regression function, price cannot be dropped from the regression function.
Similarly, the large t statistic of 4.38 for the advertising variable, X2, and its small p-
value (.003) indicate that the coefficient of advertising is significantly different
from zero (reject H0: b 2 = 02. Given the price variable, X1, in the regression func-
tion, the advertising variable cannot be dropped from the regression function. (As
a reference point for the magnitude of the t values, with seven degrees of freedom,
Table 3 in Appendix: Tables gives t.01 = 2.998.) In summary, the coefficients of
both predictor variables are significantly different from zero.
5. The p-value .007 is the probability of obtaining a t value at least as large as -3.76 if
the hypothesis H0: b 1 = 0 is true. Since this probability is extremely small, H0 is
unlikely to be true, and it is [Link] coefficient of price is significantly different
from zero. The p-value .003 is the probability of obtaining a t value at least as large
as 4.38 if H0: b 2 = 0 is true. Since a t value of this magnitude is extremely unlikely,
H0 is rejected. The coefficient of advertising is significantly different from zero.
Correlations: Y, X1, X2
Y X1
X1 -0.863
X2 0.891 -0.654 162
Regression Analysis: Y versus X1, X2
The regression equation is
Y = 16.4 - 8.25 X1 + 0.585 X2 112
Presictor Coef SE Coef T P
Constant 16.406 (1) 4.343 3.78 0.007
X1 -8.248 112 2.196 -3.76 142 0.007 (5)
X2 0.5851 (1) 0.1337 4.38 (4) 0.003 (5)
S = 1.50720 132 R -Sq = 93.2% 122 R -Sq1adj2 = 91.2% 192
Analysis of Variance
Source DF SS MS F P
Regression 2 217.70 (7) 108.85 47.92 (8) 0.000
Residual Error 7 15.90 (7) 2.27
Total 9 233.60 (7)
246
Multiple Regression Analysis
R2 = 1 - 11 - R22 ¢
n - 1
≤
n - k - 1
DUMMY VARIABLES
The dummy variable technique is illustrated in Figure 3. The data points for
females are shown as 0’s; the 1’s represent males. Two parallel lines are constructed for
the scatter diagram. The top one fits the data for females; the bottom one fits the male
data points.
Each of these lines was obtained from a fitted regression function of the form
YN = b0 + b1X1 + b2X2
247
Multiple Regression Analysis
1 5 60 0(F)
2 4 55 0(F)
3 3 35 0(F)
4 10 96 0(F)
5 2 35 0(F)
6 7 81 0(F)
7 6 65 0(F)
8 9 85 0(F)
9 9 99 1(M)
10 2 43 1(M)
11 8 98 1(M)
12 6 91 1(M)
13 7 95 1(M)
14 3 70 1(M)
15 6 85 1(M)
Totals 87 1,093
YF = the mean female job performance rating = 5.75
YM = the mean male job performance rating = 5.86
XF = the mean female aptitude test score = 64
XM = the mean male aptitude test score = 83
Y
10 0
0 = Females
1 = Males
9 0 1
8 1
Job Performance Rating
7 0 1
6 0 1 1
5 0
4 0
3 0 1
2 0 1
X
0 10 20 30 40 50 60 70 80 90 100
Aptitude Test Score
248
Multiple Regression Analysis
Y
10 0
^
0 = Females 0 Y = −1.96 + .12X1
9
1 = Males
8 1
0 1
5 0
4 0
3 0 1
2 0 1
X1
0 10 20 30 40 50 60 70 80 90 100
Aptitude Test Score
where
X1 = the test score
0 for females
X2 = b dummy variable
1 for males
The single equation is equivalent to the following two equations:
Note that b2 represents the effect of a male on job performance and that b1 represents
the effect of differences in aptitude test scores (the b1 value is assumed to be the same
for both males and females). The important point is that one multiple regression equa-
tion will yield the two estimated lines shown in Figure 3. The top line is the estimated
relation for females, and the lower line is the estimated relation for males. One might
envisage X2 as a “switching” variable that is “on” when an observation is made for a
male and “off” when it is made for a female.
Example 6
The estimated multiple regression equation for the data of Example 5 is shown in the
Minitab computer output in Table 10. It is
249
Multiple Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 2 86.981 43.491 70.35 0.000
Residual Error 12 7.419 0.618
Total 14 94.400
For the two values (0 and 1) of X2, the fitted equation becomes
250
Multiple Regression Analysis
information. However, the moderate relationship, r23 = .428, between gender and aptitude
test score indicates that the test might discriminate between sexes. Males seem to do better
on the test than do females (83 versus 64). Perhaps some element of strength is required on
the test that is not required on the job.
When both test results and gender are used to forecast job performance, 92% of the vari-
ance is explained. This result suggests that both variables make a valuable contribution to pre-
dicting performance. The aptitude test scores explain 77% of the variance, and gender used in
conjunction with the aptitude test scores adds another 15%. The computed t statistics, 11.86
1p-value = .0002 and -4.84 1p-value = .0002, for aptitude test score and gender, respec-
tively, indicate that both predictor variables should be included in the final regression function.
MULTICOLLINEARITY
In many regression problems, data are routinely recorded rather than generated from
preselected settings of the independent variables. In these cases, the independent vari-
ables are frequently linearly dependent or multicollinear. For example, in appraisal
work, the selling price of a home may be related to predictor variables such as age, liv-
ing space in square feet, number of bathrooms, number of rooms other than bath-
rooms, lot size, and an index of construction quality. Living space, number of rooms,
and number of bathrooms should certainly “move together.” If one of these variables
increases, the others will generally increase.
If this linear dependence is less than perfect, the least squares estimates of the
regression model coefficients can still be obtained. However, these estimates tend to be
unstable—their values can change dramatically with slight changes in the data—and
inflated—their values are larger than expected. In particular, individual coefficients
may have the wrong sign, and the t statistics for judging the significance of individual
terms may all be insignificant, yet the F test will indicate the regression is significant.
Finally, the calculation of the least squares estimates is sensitive to rounding errors.
5The variance inflation factor (VIF) gets its name from the fact that sbj r VIFj. The estimated standard
deviation (standard error) of the least squares coefficient, bj, increases as VIFj increases.
251
Multiple Regression Analysis
greater than 1 indicates that the estimated coefficient attached to that independent vari-
able is unstable. Its value and associated t statistic may change considerably as the other
independent variables are added or deleted from the regression equation. A large VIF
means essentially that there is redundant information among the predictor variables. The
information being conveyed by a variable with a large VIF is already being explained by
the remaining predictor variables. Thus, multicollinearity makes interpreting the effect of
an individual predictor variable on the response (dependent variable) difficult.
Example 7
A large component of the cost of owning a newspaper is the cost of newsprint. Newspaper
publishers are interested in factors that determine annual newsprint consumption. In one
study (see Johnson and Wichern, 1997), data on annual newsprint consumption (Y), the num-
ber of newspapers in a city 1X12, the logarithm6 of the number of families in a city 1X22, and
the logarithm of total retail sales in a city 1X32 were collected for n = 15 cities. The
correlation array for the three predictor variables and the Minitab output from a regression
analysis relating newsprint consumption to the predictor variables are in Table 11.
The F statistic (18.54) and its p-value (.000) clearly indicate that the regression is signifi-
cant. The t statistic for each of the independent variables is small with a relatively large
p-value. It must be concluded, for example, that the variable LnFamily is not significant, pro-
vided the other predictor variables remain in the regression function. This suggests that the
term b 2X2 can be dropped from the regression function if the remaining terms, b 1X1 and
b 3X3, are retained. Similarly, it appears as if b 3X3 can be dropped if b 1X1 and b 2X2 remain in
the regression function. The t value (1.69) associated with papers is marginally significant, but
the term b 1X1 might also be dropped if the other predictor variables remain in the equation.
Here, the regression is significant, but each of the predictor variables is not significant. Why?
The VIF column in Table 11 provides the answer. Since VIF = 1.7 for Papers, this pre-
dictor variable is very weakly related (VIF near 1) to the remaining predictor variables,
LnFamily and LnRetSales. The VIF = 7.4 for LnFamily is relatively large, indicating this
6Logarithmsof the number of families and the total retail sales are used to make the numbers less positively
skewed and more manageable.
252
Multiple Regression Analysis
variable is linearly related to the remaining predictor variables. Also, the VIF = 8.1 for
LnRetSales indicates that LnRetSales is related to the remaining predictor variables. Since
Papers is weakly related to LnFamily and LnRetSales, the relationship among the predictor
variables is essentially the relationship between LnFamily and LnRetSales. In fact, the sample
correlation between LnFamily and LnRetSales is r = .93, showing strong linear association.
The variables LnFamily and LnRetSales are very similar in their ability to explain
newsprint consumption. We need only one, but not both, in the regression function. The
Minitab output from a regression analysis with LnFamily (smallest t statistic) deleted from
the regression function is shown in Table 12.
Notice that the coefficient of Papers is about the same for the two regressions. The coef-
ficients of LnRetSales, however, are considerably different (3,455 for k = 3 predictors and
5,279 for k = 2 predictors). Also, for the second regression, the variable LnRetSales is
clearly significant 1t = 4.51 with p-valve = .0012. With Papers in the model, LnRetSales is
an additional important predictor of newsprint consumption. The R2’s for the two regres-
sions are nearly the same, approximately .83, as are the standard errors of the estimates,
sy # x’s = 1,849 and sy #x’s = 1,820, respectively. Finally, the common VIF = 1.7 for the two pre-
dictors in the second model indicates that multicollinearity is no longer a problem. As a
residual analysis confirms, for the variables considered, the regression of Newsprint on
Papers and LnRetSales is entirely adequate.
If estimating the separate effects of the predictor variables is important and multi-
collinearity appears to be a problem, what should be done? There are several ways to
deal with severe multicollinearity, as follows. None of them may be completely satisfac-
tory or feasible.
'
• Create new X variables (call them X ) by scaling all the independent variables
according to the formula
' Xi j - X j
X = j = 1, 2, Á , k; i = 1, 2, Á , n (12)
1Xij - Xj2 2
Aa
i
These new variables will each have a sample mean of 0 and the same sample
standard deviation. The regression calculations with the new X’s are less sensitive
to round-off error in the presence of severe multicollinearity.
• Identify and eliminate one or more of the redundant independent variables from
the regression function. (This approach was used in Example 7.)
253
Multiple Regression Analysis
How does one develop the best multiple regression equation to forecast a variable of
interest? The first step involves the selection of a complete set of potential predictor
variables. Any variable that might add to the accuracy of the forecast should be
included. In the selection of a final equation, one is usually faced with the dilemma of
providing the most accurate forecast for the smallest cost. In other words, when choos-
ing predictor variables to include in the final equation, the analyst must evaluate them
by using the following two opposed criteria:
1. The analyst wants the equation to include as many useful predictor variables as
possible.9
2. Given that it costs money to obtain and monitor information on a large number of
X’s, the equation should include as few predictors as possible. The simplest equa-
tion is usually the best equation.
The selection of the best regression equation usually involves a compromise between
these extremes, and judgment will be a necessary part of any solution.
After a seemingly complete list of potential predictors has been compiled, the
second step is to screen out the independent variables that do not seem appropriate. An
independent variable (1) may not be fundamental to the problem (there should be
some plausible relation between the dependent variable and an independent variable),
(2) may be subject to large measurement errors, (3) may duplicate other independent
variables (multicollinearity), or (4) may be difficult to measure accurately (accurate
data are unavailable or costly).
The third step is to shorten the list of predictors so as to obtain a “best” selection of
independent variables. Techniques currently in use are discussed in the material that
follows. None of the search procedures can be said to yield the “best” set of independent
variables. Indeed, there is often no unique “best” set. To add to the confusion, the various
techniques do not all necessarily lead to the same final prediction [Link] entire vari-
able selection process is very [Link] primary advantage of automatic-search proce-
dures is that analysts can then focus their judgments on the pivotal areas of the problem.
To demonstrate various search procedures, a simple example is presented that has
five potential independent variables.
Example 8
Pam Weigand, the personnel manager of the Zurenko Pharmaceutical Company, is inter-
ested in forecasting whether a particular applicant will become a good salesperson. She
decides to use the first month’s sales as the dependent variable (Y), and she chooses to
analyze the following independent variables:
7Alternative procedures for estimating the regression parameters are beyond the scope of this text. The
interested reader should consult the work of Draper and Smith (1998).
8Again, the procedures for creating linear combinations of the X’s that are uncorrelated are beyond the
scope of this text. Draper and Smith (1998) discuss these techniques.
9Recall that, whenever a new predictor variable is added to a multiple regression equation, R2 increases.
Therefore, it is important that a new predictor variable make a significant contribution to the regression equation.
254
Multiple Regression Analysis
255
Multiple Regression Analysis
interrelationships that must be dealt with in attempting to find the best possible set of
explanatory variables.
Two procedures are demonstrated: all possible regressions and stepwise regression.
Example 9
The results from the all possible regressions runs for the Zurenko Pharmaceutical Company
are presented in Table 15. Notice that Table 15 is divided into six sets of regression equation
outcomes. This breakdown coincides with the number of parameters contained in each
equation.
The third step involves the selection of the best independent variable (or vari-
ables) for each parameter grouping. The equation with the highest R2 is considered
best. Using the results from Example 9, the best equation from each set listed in
Table 15 is presented in Table 16.
The fourth step involves making the subjective decision: “Which equation is the
best?” On the one hand, the analyst desires the highest R2 possible; on the other
hand, he or she wants the simplest equation possible. The all possible regressions
approach assumes that the number of data points, n, exceeds the number of parame-
ters, k + 1.
Example 10
The analyst is attempting to find the point at which adding additional independent variables
for the Zurenko Pharmaceutical problem is not worthwhile because it leads to a very small
increase in R2. The results in Table 16 clearly indicate that adding variables after selling
256
Multiple Regression Analysis
1 None 29 .0000
2 X2 28 .6370
3 X1, X2 27 .8948
4 X1, X2, X5 26 .8953
5 X1, X2, X3, X5 25 .8955
6 X1, X2, X3, X4, X5 24 .8955
257
Multiple Regression Analysis
aptitude test (X1) and age (X2) is not necessary. Therefore, the final fitted regression equa-
tion is of the form
YN = b0 + b1X1 + b2X2
The all possible regressions procedure is best summed up by Draper and Smith (1998):
Stepwise Regression
The stepwise regression procedure adds one independent variable at a time to the
model, one step at a time. A large number of independent variables can be handled on
the computer in one run when using this procedure.
Stepwise regression can best be described by listing the basic steps (algorithm)
involved in the computations.
1. All possible simple regressions are considered. The predictor variable that explains
the largest significant proportion of the variation in Y (has the largest correlation
with the response) is the first variable to enter the regression equation.
2. The next variable to enter the equation is the one (out of those not included) that
makes the largest significant contribution to the regression sum of squares. The sig-
nificance of the contribution is determined by an F test. The value of the F statistic
that must be exceeded before the contribution of a variable is deemed significant
is often called the F to enter.
3. Once an additional variable has been included in the equation, the individual con-
tributions to the regression sum of squares of the other variables already in the
equation are checked for significance using F tests. If the F statistic is less than a
value called the F to remove, the variable is deleted from the regression equation.
4. Steps 2 and 3 are repeated until all possible additions are nonsignificant and all
possible deletions are significant. At this point, the selection stops.
The user of a stepwise regression program supplies the values that decide when a
variable is allowed to enter and when a variable is removed. Since the F statistics used in
stepwise regression are such that F = t2 where t is the t statistic for checking the signifi-
cance of a predictor variable, F = 4 1corresponding to ƒ t ƒ = 22 is a common choice for
both the F to enter and the F to remove. An F to enter of 4 is essentially equivalent to
testing for the significance of a predictor variable at the 5% level. The Minitab stepwise
258
Multiple Regression Analysis
program allows the user to choose an α level to enter and to remove variables or the F
value to enter and to remove variables. Using an α value of .05 is approximately equiv-
alent to using F = 4. The current default values in Minitab are a = .15 and F = 4.
The result of the stepwise procedure is a model that contains only independent
variables with t values that are significant at the specified level. However, because of
the step-by-step development, there is no guarantee that stepwise regression will
select, for example, the best three variables for prediction. In addition, an automatic
selection method is not capable of indicating when transformations of variables are
useful, nor does it necessarily avoid a multicollinearity problem. Finally, stepwise
regression cannot create important variables that are not supplied by the user. It is nec-
essary to think carefully about the collection of independent variables that is supplied
to a stepwise regression program.
The stepwise procedure is illustrated in Example 11.
Example 11
Let’s “solve” the Zurenko Pharmaceutical problem using stepwise regression.
Pam examines the correlation matrix shown in Table 14 and decides that, when she
runs the stepwise analysis, the age variable will enter the model first because it has the
largest correlation with sales 1r1,3 = .7982 and will explain 63.7% 1.79822 of the variation
in sales.
She notes that the aptitude test score will probably enter the model second because it is
strongly related to sales 1r1,2 = .6762 but not highly related to the age variable 1r2,3 = .2282
already in the model.
Pam also notices that the other variables will probably not qualify as good predictor
variables. The anxiety test score will not be a good predictor because it is not well related to
sales 1r1,4 = -.2962 . The experience and GPA variables might have potential as good
predictor variables 1r1,5 = .550 and r1,6 = .622, respectively2 . However, both of these
predictor variables have a potential multicollinearity problem with the age variable
1r3,5 = .540 and r3,6 = .695, respectively2.
The Minitab commands to run a stepwise regression analysis for this example are
demonstrated in the Minitab Applications section at the end of the chapter. The output
from this stepwise regression run is shown in Table 17. The stepwise analysis proceeds
according to the steps that follow.
259
Multiple Regression Analysis
As Pam thought, the age variable entered the model first and explains 63.7%
of the sales variance. Since the p-value of .000 is less than the a value of .05, age is
added to the model. Remember that the p-value is the probability of obtaining a
t statistic as large as 7.01 by chance alone. The Minitab decision rule that Pam
selected is to enter a variable if the p-value is less than a = .05.
Note that t = 7.01 6 2.048 , the upper .025 point of a t distribution with
28 1n - k - 1 = 30 - 1 - 12 degrees of freedom. Thus, at the .05 significance
level, the hypothesis H0: b 1 = 0 is rejected in favor of H1: b 1 Z 0 . Since
t2 = F or 2.0482 = 4.19, an F to enter of 4 is also essentially equivalent to testing
for the significance of a predictor variable at the 5% level. In this case, since the
coefficient of the age variable is clearly significantly different from zero, age
enters the regression equation, and the procedure now moves to step 2.
Step 2. The model after step 2 is
Sales = -86.79 + 5.93 1Age2 + 0.200 1Aptitude2
H0: b 2 = 0
H1: b 2 Z 0
Again, the p-value of .000 is less than the α value of .05, and aptitude test score is
added to the model. The aptitude test score’s regression coefficient is significantly
different from zero, and the probability that this occurred by chance sampling
error is approximately zero. This result means that the aptitude test score is an
important variable when used in conjunction with age.
The critical t statistic based on 27 1n - k - 1 = 30 - 2 - 12 degrees of
freedom is 2.052.10 The computed t ratio found on the Minitab output is 8.13,
which is greater than 2.052. Using a t test, the null hypothesis is also rejected.
Note that the p-value for the age variable’s t statistic, .000, remains very small.
Age is still a significant predictor of sales. The procedure now moves on to step 3.
Step 3. The computer now considers adding a third predictor variable, given that X1 (age)
and X2 (aptitude test score) are in the regression equation. None of the remaining
independent variables is significant (has a p-value less than .05) when run in
combination with X1 and X2, so the stepwise procedure is completed.
Pam’s final model selected by the stepwise procedure is the two-predictor
variable model given in step 2.
10Again, since 2.0522 = 4.21, using an F to enter of 4 is roughly equivalent to testing for the significance of a
predictor variable at the .05 level.
260
Multiple Regression Analysis
A regression analysis is not complete until one is convinced the model is an adequate
representation of the data. It is imperative to examine the adequacy of the model
before it becomes part of the decision-making apparatus.
An examination of the residuals is a crucial component of the determination of
model [Link], if regression models are used with time series data, it is important
to compute the residual autocorrelations to check the independence assumption.
Inferences (and decisions) made with models that do not approximately conform to the
regression assumptions can be grossly misleading. For example, it may be concluded
that the manipulation of a predictor variable will produce a specified change in the
response when, in fact, it will not. It may be concluded that a forecast is very likely (95%
confidence) to be within 2% of the future response when, in fact, the actual confidence
is much less, and so forth.
In this section, some additional tools that can be used to evaluate a regression
model will be discussed. These tools are designed to identify observations that are out-
lying or extreme (observations that are well separated from the remainder of the data).
Outlying observations are often hidden by the fitting process and may not be easily
detected from an examination of residual plots. Yet they can have a major role in deter-
mining the fitted regression function. It is important to study outlying observations to
decide whether they should be retained or eliminated and, if retained, whether their
influence should be reduced in the fitting process or the regression function revised.
A measure of the influence of the ith data point on the location of the fitted
regression function is provided by the leverage hii. The leverage depends only on the
predictors; it does not depend on the response, Y. For simple linear regression with one
predictor variable, X,
1 1Xi - X22
a 1Xi - X2
hii = + 2
(13)
n
With k predictors, the expression for the ith leverage is more complicated; however,
one can show that 0 6 hii 6 1 and that the mean leverage is h = 1k + 12>n.
If the ith data point has high leverage 1hii is close to 1), the fitted response, YNi, at
these X’s is almost completely determined by Yi, with the remaining data having very
little influence. The high leverage data point is also an outlier among the X’s (far from
other combinations of X values).11 A rule of thumb suggests that hii is large enough to
merit checking if hii Ú 31k + 12>n.
The detection of outlying or extreme Y values is based on the size of the residuals,
e = Y - YN . Large residuals indicate a Y value that is “far” from its fitted or predicted
11The converse is not necessarily true. That is, an outlier among the X’s may not be a high leverage point.
261
Multiple Regression Analysis
value, YN . A large residual will show up in a histogram of the residuals as a value far (in
either direction) from zero. A large residual will show up in a plot of the residuals ver-
sus the fitted values as a point far above or below the horizontal axis.
Software packages such as Minitab flag data points with extreme Y values by com-
puting “standardized” residuals and identifying points with large standardized residuals.
One standardization is based on the fact that the residuals have estimated stan-
dard deviations:
where sy # x¿s = 2MSE is the standard error of the estimate and hii is the leverage
associated with the ith data point. The standardized residual12 is then
ei ei
= (14)
sei sy # x¿s 21 - hii
` ` 7 2
ei
sei
The Y values corresponding to data points with large standardized residuals can heav-
ily influence the location of the fitted regression function.
Example 12
Chief executive officer (CEO) salaries in the United States are of interest because of their
relationship to salaries of CEOs in international firms and to salaries of top professionals
outside corporate America. Also, for an individual firm, the CEO compensation directly, or
indirectly, influences the salaries of managers in positions below that of CEO. CEO salary
varies greatly from firm to firm, but data suggest that salary can be explained in terms of a
firm’s sales and the CEO’s amount of experience, educational level, and ownership stake in
the firm. In one study, 50 firms were used to develop a multiple regression model linking
CEO compensation to several predictor variables such as sales, profits, age, experience,
professional background, educational level, and ownership stake.
After eliminating unimportant predictor variables, the final fitted regression func-
tion was
where
Minitab identified three observations from this regression analysis that have either
large standardized residuals or large leverage.
12Some software packages may call the standardized residual given by Equation 14 the Studentized residual.
262
Multiple Regression Analysis
Unusual Observations
Obs Educate LnComp Fit StDev Fit Residual St Resid
14 1.00 6.0568 7.0995 0.0949 1.0427 2.09R
25 0.00 8.1342 7.9937 0.2224 0.1405 0.31X
33 0.00 6.3969 7.3912 0.2032 0.9943 2.13R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.
Observations 14 and 33 have large standardized residuals. The fitted regression function is
predicting (log) compensation that is too large for these two CEOs. An examination of the
full data set shows that these CEOs each own relatively large percentages of their compa-
nies’ stock. Case 14 owns more than 10% of the company’s stock, and case 33 owns more
than 17% of the company’s stock. These individuals are receiving much of their remunera-
tion through long-term compensation, such as stock incentives, rather than through annual
salary and bonuses. Since amount of stock owned (or stock value) is not included as a vari-
able in the regression function, it cannot be used to adjust the prediction of compensation
determined by CEO education and company sales. Although education and (log) sales do
not predict the compensation of these two CEOs as well as the others, there appears to be
no reason to eliminate them from consideration.
Observation 25 is singled out because the leverage for this data point is greater than
31k + 12>n = 3132>50 = .18. This CEO has no college degree 1Educate = 02 but is with a
company with relatively large sales 1LnSales = 9.3942. The combination (0, 9.394) is far
from the point 1X1,X22 ; therefore, it is an outlier among the pairs of X’s. The response asso-
ciated with these X’s will have a large influence on the determination of the fitted regres-
sion function. (Notice that the standardized residual for this data point is small, indicating
that the predicted or fitted (log) compensation is close to the actual value.) This particular
CEO has 30 years of experience as a CEO, more experience than all but one of the CEOs in
the data set. This observation is influential, but there is no reason to delete it.
FORECASTING CAVEATS
We finish this discussion of multiple regression with some general comments. These
comments are oriented toward the practical application of regression analysis.
Overfitting
When such a model is applied to new sets of data selected from the same population, it
does not forecast as well as the initial fit might suggest.
Overfitting is more likely to occur when the sample size is small, especially if a large
number of independent variables are included in the model. Some practitioners have
13A good discussion of Cook’s distance is provided by Draper and Smith (1998).
263
Multiple Regression Analysis
suggested that there should be at least 10 observations for each independent variable. (If
there are four independent variables, a sample size n of at least 40 is suggested.)
One way to guard against overfitting is to develop the regression function from
one part of the data and then apply it to a “holdout” sample. Use the fitted regression
function to forecast the holdout responses and calculate the forecast errors. If the fore-
cast errors are substantially larger than the fitting errors as measured by, say, compara-
ble mean squared errors, then overfitting has occurred.
APPLICATION TO MANAGEMENT
Multiple regression analysis has been used extensively to help forecast the economic
activity of the various segments of the economy. Many of the reports and forecasts
about the future of our economy that appear in the Wall Street Journal, Fortune,
Business Week, and other similar sources are based on econometric (regression) mod-
els. The U.S. government makes wide use of regression analysis in predicting future
revenues, expenditures, income levels, interest rates, birthrates, unemployment, and
Social Security benefits requirements as well as a multitude of other events. In fact,
almost every major department in the U.S. government makes use of the tools
described in this chapter.
Similarly, business entities have adopted and, when necessary, modified regression
analysis to help in the forecasting of future events. Few firms can survive in today’s
environment without a fairly accurate forecast of tomorrow’s sales, expenditures, capi-
tal requirements, and cash flows. Although small or less sophisticated firms may be
able to get by with intuitive forecasts, larger and/or more sophisticated firms have
turned to regression analysis to study the relationships among several variables and to
determine how these variables are likely to affect their future.
Unfortunately, the very notoriety that regression analysis receives for its usefulness
as a tool in predicting the future tends to overshadow an equally important asset: its
14Some authors argue that the “four times” rule is not enough and should be replaced by a “ten times”
criterion.
15This assumes that no other defect is detected in the fit.
264
Multiple Regression Analysis
ability to help evaluate and control the present. Because a fitted regression equation
provides the researcher with both strength and direction information, management can
evaluate and change current strategies.
Suppose, for example, a manufacturer of jams wants to know where to direct its
marketing efforts when introducing a new flavor. Regression analysis can be used to
help determine the profile of heavy users of jams. For instance, a company might try to
predict the number of flavors of jam a household might have at any one time on the
basis of a number of independent variables, such as the following:
Even a superficial reflection on the jam example quickly leads the researcher to
realize that regression analysis has numerous possibilities for use in market segmenta-
tion studies. In fact, many companies use regression to study market segments to deter-
mine which variables seem to have an impact on market share, purchase frequency,
product ownership, and product and brand loyalty as well as on many other areas.
Agricultural scientists use regression analysis to explore the relationship of product
yield (e.g., number of bushels of corn per acre) to fertilizer type and amount, rainfall,
temperature, days of sun, and insect infestation. Modern farms are equipped with mini-
and microcomputers complete with software packages to help them in this process.
Medical researchers use regression analysis to seek links between blood pressure
and independent variables such as age, social class, weight, smoking habits, and race.
Doctors explore the impact of communications, number of contacts, and age of patient
on patient satisfaction with service.
Personnel directors explore the relationship of employee salary levels to geo-
graphic location, unemployment rates, industry growth, union membership, industry
type, and competitive salaries. Financial analysts look for causes of high stock prices by
analyzing dividend yields, earnings per share, stock splits, consumer expectations of
interest rates, savings levels, and inflation rates.
Advertising managers frequently try to study the impact of advertising budgets,
media selection, message copy, advertising frequency, and spokesperson choice on
consumer attitude change. Similarly, marketers attempt to determine sales from adver-
tising expenditures, price levels, competitive marketing expenditures, and consumer
disposable income as well as a wide variety of other variables.
A final example further illustrates the versatility of regression analysis. Real estate
site location analysts have found that regression analysis can be very helpful in pin-
pointing geographic areas of over- and underpenetration of specific types of retail
stores. For instance, a hardware store chain might look for a potential city in which to
locate a new store by developing a regression model designed to predict hardware sales
in any given city. Researchers could concentrate their efforts on those cities where the
model predicted higher sales than actually achieved (as can be determined from many
sources). The hypothesis is that sales of hardware are not up to potential in these cities.
In summary, regression analysis has provided management with a powerful and
versatile tool for studying the relationships between a dependent variable and multiple
independent variables. The goal is to better understand and perhaps control present
events as well as to better predict future events.
265
Multiple Regression Analysis
Glossary
Dummy variables. Dummy, or indicator, variables are Partial, or net, regression coefficient. The partial,
used to determine the relationships between qualita- or net, regression coefficient measures the aver-
tive independent variables and a dependent variable. age change in the dependent variable per unit
Multicollinearity. Multicollinearity is the situation change in the relevant independent variable, hold-
in which independent variables in a multiple ing the other independent variables constant.
regression equation are highly intercorrelated. Standard error of the estimate. The standard error
That is, a linear relation exists between two or of the estimate is the standard deviation
more independent variables. of the residuals. It measures the amount the actual
Multiple regression. Multiple regression involves values (Y) differ from the estimated values 1YN 2.
the use of more than one independent variable to Stepwise regression. Stepwise regression permits
predict a dependent variable. predictor variables to enter or leave the regression
Overfitting. Overfitting refers to adding indepen- function at different stages of its development.
dent variables to the regression function that, to a An independent variable is removed from the
large extent, account for all the eccentricities of model if it doesn’t continue to make a significant
the sample data under analysis. contribution when a new variable is added.
Key Formulas
a 1Y - Y 2
N 2 SSE
sy #x’s = = = 2MSE (4)
D n - k - 1 Dn - k - 1
Coefficient of determination
SSR ©1YN - Y22
R2 = =
SST ©1Y - Y22
SSE ©1Y - YN 22
= 1 - = 1 - (5)
SST ©1Y - Y22
266
Multiple Regression Analysis
R 2 = 1 - 11 - R 22 ¢
n - 1
≤ (8)
n - k - 1
t statistic for testing H0: Bj = 0
bj
t =
sbj
Forecast of a future value
YN * = b0 + b1X*1 + b2X*2 + Á + bkX*k (9)
1 1Xi - X 22
a 1Xi - X2
hii = + 2
(13)
n
Standardized residual
ei ei
= (14)
sei sy #x¿s 21 - hii
Problems
267
Multiple Regression Analysis
TABLE P–7
Variable Number
Variable Number 1 2 3 4 5 6
1 1.00 .55 .20 -.51 .79 .70
2 1.00 .27 .09 .39 .45
3 1.00 .04 .17 .21
4 1.00 -.44 -.14
5 1.00 .69
6 1.00
268
Multiple Regression Analysis
TABLE P-8
TABLE P-9
Annual Food
Expenditures Annual Income Family Size
Family ($100s) Y ($1,000s) X1 X2
A 24 11 6
B 8 3 2
C 16 4 1
D 18 7 3
E 24 9 5
F 23 8 4
G 11 5 2
H 15 7 2
I 21 8 3
J 20 7 2
a. Construct the correlation matrix for the three variables in Table P-9. Interpret
the correlations in the matrix.
b. Fit a multiple regression model relating food expenditures to income and family
size. Interpret the partial regression coefficients of income and family size. Do
they make sense?
c. Compute the variance inflation factors (VIFs) for the independent variables. Is
multicollinearity a problem for these data? If so, how might you modify the
regression model?
10. Beer sales at the Shapiro One-Stop Store are analyzed using temperature
and number of people (age 21 or over) on the street as independent variables.
269
Multiple Regression Analysis
Source DF SS MS F
Regression 2 11589.035 5794.516 36.11
Residual Error 17 2727.914 160.466
Total 19 14316.949
270
Multiple Regression Analysis
TABLE P-11
Miles per Age of Car Gender (0 male,
Gallon Y (years) X1 1 female) X2
22.3 3 0
22.0 4 1
23.7 3 1
24.2 2 0
25.5 1 1
21.1 5 0
20.6 4 0
24.0 1 0
26.0 1 1
23.1 2 0
24.8 2 1
20.2 5 0
Y = b 0 + b 1X1 + b 2X2 +
271
Multiple Regression Analysis
TABLE P-12
TABLE P-13
Region Personal Income Region Personal Income
($ billions) ($ billions)
1 98.5 7 67.6
2 31.1 8 19.7
3 34.8 9 67.9
4 32.7 10 61.4
5 68.8 11 85.6
6 94.7
b. Forecast annual sales for region 12 for personal income of $40 billion and the val-
ues for retail outlets and automobiles registered given in part c of Problem 12.
c. Discuss the accuracy of the forecast made in part b.
d. Which independent variables would you include in your final forecast model?
Why?
14. The Nelson Corporation decides to develop a multiple regression equation to
forecast sales performance. A random sample of 14 salespeople is interviewed and
given an aptitude test. Also, an index of effort expended is calculated for each
salesperson on the basis of a ratio of the mileage on his or her company car to the
total mileage projected for adequate coverage of territory. Regression analysis
yields the following results:
YN = 16.57 + .65 X1 + 20.6 X2
1.052 11.692
The quantities in parentheses are the standard errors of the partial regression
coefficients. The standard error of the estimate is 3.56. The standard deviation of
the sales variable is sy = 16.57. The variables are
Y = the sales performance, in thousands
X1 = the aptitude test score
X2 = the effort index
272
Multiple Regression Analysis
a. Are the partial regression coefficients significantly different from zero at the .01
significance level?
b. Interpret the partial regression coefficient for the effort index.
c. Forecast the sales performance for a salesperson who has an aptitude test score
of 75 and an effort index of .5.
d. Calculate the sum of squared residuals, ©1Y - YN 22.
e. Calculate the total sum of squares, ©1Y - Y22.
f. Calculate R2, and interpret this number in terms of this problem.
g. Calculate the adjusted coefficient of determination, R 2.
15. We might expect credit card purchases to differ from cash purchases at the same
store. Table P-15 contains daily gross sales and items sold for cash purchases and
daily gross sales and items sold for credit card purchases at the same consignment
store for 25 consecutive days.
a. Make a scatter diagram of daily gross sales, Y, versus items sold for cash pur-
chases, X1. Using a separate plot symbol or color, add daily gross sales and
items sold for credit card purchases, X2 . Visually compare the relationship
between sales and number of items sold for cash with that for credit card
purchases.
TABLE P-15
1 348 55 148 4
2 42 8 111 6
3 61 9 62 7
4 94 16 0 0
5 60 11 39 5
6 165 26 7 1
7 126 27 143 26
8 111 19 27 5
9 26 5 14 2
10 109 18 71 12
11 180 27 116 21
12 212 36 50 9
13 58 10 13 2
14 115 20 105 16
15 15 8 19 3
16 97 15 44 14
17 61 10 0 0
18 85 15 24 3
19 157 24 144 10
20 88 15 63 11
21 96 19 0 0
22 202 33 14 3
23 108 23 0 0
24 158 21 24 4
25 176 43 253 28
273
Multiple Regression Analysis
TABLE P-16
274
Multiple Regression Analysis
TABLE P-17
Variable
Variable Price Area Elevation Slope View
Price 1.00 .59 .66 .68 .88
Area 1.00 .04 .64 .41
Elevation 1.00 .13 .76
Slope 1.00 .63
View 1.00
TABLE P-18
X1 X2 X3 Y X1 X2 X3 Y
87 85 2.7 91 93 60 3.2 54
100 84 3.3 90 92 69 3.1 63
91 82 3.5 83 100 86 3.6 96
85 60 3.7 93 80 87 3.5 89
56 64 2.8 43 100 96 3.8 97
81 48 3.1 75 69 51 2.8 50
77 67 3.1 63 80 75 3.6 74
86 73 3.0 78 74 70 3.1 58
79 90 3.8 98 79 66 2.9 87
96 69 3.7 99 95 83 3.3 57
275
Multiple Regression Analysis
d. Compute the mean leverage. Are any of the observations high leverage points?
e. Compute the standardized residuals. Identify any observation with a large stan-
dardized residual. Does the fitted model under- or overpredict the response for
these observations?
19. Refer to the data in Table P-18. Find the “best” regression model using the step-
wise regression procedure and the all possible regressions procedure. Compare the
results. Are you confident using a regression model to predict the final exam score
with fewer than the original three independent variables?
20. Recall Example 12. The full data set related to CEO compensation is contained in
Appendix: Data Sets and Databases. Use stepwise regression to select the “best”
model with k = 3 predictor variables. Fit the stepwise model, and interpret the
estimated coefficients. Examine the residuals. Identify and explain any influential
observations. If you had to choose between this model and the k = 2 predictor
model discussed in Example 12, which one would you choose? Why?
21. Table P-21 contains the number of accounts (in thousands) and the assets (in bil-
lions of dollars) for 10 online stock brokerages. Plot the assets versus the number
of accounts. Investigate the possibility the relationship is curved by running a mul-
tiple regression to forecast assets using the number of accounts and the square of
the number of accounts as independent variables.
a. Give the fitted regression function. Is the regression significant? Explain.
b. Test for the significance of the coefficient of the squared term. Summarize your
conclusion.
c. Rerun the analysis without the quadratic (squared) term. Explain why the coeffi-
cient of the number of accounts is not the same as the one you found for part a.
22. The quality of cheese is determined by tasters whose scores are summarized in a
dependent variable called Taste. The independent (predictor) variables are three
chemicals that are present in the cheese: acetic acid, hydrogen sulfide (H2S), and
lactic acid. The 15 cases in the data set are given in Table P-22. Analyze these data
using multiple regression methods. Be sure to include only significant independent
variables in your final model and interpret R 2. Include an analysis of the residuals.
23. Refer to Problem 22. Using your final fitted regression function, forecast Taste (qual-
ity) for Acetic = 5.750, H 2S = 7.300, and Lactic = 1.85. (All three independent
variable values may not be required.) Although n in this case is small, construct the
TABLE P-21
Assets Number of accounts
($ billions) X (1,000s) Y
219.0 2,500
21.1 909
38.8 615
5.5 205
160.0 2,300
19.5 428
11.2 590
5.9 134
1.3 130
6.8 125
276
TABLE P-22
Case Taste Y Acetic X1 H2S X2 Lactic X3
1 40.9 6.365 9.588 1.74
2 15.9 4.787 3.912 1.16
3 6.4 5.412 4.700 1.49
4 18.0 5.247 6.174 1.63
5 38.9 5.438 9.064 1.99
6 14.0 4.564 4.949 1.15
7 15.2 5.298 5.220 1.33
8 32.0 5.455 9.242 1.44
9 56.7 5.855 10.199 2.01
10 16.8 5.366 3.664 1.31
11 11.6 6.043 3.219 1.46
12 26.5 6.458 6.962 1.72
13 0.7 5.328 3.912 1.25
14 13.4 5.802 6.685 1.08
15 5.5 6.176 4.787 1.25
277
Multiple Regression Analysis
large-sample approximate 95% prediction interval for your forecast. Do you feel your
regression analysis has yielded a useful tool for forecasting cheese quality? Explain.
24. The 1991 accounting numbers for major league baseball are given in Table P-24.
All figures are in millions of dollars. The numerical variables are GtReceit (Gate
Receipts), MediaRev (Media Revenue), StadRev (Stadium Revenue), TotRev
(Total Revenue), PlayerCt (Player Costs), OpExpens (Operating Expenses),
OpIncome 1Operating Income = Total Revenue - Operating Expenses2, and
FranValu (Franchise Value).
a. Construct the correlation matrix for the variables GtReceit, MediaRev, . . . ,
FranValu. From the correlation matrix, can you determine a variable that is
likely to be a good predictor of FranValu? Discuss.
b. Use stepwise regression to build a model for predicting franchise value using
the remaining variables. Are you surprised at the result? Explain.
c. Can we conclude that, as a general rule, franchise value is about twice total rev-
enue? Discuss.
d. Player costs are likely to be a big component of operating expenses. Develop an
equation for forecasting operating expenses from player costs. Comment on the
strength of the relation. Using the residuals as a guide, identify teams that have
unusually low or unusually high player costs as a component of operating
expenses.
e. Consider the variables other than FranValu. Given their definitions, are there
groups of variables that are multicollinear? If so, identify these sets.
CASES
16Thedata for this case study were provided by Dorothy Mercer, an Eastern Washington University M.B.A.
student. The analysis was done by M.B.A. students Tak Fu, Ron Hand, Dorothy Mercer, Mary Lou
Redmond, and Harold Wilson.
278