STAT 654
Regression Analysis Part 3
Multiple Linear Regression
Sharmistha Guha
Dept. of Statistics
Texas A&M University
Spring 2022
80
Multiple Regression
In multiple regression, we have one dependent
variable/response/outcome Y and more than
one independent variable/predictors/covariates.
Examples
(1) Study involving preschool and elementary
school children.
y = developmental score, which is an
average of scores for many characteris-
tics such as verbal ability, graphical per-
ception, communication ability, etc.
x1 = age of child
x2 = no. of older siblings
81
(2) Medical study
y = systolic blood pressure of an indi-
vidual
x1 = age
x2 = weight
x3 = cholesterol level
Our regression models will have the form:
Y = β0 + β1x1 + β2x2 + · · · + βk xk + ǫ.
As usual E(ǫ) = 0 and Var(ǫ) = σ 2, which
doesn’t depend on x1, . . . , xk , and ǫ ∼ N (0, σ 2).
The expected value of Y given x1, . . . , xk is:
E(Y ) = β0 + β1x1 + β2x2 + · · · + βk xk .
Again, all the E(·)’s and Var(·)’s above denote
conditional expectation and variance, respec-
tively, given the predictor values x1, . . . , xk .
82
This model is more flexible than it might first
appear, since a given xi might be a function of
other independent variables.
Two distinct situations are possible:
I. k = number of separate independent vari-
ables in the model. In other words, k + 1
measurements are obtained from each ex-
perimental unit, one measurement being y.
II. Number of separate independent variables
is p, which is less than k.
Examples of II:
Suppose p = 2 with independent variables x1
and x2. Two possible models are as follows:
Y = β 0 + β 1 x1 + β 2 x2 + β 3 x1 x2 + ǫ (M 1)
Here k = 3 with x3 = x1x2 (also known as the
interaction effect between x1 and x2).
83
Or we could have
Y = β 0 + β 1 x1 + β 2 x2 + β 3 x2 2
1 + β4x2 + ǫ,
(M 2)
in which case k = 4 with x3 = x2 1 and x 4 = x 2
2
(i.e. the quadratic effects of x1 and x2).
The function
f (x1, x2) = β0 + β1x1 + β2x2
is a plane, which has no curvature. If one sus-
pects that the surface of averages E(Y |x1, x2)
has curvature, then the models M 1 and M 2
might be more appropriate to use.
84
Obtaining the least squares estimates:
Given the model on pg. 82N, we want to find
least squares estimates of the parameters β0, β1,
. . . , βk . This works in exactly the same way as
in polynomial regression.
Data comes in the form:
(xi1, xi2, . . . , xik , yi), i = 1, . . . , n.
Define
f (b0, b1, . . . , bk ) =
n
X
[yi − (b0 + b1xi1 + · · · + bk xik )]2.
i=1
Choose b0, b1, . . . , bk to minimize f .
85
∂f (b0, . . . , bk )
=0 if and only if
∂bj
n
X
[yi − (b0 + b1xi1 + · · · + bk xik )]xij = 0,
i=1
for j = 0, 1, . . . k, and with xi0 = 1 for each i.
As before, we can write the solution to this set
of linear equations in matrix form.
Let X be the n × (k + 1) matrix with (j + 1)st
column equal to:
[x1j x2j · · · xnj ]T , for each j = 0, 1, . . . , k.
Let b = [b0 b1 · · · bk ]T ,
and y = [y1 y2 · · · yn]T .
86
The normal equations are:
(X T X )b = X T y ,
which have the solution:
b = (X T X )−1X T y .
The vector βb = [β̂ β̂ · · · β̂ ]T of least squares
0 1 k
estimates is:
b = (X T X )−1X T y .
β
In an R session suppose that the response and
independent variables are called y and x1, x2,
x3. Then, we could obtain the least squares
estimates using the command:
summary(lm(y∼x1+x2+x3)).
87
Example 8: Regression analysis of mesquite
tree data (dataset available in course website)
The data in this example concern the predic-
tion of total production of photosynthetic bio-
mass (leaves) of mesquite trees by using cer-
tain easily measured aspects of the plant as
opposed to actual harvesting of the mesquite.
Data on 20 mesquite trees from the same ge-
ographic location are given on page 91N.
The variables are:
z = leafwt = total weight (in grams) of pho-
tosynthetic material derived from the actual
harvesting of mesquite
u1 = diam1 = canopy diameter (in meters)
measured along the longest axis of the tree
parallel to the ground
88
u2 = diam2 = canopy diameter (in meters)
measured along the shortest axis of the tree
parallel to the ground
u3 = totht = total height (in meters) of the
mesquite tree
u4 = canht = canopy height (in meters) of the
mesquite tree
It’s desired to be able to predict leafwt using
the other measurements.
It’s more natural to consider a multiplicative
model rather than a linear one since leaf weight
should be nearly proportional to canopy vol-
ume, and canopy volume should be nearly pro-
portional to the product of canopy dimensions.
The (multiplicative) model we will consider is:
β β β β
Z = β0∗ u11 u22 u33 u44 ǫ′.
89
To analyze the data by means of a linear model,
we can take logarithms, which leads to:
y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ǫ, where
β0 = log(β0∗ ), ǫ = log(ǫ′),
y = log(z), x1 = log(u1), x2 = log(u2),
x3 = log(u3), x4 = log(u4).
As usual, we’ll assume that ǫ ∼ N (0, σ 2).
We’ll be able to find a prediction interval for
y = log(z) using the linear model above.
If we’re 95% sure that y is between, say, y1
and y2, then we’re 95% sure that z is between
ey1 and ey2 . (See Supplementary Notes).
90
Mesquite tree data
diam1 diam2 totht canht leafwt
2.50 2.3 1.70 1.40 723.0
5.20 4.0 3.00 2.50 4052.0
2.00 1.6 1.70 1.40 345.0
1.60 1.6 1.60 1.30 330.9
1.40 1.0 1.40 1.10 163.5
3.20 1.9 1.90 1.50 1160.0
1.90 1.8 1.10 0.80 386.6
2.40 2.4 1.60 1.10 693.5
2.50 1.8 2.00 1.30 674.4
2.10 1.5 1.25 0.85 217.5
2.40 2.2 2.00 1.50 771.3
2.40 1.7 1.30 1.20 341.7
1.90 1.2 1.45 1.15 125.7
2.70 2.5 2.20 1.50 462.5
1.30 1.1 0.70 0.70 64.5
2.90 2.7 1.90 1.90 850.6
2.10 1.0 1.80 1.50 226.0
4.10 3.8 2.00 1.50 1745.1
2.80 2.5 2.20 1.50 908.0
1.27 1.0 0.92 0.62 213.5
91
Scatterplot matrix
0.0 0.6 1.2 −0.4 0.2 0.8
0.2 0.8 1.4
ldiam1
0.0 0.6 1.2
ldiam2
1.0
ltotht
0.0
0.4
lcanht
−0.4
7
lleafwt
5
0.2 0.8 1.4 0.0 1.0 5 6 7 8
The horizontal axis for a column of plots corre-
sponds to the variable named in that column.
Likewise the vertical axis for each row corre-
sponds to the variable named in that row.
92
The least squares estimates found in R are as
follows:
β̂0 = 4.3043 β̂1 = 0.9579
β̂2 = 1.0194 β̂3 = 1.1650
β̂4 = −0.6040
The coefficient of x4 seems to be a little trou-
bling. What does it imply?
Variance estimation (estimation of σ 2):
ŷi = β̂0 + β̂1xi1 + β̂2xi2 + · · · + β̂k xik ,
n
X
SSE = (yi − ŷi)2,
i=1
2 SSE
σ̂ = = M SE.
n−k−1
93
Coefficient of determination:
n
X
SST = (yi − ȳ)2,
i=1
n
X
SSR = (ŷi − ȳ)2,
i=1
SST = SSR + SSE.
2 SSR
R = ,
SST
the fraction of total variation in the yi’s ex-
plained by the model being considered.
Inference: Confidence intervals and tests for βi
Done exactly as in polynomial regression. The
formulas are exactly the same.
Var(β̂i) = σ 2ci+1,
with cj the jth diagonal element of (X T X )−1.
94
Confidence intervals for β0 + β1x1 + · · · + βk xk
(i.e. the conditional mean of Y ) and prediction
intervals for Y at (x1, . . . , xk ) are also the same.
µ̂(x) = β̂0 + β̂1x1 + · · · + β̂k xk
xT = [1 x1 x2 · · · xk ]
Confidence interval for E(Y ) at X = x:
q
µ̂(x) ± tn−k−1;α/2σ̂ xT (X T X )−1x
Prediction interval for Y at X = x:
q
µ̂(x) ± tn−k−1;α/2σ̂ 1 + xT (X T X )−1x
The F -test (or ‘model utility’ test):
An interesting hypothesis to test is:
H0 : β1 = β2 = · · · = βk = 0.
95
If our model is correct, the last hypothesis says
that Y is unrelated with x1, . . . , xk .
A test statistic for this hypothesis is:
SSR/k R2/k
F = = 2
.
M SE (1 − R )/(n − k − 1)
If ǫ1, . . . , ǫn are normally distributed, then under
H0, F has the so-called F -distribution with de-
grees of freedom k and n − k − 1, i.e. Fk,n−k−1.
Rejection region: Reject H0 and conclude that
at least one βj 6= 0 (for some j = 0, 1, 2, . . . , k)
if F ≥ Fk,n−k−1;α, where
Fk,n−k−1;α is the (1 − α)th quantile of Fk,n−k−1.
Alternatively, you may compute the p-value:
P (F > Fobs) where F ∼ Fk,n−k−1 and Fobs is the
observed value of the test statistic F above.
You can find this value (or get a good approxi-
mation) using the F distribution tables or via R.
96
Reduction method of testing:
Suppose, for example, that k = 4 and we want
to test the hypothesis:
H0 : β3 = 0, β4 = 0.
This hypothesis says that x3 and x4 are not
needed in the same model that contains x1
and x2.
This hypothesis would be interesting in a case
where x1 and x2 were known to be related to
Y , but it’s not clear whether x3 and x4 are
related to Y or not.
y = measure of product quality
x1 = experience of production line workers
x2 = purity measure of raw materials
97
The variables x1 and x2 are well-known to af-
fect production quality. How about
x3 = average height of production
line workers, or
x4 = temperature of factory at
production time?
You might be tempted to test H0 : β3 = 0, β4 =
0 by carrying out separate tests of H03 : β3 = 0
and H04 : β4 = 0. But this is not a good idea.
Doing so can lead to incorrect conclusions.
Instead, you should use the reduction method.
Formal details of the reduction method:
Without loss of generality, suppose we want to
test the last l coefficients to be 0 for a model
with k coefficients originally (i.e. k predictors):
H0 : βk−ℓ+1 = βk−ℓ+2 = · · · = βk = 0.
Estimate β0, β1, . . . , βk in the full model con-
taining all k independent variables. Let SSEf
be the SSE for this model.
Estimate β0, β1, . . . , βk−ℓ in the reduced model
that contains only x1, . . . , xk−ℓ. Let SSEr be
the SSE for this model.
The test statistic is:
(SSEr − SSEf )/ℓ
F = .
SSEf /(n − k − 1)
Reject H0 if and only if F ≥ Fℓ,n−k−1;α. Can
also compute the p-value accordingly.
The F statistic is necessarily positive. This
is because SSEr must be at least as large as
SSEf . Why?
98
Example 8 (continued): Inference for mesquite
tree data
In an R session, I defined y, x1, x2, x3 and x4
to be the natural logarithms of z, u1, u2, u3
and u4, respectively.
Then I used the command
fit=lm(y∼x1+x2+x3+x4).
Test of model utility
In the last line of the output from summary(fit),
we see that the F -statistic for the test of model
utility is 28.6 with a P -value of 7.307 · 10−7.
So, we conclude that the model is useful, in the
sense that at least one of the four independent
variables has a nonzero coefficient.
99
Test for a single regression coefficient
Suppose we want to test H0 : β2 = 0, which
says “x2 = log(canopy diameter along shorter
axis) is not needed in the same model with the
other three independent variables.”
According to the output from summary(fit),
the P -value for the test of this hypothesis is
0.0353. Therefore, we reject H0 and conclude
that we should have x2 in the model.
Coefficient of determination
Note that the R2 value is pretty high: 0.884.
So, the model is explaining about 88% of the
variation in log(leaf weight). That’s good!
Estimation of mean response and prediction
Consider trees with the following dimensions:
u1 = 2.5 u2 = 2.0 u3 = 1.7 u4 = 0.92
100
Estimate the average log(leaf weight) of such
trees by means of a 95% confidence interval.
Using the commands
X=data.frame(x1=log(2.5),x2=log(2),
x3=log(1.7),x4=log(0.92))
predict(fit,X,interval="conf")
the interval is
(5.98, 7.13).
Guess the leaf weight of a single tree with the
dimensions above. Using
predict(fit,X,interval="predict")
we’re 95% sure that the log(leaf weight) is in
(5.58, 7.53),
and so we’re 95% sure that the leaf weight
is between e5.58 = 265.1 grams and e7.53 =
1863.1 grams.
101
Testing that a subset of variables is not needed
For these data, it makes sense that some of
the independent variables are highly correlated
with each other.
For example, it wouldn’t be surprising that tall
trees tend to have tall canopies. This in fact
is the case. Look at the third row and fourth
column of the scatterplot matrix.
This suggests that maybe we don’t need all
four independent variables in the model: the
information about leaf weight conveyed by tree
height may be contained in canopy height.
Likewise, we may need only one of the two
canopy diameters. Let’s test the following hy-
pothesis then:
H0 : β1 = 0, β3 = 0.
102
This hypothesis says we don’t need log(diameter
1) and log(total height) in the same model
with log(diameter 2) and log(canopy height).
Use the reduction method to test the hypoth-
esis. The command
anova(fit)
gives us the SSE for the full model:
SSEf = 2.0368.
Now we fit a reduced model that has in it only
the two independent variables x2 (log-diameter
2) and x4 (log-canopy height). The SSE for
this model is
SSEr = 2.8261.
The F -statistic is
(2.8261 − 2.0368)/2
F = = 2.906.
2.0368/15
103
From the F table or R we find F2,15;0.05 =
3.68. So, we would not reject H0. There isn’t
sufficient evidence to conclude that we need x1
and x3 in the model.
Note that R2 for the reduced model is 0.84.
This is only a bit smaller than the R2 of 0.88
for the full model.
In other words, the reduced model explains al-
most as much of the variance in leaf weight
as does the full model. For reasons of simplic-
ity, we might then want to go with the simpler
model.
To predict leaf weight, we’d only have to make
two measurements from a tree as opposed to
four.
104
Model selection
In multiple regression we’re faced with the prob-
lem of choosing a good subset of independent
variables. This is especially important when
the researcher has measured all independent
variables but the kitchen sink.
There are many tools for selecting a good sub-
set of variables. We’ll talk about three: R2,
AIC and BIC.
Two basic principles:
• If the models being compared all have the
same number of independent variables, just
choose the model with the highest R2 value.
• If the models have different numbers of in-
dependent variables, compare models using
either AIC or BIC. Smaller values of AIC
or BIC indicate better models.
105
When you don’t have more than 5 or 6 inde-
pendent variables, a reasonable thing to do is
to fit all possible models and pick out the best
candidates using AIC or BIC.
The number of possible models when you have
k variables is
k k k
+ + ··· + = 2k .
0 1 k
The command leaps in the R package leaps
will fit all possible models and give you the R2
value for each one.
106
The command leaps lists all models with the
same number of variables consecutively, and in
order from largest to smallest R2 value.
According to leaps, the best models for the
mesquite data are as follows:
No. Best
vars. model R2 AIC BIC
1 ldiam1 0.818 26.06 29.04
2 ldiam2, ltotht 0.865 22.17 26.16
ldiam1, ldiam2,
3 ltotht 0.878 22.04 27.02
ldiam1, ldiam2,
4 lcanht, ltotht 0.884 23.07 29.05
So, AIC prefers the three variable model with
ldiam1, ldiam2, ltotht and BIC prefers the two
variable model with ldiam2, ltotht.
107
Plotting residuals
As in straight line and polynomial regression,
residuals are valuable tools for helping us de-
termine if the model assumptions are met (at
least approximately).
If the model is correct, the residuals
ei = yi − ŷi, i = 1, . . . , n,
should behave like a random sample from a
normal distribution.
Standardized residuals:
∗ yi − ŷi
ei = , i = 1, . . . , n.
σ̂
These should behave like a random sample
from a standard normal distribution.
108
Methods of plotting:
Plot Purpose
e∗i vs. i Check independence
of errors.
e∗i vs. ŷi Check constant
variance.
e∗i vs. an inde- Check for nonlinear
pendent variable relationship.
Normal probab- Check normality
ility plot or
kernel density
estimate
Influential observations
Sometimes a single observation
(xi1, xi2, . . . , xik , yi)
can have a big influence on the estimates of
β0, β1, . . . , βk and/or σ 2.
109
Such an observation is called an outlier. There
are two basic kinds of outliers:
1. xi is in the “middle” of x1, . . . , xn but yi is
unusually large or small.
2. xi is far away from the rest of the xj s.
The two kinds of outliers have different effects.
The first kind doesn’t have much effect on the
β̂is, but can inflate σ̂ 2.
The second kind can have a big effect on the
β̂is.
In multiple regression spotting outliers can be
difficult because with more than two indepen-
dent variables we can’t plot y against the vec-
tor of independent variables.
110
A useful diagnostic is Cook’s D. This mea-
sures the distance between β b and an estimate
of β calculated after deleting the ith observa-
tion from the data set.
This means there is one Cook’s D value for
each of the n cases in the data set. If a D value
is “big,” then the corresponding observation is
influential, i.e., an outlier.
How large is “large?” Cook’s D values less
than 1 are not usually indicative of an outlier.
Values of D larger than 1.5 suggest the pos-
sibility of an outlier. If D > 1.5, fit the model
without the suspicious case and look at how
much the estimates of the βis change.
111
In R suppose you have put regression output
from lm into an object called fit. To obtain
Cook’s D value for all the data, use the com-
mand
output=cooks.distance(fit).
You could then plot the Cook’s D values as
follows:
plot(output,ylab=‘‘Cook’s D’’).
For more details on Cook’s D values and their
computation, see Supplementary Notes.
112