Statitistical Inference in Multiple Linear Regression
Statitistical Inference in Multiple Linear Regression
8.1 INTRODUCTION
In Unit 6, you have learnt how to test the significance of the individual
regression coefficients, the intercept and the slope as well as the overall
regression model for simple linear regression. You have also learnt how to
compute the confidence intervals of regression coefficients. In Unit 7, you have
studied the multiple linear regression model used for explaining the
relationship between the response variable and more than one regressor
variable. inferential aspect of the multiple regression model. The concept is the
same as discussed in Unit 6 for simple linear regression. In Sec. 8.2, we
describe properties of estimated regression coefficients for fitted multiple linear
regression model. We discuss testing of significance of the overall fitted
multiple regression model and individual regression coefficients in Secs. 8.3
and 8.4, respectively.
In this unit, you will learn about the In Sec. 8.5, we explain how to determine
the (1 – α)100% confidence interval of individual regression coefficients. We
discuss the coefficient of determination and adjusted coefficient of
determination in Sec. 8.6.
In the next unit, you will learn about the multiple linear regression model when
some of the regressor variables are quantitative and some are qualitative.
Objectives:
After studying the unit, you should be able to:
(
2. We define the variance-covariance matrix of ˆ = ˆ 0 , ˆ 1 , ˆ 2 ,..., ˆ k ) as:
()
V ˆ = 2 ( XX )−1 = 2 S ... (2)
with i, j = 0, 1, 2, ..., k.
Here, 2 S is also known as variance-covariance matrix of the estimated
regression coefficients. Its dimension is (k+1)×( k+1).
Thus, the variance of ̂i is given by
( )
V ˆ = 2S ;
i i = 0,1, 2,..., k
ii ... (3)
( )
cov ˆ j , ˆ j = 2Sij ; i j = 1, 2,..., k ... (5)
ri2 (y i − yˆ i ) 2
ˆ 2 = i =1
= i =1
... (6)
n − k −1 n − k −1
Thus, we define an estimated variance-covariance matrix of ̂ as:
ˆ ˆ ) = ˆ 2 ( XX )−1 = ˆ 2 S
V( … (7)
You may like to study the following examples to learn this concept.
Example 1: For the data of systolic blood pressure considering age and weight
as regressor variables given in Example 3 of Unit 7, determine the variance-
covariance matrix of the regression coefficients. Also compute the standard
errors of regression coefficients.
Solution: In order to compute variance-covariance matrix of the regression
coefficients, we use the following values computed in Example 3 of Unit 7:
ˆ 2 = 1.82415221 and
Example 2: For the data of systolic blood pressure considering three regressor
variables: age, weight and height given in Example 4 of Unit 7, obtain the
variance-covariance matrix of the regression coefficients. Also compute SE(β0),
SE(β1), SE(β3) and SE(β2).
Solution: From the solution of Example 4 of Unit 7, we have
( ) ( )
V ˆ 0 = 71.6966 SE ˆ 0 = 71.6966 = 8.4674
V (ˆ ) = 0.0102 SE (ˆ ) = 0.0102 = 0.1008
1 1
(ii) deviation of the observed value ( yi ) from the predicted value ( ŷi ).
212
Mathematically, we define Statistical Inference in
Multiple Linear Regression
( yi − y ) = ( yˆ i − y ) + ( yi − yˆ i ) ... (9)
After squaring and taking summation on both the sides of equation (9), we
obtain
n n n n n
i =1 i =1 i =1 i =1 j i =1
n n
The cross-product term will be zero, i.e., ( ŷ
i =1 ji
j − y ) ( yi − yˆ i ) = 0
Therefore, we have
n n n
( y − y) = ( yˆ i − y ) + ( yi − yˆ i )
2 2 2
i ... (10)
i =1 i =1 i =1
In the same way, we can partition the corresponding degrees of freedom (d.f.)
as:
d.f. for Total Sum of Squares = d.f. for Regression Sum of Squares + d.f. for
Residual Sum of Squares
or (n – 1) = k + (n – k –1)
where k represents the number of regressor variables which is used in the
regression model.
Note that if there is only one regressor variable in the regression model, i.e.,
k = 1, the error d.f. will be (n – 2).
Before computing various measures of variations, we define the null and
alternative hypotheses as follows:
Null Hypothesis: H0 : The fitted regression model is not significant, i.e.,
H0 : 1 = 1 = ... = k = 0
Alternative Hypothesis: H1: The fitted regression model is significant, i.e.,
H1 : Any of the i ( i = 1, 2,..., k ) is not equal to zero.
For applying ANOVA, we first compute these measures of variations as
follows:
• Total Sum of Squares (SST)
It has (n – 1) degrees of freedom. We calculate the overall variability in
the response variable Y and obtain the total sum of squares as:
213
Regression Analysis n
SST = ( yi − y )
2
i =1
n
= yi2 − ny 2 ... (12)
i =1
i =1
where x 0 j = 1; j = 1, 2,..., n
214
SSRes = YY − ˆ XY ... (20) Statistical Inference in
Multiple Linear Regression
Sum of Squares
Mean Sum of Squares =
Respective Degrees of Freedom
Therefore, we have
For testing the significance of the regression model, we define the variance
ratio (F) as:
xi2 = 15372,
i =1
yi xi = 59880 ,
i =1
y
i =1
2
i = 236403
216
MSSReg 492.7646 Statistical Inference in
FCal = = = 195.0927 Multiple Linear Regression
MSSRes 2.5258
Total 15 – 1 = 14 525.6
Since Fcal F(1,13),0.025 = 6.41 , we may reject our null hypothesis at 5% level of
significance. We conclude that there may be some dependence of SBP on age. You can also test the
Hence the fitted multiple linear regression model is significant. If you compare significance of the
these results with the results of Example 6 of Unit 6, you will observe that fitted regression model
t = (13.9676) = 195.0927 which is equal to Fcal. So, you can use either the t-
2 2 for Example 4 using
matrix approach. Give
test or the ANOVA method for testing the significance of simple linear it a try and match the
regression model. results obtained using
both methods.
Example 4: Using the data given in Example 1 of Unit 7, test the significance
of the multiple linear regression model at 5% level of significance.
Solution: In this case,
Null Hypothesis: H0 : 1 = 2 = 0 , and alternative Hypothesis: H1 : Any of the 1
and 2 is not equal to zero.
From Example 2 of Unit 7, we have
15 15 15
n = 15, yi = 1881, y = 125.4, x1i = 474, x 2i = 1102
i =1 i =1 i =1
n 15 15
yi2 = 236403,
i =1
x12 = 15372,
i=n
x
i =1
2
2i = 83140
15 15 15
We also have
0 = 88.1732, 1 = 0.9266 and 2 = 0.1082
We now compute the total sum of squares using equation (12) as follows:
n
SST = yi2 − ny 2
i =1
ˆ i y jx ij = ˆ 0 yi + ˆ 1 yi x i + ˆ 2 yi x 2i
i = 0 j=1 j=1 j=1 j=1 217
Regression Analysis = 88.1732 1881+ 0.9266 59880 + 0.1082 139075
= 236381.1102
Using equation (17), we calculate the sum of squares due to regression as:
2 n
SSReg = ˆ i y j x ij − ny 2 = 236381.1102 − 15(125.4) 2
i = 0 j=1
= 236381.1102 – 235877.4
= 503.7102
The sum of squares due to error is computed using equation (18) as:
SSRes = SST − SSReg
=525.6 − 503.7102 = 21.8898
We present all these values in the following ANOVA table.
Table 3: ANOVA Table
Source of Sum of Mean Sum of Ftab
Df Fcal
Variation Squares Squares
503.7102
Regression 3−1 = 2 503.7102 = 251.8551 251.8551
2 Fcal =
1.8242 F( 2,12), 0.025 = 5.10
21.8898
Error 14−2=12 21.8898 = 1.8242 =138.0669
12
Since Fcal > 5.10, we may reject H0 at 5% level of significance. Hence, the
fitted regression model can be considered as significant, i.e., both regressor
variables are contributing significantly in the model and we may conclude that
the relationship of Y with X1 and X2 is linearly significant.
Example 5: Using the data given in Example 4 of Unit 7, test the significance
of the multiple linear regression model at 5% level of significance using the
matrix approach.
Solution: We have Null Hypothesis: H0 : 1 = 2 = 3 = 0 and alternative
Hypothesis: H1 : Any of the 1, 2 and 3 is not equal to zero.
From the solution of Example 4 of Unit 7, we have
n = 15, y = 125.4, YY = 236403,
1881 71.5436
59880 0.8462
XY = and =
ˆ
139075 0.1048
299065 0.1223
We now compute the total sum of squares using equation (13) as:
SST = YY − ny2
218
1881 Statistical Inference in
Multiple Linear Regression
ˆ XY = 71.5436 0.8462 0.1048 0.1223 59880
139075
299065
= 236387.0531
Using equation (16), we calculate the sum of squares due to regression as:
SSReg = XY − ny 2 = 236387.0531 −15(125.4) 2
= 236387.0531– 235877.4
= 509.6531
The sum of squares due to error can be computed using equation (18) as:
SSRes = SST − SSReg
= 525.6 − 509.6531
= 15.9469
We present these calculations in the following ANOVA table.
Table 8: ANOVA Table
Source of Sum of Ftab
Df Mean Sum of Squares Fcal
Variation Squares
4−1 = 3 509.6531
Regression 509.6531 = 169.8843 169.8844 F( 3,11), 0.025
3 Fcal =
1.4497
15.9469 = 4.63
Error 14−3=11 15.9469 = 1.4497 =117.1847
11
Total 15−1=14 525.6
Since Fcal > 4.63, we reject H0 at 5% level of significance. Hence the fitted
regression model can be considered as significant, i.e., all regressor variables
are contributing significantly in the model. We may conclude that the
relationship of Y with X1, X2 and X2 is significant.
You may now like to solve the following exercises to assess your
understanding.
E4) Test the significance of the fitted multiple regression model given in E5
of Unit 7 at 1% level of significance.
E5) For the exercise given in E6 of Unit 7, test the significance of the fitted
multiple regression model at 5% level of significance.
ˆ i − i ˆ i
ti = = ... (25)
SE(ˆ i ) 2Sii
where, *i = 0 under null hypothesis H0 : i = 0 for testing the significance of
the regression coefficient.
The statistic ti follows t-distribution with (n – k – 1) degrees of freedom. We
then determine the tabulated t-value, i.e., t (n −k −1), /2 with (n – k – 1) degrees of
freedom at α% level of significance. This value has been tabulated for various
degrees of freedom in Table I given at the end of this block as explained in
Unit 4 of MST-004.
If t i is greater than or equal to the tabulated value, i.e., t i t (n −k −1), /2 for the
Fig. 8.2
given degrees of freedom and the level of significance, we may reject the null
hypothesis and conclude that the value of the ith regression coefficient (i ) is
significant. We may also conclude that the regressor variable Xi (i = 1, 2, ..., k)
is contributing significantly to the model.
In the following examples, we explain the procedure of testing the significance
of individual regression coefficients.
Example 6: For the data of systolic blood pressure considering age and weight
as regressor variables given in Example 3 of Unit 7, test whether the regression
coefficients (i) β0, (ii) β1 and (iii) β2 are significant or not at 5% level of
significance.
Solution: In order to test the significance of regression coefficients, we use the
following values computed in Example 3 of Unit 7:
0 = 88.1732, 1 = 0.9266 and 2 = 0.1082
ˆ 0 88.1732
t0 = = = 38.1851
SE (ˆ 0 ) 2.3091
220
t tab = t (n −k −1), /2 = t12,0.025 = 2.179 Statistical Inference in
Multiple Linear Regression
ˆ 1 0.9266
t1 = = = 8.9168
SE (ˆ 1 ) 0.1039
( ) ( ) ( )
SE ˆ 0 = 8.4674 , SE ˆ 1 = 0.1008 , SE ˆ 2 = 0.0394 and
SE ( ˆ ) = 0.0604
3
ˆ 1 0.8462
t1 = = = 8.3965
SE (ˆ 1 ) 0.1008
Since |t3| < 2.201, we may not reject the null hypothesis H0 : 3 = 0 at 5%
level of significance and conclude that height (X3) is insignificant
variable in the model.
Thus, we may infer that only two regressor variables, age and weight, are
significantly contributing in the construction of the given multiple
regression model.
You may now like to solve the following exercises to check your
understanding.
222
Statistical Inference in
8.5 CONFIDENCE INTERVAL OF REGRESSION Multiple Linear Regression
COEFFICIENTS
In Unit 6, we have explained how to determine the confidence interval for
simple linear regression with the desired confidence level. We expect that the
(1 – α)100% times confidence interval will include the true value of the
regression coefficient. In the same way, we define the (1 – α)100% lower and
upper confidence limits for jth (j = 0, 1, 2, ..., k) regression coefficient,
respectively, in the multiple linear regression model when σ is unknown as:
( (ˆ ) , (ˆ ) ) is called the (1− )100% confidence interval for the j
j L j U
th
regression coefficient.
You may like to solve the following examples for practice.
Example 8: Obtain the 95% confidence intervals for (i) 0 ,(ii) 1 and (iii) 2
for Example 4 given in Sec. 8.3.
Solution: From the solution of Example 4, we have
0 = 88.1732, 1 = 0.9266 and 2 = 0.1082
SE (ˆ 0 ) = 2.3091, SE (ˆ 1 ) = 0.1039 and SE(ˆ 2 ) = 0.0442
From Table I given at the end of this block, we have
t (n −k −1), /2 = t12,0.025 = 2.179
From equations (26) and (27), the lower and upper confidence limits of ̂ j ;
j = 0, 1, 2, ..., k can be determined as: Note that we can also
test the significance of
ˆ j t (n −k −1), /2 SE(ˆ j ) ; j = 0, 1, 2, ..., k the individual
regression coefficient
(i) For j = 0, we obtain the lower and upper confidence limits of 0 as: with the help of
confidence interval. If
ˆ 0L = ˆ 0 − t (n −k −1), /2 SE(ˆ 0 ) = 88.1732 − 2.179 2.3091 the (1−α)100%
confidence interval
= 88.1732 − 5.0315 = 83.1417 contains the value of
the respective
ˆ 0U = ˆ 0 + t n −k −1, /2SE(ˆ 0 ) = 88.1732+2.179 2.3091 regression coefficient
under null hypothesis,
= 88.1732+5.0315 = 93.2047 we do not reject the null
hypothesis. Otherwise,
(ii) For j = 1, the lower and upper confidence limits of ̂1 can be determined we may reject the null
hypothesis at α% level
as: of significance.
(iii) For j = 2, the lower and upper confidence limits of ̂ 2 can be determined
as:
= 0.1082+0.0962 = 0.2044
Thus, the 95% confidence intervals for ˆ 0 , ˆ 1 and ̂2 are (83.1417, 93.2047),
(0.7002, 1.1530) and (0.0119, 0.2044), respectively. You have learnt in MST-
004 that if we draw multiple samples from the same population and calculate
95% confidence intervals for each sample, we expect that the population
regression coefficients are within 95% of these confidence intervals. We can
also say that if we select 100 different samples from the same population and
compute 95% confidence intervals for each sample, 95 confidence intervals
would contain the true value of the regression coefficients.
Example 9: For Example 5 given in Sec. 8.3, obtain the confidence intervals
for (i) 0 ,(ii) 1 ,(iii) 2 and (iv) 3 .
Solution: From the solution of Example 5, we have
0 = 71.5436, 1 = 0.8462, 2 = 0.1048 and
3 = 0.12225275
From equations (26) and (27), the lower and upper confidence limits of ̂ j ;
j = 0, 1, 2, ..., k can be determined as follows:
ˆ j t ( n − k −1), /2SE(ˆ j ) ; j = 0, 1, 2, ..., k
(i) For j = 0, we obtain the lower and upper confidence limits of 0 as:
= 71.5436+18.6367 = 90.1803
(ii) For j = 1, the lower and upper confidence limits of ̂1 can be computed
as:
224
ˆ 1L = ˆ 1 − t (n −k −1), /2 SE(ˆ 1 ) = 0.8462 − 2.201 0.1008 Statistical Inference in
Multiple Linear Regression
= 0.8462+0.2218 =1.0681
(iii) For j = 2, the lower and upper confidence limits of ̂ 2 can be determined
as:
ˆ 2L = ˆ 2 − t (n −k −1), /2 SE(ˆ 2 ) = 0.1048 − 2.201 0.0394
= 0.1048+0.0867 = 0.1916
(iv) For j = 3, The lower and upper confidence limits of ̂3 can be obtained as:
( ŷ − y)
2
i
Variation explained by Regression Model
R2 = = i =1
n
... (28)
(y − y)
Total variation in Y 2
i
i =1
(y i − yˆ i )
R = 1−
2 i =1
n
... (29)
(y
i =1
i − y) 2
We can also compute the value of R2 using the values calculated in ANOVA
table discussed in Sec 8.3:
Regression sum of squares SSReg
R2 = = ... (30)
Total sum of squares SST
SSRes
or R2 = 1− ... (31)
SST
We can rewrite R 2 in matrix notation using equations (13) and (16) as:
ˆ XY − ny 2
R2 = ... (32)
YY − ny 2
We can also determine the value of adjusted R2 using the following formula:
R 2Adj = 1 −
( n − 1) (1 − R 2 ) ... (34)
(n − k − 1)
Example 10: For Example 4 given in Sec. 8.3, determine the value of
R 2 and R Adj
2
.
21.8898 12 1.8242
= 1− = 1−
525.6 14 37.5429
= 1 − 0.0486 = 0.9514
Hence, the model explains approximately 95.14% of variations in response
variable due to age and weight. Note that R 2Adj is more reliable than R2.
Example 11: For Example 5 given in Sec. 8.3, determine the value of
R 2 and R 2Adj .
Solution: From the solution of Example 5, we have
n =15, k = 3, SST = 525.6 and SSRes = 15.9469
From equation (31), we calculate the coefficient of determination as:
SSRes 15.9469
R2 = 1− = 1− = 0.9697
SST 525.6
227
Regression Analysis The fitted multiple regression model explains approximately 96.97% of
variations in Y due to variations in X1 and X2. So, it can be considered as a
good model. We compute the value of adjusted coefficient of determination
using equation (33) as:
SSRes (n − k − 1)
R 2Adj = 1 −
SST ( n − 1)
1.4497
= 1− =1− 0.0386
37.5429
= 0.9614
You can try the following exercises in order to check your understanding.
E12) Differentiate between coefficient of determination and adjusted
coefficient of determination.
E13) For the exercise given in E4, calculate the coefficient of determination
and adjusted coefficient of determination.
E14) Obtain R2 and R 2Adj for the exercise given in E5.
8.7 SUMMARY
1. The variance of the ith estimated regression coefficient ̂i is given by
2. The standard error of the ith estimated regression coefficient (ˆ i ) is defined
as:
SE(ˆ i ) = Sii
3. The covariance between ith and jth regression coefficients of (ˆ i and ˆ j ) is
given by
cov(ˆ i , ˆ j ) = 2Sij ;i j = 1, 2,..., k
4. The significance of the fitted multiple regression model can be tested by the
analysis of variance (ANOVA) technique.
5. To calculate the overall variability of the response variable Y, we obtain the
total sum of squares as:
n n
SST = ( yi − y ) = yi2 − ny 2 (It has (n – 1) degrees of freedom.)
2
I =1 i =1
i =1
8. For testing the significance of the regression model, we define the variance
ratio (F) as:
MSSReg SSReg k
Fcal = =
MSSRes SSRes (n − k − 1)
(ˆ )
j L
= ˆ j − t (n −k −1), /2 SE ˆ j ( )
(ˆ )
j U
= ˆ j + t (n −k −1), /2 SE (ˆ )j
( ŷ − y)
2
i
SSReg 1 − SSRes
R2 = i =1
n
= =
(y − y)
2 SST SST
i
i =1
ˆ 2 = 281.3269
Now, the variance-covariance matrix is obtained as:
Thus, we obtain
( ) ( )
V ˆ 0 = 515.7464 SE ˆ 0 = 22.7101
230
89.4681 −0.4103 −0.1805 −0.4528 Statistical Inference in
−0.4103 0.0146 −0.0147 0.00002 Multiple Linear Regression
V() = 0.0746
ˆ
−0.1805 −0.0147 0.0288 0.0028
−0.4528 0.00002 0.0028 0.0027
( ) ( )
V ˆ 0 = 6.6698 SE ˆ 0 = 2.5826
5895
XY = 215275
3332.5
5895
ˆ XY = 477.32692576 −2.07953754 −12.95445048 215275
3332.5
= 2322999.0772
We calculate the sum of squares due to regression as:
36 −2.6916
1390.81 0.1266
XY = and =
ˆ
374.05 0.0317
5710.3 0.0036
We now compute the total sum of squares as:
SST = YY − ny2 = 111.88 − 12(3)2 = 3.88
ˆ XY = 111.2836
Since Fcal > Ftab at 5% level of significance, we reject H0 . Hence, the model
can be considered as significant.
E6) Refer to Sec. 8.4.
E7) We have computed the following values in E2:
( ) ( ) ( )
SE ˆ 0 = 22.7101 , SE ˆ 1 = 0.7595 and SE ˆ 2 = 63.5936
ˆ 0 477.3269
t0 = = = 21.0183
( )
SE ˆ 0
22.7101
Since |t2| < 3.055, we may not reject the null hypothesis H0 : 2 = 0 at
1% level of significance. So we may infer that both regressor variables
are not playing important roles in constructing the model.
E8) The following values are computed in E3:
( ) ( ) ( )
SE ˆ 0 = 2.5826 , SE ˆ 1 = 0.0330 , SE ˆ 2 = 0.0464
and SE ( ˆ ) = 0.0141
3
ˆ 3 0.0036
t3 = = = 0.2522
SE ˆ( )3
0.0141
Since only |t1| > 2.306, we may reject the null hypothesis H0 : 1 = 0 at
5% level of significance and conclude that X1 is an important variable
in the model.
The values of t 0 , t 2 and t 3 are not greater than 2.306. So, we do not
reject their respective null hypotheses at 5% level of significance.
Hence, we may conclude that 0 , 2 and 2 are not playing important
roles in the model.
E9) Refer to Sec. 8.5.
E10) From the solution of E7, we have
0 = 477.3269, 1 = −2.0795 and 2 = −12.9545
( ) ( ) ( )
SE ˆ 0 = 22.7101 , SE ˆ 1 = 0.7595 and SE ˆ 2 = 63.5936
( )
ˆ 0L = ˆ 0 − t (n −k −1), /2 SE ˆ 0
( )
ˆ 0U = ˆ 0 + t (n −k −1), /2SE ˆ 0
The lower and upper confidence limits of ̂1 can be computed as:
( )
ˆ 1L = ˆ 1 − t (n −k −1), /2 SE ˆ 1
= −2.0795+3.055 0.7595
= 0.2407
( )
ˆ 2L = ˆ 2 − t (n −k −1), /2 SE ˆ 2
ˆ 2U = ˆ 2 + t (n −k −1), /2 SE ˆ 2( )
= −12.9545 + 3.055 63.5936
= 181.3240
( ) ( )
SE ˆ 0 = 2.58259858 , SE ˆ 1 = 0.0330 , SE ˆ 2 = 0.0464 ( )
( )
and SE ˆ 3 = 0.0141
( )
ˆ 0L = ˆ 0 − t (n −k −1), /2 SE ˆ 0
235
Regression Analysis
( )
ˆ 1L = ˆ 1 − t (n −k −1), /2 SE ˆ 1
( )
ˆ 1U = ˆ 1 + t (n −k −1), /2 SE ˆ 1
( )
ˆ 2L = ˆ 2 − t (n −k −1), /2 SE ˆ 2
( )
ˆ 2U = ˆ 2 + t (n −k −1), /2 SE ˆ 2
( )
ˆ 3L = ˆ 3 − t (n −k −1), /2 SE ˆ 3
( )
ˆ 3U = ˆ 3 + t (n −k −1), /2 SE ˆ 3
SSReg 0.5964
R2 = 1− = 1− = 0.8463
SST 3.88
SSRes (n − k − 1)
Further, R 2Adj = 1 −
SST ( n − 1)
0.5964 8
= 1− = 0.7887
3.88 11
Hence, the model explains only 84.63% and 78.87% of variations in Y
according to the coefficient of determination and the adjusted
coefficient of determination, respectively.
237