0% found this document useful (0 votes)
39 views33 pages

STATS 330: Lecture 6: Inference For The Multiple Regression Model

1) The lecture covered inference for the multiple regression model, including estimating the residual variance, calculating confidence intervals for coefficients, hypothesis testing of coefficients, and testing if subsets of variables are required in the model. 2) As an example using cherry tree data, it was shown how to calculate confidence intervals and p-values for coefficients to assess their significance. 3) Testing the overall significance of the regression using the F-statistic and p-value was also demonstrated using this example.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views33 pages

STATS 330: Lecture 6: Inference For The Multiple Regression Model

1) The lecture covered inference for the multiple regression model, including estimating the residual variance, calculating confidence intervals for coefficients, hypothesis testing of coefficients, and testing if subsets of variables are required in the model. 2) As an example using cherry tree data, it was shown how to calculate confidence intervals and p-values for coefficients to assess their significance. 3) Testing the overall significance of the regression using the F-statistic and p-value was also demonstrated using this example.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STATS 330: Lecture 6

Inference for the Multiple Regression Model

30.07.2015
Housekeeping

I Contact details
Office auckland.ac.nz hours
Steffen Klaere 303.219 s.klaere 9:3010:30, Thu+Fri
Arden Miller 303.229C a.miller Wed 910, Thu 1213

I Class representatives
Course aucklanduni.ac.nz
Jessica Courtney 330 jcou608
Monica Hill 330 mhil084
Ben Wilson 762 bwil003
Tutor Office Hours

I Blake Seers
I Tuesday, 9-11
I Thursday, 10-11
I Friday, 14-15
I Hongbin Guo
I Monday, 9-11
I Tuesday, 15-16
I Wednesday, 14-15
I Stage 3 Assistance Room 303S.294
Assignments

I Assignment 1 is due August 10.

I Focusses on data cleaning and exploratory analysis.

I Submit to Student Resource Centre by 2pm.

I Use cover page provided on webpage.


Estimate of residual variance 2

I Recall that 2 controls the scatter of the observations about


the regression plane:

I The bigger 2 , the more scatter,


I The smaller 2 , the bigger R 2 ;

I 2 is estimated by
RSS
s2 = .
nk 1

I s is also known as the residual standard error


Calculations for cherry trees

cherry.lm <- lm(volume~diameter+height,data=cherry.df)


summary(cherry.lm)

I The lm function produces an lm object that contains all the


information from fitting the regression.

I lm stands for linear model


Calculations for cherry trees

Call:
lm(formula = volume ~ diameter + height, data = cherry.df)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 4.7082 0.2643 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
Calculations for cherry trees

Hence, we get the model

V = 0.3393h + 4.7082d 57.9877.


Calculations for cherry trees

Hence, we get the model

V = 0.3393h + 4.7082d 57.9877.

Is this the true model?


Inference for the regression model

Aim of todays lecture

I To discuss how we assess the significance of variables in the


regression.

I Key concepts:

I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients

I Imagine that we keep the xs fixed, but resample the errors


and refit the plane. How much would the plane (estimated
coefficients) change?

I This gives us an idea of the variability (accuracy) of the


estimated coefficients as estimates of the coefficients of the
true regression plane.
Y

X1
X2
Variability of the regression coefficients

I Variability depends on

I The arrangement of the xs (the more correlation, the more


change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)

I Measure variability by the standard error of the coefficients


Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals

CI : Estimated coefficient standard error t

t : 97.5% point of t distribution with df degrees of


freedom.

df : n k 1.

n : number of observations.

k : number of covariates (assuming we have a constant


term).
Confidence intervals
Example: Cherries

Use stats function confint

> confint(cherry.lm)
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test

I Often we ask do we need a particular variable, given the


others are in the model?

I Note that this is not the same as asking is a particular


variable related to the response?

I Can test the former by examining the ratio of the coefficient


to its standard error.
Hypothesis test

I This is the t-statistic t.

I The bigger t, the more we need the variable.

I Equivalently, the smaller the p-value, the more we need the


variable.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Recall: p-value
Density for t with df=28

2.607 2.607
0.4
0.3

pvalue = 0.0145
0.2
0.1
0.0

4 2 0 2 4
Other hypotheses

I Overall significance of the regression: do none of the variables


have a relationship with the response?

I Use the F statistic: the bigger F , the more evidence that at


least one variable has a relationship.

I equivalently, the smaller the p-value, the more evidence that


at least one variable has a relationship.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Testing if a subset is required

I Often we want to test if a subset of variables is unnecessary.

I Terminology

Full model: Model containing all variables.

Submodel: Model with a set of variables removed.

I Test is based on comparing the RSS of the submodel with the


RSS of the full model. Full model RSS is always smaller
(why?)
Testing if a subset is required

I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.

I To do the test, we

I fit both models, get RSS for both;


I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required

I The test statistic is

(RSSsub RSSfull )
F =
s 2 (dffull dfsub )

I dffull dfsub is the number of variables dropped.

I s 2 is the estimate of 2 from the full model (the residual


mean square)

I R has a function anova to do the calculation.


p-values

I If the submodel is adequate, the test statistic has an


F -distribution with dffull dfsub and n k 1 degrees of
freedom.

I We assess if the value of F calculated from the sample is a


plausible value from this distribution by means of a p-value.

I if the p-value is too small, we have evidence against the


hypothesis that the submodel is adequate.
p-values
Density for F with 2 and 16 degrees of freedom

1.0
0.8

Fvalue
0.6
0.4
0.2

pvalue
0.0

0 2 4 6 8 10
Example: Free fatty acid data

I Use physical measures to model a biochemical parameter in


overweight children.

I Variables are

FFA: Free fatty acid level in blood (response variable)

Age: months

Weight: pounds

Skinfold thickness: inches


Analysis

Call:
lm(formula = ffa ~ age + weight + skinfold, data = fatty.df)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

This suggests
I age is not required if weight and skinfold are retained

I skinfold is not required if weight and age are retained

I Can we get away with just weight?


Analysis

> model.sub <- lm(ffa~weight,data=fatty.df)


> anova(model.sub,model.full)
Analysis of Variance Table

Model 1: ffa ~ weight


Model 2: ffa ~ age + weight + skinfold
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261

I Small F and large p-value suggest weight alone is adequate.


I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.

I Effect can be checked by comparing coefficients in full and


submodel (if available).
> summary(model.full)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

> summary(model.sub)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **

You might also like