STATS 330: Lecture 6
Inference for the Multiple Regression Model
30.07.2015
Housekeeping
I Contact details
Office auckland.ac.nz hours
Steffen Klaere 303.219 s.klaere 9:3010:30, Thu+Fri
Arden Miller 303.229C a.miller Wed 910, Thu 1213
I Class representatives
Course aucklanduni.ac.nz
Jessica Courtney 330 jcou608
Monica Hill 330 mhil084
Ben Wilson 762 bwil003
Tutor Office Hours
I Blake Seers
I Tuesday, 9-11
I Thursday, 10-11
I Friday, 14-15
I Hongbin Guo
I Monday, 9-11
I Tuesday, 15-16
I Wednesday, 14-15
I Stage 3 Assistance Room 303S.294
Assignments
I Assignment 1 is due August 10.
I Focusses on data cleaning and exploratory analysis.
I Submit to Student Resource Centre by 2pm.
I Use cover page provided on webpage.
Estimate of residual variance 2
I Recall that 2 controls the scatter of the observations about
the regression plane:
I The bigger 2 , the more scatter,
I The smaller 2 , the bigger R 2 ;
I 2 is estimated by
RSS
s2 = .
nk 1
I s is also known as the residual standard error
Calculations for cherry trees
cherry.lm <- lm(volume~diameter+height,data=cherry.df)
summary(cherry.lm)
I The lm function produces an lm object that contains all the
information from fitting the regression.
I lm stands for linear model
Calculations for cherry trees
Call:
lm(formula = volume ~ diameter + height, data = cherry.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 4.7082 0.2643 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
Calculations for cherry trees
Hence, we get the model
V = 0.3393h + 4.7082d 57.9877.
Calculations for cherry trees
Hence, we get the model
V = 0.3393h + 4.7082d 57.9877.
Is this the true model?
Inference for the regression model
Aim of todays lecture
I To discuss how we assess the significance of variables in the
regression.
I Key concepts:
I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients
I Imagine that we keep the xs fixed, but resample the errors
and refit the plane. How much would the plane (estimated
coefficients) change?
I This gives us an idea of the variability (accuracy) of the
estimated coefficients as estimates of the coefficients of the
true regression plane.
Y
X1
X2
Variability of the regression coefficients
I Variability depends on
I The arrangement of the xs (the more correlation, the more
change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)
I Measure variability by the standard error of the coefficients
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals
CI : Estimated coefficient standard error t
t : 97.5% point of t distribution with df degrees of
freedom.
df : n k 1.
n : number of observations.
k : number of covariates (assuming we have a constant
term).
Confidence intervals
Example: Cherries
Use stats function confint
> confint(cherry.lm)
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test
I Often we ask do we need a particular variable, given the
others are in the model?
I Note that this is not the same as asking is a particular
variable related to the response?
I Can test the former by examining the ratio of the coefficient
to its standard error.
Hypothesis test
I This is the t-statistic t.
I The bigger t, the more we need the variable.
I Equivalently, the smaller the p-value, the more we need the
variable.
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Recall: p-value
Density for t with df=28
2.607 2.607
0.4
0.3
pvalue = 0.0145
0.2
0.1
0.0
4 2 0 2 4
Other hypotheses
I Overall significance of the regression: do none of the variables
have a relationship with the response?
I Use the F statistic: the bigger F , the more evidence that at
least one variable has a relationship.
I equivalently, the smaller the p-value, the more evidence that
at least one variable has a relationship.
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Testing if a subset is required
I Often we want to test if a subset of variables is unnecessary.
I Terminology
Full model: Model containing all variables.
Submodel: Model with a set of variables removed.
I Test is based on comparing the RSS of the submodel with the
RSS of the full model. Full model RSS is always smaller
(why?)
Testing if a subset is required
I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.
I To do the test, we
I fit both models, get RSS for both;
I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required
I The test statistic is
(RSSsub RSSfull )
F =
s 2 (dffull dfsub )
I dffull dfsub is the number of variables dropped.
I s 2 is the estimate of 2 from the full model (the residual
mean square)
I R has a function anova to do the calculation.
p-values
I If the submodel is adequate, the test statistic has an
F -distribution with dffull dfsub and n k 1 degrees of
freedom.
I We assess if the value of F calculated from the sample is a
plausible value from this distribution by means of a p-value.
I if the p-value is too small, we have evidence against the
hypothesis that the submodel is adequate.
p-values
Density for F with 2 and 16 degrees of freedom
1.0
0.8
Fvalue
0.6
0.4
0.2
pvalue
0.0
0 2 4 6 8 10
Example: Free fatty acid data
I Use physical measures to model a biochemical parameter in
overweight children.
I Variables are
FFA: Free fatty acid level in blood (response variable)
Age: months
Weight: pounds
Skinfold thickness: inches
Analysis
Call:
lm(formula = ffa ~ age + weight + skinfold, data = fatty.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
This suggests
I age is not required if weight and skinfold are retained
I skinfold is not required if weight and age are retained
I Can we get away with just weight?
Analysis
> model.sub <- lm(ffa~weight,data=fatty.df)
> anova(model.sub,model.full)
Analysis of Variance Table
Model 1: ffa ~ weight
Model 2: ffa ~ age + weight + skinfold
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261
I Small F and large p-value suggest weight alone is adequate.
I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.
I Effect can be checked by comparing coefficients in full and
submodel (if available).
> summary(model.full)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
> summary(model.sub)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **