Linear regression in R
MACC7006 Accounting Data and Analytics
Keri Hu
Faculty of Business and Economics
1/23
Today: Linear regression in R
By the end of today’s lecture, you should be able to:
• Perform regression analysis to determine linear relationships
between variables
• Understand hypothesis testing and statistical inference
• Interpret coefficient estimates and add best fit lines to scatter
plots of the data
We will work with the datasets: Wine.csv and WineTest.csv.
2/23
Review of regression basics
• Linear regression: Explain movements in the dependent variable by
movements in the independent variables
ñ find the line that fits data in the sample
• Univariate model: Yi “ β0 ` β1 Xi ` ϵi
• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi
• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K
• Interpretation:
• β̂1 : One-unit increase in X1 is associated with β̂1 units of increase
in Y on average, holding constant X2 , . . . , XK .
3/23
Correlation is not causation
Regression results cannot prove causality.
• If two things A and B are related statistically, it is possible that
• A causes B
• B causes A
• Some third factor causes both A and B.
• Correlation
• How strongly the variables are linearly related and change together
• Not why and how behind the relationship – just the relationship exists
4/23
Example: Covid infection
03/26/2020 excerpt of KPBS San Diego:
• “Of the 297 people in San Diego County with positive diagnoses,
cases in patients between 20 and 59 formed the bulk of the total,
236 overall or 79% of cases.”
Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?
1 KPBS San Diego, 2020 5/23
Testing rate matters
“Dr. Eric McDonald said that statistic probably represented a testing bias,
as members of the military, first-responders and healthcare workers fall
most frequently into that age group and these people are tested at rates
much higher than the general population.”
• Members of the military, first-responders and healthcare workers are
mostly in 20 ´ 59.
• Essential workers are tested more and more positive can be found.
• Age 20 ´ 59 œ More vulnerable to COVID
6/23
Variables in the dataset
Build a linear regression model to predict Price, using Age, AGST,
HarvestRain, WinterRain, and FrancePop as independent variables
7/23
Plot Price versus Age, AGST, HarvestRain, WinterRain
8/23
Estimate a linear model: lm()
Fit a regression line (we save the model to WineReg)
• We do not need to use $ to specify variables here, because we have
the data argument telling R which dataset to use.
WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +
Age + FrancePop, data = Wine)
• Check the output of the model: summary(WineReg)
9/23
Regression result
10/23
Description of the table
• Residual: ei “ Yi ´ Ŷi
• Estimate: β̂0 (Intercept), β̂1 (WinterRain), β̂2 (AGST), β̂3
(HarvestRain), β̂4 (Age), β̂5 (FrancePop)
• The other three columns (Std. Error, t value, and Pr(>|t|))
help us determine if a variable should be included in the model,
specifically if its coefficient is significantly different from zero.
• “***, **, *, ., ” (most significant Ñ least significant): which
variables are significant
• Adjusted R2 : R2 adjusted for the number of independent variables
So how do the three columns mean?
11/23
Hypothesis testing
Did our sample “conform to” a particular hypothesis?
A hypothesis test evaluates two mutually exclusive statements about a
population and determines which is supported by the sample data.
$
&Null hypothesis H0 : The age does not affect wine quality.
’
’
versus
’
%Alternative hypothesis H : The age does affect wine quality.
’
A
1. State the hypotheses to be tested: H0 (something we expect to
reject) and HA (something to be supported)
2. Determine which test to use (e.g. t test)
3. Estimate equation and calculate value of test statistic (e.g. t value)
4. Draw a conclusion
12/23
Hypothesis testing in regression
We want to determine whether each independent variable is correlated
with the dependent variable respectively.
1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.
H0 : βk “ 0 and HA : βk ‰ 0
2. Obtain t value and Pr(>|t|) for each variable Xk respectively
3. Determine whether to reject the null hypothesis βk “ 0.
13/23
Null hypothesis H0 : βk “ 0
If H0 (i.e. βk “ 0) is true, the distribution of estimated regression
coefficient β̂k should follow t distribution and be something like this:
If our sample statistic (e.g. t value) is far from the hypothetical value 0,
we can say this is unusual enough and reject the null hypothesis βk “ 0.
2 https://analystprep.com/cfa-level-1-exam/quantitative-methods/one-tailed-vs-two-tailed-
hypothesis-testing/ 14/23
t value, Std. Error, and Pr(>|t|)
A higher |t value| or a lower Pr(>|t|) implies being statistically more
significant, i.e., strong evidence of correlation with the dependent variable.
• Normalization: t value of Xk “ β̂k {Std. Error of β̂k
• Standard error: estimated standard deviation
• The larger the sample, the more precise coefficient estimates and the
higher |t value| .
• Pr(>|t|): Probability of observing a t value more extreme than
this sample if H0 is true
• If Pr(>|t|) is small, it means that the t value in this sample is
extreme and unlikely if we assume H0 .
15/23
Level of significance α
Definition: probability of type I error (rejecting H0 when it is true)
• An independent variable is statistically significant if the level of
significance is small.
• The probability of false rejection is small when |t value| is big
enough or Pr(>|t|) is small enough.
• Use 0.1% (***), 1% (**), 5% (*), or 10% (.) level of significance
• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient
β̂k is statistically significant at the 5% level, or βk “ 0 is rejected at
the 5% level.
16/23
Refine the model
Remove insignificant independent variables
• Due to multicollinearity, we should remove independent variables one
at a time.
• Two variables that are not significant: Age and FrancePop
• Try removing FrancePop first, since it makes the least intuitive sense.
17/23
Re-run the model by leaving out FrancePop
WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +
Age, data = Wine)
18/23
What has changed?
• All of our independent variables are significant!
• By removing an independent variable, all of our coefficient estimates
adjusted slightly.
• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2
increases from 0.7845 to 0.7943.
• If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R2 would decrease to 0.7537.
19/23
Multicollinearity
What is the correlation between Age and FrancePop?
cor(Wine$Age, Wine$FrancePop)
[1] -0.9945
20/23
Add best fit line to plot
We regress Price on AGST:
WineLess <- lm(Price „ AGST, data = Wine)
plot(Wine$AGST, Wine$Price, abline(WineLess), ylab = ...)
21/23
Make predictions
We can make predictions on new observations by using predict.
WineTest <- read.csv("WineTest.csv")
WinePredictions <- predict(WineReg, newdata = WineTest)
str(WinePredictions)
22/23
Compare to the actual values
Out-of-sample R2
Use the mean of Price in the training set to calculate SST .
23/23