0% found this document useful (0 votes)

34 views23 pages

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views23 pages

Lab 2

lab lecture notes of R language.

Uploaded by

neilzhaony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Linear regression in R

MACC7006 Accounting Data and Analytics

Keri Hu

Faculty of Business and Economics

1/23
Today: Linear regression in R

By the end of today’s lecture, you should be able to:

• Perform regression analysis to determine linear relationships

between variables
• Understand hypothesis testing and statistical inference
• Interpret coefficient estimates and add best fit lines to scatter
plots of the data

We will work with the datasets: Wine.csv and WineTest.csv.

2/23
Review of regression basics

• Linear regression: Explain movements in the dependent variable by

movements in the independent variables
ñ find the line that fits data in the sample

• Univariate model: Yi “ β0 ` β1 Xi ` ϵi

• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi

• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K

• Interpretation:
• β̂1 : One-unit increase in X1 is associated with β̂1 units of increase
in Y on average, holding constant X2 , . . . , XK .

3/23
Correlation is not causation

Regression results cannot prove causality.

• If two things A and B are related statistically, it is possible that

• A causes B
• B causes A
• Some third factor causes both A and B.
• Correlation
• How strongly the variables are linearly related and change together
• Not why and how behind the relationship – just the relationship exists

4/23
Example: Covid infection

03/26/2020 excerpt of KPBS San Diego:

• “Of the 297 people in San Diego County with positive diagnoses,
cases in patients between 20 and 59 formed the bulk of the total,
236 overall or 79% of cases.”

Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?

1 KPBS San Diego, 2020 5/23

Testing rate matters

“Dr. Eric McDonald said that statistic probably represented a testing bias,
as members of the military, first-responders and healthcare workers fall
most frequently into that age group and these people are tested at rates
much higher than the general population.”

• Members of the military, first-responders and healthcare workers are

mostly in 20 ´ 59.
• Essential workers are tested more and more positive can be found.
• Age 20 ´ 59 œ More vulnerable to COVID

6/23
Variables in the dataset

Build a linear regression model to predict Price, using Age, AGST,

HarvestRain, WinterRain, and FrancePop as independent variables

7/23
Plot Price versus Age, AGST, HarvestRain, WinterRain

8/23
Estimate a linear model: lm()

Fit a regression line (we save the model to WineReg)

• We do not need to use $ to specify variables here, because we have

the data argument telling R which dataset to use.

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

Age + FrancePop, data = Wine)

• Check the output of the model: summary(WineReg)

9/23
Regression result

10/23
Description of the table

• Residual: ei “ Yi ´ Ŷi
• Estimate: β̂0 (Intercept), β̂1 (WinterRain), β̂2 (AGST), β̂3
(HarvestRain), β̂4 (Age), β̂5 (FrancePop)
• The other three columns (Std. Error, t value, and Pr(>|t|))
help us determine if a variable should be included in the model,
specifically if its coefficient is significantly different from zero.
• “***, **, *, ., ” (most significant Ñ least significant): which
variables are significant
• Adjusted R2 : R2 adjusted for the number of independent variables

So how do the three columns mean?

11/23
Hypothesis testing

Did our sample “conform to” a particular hypothesis?

A hypothesis test evaluates two mutually exclusive statements about a
population and determines which is supported by the sample data.
$
&Null hypothesis H0 : The age does not affect wine quality.
’
’
versus
’
%Alternative hypothesis H : The age does affect wine quality.
’
A

1. State the hypotheses to be tested: H0 (something we expect to

reject) and HA (something to be supported)
2. Determine which test to use (e.g. t test)
3. Estimate equation and calculate value of test statistic (e.g. t value)
4. Draw a conclusion

12/23
Hypothesis testing in regression

We want to determine whether each independent variable is correlated

with the dependent variable respectively.

1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.

H0 : βk “ 0 and HA : βk ‰ 0
2. Obtain t value and Pr(>|t|) for each variable Xk respectively
3. Determine whether to reject the null hypothesis βk “ 0.

13/23
Null hypothesis H0 : βk “ 0

If H0 (i.e. βk “ 0) is true, the distribution of estimated regression

coefficient β̂k should follow t distribution and be something like this:

If our sample statistic (e.g. t value) is far from the hypothetical value 0,
we can say this is unusual enough and reject the null hypothesis βk “ 0.

2 https://analystprep.com/cfa-level-1-exam/quantitative-methods/one-tailed-vs-two-tailed-

hypothesis-testing/ 14/23
t value, Std. Error, and Pr(>|t|)

A higher |t value| or a lower Pr(>|t|) implies being statistically more

significant, i.e., strong evidence of correlation with the dependent variable.

• Normalization: t value of Xk “ β̂k {Std. Error of β̂k

• Standard error: estimated standard deviation
• The larger the sample, the more precise coefficient estimates and the
higher |t value| .

• Pr(>|t|): Probability of observing a t value more extreme than

this sample if H0 is true
• If Pr(>|t|) is small, it means that the t value in this sample is
extreme and unlikely if we assume H0 .

15/23
Level of significance α

Definition: probability of type I error (rejecting H0 when it is true)

• An independent variable is statistically significant if the level of

significance is small.

• The probability of false rejection is small when |t value| is big

enough or Pr(>|t|) is small enough.

• Use 0.1% (), 1% (), 5% (), or 10% (.) level of significance

• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient

β̂k is statistically significant at the 5% level, or βk “ 0 is rejected at
the 5% level.

16/23
Refine the model

Remove insignificant independent variables

• Due to multicollinearity, we should remove independent variables one

at a time.

• Two variables that are not significant: Age and FrancePop

• Try removing FrancePop first, since it makes the least intuitive sense.

17/23
Re-run the model by leaving out FrancePop

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

Age, data = Wine)

18/23
What has changed?

• All of our independent variables are significant!

• By removing an independent variable, all of our coefficient estimates

adjusted slightly.

• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2

increases from 0.7845 to 0.7943.
• If we removed Age and FrancePop at the same time (they were both
insignificant in the original model), R2 would decrease to 0.7537.

19/23
Multicollinearity

What is the correlation between Age and FrancePop?

cor(Wine$Age, Wine$FrancePop)

[1] -0.9945

20/23
Add best fit line to plot

We regress Price on AGST:

WineLess <- lm(Price „ AGST, data = Wine)
plot(Wine$AGST, Wine$Price, abline(WineLess), ylab = ...)

21/23
Make predictions

We can make predictions on new observations by using predict.

WineTest <- read.csv("WineTest.csv")
WinePredictions <- predict(WineReg, newdata = WineTest)
str(WinePredictions)

22/23
Compare to the actual values

Out-of-sample R2

Use the mean of Price in the training set to calculate SST .

23/23

MKT3600 - L09 - Correlation and Regression
No ratings yet
MKT3600 - L09 - Correlation and Regression
51 pages
R Basics: Graphs & Paired t-Test Guide
No ratings yet
R Basics: Graphs & Paired t-Test Guide
5 pages
Predictive Analytics in Regression Analysis
No ratings yet
Predictive Analytics in Regression Analysis
24 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
Chapter 1 Introduction of Regression
No ratings yet
Chapter 1 Introduction of Regression
43 pages
Week 8 - 10
No ratings yet
Week 8 - 10
72 pages
Statistical Modelling
No ratings yet
Statistical Modelling
39 pages
Correlation
No ratings yet
Correlation
13 pages
408 Mid
No ratings yet
408 Mid
7 pages
R Exercise For Referencer
No ratings yet
R Exercise For Referencer
14 pages
Chapter 05 Demand Estimation
No ratings yet
Chapter 05 Demand Estimation
41 pages
Statistical Analysis Techniques
No ratings yet
Statistical Analysis Techniques
3 pages
Linearregression
No ratings yet
Linearregression
18 pages
23 HW Assignment Biostat
No ratings yet
23 HW Assignment Biostat
6 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
9 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
STATA T-Model Calculations Guide
No ratings yet
STATA T-Model Calculations Guide
4 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Module 3 - MultipleLinearRegression - Afterclass1b
No ratings yet
Module 3 - MultipleLinearRegression - Afterclass1b
34 pages
Causal Relationships in Regression Analysis
No ratings yet
Causal Relationships in Regression Analysis
8 pages
Unit 2 DSRP
No ratings yet
Unit 2 DSRP
56 pages
05 Linear Regression 2
No ratings yet
05 Linear Regression 2
71 pages
Project A
No ratings yet
Project A
7 pages
Bivariate Linear Regression Analysis
No ratings yet
Bivariate Linear Regression Analysis
15 pages
Regression
No ratings yet
Regression
90 pages
Advanced Data Analytics - 2 - Correlation and Simpleregression
No ratings yet
Advanced Data Analytics - 2 - Correlation and Simpleregression
36 pages
Multiple Regression Analysis Overview
No ratings yet
Multiple Regression Analysis Overview
40 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
3 pages
Chapter 4 Thomas Managerial Economics
No ratings yet
Chapter 4 Thomas Managerial Economics
36 pages
@regression
No ratings yet
@regression
33 pages
Assignments
No ratings yet
Assignments
6 pages
R Analysis for Data Scientists
No ratings yet
R Analysis for Data Scientists
16 pages
Acts 372 Unit 6
No ratings yet
Acts 372 Unit 6
40 pages
Lecture Plan 12 - 16!1!1
No ratings yet
Lecture Plan 12 - 16!1!1
7 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Independent Variable in Regression Analysis
No ratings yet
Independent Variable in Regression Analysis
14 pages
Basic Econometrics Notes
No ratings yet
Basic Econometrics Notes
47 pages
Chapter 14
No ratings yet
Chapter 14
65 pages
Discussion+on+Multiple+Regression ShimengHuang
No ratings yet
Discussion+on+Multiple+Regression ShimengHuang
35 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Applied General Statistics (HIS 223)
No ratings yet
Applied General Statistics (HIS 223)
35 pages
Unit-15 Data Analysis and R
No ratings yet
Unit-15 Data Analysis and R
12 pages
Multiple Linear Regression: y BX BX BX
No ratings yet
Multiple Linear Regression: y BX BX BX
14 pages
Lecture Week 12 - Intro To Regression
No ratings yet
Lecture Week 12 - Intro To Regression
5 pages
Excel Regression for Finance Students
No ratings yet
Excel Regression for Finance Students
19 pages
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
No ratings yet
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
34 pages
Predictive Analytics & Hypothesis Testing
No ratings yet
Predictive Analytics & Hypothesis Testing
27 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
Chi-Square Goodness-of-Fit Test Guide
No ratings yet
Chi-Square Goodness-of-Fit Test Guide
77 pages
Intro to Linear Regression Basics
No ratings yet
Intro to Linear Regression Basics
34 pages
Chapter 4 Part 3 Inference
No ratings yet
Chapter 4 Part 3 Inference
22 pages
Quantitative Methods Vocabulary
No ratings yet
Quantitative Methods Vocabulary
5 pages
Multiple Regression: Estimation & Testing
No ratings yet
Multiple Regression: Estimation & Testing
12 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Quants
No ratings yet
Quants
8 pages
ISOM2500 Spring 25 - Topic 10 - Linear Regression Interpretation and Diagnosis
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Linear Regression Interpretation and Diagnosis
51 pages
R Model Selection for Business Students
No ratings yet
R Model Selection for Business Students
30 pages
Lab 3 (Tutorial 1)
No ratings yet
Lab 3 (Tutorial 1)
20 pages
Lab 4
No ratings yet
Lab 4
20 pages
Lab 1
No ratings yet
Lab 1
26 pages
Project Report (Amazon Review (Sentiment Analysis) )
100% (1)
Project Report (Amazon Review (Sentiment Analysis) )
31 pages
(Ebook) Quantum Computing: A Gentle Introduction by Eleanor Rieffel, Wolfgang Polak ISBN 9780262015066, 0262015064 Direct Download
No ratings yet
(Ebook) Quantum Computing: A Gentle Introduction by Eleanor Rieffel, Wolfgang Polak ISBN 9780262015066, 0262015064 Direct Download
144 pages
Gauss PDF
No ratings yet
Gauss PDF
48 pages
CNN-Based Network Intrusion Detection Against Denial-of-Service Attacks
100% (1)
CNN-Based Network Intrusion Detection Against Denial-of-Service Attacks
21 pages
Statistical Mechanics Lecture Notes
No ratings yet
Statistical Mechanics Lecture Notes
97 pages
EEE3335 - Final Exam Companion Prep Pack
No ratings yet
EEE3335 - Final Exam Companion Prep Pack
7 pages
Deep Reinforcement Learning For Power System
No ratings yet
Deep Reinforcement Learning For Power System
13 pages
Algorithms Data Structures GATE Computer Science Postal Study Material
No ratings yet
Algorithms Data Structures GATE Computer Science Postal Study Material
15 pages
Algorithms Homework: Recurrences & Sorting
No ratings yet
Algorithms Homework: Recurrences & Sorting
3 pages
Chapter 5 Discrete Probability Distributions
No ratings yet
Chapter 5 Discrete Probability Distributions
36 pages
CH-1 - Introduction-Updated
No ratings yet
CH-1 - Introduction-Updated
55 pages
Count Inversions Using Merge Sort
No ratings yet
Count Inversions Using Merge Sort
8 pages
Module 4-Probability-Part 1
No ratings yet
Module 4-Probability-Part 1
20 pages
Btech Ce 3 Sem Transform and Discrete Mathematics 18c72 2021
No ratings yet
Btech Ce 3 Sem Transform and Discrete Mathematics 18c72 2021
2 pages
Class10 IT Unit2 Electronic Spreadsheet Advanced
No ratings yet
Class10 IT Unit2 Electronic Spreadsheet Advanced
4 pages
Simple Linear Regression - Assign2
No ratings yet
Simple Linear Regression - Assign2
9 pages
Basic Statistical Descriptions - Mean, Median, Mode, Variance, Standard Deviation
No ratings yet
Basic Statistical Descriptions - Mean, Median, Mode, Variance, Standard Deviation
11 pages
Value Function Iteration in Macro Theory
No ratings yet
Value Function Iteration in Macro Theory
10 pages
Transformasi Fourier Diskrit
No ratings yet
Transformasi Fourier Diskrit
45 pages
Properties of Task Environment
100% (2)
Properties of Task Environment
2 pages
19 - A Sequential Algorithm For Training Text Classi Ers
No ratings yet
19 - A Sequential Algorithm For Training Text Classi Ers
10 pages
Unit 1 Supplementary Review
No ratings yet
Unit 1 Supplementary Review
2 pages
BBA Operations Research Exam
No ratings yet
BBA Operations Research Exam
7 pages
Ejercicios 1.50 1.70 Monson Hayes PDF
No ratings yet
Ejercicios 1.50 1.70 Monson Hayes PDF
3 pages
Fundamentals of Cryptography Explained
No ratings yet
Fundamentals of Cryptography Explained
16 pages
7.2. Machine Learning Support Vector Machine
No ratings yet
7.2. Machine Learning Support Vector Machine
52 pages
Econometrics Chapter 3 2025
No ratings yet
Econometrics Chapter 3 2025
53 pages
Solution Manual For Neural Networks and Learning Machines 3rd Edition
No ratings yet
Solution Manual For Neural Networks and Learning Machines 3rd Edition
6 pages
Big Data & Data Mining Explained
No ratings yet
Big Data & Data Mining Explained
15 pages
Performance Task: Statistics and Probability - Grade 11
No ratings yet
Performance Task: Statistics and Probability - Grade 11
4 pages

Lab 2

Uploaded by

Lab 2

Uploaded by

Linear regression in R

MACC7006 Accounting Data and Analytics

Faculty of Business and Economics

By the end of today’s lecture, you should be able to:

• Perform regression analysis to determine linear relationships

We will work with the datasets: Wine.csv and WineTest.csv.

• Linear regression: Explain movements in the dependent variable by

• Multivariate model: Yi “ β0 ` β1 Xi1 ` β2 Xi2 ` ¨ ¨ ¨ ` βK XiK ` ϵi

• Produce estimated coefficients: β̂0 , β̂1 , . . . , β̂K

Regression results cannot prove causality.

• If two things A and B are related statistically, it is possible that

03/26/2020 excerpt of KPBS San Diego:

Does the age range of 20 ´ 59 lead to a higher risk of contracting Covid?

1 KPBS San Diego, 2020 5/23

• Members of the military, first-responders and healthcare workers are

Build a linear regression model to predict Price, using Age, AGST,

Fit a regression line (we save the model to WineReg)

• We do not need to use $ to specify variables here, because we have

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

• Check the output of the model: summary(WineReg)

So how do the three columns mean?

Did our sample “conform to” a particular hypothesis?

1. State the hypotheses to be tested: H0 (something we expect to

We want to determine whether each independent variable is correlated

1. Hypothesize that each regression coefficient β1 “ 0, . . . , βK “ 0.

If H0 (i.e. βk “ 0) is true, the distribution of estimated regression

A higher |t value| or a lower Pr(>|t|) implies being statistically more

• Normalization: t value of Xk “ β̂k {Std. Error of β̂k

• Pr(>|t|): Probability of observing a t value more extreme than

Definition: probability of type I error (rejecting H0 when it is true)

• An independent variable is statistically significant if the level of

• The probability of false rejection is small when |t value| is big

• Use 0.1% (***), 1% (**), 5% (*), or 10% (.) level of significance

• If Pr(>|t|) is between 1% and 5%, we say the estimated coefficient

Remove insignificant independent variables

• Due to multicollinearity, we should remove independent variables one

• Two variables that are not significant: Age and FrancePop

WineReg <- lm(Price „ WinterRain + AGST + HarvestRain +

• All of our independent variables are significant!

• By removing an independent variable, all of our coefficient estimates

• R2 decreases slightly from 0.8294 to 0.8286, while adjusted R2

What is the correlation between Age and FrancePop?

We regress Price on AGST:

We can make predictions on new observations by using predict.

Use the mean of Price in the training set to calculate SST .

You might also like

• Use 0.1% (), 1% (), 5% (), or 10% (.) level of significance