Machine Learning and Data Analytics
Linear regression – Part 1
Dr. Rossana Cavagnini
Deutsche Post Chair – Optimization of Distribution Networks (DPO)
RWTH Aachen University
mlda@[Link]
Simple linear regression
Agenda
1 Simple linear regression
Overview
Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 2
Simple linear regression
DPO MLDA 3
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
Regression: find the line which best interpolates the data
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
Regression: find the line which best interpolates the data
Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
Regression: find the line which best interpolates the data
Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X
Y : target (output)
β1 : slope
X : input
β0 : intercept (steepness)
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Estimating the coefficients
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Sales
15
10
5
0 50 100 150 200 250 300
TV
Regression line in blue and errors in gray
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Estimating the coefficients
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Error=actual sales - predicted sales
Sales
15
10
5
0 50 100 150 200 250 300
TV
Regression line in blue and errors in gray
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Estimating the coefficients
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Error=actual sales - predicted sales
Sales
15
ŷi = β̂0 + β̂1 xi : prediction for Y
10
based on the ith value of X
5
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
TV
Regression line in blue and errors in gray
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Estimating the coefficients
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Error=actual sales - predicted sales
Sales
15
ŷi = β̂0 + β̂1 xi : prediction for Y
10
based on the ith value of X
5
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
TV A good fit minimizes the residuals
Regression line in blue and errors in gray
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors?
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors?
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors?
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors? → Yes. There is only one line minimizing this sum
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What will a good fit minimize?
1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors? → Yes. There is only one line minimizing this sum
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
OLS
The ordinary least squares criterion (OLS) minimizes the Residual Sum of Squares
(RSS):
RSS = e12 + e22 + · · · + en2
by defining the least squares coefficient estimates:
n
P
(xi − x)(yi − y )
i=1
β̂1 = n and β̂0 = y − β̂1 x,
x)2
P
(xi −
i=1
n n
1 P 1 P
where y ≡ n yi and x ≡ n xi
i=1 i=1
DPO MLDA 7
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
How many units will be sold by spending 1000 USD in TV advertising?
Regress sales onto TV advertising: sales ≈ β0 + β1 TV .
25
20
Sales
15
10
RS
S
5
0 50 100 150 200 250 300 β1
TV
β0
OLS: β̂0 = 7.03 and β̂1 = 0.0475
The red dot corresponds to the pair of
By spending 1000 on TV advertising, least squares estimates
47.5 additional units are sold
DPO MLDA 8
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the coefficient estimates
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the coefficient estimates
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the coefficient estimates
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the coefficient estimates
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
The true relationship is generally not known, while the least squares line can always
be computed
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Population regression line vs least squares line
10
10
5
5
Y
Y
0
0
−5
−5
−10
−10
−2 −1 0 1 2 −2 −1 0 1 2
X X
Left: the red line is the population regression line, the blue line is the least squares line.
Right: light blue are 10 least squares lines on 10 different observation data
The least squares line uses information from a sample to estimate characteristics of a
large population
For a particular set of observations, the estimator may underestimate or overestimate
The estimate for a huge number of observations is expected to be exactly equal to
the true value
DPO MLDA 10
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
How far will a single estimate be from the actual value?
Standard error of µ̂ (how the deviation shrinks with n)
σ2
Var (µ̂) = SE (µ̂)2 =
n
Standard error (SE) of the estimators:
" #
1 x2 σ2
SE (β̂0 )2 = σ 2 n + n , SE (β̂1 )2 = n
−x)2 (xi −x)2
P P
(xi
i=1 i=1
where σ 2 = Var () (generally not known)
SE (β̂0 )2 small when xi more spread out
Residual standard error: p
RSE = RSS/(n − 2)
DPO MLDA 11
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
1. Confidence intervals
A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter
There is approx. a 95% chance that the interval: [β̂1 − 2SE (β̂1 ), β̂1 + 2SE (β̂1 )] will
contain the true value of β1 .
Example: Advertising
β0 has a confidence interval of [6.130, 7.935]
β1 has a confidence interval of [0.042, 0.053]
With no advertising, sales will fall somewhere between 6130 and 7935 units
For each 1000 USD increase in advertising, there is an average increase in sales between 42 and
53 units.
DPO MLDA 12
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero?
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”?
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
What are standard errors useful for?
2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))
t-statistic measures the number of standard deviations that β̂1 is away from 0
t = β̂1 −0
SE (β̂1 )
If SE (β̂1 ) small, also a small β̂1 suggests that β1 6= 0 (a relationship between X and Y )
If SE (β̂1 ) large, β̂1 must be large to reject the null hypothesis
p-value is the probability of observing any value equal to | t | or larger, assuming that
β1 = 0
Small p-value: there is an association between the predictor and the response, i.e. reject
the null hypothesis. Typical cutoff values: 5% or 1%
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Coefficient Std. Error t-statistic p-value
Intercept 7.0325 0.4578 15.36 <0.0001
TV 0.0475 0.0027 17.67 < 0.0001
β̂0 and β̂1 very large with respect to Std. Error → t-statistics are large
The p-values are close to zero → β0 6= 0 and β1 6= 0 (Reject the null hypothesis)
β0 6= 0: in absence of TV advertising, sales are non-zero
β1 6= 0: there is a relationship between TV and sales
DPO MLDA 14
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the model
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Assessing the accuracy of the model
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
2. Residual Standard Error (RSE) estimates of the standard deviation of , i.e., the
average amount that the response will deviate from the true regression line.
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2
n−2 n−2
i=1
Absolute measure (expressed in units of Y )
DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
3. R 2 statistic measures of the proportion of variability in Y that can be explained using
X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)
DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
3. R 2 statistic measures of the proportion of variability in Y that can be explained using
X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)
Close to 1: a large proportion of the variability in the response can be explained by the
regression
Close to 0: the regression does not explain much of the variability in the response.
For the simple linear regression, R 2 = [Corr (X , Y )2 ]
DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Remark:
n
X n
X
2
TSS = (yi − y ) vs RSS = (yi − ŷ )2
i=1 i=1
Both require to use the actual data points of the data training set (yi )
RSS measures the variability after performing the regression (difference with respect
to the predictions)
TSS measures the variance in Y before the regression is performed (difference with
respect to the mean of data points)
The ratio RSS/TSS measures how good the model is compared to the mean value
without variance
A low ratio: a low residual error with actual values, a high residual error with respect to
the mean → The model is more robust
DPO MLDA 17
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Quantity Value
Residual Standard Error 3.26
R2 0.612
RSE: Actual sales in each market deviate from the true regression line by approx.
3,26 units, on average
R 2 : Around 2/3 of the variability in sales is explained by a variation in the TV budget
DPO MLDA 18