Module 4
Module 4
Module 4
Linear Regression with a
Single Regressor:
Estimation
Department of Economics, HKUST
Instructor: Junlong Feng
Fall 2022
Menu of Module 4
I. Linear III. LS
Regression II. OLS Estimator Assumptions for
Model Causal Inference
IV. Sampling
Distribution of V. On Prediction
OLS
2
I. Linear Regression Model
3
I. Linear Regression Model
4
I. Linear Regression Model
5
I. Linear Regression Model
6
I. Linear Regression Model
𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• 𝑌: dependent variable
• 𝑋: independent variable, or regressor
• 𝑢: regression error, or simply, error (conventional. Not accurate in economics), or
unobservable (more accurate)
• Contains unobserved factors that also affect 𝑌# .
• Also contains measurement error of 𝑌# .
• 𝛽!: intercept
• 𝛽": slope
• 𝛽! + 𝛽"𝑋# : population regression function, or population regression line
7
I. Linear Regression Model
𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• 𝐸 𝑌# 𝑋# = 𝛽! + 𝛽"𝑋# . Conditional expectation is a population quantity.
• 𝛽!, 𝛽" : population parameters.
• Use the random sample to estimate them.
• How?
9
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• Intuition: If 𝑢# = 0 for all 𝑖, then 𝑌# = 𝛽! + 𝛽"𝑋# .
• 𝐸 𝑌# 𝑋# = 𝛽! + 𝛽"𝑋# is still satisfied.
• Since we can observe the 𝑌# s and the 𝑋# s (e.g. test scores and STRs), we have 𝑛
equations and two unknowns:
𝛽! + 𝛽"𝑋" − 𝑌" = 0
𝛽! + 𝛽$𝑋$ − 𝑌$ = 0
⋯
𝛽! + 𝛽$𝑋% − 𝑌% = 0
• In general, this system of equations has NO solution.
10
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
• Idea: Instead of forcing all 𝑛 equations to be exactly zero, try to make each of them
to be as close to zero as possible.
• May be none of the equations is zero in the end, but the total distance from zero is
the smallest.
• For any 𝑏! and 𝑏":
• Distance of equation 1 from zero: 𝑏! + 𝑏"𝑋" − 𝑌"
• Distance of equation 2 from zero: 𝑏! + 𝑏"𝑋$ − 𝑌$
• ⋯
• Total distance: (Euclidean distance)
𝑏! + 𝑏"𝑋" − 𝑌" $ + 𝑏! + 𝑏"𝑋$ − 𝑌$ $ + ⋯ + 𝑏! + 𝑏"𝑋% − 𝑌% $
11
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
12
II. Estimate (𝛽! , 𝛽" ):
The OLS Estimator
• The least square estimator was first
invented by Gauss (1795, 1809) and
Legendre (1805).
• More than 200 years old.
• It’s still among the most powerful
estimators/predictors/machine
learners in statistics, econometrics,
machine learning, etc.
13
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
$
• 𝛽'! = 𝑌4 − 𝛽'" 𝑋4
• Derivation in Appendix 4.2. Not required.
• Predicted value (or, fitted value): 𝑌5' = 𝛽'! + 𝛽'" 𝑋'
• Compare with population regression line.
• Residual: 𝑢6 ' = 𝑌' − 𝑌5'
• Compare with estimation error.
14
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
17
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
Mathematically, you can imagine there is a production function 𝑔, and
𝑌 = 𝑔 𝑋, 𝑉
(Of course 𝑔 is abstract and unknown.)
Now, imagine there are two parallel universes. In the first universe, 𝑋 = 1. In the second, 𝑋 = 0. In
both universes, 𝑉 are the same.
• The causal effect is defined as 𝑔 1, 𝑉 − 𝑔 0, 𝑉 .
• The two terms 𝑔(1, 𝑉) and 𝑔(0, 𝑉) are called potential outcomes, denoted by 𝑌 1 and 𝑌 0 .
The problem, however, is that Alice only lives once. She cannot observe the two potential outcomes at
the same time: She either goes to the library everyday in this semester or not.
• This is called the fundamental problem of causal inference.
18
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
So, instead of getting to know Alice’s own causal effect, economists try to get to know something less
informative but still useful enough in many settings: the average treatment effect (ATE):
19
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
To summarize:
• ATE is what people aim for given the fundamental problem of causal inference.
• ATE can be expressed as the difference of two conditional expectations of 𝑌 given 𝑋
under random assignment.
• Conditional expectation of 𝑌 given 𝑋 can be assumed to be linear.
• In Module 7, I’ll show you cases where linearity is NOT an assumption but a mathematical truth.
• OLS is a tool to estimate the coefficients in a linear model.
• So, putting everything together, we are tempted to use OLS to estimate conditional
expectation, and further, to estimate ATE.
• However, this requires we reverse the logic chain.
20
III. When does OLS estimate the causal effect?
We motivated the linear model and OLS as the following logic, but under what conditions can
the arrows be reversed?
21
III. When does OLS estimate the causal effect?
22
III. When does OLS estimate the causal effect?
Reverse Arrow 2: When is OLS a good estimator of the parameters in the linear model (now the causal
parameters)? Assumption 1 is still helpful, but it’s not enough.
23
III. When does OLS estimate the causal effect?
Reverse Arrow 2: When is OLS a good estimator of the parameters in the linear model
(now, the causal parameters)? Assumption 1 is still helpful, but it’s not enough.
Under Assumptions 1-3, the OLS estimator 𝛽H!, 𝛽H" is an unbiased and consistent
estimator of 𝛽!, 𝛽" . Details in Topic IV.
24
III. When does OLS estimate the causal effect?
25
III. When does OLS estimate the causal effect?
27
IV. Sampling distribution of OLS
A. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is unbiased for 𝛽!, 𝛽" .
• 𝐸 𝛽H" = 𝛽". Proof in Appendix 4.3.
• 𝐸 𝛽H! = 𝛽!. Proof as a PSET question (also see Exercise 4.7).
• CAN use 𝐸 𝛽e" = 𝛽" as known.
28
IV. Sampling distribution of OLS
B. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is consistent for 𝛽!, 𝛽" .
∑ '* (') ⋅ +* (+) ,,- -
• 𝛽H" = * ∑* '* (') +
= + → 𝛽". Heuristics:
,,
(
∑$ .$ 12% ⋅ /$ 12& ∑ . 12% ⋅ /$ 12&
• 𝑋f and 𝑌f are consistent of 𝜇. and 𝜇/ . So 𝛽e" ≈ ∑$ .$ 12% '
= ) $ $
(
∑ . 12% '
) $ $
" ,
• WLLN: ∑# 𝑋# − 𝜇. ⋅ 𝑌# − 𝜇/ → 𝐸 𝑋# − 𝜇. ⋅ 𝑌# − 𝜇/ = 𝑐𝑜𝑣(𝑋# , 𝑌# ).
'
" ,
• WLLN: ∑# 𝑋# − 𝜇. 4→𝐸 𝑋# − 𝜇. 4 = 𝑣𝑎𝑟 𝑋#
'
, 5+6 .$ ,/$
• So 𝛽e" →
689 .$
• Meanwhile, given 𝑌# = 𝛽! + 𝛽" 𝑋# + 𝑢# and 𝑐𝑜𝑣 𝑋# , 𝑢# = 0 (the latter implied by 𝐸 𝑢# 𝑋# = 0),
we have 𝑐𝑜𝑣 𝑌# , 𝑋# = 𝑐𝑜𝑣 𝛽! + 𝛽" 𝑋# + 𝑢# , 𝑋# = 𝑐𝑜𝑣 𝛽! , 𝑋# + 𝛽" 𝑐𝑜𝑣 𝑋# , 𝑋# + 𝑐𝑜𝑣(𝑢# , 𝑋# )
= 0 + 𝛽" 𝑣𝑎𝑟 𝑋# + 0. Therefore, 𝛽" = 𝑐𝑜𝑣 𝑋# , 𝑌# /𝑣𝑎𝑟(𝑋# ).
29
IV. Sampling distribution of OLS
B. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is consistent for 𝛽!, 𝛽" .
-
• 𝛽H! = 𝑌K − 𝛽H"𝑋K → 𝛽!
• Proof as a PSET question.
-
• You may use 𝛽"̂ → 𝛽" as a known fact.
, ,
• You also need to invoke Slutsky’s theorem: if two random sequences 𝐴' → 𝐴 and 𝐵' → 𝐵, then
, ,
𝐴' 𝐵' → 𝐴𝐵 and 𝐴' + 𝐵' → 𝐴 + 𝐵.
30
IV. Sampling distribution of OLS
C. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is asymptotically normal (Key in
the proof is CLT).
" 012 '* (3, 4*
• 𝛽H" is approximately 𝑁 𝛽", 𝜎/.$ , where 𝜎/.$ = % 5+
. . ,
" 012 6* 4* 3,
• 𝛽H! is approximately 𝑁 𝛽!, 𝜎/.$ , where 𝜎/.$ = % 7 6+ +
, where 𝐻# = 1 − 7 '+ 𝑋# .
/ / *
*
• Takeaways: (Recall 𝑌# = 𝛽! ⋅ 1 + 𝛽" ⋅ 𝑋# + 𝑢# . Decompose 𝑌# into 1, 𝑋# and 𝑢# )
• If 𝑛 → ∞, then 𝜎;:4 → 0 and 𝜎;:4 → 0. The limits become 𝛽" and 𝛽! , echoing consistency.
( *
• When 𝑋# has larger variance, 𝛽e" has smaller variance, or is more accurate. The larger the
variance 𝑋# is, the more different 𝑋# is from 1 (the latter has 0 variance). It’s easier for the
algorithm to recognize the different contributions from 𝑋# and 1.
• When 𝑢# has larger variance, 𝛽e" has larger variance. The noise is too noisy to abstract accurate
info about the contribution of 𝑋# .
31
V. On prediction
Now forget about causal inference. We just want to use a straight line to fit the data and
perhaps make predictions.
• We still write 𝑌' = 𝛽! + 𝛽" 𝑋' + 𝑢'
• But the only restriction (actually a normalization) is 𝐸 𝑢' = 0.
• This is even not an assumption!
• It is without loss of generality because if 𝐸 𝑢' = 𝑐 ≠ 0, then just let 𝛽!0 = 𝛽! + 𝑐 and 𝑢'0 =
𝑢' − 𝑐. In the new regression 𝑌' = 𝛽!0 + 𝛽" 𝑋' + 𝑢'0 we have 𝐸 𝑢'0 = 0
• Once we have estimated 𝛽! and 𝛽" , we can make prediction for any value of 𝑋' = 𝑥
• The prediction is formed by 𝛽!̂ + 𝛽'" 𝑥
• 𝑥 may be a value in your dataset (in-sample prediction)
• 𝑥 may be not a value in your dataset (out-of-sample prediction)
• Wanna know a) how good the straight line fits the data and b) how accurate a prediction is.
32
V. On prediction
a) Goodness-to-fit
• Fitted value: 𝑌r# = 𝛽8! + 𝛽e" 𝑋#
s# .
• Residual: 𝑢Q # = 𝑌# − 𝑌
• Then 𝑌# = 𝑌r# + 𝑢Q #
• If 𝑢Q # = 0, perfect fit at 𝑖.
• Inspired by this, define 𝑆𝑆𝑅 = ∑# 𝑢Q #4
• SSR: Sum of squared residuals
4
• Define 𝐸𝑆𝑆 = ∑# 𝑌r# − 𝑌f
• ESS: Explained sum of squares
• Define 𝑇𝑆𝑆 = ∑# 𝑌# − 𝑌f# 4
a) Goodness-to-fit
• Can be shown 𝐸𝑆𝑆 + 𝑆𝑆𝑅 = 𝑇𝑆𝑆
122
• Define 𝑅 0 = , or equivalently
322
224
𝑅0 = 1 − . The smaller the total
322
magnitude of residuals is, the larger 𝑅 0 is.
• 𝑅 0 is between 0 and 1 by construction.
• A large 𝑅 0 is usually favored when the
goal is fitting and in-sample prediction.
• A large 𝑅 0 is NEVER a priority if the goal
is causal inference!!!
• A nearly zero 𝑅! is still OK as long as
𝐸 𝑢" 𝑋" = 0. More in Module 6.
a) Goodness-to-fit
• Some useful facts to show 𝐸𝑆𝑆 + 𝑆𝑆𝑅 = 𝑇𝑆𝑆
1
X 𝑢Y # = 0 (analogous to 𝐸 𝑢# = 0, implied by Assumption 1 and L. I. E)
𝑛
#
1
X 𝑌d# = 𝑌K
𝑛
#
1
X 𝑢Y # 𝑋# = 0 (analogous to 𝐸 𝑢# 𝑋# = 0, implied by Assumption 1 and L. I. E)
𝑛
#
𝑠4:' = 0 (analogous to 𝐸 𝑢# 𝑋# = 0, implied by Assumption 1 and L. I. E)
b) Prediction accuracy
• Define the standard error of the regression (SER) by
$ 1 $ 𝑆𝑆𝑅
𝑆𝐸𝑅 = 𝑠4: , where 𝑠4: = X 𝑢Y # =
𝑛−2 𝑛−2
#
• 𝑛 − 2 because at least two data points are needed to estimate 𝛽! and 𝛽".
• Asymptotically, −2 does not matter.
• 𝑌d ± 𝑠4:$ as a measure of prediction accuracy.
36