0% found this document useful (0 votes)
35 views36 pages

Module 4

The document discusses linear regression models and ordinary least squares (OLS) estimation. It introduces the linear regression model and how OLS finds estimates for the slope and intercept coefficients by minimizing the sum of squared residuals. It also discusses how OLS estimates can represent causal effects under the potential outcomes framework.

Uploaded by

Pak Hei Sin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views36 pages

Module 4

The document discusses linear regression models and ordinary least squares (OLS) estimation. It introduces the linear regression model and how OLS finds estimates for the slope and intercept coefficients by minimizing the sum of squared residuals. It also discusses how OLS estimates can represent causal effects under the potential outcomes framework.

Uploaded by

Pak Hei Sin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Econ 3334

Module 4
Linear Regression with a
Single Regressor:
Estimation
Department of Economics, HKUST
Instructor: Junlong Feng
Fall 2022
Menu of Module 4
I. Linear III. LS
Regression II. OLS Estimator Assumptions for
Model Causal Inference

IV. Sampling
Distribution of V. On Prediction
OLS
2
I. Linear Regression Model

• Alice wants to choose a university to attend.


• She really wants to have a good GPA, or test score.
• Her friend Amy told her that student-teacher ratio (STR) is an important factor that
determines test score.
• Alice wishes to know the mean GPA for every possible STR.
• Knowing probability theory, what Alice wants to know is 𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 .
• But how?

3
I. Linear Regression Model

• To know 𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 is hard:


• 1. Need information about joint distribution of 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒, 𝑆𝑇𝑅 . Unknown.
• 2. For every outcome of STR, say 𝑐, you need to 𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 = 𝑐 . Infinite number of
unknowns.
• Idea: use a model to simplify.
• The simplest model:
𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 = 𝛽! + 𝛽"𝑆𝑇𝑅
• Nice because
• 1. No need to know the joint distribution.
• 2. Once 𝛽! and 𝛽" are known, you know everything! Two unknowns.
• Drawback: Not robust; how do you know the conditional mean is linear in the
conditioning variable?

4
I. Linear Regression Model

From conditional expectation to linear regression model:


• 𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 = 𝛽! + 𝛽"𝑆𝑇𝑅
• Let 𝑢 be 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 − 𝐸 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 𝑆𝑇𝑅 .
• Then 𝑇𝑒𝑠𝑡𝑠𝑐𝑜𝑟𝑒 = 𝛽! + 𝛽"𝑆𝑇𝑅 + 𝑢
• Test score is indeed related to STR.
• But even if you know 𝛽! and 𝛽" , you won’t know the exact test score.
• 𝑢 contains all other factors that affect test score.

5
I. Linear Regression Model

In general, suppose we have an i.i.d. sample of { 𝑌# , 𝑋# : 𝑖 = 1,2, … , 𝑛}


• A linear conditional expectation model: 𝐸 𝑌# 𝑋# = 𝛽! + 𝛽"𝑋# for all 𝑖 = 1, … , 𝑛
• Introduce 𝑢# = 𝑌# − 𝐸 𝑌# 𝑋#
• Then 𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢# . This is called a linear regression model.

6
I. Linear Regression Model

𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• 𝑌: dependent variable
• 𝑋: independent variable, or regressor
• 𝑢: regression error, or simply, error (conventional. Not accurate in economics), or
unobservable (more accurate)
• Contains unobserved factors that also affect 𝑌# .
• Also contains measurement error of 𝑌# .
• 𝛽!: intercept
• 𝛽": slope
• 𝛽! + 𝛽"𝑋# : population regression function, or population regression line

7
I. Linear Regression Model

• Blue line: population regression line.


%
• Points: 𝑋# , 𝑌# #$" .
• The blue line does not pass through
any points (recall the expectation may
lie out of the sample space! Bob’s singing
competition).
• The vertical distance from each point
to the curve is 𝑢# .
• The population regression line cannot
be on the same side of all points.

SW (4th edition) p.146.


8
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• 𝐸 𝑌# 𝑋# = 𝛽! + 𝛽"𝑋# . Conditional expectation is a population quantity.
• 𝛽!, 𝛽" : population parameters.
• Use the random sample to estimate them.
• How?

9
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

𝑌# = 𝛽! + 𝛽"𝑋# + 𝑢#
• Intuition: If 𝑢# = 0 for all 𝑖, then 𝑌# = 𝛽! + 𝛽"𝑋# .
• 𝐸 𝑌# 𝑋# = 𝛽! + 𝛽"𝑋# is still satisfied.
• Since we can observe the 𝑌# s and the 𝑋# s (e.g. test scores and STRs), we have 𝑛
equations and two unknowns:
𝛽! + 𝛽"𝑋" − 𝑌" = 0
𝛽! + 𝛽$𝑋$ − 𝑌$ = 0

𝛽! + 𝛽$𝑋% − 𝑌% = 0
• In general, this system of equations has NO solution.

10
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

• Idea: Instead of forcing all 𝑛 equations to be exactly zero, try to make each of them
to be as close to zero as possible.
• May be none of the equations is zero in the end, but the total distance from zero is
the smallest.
• For any 𝑏! and 𝑏":
• Distance of equation 1 from zero: 𝑏! + 𝑏"𝑋" − 𝑌"
• Distance of equation 2 from zero: 𝑏! + 𝑏"𝑋$ − 𝑌$
• ⋯
• Total distance: (Euclidean distance)
𝑏! + 𝑏"𝑋" − 𝑌" $ + 𝑏! + 𝑏"𝑋$ − 𝑌$ $ + ⋯ + 𝑏! + 𝑏"𝑋% − 𝑌% $

11
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

• Total distance: (Euclidean distance)


𝑏! + 𝑏" 𝑋" − 𝑌" # + 𝑏! + 𝑏" 𝑋# − 𝑌# # + ⋯ + 𝑏! + 𝑏" 𝑋$ − 𝑌$ #
• The 𝑏! and 𝑏" that minimize the total distance is called the OLS estimator, denoted by
𝛽'! and 𝛽'" .
• That is, 𝛽'! and 𝛽'" are the minimizers of
min 𝑏! + 𝑏" 𝑋" − 𝑌" # + 𝑏! + 𝑏" 𝑋# − 𝑌# # + ⋯ + 𝑏! + 𝑏" 𝑋$ − 𝑌$ #
%! ,%"

min [ 𝑏! + 𝑏" 𝑋" − 𝑌" # + 𝑏! + 𝑏" 𝑋# − 𝑌# # + ⋯ + 𝑏! + 𝑏" 𝑋$ − 𝑌$ # ]
%! ,%"
Or more compactly,
$

min 2 𝑏! + 𝑏" 𝑋' − 𝑌' #


%! ,%"
'("

12
II. Estimate (𝛽! , 𝛽" ):
The OLS Estimator
• The least square estimator was first
invented by Gauss (1795, 1809) and
Legendre (1805).
• More than 200 years old.
• It’s still among the most powerful
estimators/predictors/machine
learners in statistics, econometrics,
machine learning, etc.

13
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator
$

min 2 𝑏! + 𝑏" 𝑋' − 𝑌' #


%! ,%"
'("
• This OLS-minimization problem has an analytical solution.
∑# *# +*, ⋅ .# +., /%&
'
• 𝛽" = = $
∑# *# +*, $ /%

• 𝛽'! = 𝑌4 − 𝛽'" 𝑋4
• Derivation in Appendix 4.2. Not required.
• Predicted value (or, fitted value): 𝑌5' = 𝛽'! + 𝛽'" 𝑋'
• Compare with population regression line.
• Residual: 𝑢6 ' = 𝑌' − 𝑌5'
• Compare with estimation error.

14
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

• Still Testscore-STR sample. 𝑛 = 420


• Blue line: 𝑌& = 𝛽(! + 𝛽(" 𝑋
• Districts with one more student per
teacher on average have test scores that
are 2.28 points lower.
• The intercept (taken literally) means that,
according to this estimated line, districts
with zero students per teacher would have a (predicted)
test score of 698.9. But this interpretation
of the intercept makes no sense – it
extrapolates the line outside the range of
the data – here, the intercept here is not
economically meaningful.
SW (4th edition) p.151.
15
II. Estimate (𝛽! , 𝛽" ): The OLS Estimator

• One of the districts in the data set is


Antelope, for which STR = 19.33 and
Test Score = 657.8
• Predicted value:
𝑌&#$%&'()& = 698.9 − 2.28×19.33 = 654.8
• Residual:
𝑢Q&'()*+,) = 657.8 − 654.8 = 3

SW (4th edition) p.151.


16
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
How do we define causality?
• There are a couple of approaches to define causality. The most popular one among
econometricians and economists is the potential outcome framework developed by Rubin
in the 1970s. It gained popularity in economics by works of Angrist, Imbens and Rubin
himself in the 1990s. (Angrist and Imbens won the 2021 Nobel Prize in Economics for it).
• Let’s think about a very simple question: Does going to library everyday cause A+ for
ECON3334 in this semester?
• Let’s use 𝑋 to denote whether Alice goes to library everyday. 𝑋 = 1 if she does so. 𝑋 ≠ 1 if
not.
• Let her final grade be 𝑌. Of course 𝑌 is determined by a lot of things. Let’s use 𝑉 to
represent all other factors besides 𝑋. For example, 𝑉 may contain her interest in 3334, her
math background, her time management skills, etc.

17
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
Mathematically, you can imagine there is a production function 𝑔, and

𝑌 = 𝑔 𝑋, 𝑉
(Of course 𝑔 is abstract and unknown.)
Now, imagine there are two parallel universes. In the first universe, 𝑋 = 1. In the second, 𝑋 = 0. In
both universes, 𝑉 are the same.
• The causal effect is defined as 𝑔 1, 𝑉 − 𝑔 0, 𝑉 .
• The two terms 𝑔(1, 𝑉) and 𝑔(0, 𝑉) are called potential outcomes, denoted by 𝑌 1 and 𝑌 0 .
The problem, however, is that Alice only lives once. She cannot observe the two potential outcomes at
the same time: She either goes to the library everyday in this semester or not.
• This is called the fundamental problem of causal inference.

18
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
So, instead of getting to know Alice’s own causal effect, economists try to get to know something less
informative but still useful enough in many settings: the average treatment effect (ATE):

𝐴𝑇𝐸 𝑥, 𝑥 - = 𝐸(𝑌 𝑥 ) − 𝐸(𝑌 𝑥 - )


• As we have seen, 𝑌 𝑥 = 𝑔 𝑥, 𝑉 . Since 𝑉 is unknown to the researcher, we treat it as random. So it’s meaningful to
take the expectation.
• ATE means the average effect in a population.
• Although the fundamental problem still exists, to know ATE is possible:
• Suppose the university (of course not UST! This policy is too bad, eww! L) launches a policy that whether a student
goes to library everyday (𝑋) is determined by flipping a coin. Then 𝑋 is independent of everything else in 𝑉.
• Then 𝐸 𝑌 𝑥 = 𝐸 𝑔 𝑥, 𝑉 = 𝐸 𝑔 𝑥, 𝑉 𝑋 = 𝑥 = 𝐸 𝑔 𝑋, 𝑉 𝑋 = 𝑥 = 𝐸 𝑌 𝑋 = 𝑥 !
• Therefore, under such random assignment of 𝑿, 𝐴𝑇𝐸 𝑥, 𝑥 ! = 𝐸 𝑌 𝑋 = 𝑥 − 𝐸(𝑌|𝑋 = 𝑥′).
• Under our linearity assumption that 𝐸 𝑌 𝑋 = 𝛽" + 𝛽# 𝑋, 𝐴𝑇𝐸 = 𝛽# !

19
III. When does OLS estimate the causal effect? The
Potential Outcome Framework
To summarize:
• ATE is what people aim for given the fundamental problem of causal inference.
• ATE can be expressed as the difference of two conditional expectations of 𝑌 given 𝑋
under random assignment.
• Conditional expectation of 𝑌 given 𝑋 can be assumed to be linear.
• In Module 7, I’ll show you cases where linearity is NOT an assumption but a mathematical truth.
• OLS is a tool to estimate the coefficients in a linear model.
• So, putting everything together, we are tempted to use OLS to estimate conditional
expectation, and further, to estimate ATE.
• However, this requires we reverse the logic chain.

20
III. When does OLS estimate the causal effect?

We motivated the linear model and OLS as the following logic, but under what conditions can
the arrows be reversed?

From 𝐸 𝑌' 𝑋' =


𝛽! + 𝛽" 𝑋' , Linear regression OLS: min ∑'(𝑌' −
model: 𝑌' = 𝛽! + %! ,%"
define 𝑢' = 𝑌' − 𝛽" 𝑋' + 𝑢' 𝑏! − 𝑏" 𝑋' )#
𝐸(𝑌' |𝑋' )
1. When can a linear model represent a conditional mean function? (Reversing the first
arrow)
2. Even without the model, solving the OLS minimization problem always gives a number.
When does the number is a good estimate of the linear model? (Reversing the second
arrow)

21
III. When does OLS estimate the causal effect?

Reverse Arrow 1: Give a linear model a conditional-mean interpretation.

From 𝐸 𝑌' 𝑋' =


𝛽! + 𝛽" 𝑋' , Linear regression OLS: min ∑'(𝑌' −
model: 𝑌' = 𝛽! + %! ,%"
define 𝑢' = 𝑌' − 𝛽" 𝑋' + 𝑢' 𝑏! − 𝑏" 𝑋' )#
𝐸(𝑌' |𝑋' )

Assumption 1: 𝐸 𝑢' 𝑋' = 0


• Proof: Take conditional expectation given 𝑋' on both sides of the linear model, we have
𝐸 𝑌' |𝑋' = 𝛽! + 𝛽" 𝑋" + 𝐸(𝑢' |𝑋' ). Under Assumption 1, 𝐸 𝑌' |𝑋' = 𝛽! + 𝛽" 𝑋"
• The assumption is called unconfoundedness in causal inference.

22
III. When does OLS estimate the causal effect?

Reverse Arrow 2: When is OLS a good estimator of the parameters in the linear model (now the causal
parameters)? Assumption 1 is still helpful, but it’s not enough.

From 𝐸 𝑌' 𝑋' =


𝛽! + 𝛽" 𝑋' , Linear regression OLS: min ∑'(𝑌' −
model: 𝑌' = 𝛽! + %! ,%"
define 𝑢' = 𝑌' − 𝛽" 𝑋' + 𝑢' 𝑏! − 𝑏" 𝑋' )#
𝐸(𝑌' |𝑋' )
Assumption 2: 𝑋# , 𝑌# : 𝑖 = 1, … , 𝑛 is an i.i.d. sample.
• Reason: going to use asymptotic theory.
Assumption 3: Large outliers are not likely.
• Reason (partly): asymptotic theory holds for a large sample, but no matter how large the sample is, if
the minimization problem is driven by a few outliers, the effective sample size is small so the theory
does not work.

23
III. When does OLS estimate the causal effect?

Reverse Arrow 2: When is OLS a good estimator of the parameters in the linear model
(now, the causal parameters)? Assumption 1 is still helpful, but it’s not enough.

From 𝐸 𝑌' 𝑋' =


𝛽! + 𝛽" 𝑋' , Linear regression OLS: min ∑'(𝑌' −
model: 𝑌' = 𝛽! + %! ,%"
define 𝑢' = 𝑌' − 𝛽" 𝑋' + 𝑢' 𝑏! − 𝑏" 𝑋' )#
𝐸(𝑌' |𝑋' )

Under Assumptions 1-3, the OLS estimator 𝛽H!, 𝛽H" is an unbiased and consistent
estimator of 𝛽!, 𝛽" . Details in Topic IV.

24
III. When does OLS estimate the causal effect?

When do Assumptions 1-3 fail?


• Failure of Assumption 1 (unconfoundedness).
• Given 𝑌# = 𝛽! + 𝛽" 𝑋# + 𝑢# , recall 𝑢# contains all factors other than 𝑋# that affect 𝑌# .
• Since 𝐸 𝑢# 𝑋# = 0 ⇒ 𝑐𝑜𝑣 𝑢# , 𝑋# = 0, Assumption 1 fails if 𝑢# and 𝑋# are correlated.
• In the test score example, the assumption fails if there are some factors that are correlated with
STR and also determine test scores. There can be many.
• Failure of Assumption 2 (i.i.d.).
• i.i.d. usually holds in cross-sectional data. Suppose Alice went to LG1 for lunch. Bob went to LG7
for lunch. No way to know where Amy went for lunch.
• It usually does not hold in time series data. Suppose Alice went to LG1 for lunch everyday, where
would she go for lunch tomorrow?

25
III. When does OLS estimate the causal effect?

Failure of Assumption 3 (no large


outliers).
• Not a concern for bounded data.
• Maybe due to data input error.
• Or due to heavy-tail, fat-tail data;
Common in finance applications.

SW (4th edition) p.159.


26
IV. Sampling distribution of OLS

It is useful to know the distribution of 𝛽H!, 𝛽H" .


• Recall that unbiasedness is about the mean of the estimator. Mean is related to the
distribution.
• Recall that hypothesis testing and confidence intervals are constructed by choosing
appropriate critical value to control some probabilities.

27
IV. Sampling distribution of OLS

A. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is unbiased for 𝛽!, 𝛽" .
• 𝐸 𝛽H" = 𝛽". Proof in Appendix 4.3.
• 𝐸 𝛽H! = 𝛽!. Proof as a PSET question (also see Exercise 4.7).
• CAN use 𝐸 𝛽e" = 𝛽" as known.

28
IV. Sampling distribution of OLS

B. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is consistent for 𝛽!, 𝛽" .
∑ '* (') ⋅ +* (+) ,,- -
• 𝛽H" = * ∑* '* (') +
= + → 𝛽". Heuristics:
,,
(
∑$ .$ 12% ⋅ /$ 12& ∑ . 12% ⋅ /$ 12&
• 𝑋f and 𝑌f are consistent of 𝜇. and 𝜇/ . So 𝛽e" ≈ ∑$ .$ 12% '
= ) $ $
(
∑ . 12% '
) $ $
" ,
• WLLN: ∑# 𝑋# − 𝜇. ⋅ 𝑌# − 𝜇/ → 𝐸 𝑋# − 𝜇. ⋅ 𝑌# − 𝜇/ = 𝑐𝑜𝑣(𝑋# , 𝑌# ).
'
" ,
• WLLN: ∑# 𝑋# − 𝜇. 4→𝐸 𝑋# − 𝜇. 4 = 𝑣𝑎𝑟 𝑋#
'
, 5+6 .$ ,/$
• So 𝛽e" →
689 .$
• Meanwhile, given 𝑌# = 𝛽! + 𝛽" 𝑋# + 𝑢# and 𝑐𝑜𝑣 𝑋# , 𝑢# = 0 (the latter implied by 𝐸 𝑢# 𝑋# = 0),
we have 𝑐𝑜𝑣 𝑌# , 𝑋# = 𝑐𝑜𝑣 𝛽! + 𝛽" 𝑋# + 𝑢# , 𝑋# = 𝑐𝑜𝑣 𝛽! , 𝑋# + 𝛽" 𝑐𝑜𝑣 𝑋# , 𝑋# + 𝑐𝑜𝑣(𝑢# , 𝑋# )
= 0 + 𝛽" 𝑣𝑎𝑟 𝑋# + 0. Therefore, 𝛽" = 𝑐𝑜𝑣 𝑋# , 𝑌# /𝑣𝑎𝑟(𝑋# ).

29
IV. Sampling distribution of OLS

B. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is consistent for 𝛽!, 𝛽" .
-
• 𝛽H! = 𝑌K − 𝛽H"𝑋K → 𝛽!
• Proof as a PSET question.
-
• You may use 𝛽"̂ → 𝛽" as a known fact.
, ,
• You also need to invoke Slutsky’s theorem: if two random sequences 𝐴' → 𝐴 and 𝐵' → 𝐵, then
, ,
𝐴' 𝐵' → 𝐴𝐵 and 𝐴' + 𝐵' → 𝐴 + 𝐵.

30
IV. Sampling distribution of OLS

C. Under Assumption 1-3, the OLS estimator 𝛽H!, 𝛽H" is asymptotically normal (Key in
the proof is CLT).
" 012 '* (3, 4*
• 𝛽H" is approximately 𝑁 𝛽", 𝜎/.$ , where 𝜎/.$ = % 5+
. . ,
" 012 6* 4* 3,
• 𝛽H! is approximately 𝑁 𝛽!, 𝜎/.$ , where 𝜎/.$ = % 7 6+ +
, where 𝐻# = 1 − 7 '+ 𝑋# .
/ / *
*
• Takeaways: (Recall 𝑌# = 𝛽! ⋅ 1 + 𝛽" ⋅ 𝑋# + 𝑢# . Decompose 𝑌# into 1, 𝑋# and 𝑢# )
• If 𝑛 → ∞, then 𝜎;:4 → 0 and 𝜎;:4 → 0. The limits become 𝛽" and 𝛽! , echoing consistency.
( *
• When 𝑋# has larger variance, 𝛽e" has smaller variance, or is more accurate. The larger the
variance 𝑋# is, the more different 𝑋# is from 1 (the latter has 0 variance). It’s easier for the
algorithm to recognize the different contributions from 𝑋# and 1.
• When 𝑢# has larger variance, 𝛽e" has larger variance. The noise is too noisy to abstract accurate
info about the contribution of 𝑋# .

31
V. On prediction

Now forget about causal inference. We just want to use a straight line to fit the data and
perhaps make predictions.
• We still write 𝑌' = 𝛽! + 𝛽" 𝑋' + 𝑢'
• But the only restriction (actually a normalization) is 𝐸 𝑢' = 0.
• This is even not an assumption!
• It is without loss of generality because if 𝐸 𝑢' = 𝑐 ≠ 0, then just let 𝛽!0 = 𝛽! + 𝑐 and 𝑢'0 =
𝑢' − 𝑐. In the new regression 𝑌' = 𝛽!0 + 𝛽" 𝑋' + 𝑢'0 we have 𝐸 𝑢'0 = 0
• Once we have estimated 𝛽! and 𝛽" , we can make prediction for any value of 𝑋' = 𝑥
• The prediction is formed by 𝛽!̂ + 𝛽'" 𝑥
• 𝑥 may be a value in your dataset (in-sample prediction)
• 𝑥 may be not a value in your dataset (out-of-sample prediction)
• Wanna know a) how good the straight line fits the data and b) how accurate a prediction is.

32
V. On prediction

a) Goodness-to-fit
• Fitted value: 𝑌r# = 𝛽8! + 𝛽e" 𝑋#
s# .
• Residual: 𝑢Q # = 𝑌# − 𝑌
• Then 𝑌# = 𝑌r# + 𝑢Q #
• If 𝑢Q # = 0, perfect fit at 𝑖.
• Inspired by this, define 𝑆𝑆𝑅 = ∑# 𝑢Q #4
• SSR: Sum of squared residuals
4
• Define 𝐸𝑆𝑆 = ∑# 𝑌r# − 𝑌f
• ESS: Explained sum of squares
• Define 𝑇𝑆𝑆 = ∑# 𝑌# − 𝑌f# 4

• TSS: Total sum of squares

SW (4th edition) p.146.


33
V. On prediction

a) Goodness-to-fit
• Can be shown 𝐸𝑆𝑆 + 𝑆𝑆𝑅 = 𝑇𝑆𝑆
122
• Define 𝑅 0 = , or equivalently
322
224
𝑅0 = 1 − . The smaller the total
322
magnitude of residuals is, the larger 𝑅 0 is.
• 𝑅 0 is between 0 and 1 by construction.
• A large 𝑅 0 is usually favored when the
goal is fitting and in-sample prediction.
• A large 𝑅 0 is NEVER a priority if the goal
is causal inference!!!
• A nearly zero 𝑅! is still OK as long as
𝐸 𝑢" 𝑋" = 0. More in Module 6.

SW (4th edition) p.146.


34
V. On prediction

a) Goodness-to-fit
• Some useful facts to show 𝐸𝑆𝑆 + 𝑆𝑆𝑅 = 𝑇𝑆𝑆
1
X 𝑢Y # = 0 (analogous to 𝐸 𝑢# = 0, implied by Assumption 1 and L. I. E)
𝑛
#
1
X 𝑌d# = 𝑌K
𝑛
#
1
X 𝑢Y # 𝑋# = 0 (analogous to 𝐸 𝑢# 𝑋# = 0, implied by Assumption 1 and L. I. E)
𝑛
#
𝑠4:' = 0 (analogous to 𝐸 𝑢# 𝑋# = 0, implied by Assumption 1 and L. I. E)

• Proofs in Appendix 4.3


35
V. On prediction

b) Prediction accuracy
• Define the standard error of the regression (SER) by
$ 1 $ 𝑆𝑆𝑅
𝑆𝐸𝑅 = 𝑠4: , where 𝑠4: = X 𝑢Y # =
𝑛−2 𝑛−2
#
• 𝑛 − 2 because at least two data points are needed to estimate 𝛽! and 𝛽".
• Asymptotically, −2 does not matter.
• 𝑌d ± 𝑠4:$ as a measure of prediction accuracy.

36

You might also like