Part A: Regression and causality
A1: Key facts about regression
Kirill Borusyak
ARE 213 Applied Econometrics
UC Berkeley, Fall 2024
1
Acknowledgments
These lecture slides draw on the materials by Michael Anderson, Peter Hull, Paul
Goldsmith-Pinkham, and Michal Kolesar
All errors are mine — please let me know if you spot them!
2
What is this course about (1)
Goal: help you do rigorous empirical (micro)economic research
Focus on causal inference / program evaluation / treatment effects
What is shared by [the causal] literature is [...] an explicit emphasis on credibly
estimating causal effects, a recognition of the heterogeneity in these effects, clarity in the
identifying assumptions, and a concern about endogeneity of choices and the role study
design plays. (Imbens, 2010, “Better LATE Than Nothing”)
3
What is this course about (2)
Focus on most common research designs / identification strategies
The econometrics literature has developed a small number of canonical settings where
researchers view the specific causal models and associated statistical methods as well
established and understood. [They are] referred to as identification strategies. [These]
include unconfoundedness, IV, DiD, RDD, and synthetic control methods and are
familiar to most empirical researchers in economics. The [associated] methods associated
are commonly used in empirical work and are constantly being refined, and new
identification strategies are occasionally added to the canon. Empirical strategies not
currently in this canon, rightly or wrongly, are viewed with much more suspicion until
they reach the critical momentum to be added. (Imbens, 2020)
We will study target estimands, assumptions, tests, estimators, statistical inference
Introduce multi-purpose econometric tools: e.g. randomization inference
4
Course outline (1)
A. Introduction: regression and causality (~4 lectures)
▶ Key facts about regression; potential outcomes and RCTs
B. Selection on observables (~4 lectures)
▶ Covariate adjustment via regression, via propensity scores, doubly-robust methods,
double machine learning
C. Panel data methods (~7 lectures)
▶ Diff-in-diffs and event studies; synthetic controls and factor models
5
Course outline (2)
D. Instrumental variables (IVs) (~7 lectures)
▶ Linear IV; IV with treatment effect heterogeneity
▶ formula instruments, recentering, shift-share IV, spillovers
▶ Examiner designs (“judge IVs”)
E. Regression discontinuity (RD) designs (~3 lectures)
▶ Sharp and fuzzy RD designs and various extensions
F. Miscellaneous topics (~3 lectures)
▶ Nonlinear models: Poisson regression, quantile regression
▶ Statistical inference: clustering, bootstrap
▶ Topics of your interest (email me in advance!)
6
Course outline (3)
7
Currently not covered
Descriptive statistics, data visualization
Structural estimation
Time series data
Experimental design
8
Textbooks
MHE Angrist, Joshua and Jorn-Steffen Pischke (2009). Mostly Harmless Econometrics.
Princeton University Press.
CT Cameron, A. Colin and Pravin Trivedi (2005). Microconometrics: Methods and
Applications. Cambridge University Press.
IW Imbens, Guido and Jeffrey Wooldridge (2009). New developments in econometrics:
Lecture notes.
https://www.cemmap.ac.uk/resource/new-developments-in-econometrics/
JW Wooldridge, Jeffrey (2002). Econometric Analysis of Cross Section and Panel
Data. MIT Press. (Or second edition from 2010)
9
Some econometric vocabulary
P −1 P
′ −1 ′ N ′ N
OLS estimator: β̂ = (X X) XY≡ 1
N i=1 Xi Xi N
1
i=1 Xi Yi
▶ Random variable, function of the observed sample
OLS estimand: βOLS = E [XX′ ]−1 E [XY] ≡ E [Xi X′i ]−1 E [Xi Yi ] (assuming a
random sample)
▶ A non-stochastic population parameter
p
▶ β̂ → βOLS with a random sample under weak regularity conditions
▶ This does not involve assuming a model, exogeneity conditions etc.
β̂ and βOLS correspond to a linear specification Yi = β ′ Xi + error
▶ Just notational convention for reg Y X, not necessarily a model
10
Some econometric vocabulary (2)
An economic or statistical model is needed to interpret βOLS and other estimands
▶ A model involves parameters (with economic meaning) and assumptions
(restricting the DGP)
▶ Assumptions hopefully make some parameters identified, i.e. possible to uniquely
determine from everything the data contain — here, the distribution of (X, Y)
11
Some econometric vocabulary (3)
Example 1: demand and supply
Qi = −βd Pi + εd , Qi = βs Pi + εs , Cov [εd , εs ] = 0
▶ Regressing Qi on Pi and a constant yields (prove this!)
Var [εs ] Var [εd ]
βOLS = · (−βd ) + · βs
Var [εd ] + Var [εs ] Var [εd ] + Var [εs ]
Example 2: heterogeneous effects
Yi = βi Xi + εi , Xi ⊥
⊥ (βi , εi )
▶ Regressing Yi on Xi and a constant yields (prove this!)
βOLS = E [βi ]
12
Outline
1 Course intro
2 What is regression and why do we use it?
3 Linear regression and its mechanics
Regression and its uses
Regression of Y on X ≡ conditional expectation function (CEF):
h(·) : x 7→ h(x) ≡ E [Yi | Xi = x]
Conditional expectation E [Yi | Xi ] = h(Xi ) is a random variable because Xi is
Uses of regression:
Descriptive: how Y on average covaries with X — by definition
Prediction: if we know Xi , our best guess for Yi is h(Xi ) — prove next
Causal inference: what happens to Yi if we manipulate Xi — sometimes
13
Regression as optimal prediction (1)
What is the best guess is defined by a loss function
Proposition: CEF is the best predictor with quadratic loss:
h(·) = arg min E (Yi − g(Xi ))2
g(·)
Lemma: the CEF residual Yi − E [Yi | Xi ] is mean-zero and uncorrelated with any
g(Xi ).
▶ Proof by the law of iterated expectations (LIE)
▶ E [Yi − E [Yi | Xi ]] = E [E [Yi − E [Yi | Xi ] | Xi ]] = 0
▶ E [(Yi − h(Xi )) g(Xi )] = E [E [(Yi − h(Xi )) g(Xi ) | Xi ]] =
E [E [Yi − h(Xi ) | Xi ] · g(Xi )] = 0
14
Regression as optimal prediction (2)
Proposition: CEF is the best predictor with quadratic loss:
h(·) = arg min E (Yi − g(Xi ))2
g(·)
Lemma: the CEF residual Yi − E [Yi | Xi ] is mean-zero and uncorrelated with any
g(Xi ).
Proposition proof:
h i h i
E (Yi − g(Xi ))2 = E {(Yi − h(Xi )) + (h(Xi ) − g(Xi ))}2
h i h i
= E (Yi − h(Xi ))2 + 2E [(Yi − h(Xi )) (h(Xi ) − g(Xi ))] + E (h(Xi ) − g(Xi ))2
h i h i h i
2 2 2
= E (Yi − h(Xi )) + E (h(Xi ) − g(Xi )) ≥ E (Yi − h(Xi ))
15
Regression as optimal prediction: Exercise
What is the best predictor with loss |Yi − g(Xi )|, i.e. arg ming(·) E [|Yi − g(Xi )|]?
Or with the “check” loss function (slope q ∈ (0, 1) on the right, q − 1 on the left)?
Hint: solve it first assuming Xi takes only one value
Note: this exercise is linked to quantile regression
16
Outline
1 Course intro
2 What is regression and why do we use it?
3 Linear regression and its mechanics
Five reasons for linear regression
What does CEF have to do with least squares estimand βOLS = E [XX′ ]−1 E [XY]? And
why do we use it instead of E [Y | X]?
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
[but machine learning methods make it easier]
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y, i.e.
h i
βOLS = arg min E (Y − X′ b)
2
b
3. OLS is also the best linear approximation to the CEF:
h i
′ 2
βOLS = arg min E (E [Y | X] − X b)
b
17
Five reasons for linear regression (cont.)
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y, i.e.
h i
βOLS = arg min E (Y − X′ b)
2
b
3. OLS is also the best linear approximation to the CEF:
h i
βOLS = arg min E (E [Y | X] − X′ b)
2
b
▶ Proof by FOC: E [X (E [Y | X] − X′ b)] = 0 =⇒
b = E [XX′ ]−1 E [XE [Y | X]] = E [XX′ ]−1 E [XY] = βOLS
18
Five reasons for linear regression (cont.)
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y
3. OLS is also the best linear approximation to the CEF
4. With scalar X, βOLS is a convexly-weighted average of dE [Y | X = x] /dx (or its
discrete analog)
19
Proof of #4: Discrete X (with values x0 < · · · < xK)
PK
Rewrite E [Y | X = x] ≡ h(x) = h(x0 ) + k=1 (h(xk ) − h(xk−1 )) 1 [x ≥ xk ]
PK
Thus Cov [Y, X] = Cov [E [Y | X] , X] = k=1 (h(xk ) − h(xk−1 )) Cov [1 [X ≥ xk ] , X]
and
Cov [Y, X] X h(xk ) − h(xk−1 )
K
(xk − xk−1 ) Cov [1 [X ≥ xk ] , X]
βOLS = = ωk , ωk =
Var [X] x k − xk−1 Var [X]
k=1
Here ωk ≥ 0 because 1 [X ≥ xk ] is monotone. Specifically (prove it!):
Cov [1 [X ≥ xk ] , X] = (E [X | X ≥ xk ] − E [X | X < xk ]) P (X ≥ xk ) P (X < xk )
PK PK
And k=1 ωk = 1 because X = x0 + k=1 (xk − xk−1 ) 1 [X ≥ xk ]
20
Proof of #4: Continuous X
Similarly for continuous X:
Z ∞
Cov [1 [X ≥ x] , X]
βOLS = ω(x)h′ (x)dx, ω(x) =
−∞ Var [X]
R∞
with ω(x) ≥ 0 and −∞
ω(x) = 1
Exercise: if X is Gaussian, βOLS = E [h′ (X)] (prove it!)
φ(a)
▶ Hint: use E [Z | Z ≥ a] = 1−Φ(a) for Z ∼ N (0, 1)
21
Five reasons for linear regression (cont.)
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y
3. OLS is also the best linear approximation to the CEF
4. With scalar X, βOLS is a convexly-weighted average of ∂E [Y | X = x] /∂x
5. If E [Y | X] happens to be linear, E [Y | X] = X′ βOLS
▶ Linearity is guaranteed when (X, Y) are jointly normally distributed
▶ or when X is “saturated”: dummies for all values of a discrete variable. E.g. for
binary D and X = (1, D),
E [Y | X] = E [Y | D] = E [Y | D = 0] ·1 + (E [Y | D = 1] − E [Y | D = 0]) ·D
| {z } | {z }
intercept slope
22
(Linear) regression mechanics: Key results
1. When an intercept is included, residuals are mean-zero and uncorrelated with
regressors
2. Regressing Y = Xk on X1 , . . . , XK produces coefficients (0, . . . , 0, 1, 0, . . . 0)
3. β̂ is a linear estimator
4. Frisch-Waugh-Lovell (FWL) theorem
5. Omitted variable bias (OVB) formula
6. Asymptotic distribution and robust standard errors for OLS estimator
23
Linear regression results (cont.)
When an intercept is included, population residuals are mean-zero and uncorrelated
′
with regressors: E [X (Y − βOLS X)] = 0
▶ A simple result, not an assumption (prove it!)
P
▶ The sample analog also holds: N1 i Xi Yi − β̂ ′ Xi = 0
▶ Since residuals are mean-zero, average fitted value equals average outcome,
1 P ′ 1 P
N i β̂ X i = N Y
i i
Regressing Y = Xk on X1 , . . . , XK produces coefficients (0, . . . , 0, 1, 0, . . . 0)
▶ Prove it!
24
OLS is a linear estimator
Given the regressors X, each β̂k is linear in the outcomes, i.e. ∃ {ωki }Ni=1 such that
X
β̂k = ωki Yi
i
for some weights ωki ≡ ωki (X) (prove it!)
P ωki are mean-zero (for Xk 6= intercept), orthogonal to non-Xk regressors,
Weights
▶
and i ωki Xki = 1 (prove it!)
Implication: Regression coefficients can be decomposed
▶ If Yi = Y1i + · · · + YPi , regressing each Ypi on Xi and adding up the coefficient
estimates is numerically the same as regressing Yi on Xi
25
Partialling out: Frisch-Waugh-Lovell (FWL) theorem
Cov[X̃k ,Y] Cov[X̃k ,Ỹ]
Theorem: The k’th element of βOLS can be obtained as βk = Var[X̃k ]
or βk = Var[X̃k ]
where X̃k is the residual from regressing Xk on all other regressors (and same for Ỹ)
Proof:
′ ′ Cov[X̃k ,Y]
Define ε = Y − βOLS X. Plug in Y = βOLS X + ε to Var[X̃k ]
Note that X̃k is uncorrelated with ε; with other regressors; and with Y − Ỹ
Implication: Explicit characterization of the weights ωki :
P
X̃ki Yi X X̃ki
β̂k = Pi 2 = ωki Yi for ωki = PN 2 .
i X̃ki i j=1 X̃kj
26
Omitted variable “bias”
OVB formula is a mechanical relationship between βOLS from a “long” specification
Y = β0 + β1 X1 + β2 X2 + ε
and δOLS from a “short” specification
Y = δ0 + δ1 X1 + error
Claim: δ1 = β1 + β2 ρ, where ρ = Cov [X1 , X2 ] /Var [X1 ] is the regression slope of X2
(“omitted”) on X1 (“included”)
Cov[X1 ,Y] Cov[X1 ,β0 +β1 X1 +β2 X2 +ε]
Proof: δ1 = Var[X1 ]
= Var[X1 ]
= β1 + β2 Cov[X 1 ,X2 ]
Var[X1 ]
.
When included X1 is uncorrelated with omitted X2 , OVB = 0
Generalizes to multiple omitted variables (with OVB = β2′ ρ)
Applies with extra controls X3 included in long, short, and auxiliary regression
27
Asymptotic distribution of the OLS estimator
!−1 ! !−1 !
1X 1X 1X 1X
β̂ = Xi X′i Xi Yi = βOLS + Xi X′i Xi εi
N i N i N i N i
′
where by definition ε = Y − βOLS X. Thus,
!−1 !
√ 1X 1 X
N β̂ − βOLS = Xi X′i √ Xi εi
N i N i
P p
By LLN, 1
N
Xi X′i → E [XX′ ] (assumed non-singular)
i
P D
In a random sample, by CLT (using E [Xε] = 0), √1N i Xi εi → N (0, Var [Xε])
By the continuous mapping theorem,
√
D −1 −1
N β̂ − βOLS → N (0, V) , V = E [XX′ ] Var [Xε] E [XX′ ]
28
Robust standard errors
We estimate V by its sample analog (“sandwich formula”), up to a
degree-of-freedom correction:
!−1 ! !−1
1X 1 X 1 X
V̂ = Xi X′i · Xi X′i ε̂2i · Xi X′i
N i N − dim(X) i N i
Heteroskedasticity-robust (Eicker-Huber-White) standard error is
p
SE β̂k = Vkk /N
Never use homoskedastic standard errors!
For later: standard errors outside iid samples, e.g. clustered SE in panels
29