0% found this document useful (0 votes)
33 views31 pages

A1 Regression

The document outlines a course on Applied Econometrics focusing on regression and causality, emphasizing causal inference and program evaluation. It details the course structure, including topics such as regression, selection on observables, panel data methods, and instrumental variables. Key concepts and vocabulary related to regression analysis are also introduced, along with the importance of understanding causal relationships in empirical research.

Uploaded by

Sunakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views31 pages

A1 Regression

The document outlines a course on Applied Econometrics focusing on regression and causality, emphasizing causal inference and program evaluation. It details the course structure, including topics such as regression, selection on observables, panel data methods, and instrumental variables. Key concepts and vocabulary related to regression analysis are also introduced, along with the importance of understanding causal relationships in empirical research.

Uploaded by

Sunakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Part A: Regression and causality

A1: Key facts about regression

Kirill Borusyak
ARE 213 Applied Econometrics
UC Berkeley, Fall 2024

1
Acknowledgments

These lecture slides draw on the materials by Michael Anderson, Peter Hull, Paul
Goldsmith-Pinkham, and Michal Kolesar

All errors are mine — please let me know if you spot them!

2
What is this course about (1)

Goal: help you do rigorous empirical (micro)economic research

Focus on causal inference / program evaluation / treatment effects

What is shared by [the causal] literature is [...] an explicit emphasis on credibly


estimating causal effects, a recognition of the heterogeneity in these effects, clarity in the
identifying assumptions, and a concern about endogeneity of choices and the role study
design plays. (Imbens, 2010, “Better LATE Than Nothing”)

3
What is this course about (2)
Focus on most common research designs / identification strategies
The econometrics literature has developed a small number of canonical settings where
researchers view the specific causal models and associated statistical methods as well
established and understood. [They are] referred to as identification strategies. [These]
include unconfoundedness, IV, DiD, RDD, and synthetic control methods and are
familiar to most empirical researchers in economics. The [associated] methods associated
are commonly used in empirical work and are constantly being refined, and new
identification strategies are occasionally added to the canon. Empirical strategies not
currently in this canon, rightly or wrongly, are viewed with much more suspicion until
they reach the critical momentum to be added. (Imbens, 2020)

We will study target estimands, assumptions, tests, estimators, statistical inference


Introduce multi-purpose econometric tools: e.g. randomization inference
4
Course outline (1)
A. Introduction: regression and causality (~4 lectures)

▶ Key facts about regression; potential outcomes and RCTs

B. Selection on observables (~4 lectures)

▶ Covariate adjustment via regression, via propensity scores, doubly-robust methods,


double machine learning

C. Panel data methods (~7 lectures)

▶ Diff-in-diffs and event studies; synthetic controls and factor models

5
Course outline (2)
D. Instrumental variables (IVs) (~7 lectures)
▶ Linear IV; IV with treatment effect heterogeneity
▶ formula instruments, recentering, shift-share IV, spillovers
▶ Examiner designs (“judge IVs”)
E. Regression discontinuity (RD) designs (~3 lectures)
▶ Sharp and fuzzy RD designs and various extensions
F. Miscellaneous topics (~3 lectures)
▶ Nonlinear models: Poisson regression, quantile regression
▶ Statistical inference: clustering, bootstrap
▶ Topics of your interest (email me in advance!)

6
Course outline (3)

7
Currently not covered

Descriptive statistics, data visualization

Structural estimation

Time series data

Experimental design

8
Textbooks
MHE Angrist, Joshua and Jorn-Steffen Pischke (2009). Mostly Harmless Econometrics.
Princeton University Press.

CT Cameron, A. Colin and Pravin Trivedi (2005). Microconometrics: Methods and


Applications. Cambridge University Press.

IW Imbens, Guido and Jeffrey Wooldridge (2009). New developments in econometrics:


Lecture notes.
https://www.cemmap.ac.uk/resource/new-developments-in-econometrics/

JW Wooldridge, Jeffrey (2002). Econometric Analysis of Cross Section and Panel


Data. MIT Press. (Or second edition from 2010)

9
Some econometric vocabulary
 P −1  P 
′ −1 ′ N ′ N
OLS estimator: β̂ = (X X) XY≡ 1
N i=1 Xi Xi N
1
i=1 Xi Yi

▶ Random variable, function of the observed sample

OLS estimand: βOLS = E [XX′ ]−1 E [XY] ≡ E [Xi X′i ]−1 E [Xi Yi ] (assuming a
random sample)
▶ A non-stochastic population parameter
p
▶ β̂ → βOLS with a random sample under weak regularity conditions
▶ This does not involve assuming a model, exogeneity conditions etc.

β̂ and βOLS correspond to a linear specification Yi = β ′ Xi + error


▶ Just notational convention for reg Y X, not necessarily a model

10
Some econometric vocabulary (2)

An economic or statistical model is needed to interpret βOLS and other estimands

▶ A model involves parameters (with economic meaning) and assumptions


(restricting the DGP)

▶ Assumptions hopefully make some parameters identified, i.e. possible to uniquely


determine from everything the data contain — here, the distribution of (X, Y)

11
Some econometric vocabulary (3)
Example 1: demand and supply

Qi = −βd Pi + εd , Qi = βs Pi + εs , Cov [εd , εs ] = 0

▶ Regressing Qi on Pi and a constant yields (prove this!)

Var [εs ] Var [εd ]


βOLS = · (−βd ) + · βs
Var [εd ] + Var [εs ] Var [εd ] + Var [εs ]

Example 2: heterogeneous effects

Yi = βi Xi + εi , Xi ⊥
⊥ (βi , εi )

▶ Regressing Yi on Xi and a constant yields (prove this!)

βOLS = E [βi ]

12
Outline

1 Course intro

2 What is regression and why do we use it?

3 Linear regression and its mechanics


Regression and its uses
Regression of Y on X ≡ conditional expectation function (CEF):

h(·) : x 7→ h(x) ≡ E [Yi | Xi = x]

Conditional expectation E [Yi | Xi ] = h(Xi ) is a random variable because Xi is

Uses of regression:

Descriptive: how Y on average covaries with X — by definition

Prediction: if we know Xi , our best guess for Yi is h(Xi ) — prove next

Causal inference: what happens to Yi if we manipulate Xi — sometimes

13
Regression as optimal prediction (1)
What is the best guess is defined by a loss function
Proposition: CEF is the best predictor with quadratic loss:
 
h(·) = arg min E (Yi − g(Xi ))2
g(·)

Lemma: the CEF residual Yi − E [Yi | Xi ] is mean-zero and uncorrelated with any
g(Xi ).
▶ Proof by the law of iterated expectations (LIE)
▶ E [Yi − E [Yi | Xi ]] = E [E [Yi − E [Yi | Xi ] | Xi ]] = 0
▶ E [(Yi − h(Xi )) g(Xi )] = E [E [(Yi − h(Xi )) g(Xi ) | Xi ]] =
E [E [Yi − h(Xi ) | Xi ] · g(Xi )] = 0

14
Regression as optimal prediction (2)
Proposition: CEF is the best predictor with quadratic loss:
 
h(·) = arg min E (Yi − g(Xi ))2
g(·)

Lemma: the CEF residual Yi − E [Yi | Xi ] is mean-zero and uncorrelated with any
g(Xi ).
Proposition proof:
h i h i
E (Yi − g(Xi ))2 = E {(Yi − h(Xi )) + (h(Xi ) − g(Xi ))}2
h i h i
= E (Yi − h(Xi ))2 + 2E [(Yi − h(Xi )) (h(Xi ) − g(Xi ))] + E (h(Xi ) − g(Xi ))2
h i h i h i
2 2 2
= E (Yi − h(Xi )) + E (h(Xi ) − g(Xi )) ≥ E (Yi − h(Xi ))

15
Regression as optimal prediction: Exercise
What is the best predictor with loss |Yi − g(Xi )|, i.e. arg ming(·) E [|Yi − g(Xi )|]?
Or with the “check” loss function (slope q ∈ (0, 1) on the right, q − 1 on the left)?

Hint: solve it first assuming Xi takes only one value


Note: this exercise is linked to quantile regression
16
Outline

1 Course intro

2 What is regression and why do we use it?

3 Linear regression and its mechanics


Five reasons for linear regression
What does CEF have to do with least squares estimand βOLS = E [XX′ ]−1 E [XY]? And
why do we use it instead of E [Y | X]?
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
[but machine learning methods make it easier]
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y, i.e.
h i
βOLS = arg min E (Y − X′ b)
2
b

3. OLS is also the best linear approximation to the CEF:


h i
′ 2
βOLS = arg min E (E [Y | X] − X b)
b

17
Five reasons for linear regression (cont.)
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y, i.e.
h i
βOLS = arg min E (Y − X′ b)
2
b

3. OLS is also the best linear approximation to the CEF:


h i
βOLS = arg min E (E [Y | X] − X′ b)
2
b

▶ Proof by FOC: E [X (E [Y | X] − X′ b)] = 0 =⇒


b = E [XX′ ]−1 E [XE [Y | X]] = E [XX′ ]−1 E [XY] = βOLS

18
Five reasons for linear regression (cont.)

1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional

2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y

3. OLS is also the best linear approximation to the CEF

4. With scalar X, βOLS is a convexly-weighted average of dE [Y | X = x] /dx (or its


discrete analog)

19
Proof of #4: Discrete X (with values x0 < · · · < xK)
PK
Rewrite E [Y | X = x] ≡ h(x) = h(x0 ) + k=1 (h(xk ) − h(xk−1 )) 1 [x ≥ xk ]
PK
Thus Cov [Y, X] = Cov [E [Y | X] , X] = k=1 (h(xk ) − h(xk−1 )) Cov [1 [X ≥ xk ] , X]
and

Cov [Y, X] X h(xk ) − h(xk−1 )


K
(xk − xk−1 ) Cov [1 [X ≥ xk ] , X]
βOLS = = ωk , ωk =
Var [X] x k − xk−1 Var [X]
k=1

Here ωk ≥ 0 because 1 [X ≥ xk ] is monotone. Specifically (prove it!):

Cov [1 [X ≥ xk ] , X] = (E [X | X ≥ xk ] − E [X | X < xk ]) P (X ≥ xk ) P (X < xk )

PK PK
And k=1 ωk = 1 because X = x0 + k=1 (xk − xk−1 ) 1 [X ≥ xk ]
20
Proof of #4: Continuous X

Similarly for continuous X:


Z ∞
Cov [1 [X ≥ x] , X]
βOLS = ω(x)h′ (x)dx, ω(x) =
−∞ Var [X]
R∞
with ω(x) ≥ 0 and −∞
ω(x) = 1

Exercise: if X is Gaussian, βOLS = E [h′ (X)] (prove it!)

φ(a)
▶ Hint: use E [Z | Z ≥ a] = 1−Φ(a) for Z ∼ N (0, 1)

21
Five reasons for linear regression (cont.)
1. Curse of dimensionality: E [Y | X] is hard to estimate when X is high-dimensional
2. OLS and CEF solve similar problems: X′ βOLS is the best linear predictor of Y
3. OLS is also the best linear approximation to the CEF
4. With scalar X, βOLS is a convexly-weighted average of ∂E [Y | X = x] /∂x
5. If E [Y | X] happens to be linear, E [Y | X] = X′ βOLS
▶ Linearity is guaranteed when (X, Y) are jointly normally distributed
▶ or when X is “saturated”: dummies for all values of a discrete variable. E.g. for
binary D and X = (1, D),

E [Y | X] = E [Y | D] = E [Y | D = 0] ·1 + (E [Y | D = 1] − E [Y | D = 0]) ·D
| {z } | {z }
intercept slope

22
(Linear) regression mechanics: Key results
1. When an intercept is included, residuals are mean-zero and uncorrelated with
regressors

2. Regressing Y = Xk on X1 , . . . , XK produces coefficients (0, . . . , 0, 1, 0, . . . 0)

3. β̂ is a linear estimator

4. Frisch-Waugh-Lovell (FWL) theorem

5. Omitted variable bias (OVB) formula

6. Asymptotic distribution and robust standard errors for OLS estimator

23
Linear regression results (cont.)
When an intercept is included, population residuals are mean-zero and uncorrelated

with regressors: E [X (Y − βOLS X)] = 0

▶ A simple result, not an assumption (prove it!)


P  
▶ The sample analog also holds: N1 i Xi Yi − β̂ ′ Xi = 0

▶ Since residuals are mean-zero, average fitted value equals average outcome,
1 P ′ 1 P
N i β̂ X i = N Y
i i

Regressing Y = Xk on X1 , . . . , XK produces coefficients (0, . . . , 0, 1, 0, . . . 0)

▶ Prove it!

24
OLS is a linear estimator
Given the regressors X, each β̂k is linear in the outcomes, i.e. ∃ {ωki }Ni=1 such that
X
β̂k = ωki Yi
i

for some weights ωki ≡ ωki (X) (prove it!)

P ωki are mean-zero (for Xk 6= intercept), orthogonal to non-Xk regressors,


Weights

and i ωki Xki = 1 (prove it!)

Implication: Regression coefficients can be decomposed


▶ If Yi = Y1i + · · · + YPi , regressing each Ypi on Xi and adding up the coefficient
estimates is numerically the same as regressing Yi on Xi

25
Partialling out: Frisch-Waugh-Lovell (FWL) theorem
Cov[X̃k ,Y] Cov[X̃k ,Ỹ]
Theorem: The k’th element of βOLS can be obtained as βk = Var[X̃k ]
or βk = Var[X̃k ]

where X̃k is the residual from regressing Xk on all other regressors (and same for Ỹ)
Proof:
′ ′ Cov[X̃k ,Y]
Define ε = Y − βOLS X. Plug in Y = βOLS X + ε to Var[X̃k ]

Note that X̃k is uncorrelated with ε; with other regressors; and with Y − Ỹ

Implication: Explicit characterization of the weights ωki :


P
X̃ki Yi X X̃ki
β̂k = Pi 2 = ωki Yi for ωki = PN 2 .
i X̃ki i j=1 X̃kj

26
Omitted variable “bias”
OVB formula is a mechanical relationship between βOLS from a “long” specification

Y = β0 + β1 X1 + β2 X2 + ε

and δOLS from a “short” specification

Y = δ0 + δ1 X1 + error

Claim: δ1 = β1 + β2 ρ, where ρ = Cov [X1 , X2 ] /Var [X1 ] is the regression slope of X2


(“omitted”) on X1 (“included”)
Cov[X1 ,Y] Cov[X1 ,β0 +β1 X1 +β2 X2 +ε]
Proof: δ1 = Var[X1 ]
= Var[X1 ]
= β1 + β2 Cov[X 1 ,X2 ]
Var[X1 ]
.
When included X1 is uncorrelated with omitted X2 , OVB = 0
Generalizes to multiple omitted variables (with OVB = β2′ ρ)
Applies with extra controls X3 included in long, short, and auxiliary regression
27
Asymptotic distribution of the OLS estimator
!−1 ! !−1 !
1X 1X 1X 1X
β̂ = Xi X′i Xi Yi = βOLS + Xi X′i Xi εi
N i N i N i N i

where by definition ε = Y − βOLS X. Thus,
!−1 !
√   1X 1 X
N β̂ − βOLS = Xi X′i √ Xi εi
N i N i
P p
By LLN, 1
N
Xi X′i → E [XX′ ] (assumed non-singular)
i
P D
In a random sample, by CLT (using E [Xε] = 0), √1N i Xi εi → N (0, Var [Xε])
By the continuous mapping theorem,
√  
D −1 −1
N β̂ − βOLS → N (0, V) , V = E [XX′ ] Var [Xε] E [XX′ ]
28
Robust standard errors
We estimate V by its sample analog (“sandwich formula”), up to a
degree-of-freedom correction:
!−1 ! !−1
1X 1 X 1 X
V̂ = Xi X′i · Xi X′i ε̂2i · Xi X′i
N i N − dim(X) i N i

Heteroskedasticity-robust (Eicker-Huber-White) standard error is


  p
SE β̂k = Vkk /N

Never use homoskedastic standard errors!


For later: standard errors outside iid samples, e.g. clustered SE in panels

29

You might also like