LECTURE 5 ● Schools were chosen, then
Matching and subclassification students were randomized.
Regression models included a
Intro school fixed effect.
- Recall from DAG discussion: you can use
DAGs to identify a causal effect so long as Subclassification
there exists a conditioning strategy that - Insofar as CIA is credible it means that you
satisfies the backdoor criterion found a conditioning strategy satisfying the
- Let’s now consider three different backdoor criterion
conditioning strategies: - But when treatment assignment is
● Subclassification conditional on observable variables, it’s a
● Exact matching situation of selection on observables
● Approximate matching ● Think of X as a n x k matrix of
covariates satisfying CIA as a
Subclassification whole
- Subclassification is a method of satisfying
the backdoor criterion by weighting Example: Smoking and lung cancer
differences in means by strata-specific - A big problem in the mid-to late-20th
weights century was the rising incidence of lung
● Weights, in turn, adjust the cancer
differences in means so their - People began to suspect that smoking had
distribution by strata is the same something to do with it (high correlation).
as the counterfactual’s strata E.g., daily smoking and lung cancer in
● Method implicitly achieves males was monotonically increasing in the
distributional balance between number of cigarettes per day
the treatment and control in terms - But smoking was not independent of
of that known, observable potential health outcomes. Smoking is
confounder endogenous: people choose to smoke,
● Method was created by Cochran after all
(1968) who analyzed the causal - For all we know, people who smoked
effect of smoking on lung cancer cigarettes differed from non-smokers in
ways directly related to the incidence of
Intro lung cancer
- We will rely a lot on the conditional ● What if there was an unobserved
independence assumption (CIA): genetic element causing both
people to smoke and
independently caused people to
develop lung cancer.
● What if smokers were more
1 0
- That is, the expected values of 𝑌 , 𝑌 are extroverted? Differed in age,
equal for treatment and control groups for income, education, etc.?
each value of X - So comparing the incidence of lung cancer
- Alternatively, between smokers and non-smokers won’t
do (SDO) if the independence assumption
doesn’t hold
- Statisticians said that correlation was
spurious due to a non-random selection of
- Sometimes we know that randomization subjects
occurred only conditional on some ● Or maybe the functional form was
observable characteristics incorrect, and this affected risk
● E.g., In Krueger (1999), treatment ratios and odds ratios. (Usual
assignment was conditionally critiques of statistical association
random: Tennessee randomly from an observational dataset.)
assigned kinder students and ● Or, the magnitudes relating
teachers to small classrooms smoking and lung cancer were
(treatment), large classrooms, implausibly large.
and large classrooms with an
aide
- Finally, there was no experimental - Since older people die at a higher rate
evidence that could incriminate smoking as (apart from being more likely to smoke
a cause of lung cancer cigars), maybe the higher death rate for
- Critics of the smoking-lung cancer nexus cigar smokers is because they’re older on
were proven wrong by science and average
medicine, which showed beyond doubt - By the same token, cigarette smoking has
that there’s a causal relationship. Why did a lower mortality rate because cigarette
critics get it wrong? smokers are younger on average
- Cochran (1968) studied smoking patterns, - Role of age can be summarized in a DAG:
and laid out mortality rates by country and
smoking type
- The highest death rate for Canadians is
among cigar and pipe smokers, and that’s
higher than for non-smokers or those who
smoke cigarettes. Similar patterns for UK
and US
- So are pipes and cigars more dangerous
than cigarettes? Note that you cigar and
pipe smokers don’t often inhale (so tar
accumulates in lungs less)
- Here, D is smoking, Y mortality, A age
- CIA is violated; we have an open backdoor
- Do we satisfy the CIA? Is it really the case path, and also omitted variable bias
that: - But note that the distribution of age for
each group will be different (aka covariate
imbalance)
- The first strategy to address covariate
That is, are factors related to these three imbalance is to so subclassification:
SOTW truly independent to factors here, condition on age so that the
determining death rates? distribution of age is comparable for
- Assuming that independence assumption treatment and control groups
holds, what else would be true across the - First, divide age into strata: e.g., 20-40,
three groups? 41-70, 71 and older
● If mean potential outcomes are - Second, calculate the mortality rate for
the same for each type of some treatment group (cigarette smokers)
smoking category, wouldn’t we by strata (age)
expect the observable - Third, weight the mortality rate for the
characteristics of the smokers treatment group by a strata-specific (here
themselves to be equal as well? age-specific) weight corresponding to the
● If mean of covariates are the control group. This gives the age-adjusted
same for each group, those mortality rate for the treatment group
covariates are balanced, and the
two groups are exchangeable
with respect to those covariates
● The connection between
independence assumption and
characteristics of the group is
called balance
- Age seems to matter: older people were - Assume that age is the only relevant
more likely to smoke cigars and pipes, and confounder between smoking and
older people are more likely to die mortality. What’s the average death rate
- Look at the mean ages across smoking for cigarette smokers without
groups below subclassification?
- It’s the weighted average of the mortality
rate column where each weight is 𝑁𝑡/𝑁
where 𝑁𝑡 is the number of people in
group/stratum t, and N is the total number
of people
65 25 10
- 20 * 100
+ 40 * 100
+ 60 * 100
= 29
(mortality rate of cigarette smokers per
100,000)
- What variables should we use for
adjustment?
- In prior example, we were guided by DAG
and the backdoor criterion to choose the
right covariates: exogenous random
variable assigned to individual units prior
- By construction, the age distribution is the to treatment
exact opposite for pipe or cigar smokers. ● Controlling for the right covariates
So the age distribution is “imbalanced” allows us to close all open
- Subclassification adjusts the mortality rate backdoor paths, so CIA is
for cigarette smokers (treatment) so it has achieved
the same age distribution as pipe or cigar ● A variable is exogenous with
smokers (control group) respect to D is the value of X
- That is, we multiply each age-specific doesn’t depend on D. Usually,
mortality rate by the proportion of these are time-invariant
individuals in that age strata for the variables, e.g., race.
comparison group: ● But covariates must not be
colliders as well
● This is why when we try to adjust
for a confounder using
subclassification, we rely on a
credible DAG to help guide the
selection of variables
10 25 60 - Formally, to estimate the causal effect of a
- 20 * + 40 * + 60 * = 51
100 100 100 confounder, we need CIA and the
(mortality rate of cigarette smokers probability of treatment to be between )
adjusting for the age distribution as pipe or and 1 for each strata:
cigar smokers) 1 0
1. (𝑌 , 𝑌 ) ⊥ 𝐷|𝑋 (conditional
- This is almost twice as much as the
independence assumption)
cigarette smokers’ mortality rate using the
2. 𝑃𝑟(𝐷 = 1|𝑋) ϵ(0, 1) with
naive calculation, unadjusted for the age
probability one (aka common
confounder
support)
- CIA simply means that the backdoor
criterion is met in the data by conditioning
on a vector X; i.e., conditionally on X, the
assignment of units to the treatment is as
good as random
- Cochran (1968), in his paper using - CIA effectively requires that for each value
subclassification, recalculates the mortality of X, there’s a positive probability of being
rates for the three countries and smoking both treated and untreated:
groups. Adjusting for age, cigarette 0 < 𝑃𝑟(𝐷𝑖 = 1|𝑋𝑖 < 1
smokers have the highest death rates - This means that the probability of receiving
treatment for every value of vector X is
Subclassification strictly within the unit interval
- Common support means that there should
be units in BOTH treatment and control
groups. It ensures there’s sufficient overlap
in the characteristics of treated and
untreated units, to find adequate matches
- This is required to calculate any particular
kind of ATE. without it, you will just get
some kind of weird weighted ATE for only
those regions that do have common
support
- “Weird” because the ATE doesn’t
correspond to any of the interesting effects
the policymakers needed
- These two assumptions yield the identity:
- Here, being a woman (W) or a child (C)
made you more likely to be in 1st class
(D), but also made you more likely to
Where each value of Y is determined by survive (Y) because lifeboats were more
the switching equation likely to be allocated to women and
- Given common support, we get the children. There are no other confounders
following estimator: (observed or unobserved).
1. 𝐷 → 𝑌 (direct, causal path)
2. 𝐷 ← 𝐶 → 𝑌
- ATE requires that the treatment is 3. 𝐷 ← 𝑊 → 𝑌
conditionally independent of both potential - Data includes age and gender, so we can
outcomes close each backdoor path and satisfy the
- But here, to identify ATT, we need only backdoor criterion, through
● The treatment to be conditionally subclassification
0 - But first, let’s calculate the naive SDO for
independent of 𝑌 , and
the sample: E[Y|D = 1] - E[Y|D = 0]
● There exist some units in the
control group for each treatment
strata
- Note: The reason for the common support
assumption is because we are weighing
the data; without common support, we
can’t calculate the relevant weights
Example: Surviving the Titanic disaster
- What role did wealth and norms play in
passengers’ survival in the Titanic - SDO reveals that being seated in first
disaster? class raised the probability of survival by
- E.g., Did being seated in 1st class make 35.4%
someone more likely to survive? - But this is likely a biased estimate of the
● Problem is that women and ATE, because it doesn’t adjust for
children were explicitly given observable confounders, age and sex
priority for boarding the scarce - Let’s use subclassification weighting to
lifeboats control for the confounders. Steps:
● If they’re more likely to be seated 1. Stratify the data into 4 groups:
in 1st class, maybe differences in young males, young females, old
survival by first class is simply males, old females
picking up the effect of that social 2. Calculate the difference in
norm survival probabilities in each
● DAG can reveal the sufficient group
conditions for identifying the 3. Calculate the number of people in
causal effect of 1st class on the non-first-class groups, and
survival divide by the total number of
non-first-class population. These
are the strata-specific weights
4. Calculate the weighted average sparseness in some cells where sample is
survival rate using these strata too small
weights
- But suppose the problem is with the
treatment group, and there’s always
someone in the control group for any given
age-sex combination. Then we can
calculate ATT if there exist controls for a
given treatment strata
- In dataset, for male 11-yo and 14-yo, there
are both treatment and control group
values for the calculation
- Equation to compute ATT:
- As the number of covariates/dimensions K
- Note: once we condition on confounders increases, subclassification becomes less
(age, sex), 1st class seating has a much feasible because data becomes sparse
lower probability of survival associated ● Sample is too small relative to
with it (ATE is 18.9% only). size of covariate matrix. Missing
- What if age data had specific ages (not values will appear for K
just young and old)? categories
- When we condition on individual age and - If we added a third strata (e.g., race), we’d
sex, probably we won’t have necessary have two age, two sex, two race
information to calculate differences in categories—8 possibilities. Many cells will
strata – and we’re unable to calculate likely be blank
strata-specific weights needed for - This is the curse of dimensionality: many
subclassification. (That is, the common cells may contain either only treatment or
support assumption is violated.) only control units, not both. We won’t have
- Consider data with specific ages, the common support
subdataset is shown as follows: ● We need to look for an alternative
method to satisfy the backdoor
criterion
Exact matching
- Recap: Subclassification uses the
difference between treatment and control
group units and achieves covariate
- For each stratum, there must exist balance by using the K probability weights
observations in both the treatment and to weight the averages
control group ● It uses raw data, but weighting it
- But there aren’t any 12-yo male to achieve balance. We weight
passengers in 1st class, nor 14-yo male the differences, and sum over
passengers in 1st class. Applies to many those weighted differences
other age-sex combinations - Subclassification, though, often runs into
- In this case, we can’t estimate ATE by the curse of dimensionality; in many
subclassification: stratifying variable has research projects you’ll likely run into many
too many dimensions, and we have variables
- One alternative is to resort to matching: - But we can also estimate ATE
one estimates δ𝐴𝑇𝑇 by imputing the ● We are filling in both missing
control group units like before,
missing potential outcomes by conditioning and missing treatment group units
on the confounding, observed covariate ● If i is treated, we need to fill in the
● That is, we filled in the missing 0
missing 𝑌𝑖 using the control
potential outcome for each
treatment unit using the control matches; if i is a control unit, we
group that’s closest to the 1
need to fill in the missing 𝑌𝑖 using
treatment group unit for some X
the treatment matches
confounder
- The estimator is as follows:
● This would give us estimates of
all counterfactuals from which we
can simply take the average over
the differences. This also
achieves covariate balance Where 2𝐷𝑖 − 1 is a trick; if 𝐷𝑖 = 1, leading
- Two types of matching: exact and term becomes 1; if 𝐷𝑖 = 0, leading term is
approximate
-1, and outcomes reverse order so the
- With exact matching, the simple matching
treatment observation can be imputed
estimator is given by:
Example: Job training program
- Consider the example data from a job
training program, and a list of
non-participants/non-trainees
Where 𝑌𝑗(𝑖) is the jth unit matched to the ith - Left group is the treatment, right the
control
unit based on the jth unit being closest to
- Matching algorithm will create a third group
the ith unit for some covariate X
called the matched sample consisting of
- E.g., A unit in the treatment group has a
each treatment group unit’s matched
covariate with value 2, and there’s exactly
counterfactual. Here we’ll match on age
one other unit in the control group with
covariate 2. We will impute the treatment
unit’s missing counterfactual with the
matched unit’s, and take the difference
- What if there's more than one variable
closest to the ith unit (e.g., the same ith
unit has a covariate of 2 and we find two j
units with value of 2)?
- One option is to just take the average of
those two units’ Y outcome value
- If there are 3 close units, or 4? However
many matches M there are, assign the
average outcome 1/M as the
counterfactual for the treatment group unit.
The estimator is:
Where we just replaced 𝑌𝑗(𝑖). That is, we
average over several close matches, not
just one
- This works well if we can find a number of
good matches for each treatment group
unit. We usually define M to be small like
M=2; if greater than 2, simply randomly
select two units to average outcomes over
- The above equations are ATT estimators
(summing over the treatment group)
- Note how the ages of trainees differ on - This won’t always the case, but as the
average from the ages of non-trainees control group sample size grows, the
● The average age of participants is likelihood of us finding a unit with the same
24.3, and non-participants 31.95. covariate value as one in the treatment
(People in the control group are group grows
older; and since wages rise with - Do this for all treatment units (trainees)
age, and that could explain partly - E.g., For treatment unit 1, their age is 18.
why their average earnings are Search down the list of non-trainees, and
higher.) you find exactly one person with age 18
- The two groups are not exchangeable (unit 14). We move the age and earnings
because the covariate is not balanced info of unit 14 to the new matched sample
- Histogram shows the age distribution for column
the treatment and control groups, - If there’s more than one control group unit
respectively that’s close, simple average over time
● The two populations not only ● E.g., There are two non-trainees
have different means, but the with age 30, units 10 and 18. So
entire age distribution across get the average earnings and
samples is different match that average for treatment
unit 10
- Let’s use the exact matching algorithm and
create the missing counterfactuals for each
treatment group unit (i.e., we create the
matched sample).
● I.e., We impute the missing units
for each treatment unit, so what
we get is an estimate of δ𝐴𝑇𝑇
- The distance traveled to the nearest
neighbor will be zero integers (find the
control unit with the closest value on X to
fill in the missing counterfactual for each
treatment unit).
- Now, in the matched sample, the mean
age is the same for both groups. Apparent
in the distribution of age for the matched
sample
- That is, the two groups are now exactly
balanced on age, and are now
exchangeable
- The difference in earnings between those - But here distance itself depends on the
in the treatment and control group is scale of the variables. So researchers
$1,695 (so the causal effect of the program typically use a modification of Euclidean
was $1,695 in higher earnings) distance, such as normalized Euclidean
distance, where the distance of each
Exact matching variable is scaled by the variable’s
- In sum, the treatment and control groups variance
are likely to be different in ways that are a - The distance is measured as
direct function of potential outcomes—thus
violating the independence assumption
- But if the treatment assignment was
conditionally random, matching on So that
covariates X creates an exchangeable set
of observations (a balanced matched
sample)
- Just find a unit or collection of units that
have the same value as some covariate X,
and substitute their outcomes as some unit
j’s counterfactuals. Then, get the
differences for an estimate of the ATE.
Approximate matching - The normalized Euclidean distance is
- What if you couldn’t find another unit with
the exact same value? Do approximate
matching
- For instance, if K grows large, how can we
match on more than one variable without
using the subclassification approach? So that if there are changes in the scale of
- We use the notion of distance: how close X, the changes also affect its variance, so
is one unit’s covariate to someone else’s? the normalized Euclidean distance doesn’t
What does it mean when there are multiple change
covariates with measurements in multiple - There’s also the Mahalanobis distance,
dimensions? which is a scale-invariant distance metric:
- If you match on a single covariate that’s
straightforward; distance is measured in
terms of the own covariate’s measures
● E.g., Age is just how close in −1
years/months one person is to Where Σ𝑋 is the sample
another variance-covariance matrix of X
● E.g., If we have several - All in all, having more than on covariate is
covariates for matching (age and problematic: it creates the curse of
log income), a 1-point change in dimensionality problem, and distance is
age is a different from a 1-point harder to measure. Finding a good match
change in log income, and we’re in the data is not trivial
now measuring distance in 2 - There’s always going to be matching
dimensions discrepancies
- We need a new definition of distance, the - E.g., 𝑋𝑖 ≠ 𝑋𝑗, so some unit i has been
simplest being Euclidean distance:
matched with j on the basis of a covariate
value of X = x. Maybe i has age 25, but j is
26. Sometimes differences are zero, small,
or large. As they move away from zero,
they introduce bias and spell trouble for
estimation
- Thankfully, matching discrepancies tend to
converge to zero as the sample size
increases. Approximate matching is
“data-greedy.” The more covariates, the
longer it takes for convergence to zero to
occur
● The larger the dimension, the
greater likelihood of matching
discrepancies, and the more data
you need
- What options do you have if you don’t
have a large dataset with many controls? - Applying the central limit theorem and the
- Abadie and Imbens (2011) introduced bias
difference, 𝑁𝑇(δ𝐴𝑇𝑇 − δ𝐴𝑇𝑇) converges to
correction techniques with matching
estimators when there are matching a normal distribution with zero mean
discrepancies in finite samples - But,
- Let’s derive the bias if we have poor
matching discrepancies. Subtract from the
sample ATT estimates the true ATT: - If the number of covariates becomes large,
Where each i and j(i) units are matched, the difference between 𝑋𝑖 and 𝑋𝑗(𝑖)
𝑋𝑖 ≈ 𝑋𝑗(𝑖), and 𝐷𝑗(𝑖) = 0 converges slowly to zero. So, the
0 0
- Then we define the conditional difference between µ (𝑋𝑖), µ (𝑋𝑗(𝑖))
expectations outcomes, based on the
converges to zero very slowly
switching equation for both control and
- Also, the RHS may not converge to zero,
treatment groups:
and therefore, also the LHS
- The bias of the matching estimator can be
severe depending on the magnitude of
these matching discrepancies
- We then write the observed value as a - BUT the discrepancies are observed, and
function of expected conditional outcomes we can see the degree to which each
and some stochastic element: unit’s mached sample has severe
mismatch on the covariates themselves
- We can also make the matching
discrepancy small by using a large donor
pool of untreated units to select our
- Rewrite the ATT estimator using the above matches, since the likelihood of finding a
𝜇 terms: good match grows as a function of sample
size
- So if we are content to estimating ATT,
increasing the donor pool’s size can be
useful. And if we can’t increase the donor
- Note that the first line is just ATT with the pool, we can apply bias-correction
stochastic element from the previous line. methods to minimize the bias (Abadie &
Second line rearranges, so we get the Imbens, 2011)
estimated ATT plus the average difference - Total bias is made up of the bias
i nthe stochastic terms for the matched associated with each individual unit. So
sample each treated observation contributes
0 0
- Let’s now compare this estimator with the µ (𝑋𝑖) − µ (𝑋𝑗(𝑖)) to the overall bias
true value of ATT - The bias-corrected matching estimator
is:
0
Where µ (𝑋) is an estimate of E[Y|X=x,
D=0] using, say, OLS
Example of approximate matching
- Consider the example below for 8 units, 4
And we get the outcomes, treatment
treated and 4 controls
status, and predicted values in the
- According to the switching equation, we
following table
only observe the actual outcomes
- Do this for the other three sample
associated with the potential outcomes
differences, each of which is added to the
under treatment or control. So we’re
bias-correction term based on the fitted
missing the control values of the treatment
values from the covariate values.
group
- We can’t implement exact matching
because none of the treatment group units
has an exact match in the control group—a
consequence of finite samples. (If the
sample size control group grows faster
than the treatment group, the likelihood of
finding an exact match grows.)
- We now use the fitted values for bias
correction
- We use nearest-neighbor matching, - Take the simple differences (e.g., 5 - 4 in
which is simply going to match each row 1) but also subtract out the fitted
treatment unit to the control group unit values associated with each observation’s
whose covariate value is nearest to that of unique covariate
the treatment group itself - E.g., In row 1, the outcome 5 has covariate
- We necessarily create matching 11, giving a fitted value of 3.89. But the
discrepancies, though: covariates are not counterfactual has value 10, giving a
perfectly matched for every unit predicted value of 3.94. So we use the
- The nearest-neighbor algorithm creates following bias correction:
the following table:
- This illustrates how a specific fitted value is
calculated, and how it contributes to the
ATT calculation
- In full, the entire calculation is as follows:
- We can estimate the ATT from the sample
as follows:
- This is slightly higher than the adjusted
ATE of 3.25
- With bias correction, we need to estimate - The bias-correcting adjustment is more
0 significant as the matching discrepancies
µ (𝑋) using OLS (fitted values from a themselves become more common
regression of Y on X). We get:
- If not common to begin with, by definition, developed by Rubin (1977) and
bias adjustment doesn’t change the Rosenbaum and Rubin (1983)
estimated parameter very much - Similar to both subclassification and the
nearest-neighbor matching of Abadie and
Approximate matching Imbens (2006)
- Again, bias arises because of the large - PSM is very popular especially in
matching discrepancies. To minimize medicine, as a way to address selection on
discrepancies, observation. It’s also popular among
● We need a small number of economists
matches M, like 1. Larger M - But these days, PSM is not so popular and
values produce larger matching widely used among economists, vis-a-vis,
discrepancies say, regression discontinuity or DiD
● We also need matching with - It helps to be agnostic about whether CIA
replacement, and since this can holds or doesn’t hold in your
use untreated units as a match study/application. There’s no theoretical
more than once, matching with basis to dismiss a procedure designed to
replacement produces smaller estimate causal effects on some ad hoc
discrepancies principle one holds due to a hunch
● Finally, try to match covariates - Only prior knowledge and deep familiarity
0
with a large effect on µ (.) about the institutional details of a topic can
- Matching estimators have a normal tell you what the appropriate identification
distribution in large samples, if the bias is strategy is
small - Insofar as the backdoor criterion is met,
- For matching without replacement, the matching methods may be perfectly
usual variance estimator is valid: appropriate. If backdoor criterion is NOT
met, then matching methods may be
inappropriate—but naive multivariate
regression will likely be inappropriate, too.
- How does PSM work?
- For matching with replacement, - PSM takes necessary covariates,
estimates a maximum likelihood mode of
the conditional probability of treatment
(logit or probit to endure that fitted values
are bounded by 0 and 1), and uses the
predicted values to collapse those
covariates into a single scalar called a
Where 𝐾𝑖 is the number of times an
propensity score. All comparisons
observation i is used as a match between treatment and control are based
- 𝑉𝑎𝑟(𝑌𝑖|𝑋𝑖, 𝐷𝑖 = 0)Can be estimated by on that value
- E.g., W have two units A, B assigned to
matching
treatment and control, respectively. Their
- If you have two observations 𝐷𝑖 = 𝐷𝑗 = 0,
propensity score is 0.6. So they have the
and 𝑋𝑖 ≈ 𝑋𝑗, same 60% conditional probability of being
assigned to treatment
● By random chance, A was
assigned to treatment, B to
control
- PSM compares units who, based on
And this is an unbiased estimator of variables, had very similar probabilities of
𝑉𝑎𝑟(ϵ𝑖|𝑋𝑖, 𝐷𝑖 = 0) being placed into treatment group—even
though those units differed with regard to
- But the bootstrap doesn’t create valid
actual treatment assignment
standard errors (Abadie and Imbens 2008)
- If conditional on X, two units have the
same probability of being treated, they
Propensity score methods
have similar propensity scores, and all
- A popular way to achieve the conditioning
remaining variation in treatment
strategy implied by the backdoor criterion
assignment is due to chance
is the propensity score method
● Insofar as A, B have the same - Jobs varied within sites (some were gas
propensity score of 0.6, but one is station attendants, others worked at a
in treatment and the other is not, printer shop). Men and women frequently
and CIA credibly holds in the performed different kinds of work
data, then differences in observed - MDRC collected earnings and
outcomes are attributable to the demographic information from both the
treatment treatment and control groups, at baseline
- Implicit here, though, is meeting the and every 9 months thereafter
common support assumption, which - They also conducted 4 post-baseline
requires there be units in the treatment interviews. But sample sizes differed from
and control group across the estimated one study to the next
propensity score - NSW was a randomized job training
● Here, we had common support program, so the independence assumption
for 0.6 because there was a unit was satisfied
in the treatment group (A) and - Hence, calculating ATE is straightforward:
one in control group (B) for 0.6
- Propensity score can thus be used to
check for covariate balance between
treatment and control group so that the two - Turns out, the treatment benefitted
groups become observationally equivalent workers: their real earnings post-treatment
in 1978 were more than the earnings of the
Example: NSW job training program control by $900 to $1800—depending on
- The National Supported Work the sample used
Demonstration (NSW) job training program - LaLonde (1986) evaluated the NSW
was operated by the Manpower program and commonly used econometric
Demonstration Research Corp (MRDC) in methods at that time
the mid-1970s - Particularly, he evaluated econometric
- It was a temporary employment program estimators’ performance by trading out the
meant to help disadvantaged workers experimental control group with data on
lacking basic job skills to move into the the non-experimental control group from
labor market by giving them work the US population
experience and counseling in a sheltered - He used 3 samples of the Current
environment Population Survey and 3 samples of the
- It randomly assigned qualified applicants Panel Survey of Income Dynamics but we
to training posts: treatment group received use one from each (after all,
all the benefits of NSW program, while non-experimental data is more commonly
control was left to fend for themselves encountered by economists)
- Program admitted women receiving Aid to - But the difference with NSW is that it’s a
Families with Dependent Children, randomized experiment, so we know the
recovering addicts, released offenders, ATE, and we can see how well a variety of
and men and women of both sexes who econometric models perform
didn’t complete high school ● If NSW increased earnings by
- Treatment group guaranteed a job for 9-18 about $900, do other estimators
months, depending on target group and yield that, too?
site - Results were consistently horrible:
- They were divided into crews of 3-5 estimates of LaLonde (1986) were very
participants who worked together and met different in magnitude and had the wrong
frequently with a NSW counselor to sign
discuss grievances with the program and - Next table shows the effect of the
performance treatment when comparing the treatment
- They were also paid for their work: NSW group to the experimental control group
offered trainees lower wages than they ● Baseline difference in real
would’ve received in a regular job, but earnings was negligible:
allowed earnings to increase for treatment made $39 more than
satisfactory performance and attendance control in pre-treatment period
- After participants’ terms expired, they were without controls, and $21 less in
forced to find regular employment multivariate regression—neither
statistically significant
● But post-treatment difference in unemployed in 1975, and less
average earnings was between likely to have considerable
$798 and $886 earnings in 1975
- Pessimistic conclusion of the paper led to ● In short, the two groups are not
more experimental observations exchangeable on observables,
- When he used the non-experimental data and likely on unobservables, too
as the control group, using one sample
from PSID and one from CPS, in nearly
every point estimate the effect was
negative. (Except in DiD model which had
a small, insignificant, positive effect)
- Dehejia and Wahba (1999) reevaluated
LaLonde (1986), using the same
non-experimental control group datasets
- They wanted to examine whether PSM
could be an improvement in estimating
- Stark difference when we move from the treatment effects using non-experimental
NSW control to either PSID or CPS was data. They also wanted to show the
because of selection bias: real earnings of diagnostic value of PSM
NSW participants would’ve been much - First, the authors estimated the propensity
lower than the non-experimental control score using maximum likelihood modeling,
group’s earnings then compared treatment units to control
units within intervals of the propensity
score itself
● This process of checking if there
- That is, it’s highly unlikely that the real are units in both treatment and
earnings of NSW participants would’ve control for intervals of the
been much lower than the propensity is called checking for
non-experimental control group’s earnings common support
- From the SDO decomposition, the second ● Easy way is to plot the number of
form of bias is selection bias, and if treatment and control group
0 0
𝐸[𝑌 |𝐷 = 1] < 𝐸[𝑌 |𝐷 = 0], this will bias observations separately across
the ATE estimate downward (estimates the propensity score with a
show a negative effect) histogram. They found that the
- But in fact, a violation of independence overlap was almost nonexistent
also implies that covariates will be - In their CPS sample, the overlap was so
unbalanced across the propensity score bad that they dropped 12,611 observations
(aka balancing property) in the control group because their
- Next table shows the mean values for propensity scores were outside the
each covariate for the treatment and treatment group range
control groups, where the control is the - Also, a large number of observations had
15,992 obs. from the CPS. low propensity scores, evidenced by the
- Treatment group appears to be very fact that the first bin contains 2,969
different on average from the control group comparison units. Even with trimming, the
CPS sample along nearly every covariate overlap improved but it wasn’t great
listed - From this diagnostic, we learn that:
● NSW participants are more black, ● The selection bias on
more Hispanic, younger, less observables is probably extreme
likely to be married, more likely to since there are so few units in
have no degree and less both treatment and control for
schooling, more likely to be
given values of the propensity
score
● When there is considerable
bunching at either end of the
propensity score distribution, it
suggests that you have units who
differ remarkably on observables
with respect to the treatment
variable itself
● Trimming around those extreme
values has been a way of
addressing this when employing
traditional propensity score
adjustment techniques - We use the data from Dehejia and Wahba
- With estimated propensity scores, and (2002) for the following exercises. First, we
using a slightly different sample, Dehejia calculate the ATE from the actual
and Wahba (1999) estimated the treatment experiment
effect on real earnings in 1978 using the - From the code, the NSW job training
experimental treatment group, compared program caused real earnings in 1978 to
with the non-experimental control group increase $1,794.343
- They found that the NSW program caused
earnings to increase between $1,672 and
$1,794—depending on whether
exogenous covariates were included in a
regression. Both estimates were highly
significant
- The first two columns in the next table
labeled unadjusted and adjusted represent
OLS regressions without controls
● Without controls, PSID and CPS
estimates were extremely
negative and precise—but
- Next, we look at examples in which we
because of severe selection bias
estimate the ATE or some of its variants,
in the NSW program
such as ATT and ATU
● With controls, effects become
- Rather than use the experimental control
positive and imprecise for the
group from the original randomized
PSID sample, though almost
experiment, we use the non-experimental
significant at 5% for CPS. Each
control group from the Current Population
effect size is only about half the
Survey
true effect
- While the treatment group is an
- Using PSM, the results considerably
experimental group, the control group now
improved over LaLonde (1986)
consists of a random sample of Americans
- Treatment effects are positive and similar
from that period
in magnitude to what they found in
- So the control group suffers from extreme
columns 1 and 2 using only the
selection bias, since most Americans
experimental data
wouldn’t function as counterfactuals for the
distressed group of workers who selected
into the NSW program
- We will append now the CPS data to the
experimental data, and estimate the
propensity score using logit
- The propensity score is the fitted values of
the logit model. Put differently, we used the
estimated coefficients from logit to
estimate the conditional probability of
treatment—assuming that probabilities are
based on the cumulative logistic
distribution:
𝑒
Where 𝐹(·) = 1+𝑒
and X is the vector of
exogenous covariates
- The propensity score used the fitted values
from the maximum likelihood regression to
calculate each unit’s conditional probability
of treatment regardless of actual treatment
status
- The definition of the propensity score is the
selection probability conditional on the
confounding variables:
𝑝(𝑋) = 𝑃𝑟(𝐷 = 1|𝑋)
● I.e., Propensity score is just the
predicted conditional probability - The next graph shows the distribution of
of treatment or fitted value for the propensity score for the two groups
each unit using histograms
● It’s advisable to use maximum - The probability of treatment is spread out
likelihood here, so the fitted across the units, but there’s a large mass
values are in [0,1]. If we use a of nearly zero propensity scores in the
linear probability model, but it CPS. This means the characteristics of
creates a value below zero and individuals in the treatment groups are rare
above one (which are not true in the CPS sample
probabilities that fall in [0,1] ● Unsurprising given the strong
- Note that the CIA assumption is not negative selection into the
testable because it’s based on treatment, they’re younger, less
unobservable potential outcomes likely to be married, more likely to
- Unlike CIA, the second assumption, be educated, and a minority
common support, is testable by simply - So, if the two groups are significantly
plotting histograms or summarizing data different on background characteristics,
- Using the data, the mean value of the the propensity scores will have grossly
propensity score for the treatment group is different distributions by treatment status
0.43, and the mean for the CPS control - The simple diagnostic tests show the
group is 0.007 problem later if we use inverse probability
● The following table shows that the weighting (for later)
50th percentile for the treatment
group is 0.4, but the control group
doesn’t reach a high enough
number until the 99th percentile
Propensity score matching
- The treatment parameter under both
assumptions would be:
- The CIA allows us to make the following
substitutions:
longer, and D, X are independent of one
another conditional on the propensity
score:
- So under both assumptions,
- From this, we also obtain the balancing
property of the propensity score:
- From the assumptions we derive the
propensity score theorem, which states
that under CIA,
Which means that, conditional on the
propensity score, the distribution of the
covariates is the same for treatment and
control group units
Which yields:
Where p(X) = Pr(D=1|X) (propensity score)
- This means that to achieve independence,
assuming CIA, we just have to condition
on the propensity score—and it’s enough
to have independence between the
treatment and potential outcomes
- An extremely valuable theorem since
stratifying on X tends to run into the
sparseness-related problems (empty cells)
in finite samples for even a moderate
number of covariates
● By contrast, propensity scores
are just scalars, so stratifying
across a probability is going to
reduce the dimensionality - In the DAG on the right, there exist two
problem paths between X, D: the direct path X →
- See proof of propensity score theorem in p(X) → D, and a backdoor path X → Y ←
book (application of law of iterated D (blocked by a collider, so there’s no
expectations) systematic correlation between X and D
- Like the omitted variable bias formula in through it)
regression, the propensity score theorem - But there’s a systematic correlation
says you need only control for covariates between X, D through the first directed
that determine the likelihood a unit path
receives the treatment - When we condition on propensity score
- More, it says that the only covariate you p(X), D and X are statistically independent:
need to condition on is the propensity D ┴ X|p(X). This implies
score—all of the information from matrix X
has been collapsed to the (scalar)
propensity score
- We can directly test this, but conditional on
- Corollary: given CIA, we can estimate the
the propensity score, treatment and control
ATE by weighting appropriately the simple
should be (on average) the same with
difference in means
respect to X
- Because the propensity score is a function
- That is, the propensity score theorem
of X,
implies balanced observable covariates
Weighting on the propensity score
- Using estimated propensity score, how can
we estimate ATEs?
- Thus, conditional on the propensity score,
- Busso, Dinardo, and McCrary (2014)
the probability that D=1 depends on X no
examined the properties of various
approaches, and found that inverse - We have a few options for estimating the
probability weighting was competitive in variance of this estimator, one simply to
several simulations use bootstrapping (Efron 1979)
- Basically, give more weight to - In the context of inverse probability
underrepresented people to make the weighting, we repeatedly draw with
control group similar to the treatment replacement a random sample of our
group original data, then use that smaller sample
● If someone in the treatment group to calculate the sample analogs of ATE or
had a low probability of taking the ATT
treatment, they get a higher - With the smaller bootstrapped data, we
weight because they are rare first estimate the propensity score, which is
● If someone in the control group then used to calculate the sample analogs
had a low probability of not taking of ATE or ATT
the treatment, they also get a - Do this over and over (1,000 or 10,000
higher weight times) to obtain a distribution of treatment
● People who were highly likely to effects corresponding to the different cuts
be in either group get a lower of the data
weight because they don’t need - We also get a distribution of the parameter
extra emphasis estimates from which we calculate the
- This comes from the work of Horvitz and standard deviation—which becomes akin
Thompson (1952) to a standard error, and gives us a
- Assuming CIA holds in the data, we can measure of the dispersion of the
use a weighting procedure where each parameter estimate under uncertainty
individual’s propensity score is a weight of regarding the sample itself
the individual’s outcomes. When - Adudumilli (2018) and Bodory et al. (2020)
aggregated, this can identify some ATE discuss the performance of various
- Weight enters the expression differently bootstrapping procedures, such as the
depending on each unit’s treatment status, standard bootstrap and wild bootstrap
and takes on two forms depending on - The sensitivity of inverse probability
whether that target parameter is ATE or weighting to extreme values of the
ATT: propensity score has led some
researchers to propose an alternative that
can handle extremes better
- Hirano and Imbens (2001) proposed an
inverse probability weighting estimator of
the ATE assigning weights normalized by
the sum of propensity scores for treated
and control groups, as opposed to equal
weights of 1/N for each observation.
Millimet and Tchernis (2009) call this the
“normalized estimator”:
- The sample versions of both (written
- Most statistical software have programs
below) are obtained by a two-step
estimating the sample analog of these
estimation procedure
inverse probability weighted parameters
1. Estimate the propensity score
using the second method with normalized
using logit or probit
weights. (E.g., Stata’s -teffects- command.)
2. Use the estimated propensity
They’ll also generate standard errors
score to produce sample versions
- We can also manually calculate the point
of one of the ATE estimators
estimates to see how to use propensity
above
scores to construct non-normalized or
normalized weights, and then estimate
ATT
- One can also match on the propensity
score (as an alternative to inverse
probability weighting)
- This is done by finding a couple of units
with comparable propensity scores from
the control unit donor pool, with some ad
hoc radius distance of the treated unit’s
own propensity score
- Researcher averages the outcomes, then
assigns that average as an imputation to
the original treated unit, as a proxy for the
potential outcome under counterfactual
control
- Then one trims to ensure common support
- Nearest-neighbor matching, along with
inverse probability weighting, is perhaps
the most common method for estimating a
PSM
- Using the propensity score pairs,
nearest-neighbor matching pairs each
treatment unit i with one or more
comparable control group units j, where
comparability is measured in terms of
distance to the nearest propensity score
- After deriving the matched sample, ATT is
estimated as:
- Using inverse probability weighting and the
non-normalized weighting procedure, the
estimated ATT is -$11,876. With
normalization of weights, ATT is -$7,238
- Why so different than with experimental Where 𝑌𝑖(𝑗) is the matched control group
data? Note that inverse probability
unit to i. We focus on ATT because of
weighting is weighting the treatment and
problems with overlap (as discussed
control units according to 𝑝(𝑋), which earlier
causes units with very small values of the
propensity score to blow up and become
unusually influential in calculating ATT
- So we’ll need to trim the data, and even a - Here we match using 5 nearest neighbors;
small trip can eliminate the mass of values i.e., we find the 5 nearest units in the
at the far-left tail control group, where “nearest” is
● Crump et al. (2009) developed a measured as closest on the propensity
principled method for addressing score itself
a lack of overlap: a rule of thumb - Unlike covariate matching, distance is
is to keep only observations on straightforward due to dimension reduction
the interval [0.1,0.9] afforded by the propensity score
● With trimmed propensity score, - We average actual outcome, and match
ATT is $2,006 with that to each treatment unit
non-randomized weights and - Then, subtract each unit’s matched control
$1,806 with normalized weights. from its treatment value, and divide by 𝑁𝑇
This is similar to the true causal - Result is ATT of $1,725 with p < 0.05. (It’s
effect of $1,794. (Normalized relatively precise and similar to the
weights are closer) experiment’s result)
- Standard errors can be calculated with
various bootstrapping methods Coarsened exact matching
- Recap:
● Exact matching matches a which affects what you’re
treated unit to all of control units estimating. As long as you’re
with the same covariate value. clear about this, readers may
But sometimes impossible accept it
because of matching ● When trimming data, you’re not
discrepancies (e.g., matching estimating ATE or ATT, just like
continuous age and income). when trimming propensity scores
Mismatching leads to bias - CEM is good in that it’s part of a class of
● Approximate matching specifies a matching methods called monotonic
metric (Euclidean distance, imbalance bounding (MIB), which bound
Mahalanobis distance, or the maximum imbalance in some feature
propensity score) to find control of the empirical distributions by an ex ante
units close to the treated unit decision by the user
- Iacus et al. (2012) introduced a kind of ● In CEM, this ex ante choice is the
exact matching called coarsened exact coarsening decision, and users
matching (CEM). It’s based on the notion control the amount of imbalance
that sometimes it’s possible to do exact in the matching solution. It’s also
matching once we coarsen the data very fast
enough - One measure of imbalance is L1. It’s a
● CEM simplifies matching by number that tells you how different your
grouping people into broad treatment and control groups are before
categories (or bins) before finding and after matching
matches - Perfect global balance is indicated by
● I.e., if we create categorical L1=0. Larger values indicate larger
variables (e.g., 0-10 yo, 11-12 imbalance, with a maximum of 1. So,
yo), we can often find exact there’s “imbalance bounding” between 0
matches. Once we do, calculate and 1
the weights on the basis of where - Let’s use the same job training data to
a person fits in some strata, and estimate
weights are used in a simple
weighted regression
- First, we begin with covariates X and make
a copy X*
- Next, we coarsen X* according to
user-defined cutpoints, or CEM’s
automatic binning algorithm. (e.g.,
schooling becomes less than high school,
high school only, some college, college
graduate, post-college)
- Third, we create one stratum per unique
observation of X* and place each
- The estimated ATE is $2,152, larger than
observation in a stratum
our estimated experimental effect. But this
- Fourth, we assign these strata to the
ensured a high degree of balance on
original and uncoarsened data X and drop
covariates, as seen in the output of Stata’s
any observation whose stratum doesn’t
-cem- command
contain at least one treated and control
- Note, too, that L1 values are close to zero
unit
in most cases, and the largest L1 is 0.12
- Fifth, we add weights for stratum size and
for squared age
analyze without matching
- There’s tradeoff: larger bins mean more
Conclusion
coarsening of the data and fewer strata,
- Matching methods are an important
and in turn more diverse observations
member of the causal inference arsenal.
within the same strata and higher covariate
(Also an active area of research)
imbalance
- You never know when the right project
- CEM keeps only the treated and control
comes along for which matching methods
individuals who fall into the same bins
are the perfect solution, so don’t write
● CEM removes unmatched
them off!
individuals from both groups,
- Propensity scores are an excellent tool to
check the balance and overlap of
covariates. It’s an underappreciated
diagnostic, one you might miss if you ran
only regressions
- Propensity scores make groups
comparable, but only on the variables used
to estimate the propensity scores in the
first place
- Propensity score also has a very long
half-life: they make their way into other
designs like DiD
- But every matching solution to a causality
problem requires the credible belief that
the backdoor criterion can be achieved by
conditioning on some matrix X (CIA)
- This explicitly requires that there are no
unobservable variables opening backdoor
paths as confounders—but this might be a
leap of faith, and concluding that requires
deep institutional knowledge
- If you have good reason to believe that
there are important unobservable
variables, you might need another tool
LECTURE 6 ● In so doing, RDD can recover
Regression discontinuity ATE for a given subpopulation of
units
Intro - Consider the DAG of Steiner, Kim, Hall,
- There’s increasing interest in regression and Su (2017):
discontinuity design (RDD) in the past two
decades
- More papers on Google Scholar have used
the phrase “regression discontinuity
design.”
- Dates back to Thistlewaite and Campbell
(1960). Didn’t catch on until a few doctoral
students and a handful of papers
- In economics, the first RDD papers was - In the first graph, X is a continuous
Goldberger (2008) (based on a much older variable assigning units to treatment D;X
1972 paper), but didn’t catch on until 1999 →D
● Angrist and Lavy (1999) studied - Assignment is based on a cutoff score 𝑐0
the effect of class size on pupil so that any unit with a score above the
achievement in Israeli public cutoff gets placed into the treatment group,
schools, where smaller classes and units below don’t
were created when the number of - E.g., driving under the influence:
students passed a certain individuals with a blood-alcohol content of
threshold 0.08 or more are arrested and charged,
● Black (1999) used discontinuities while those below aren’t
at the geographic level created by - Assignment variable may itself
school district zoning to estimate independently affect the outcome variable
people’s willingness to pay for (X→Y) and may even be related to a set of
better schools variables U that independently determine
Y
- Note that treatment status is exclusively
determined by the assignment rule, and
not by U
- DAG shows clearly that assignment
variable X (aka the running variable) is
an observable confounder since it causes
both D, Y
- Since assignment variable assigns
treatment on the basis of a cutoff, we’re
never able to observe units in both control
- Cook (2008) said RDD was “waiting for and treatment for the same value of X
life” from 1972 to 1999 ● I.e., it doesn’t satisfy the overlap
- There was growing acceptance of the condition needed to use matching
potential outcomes framework among methods, so the backdoor
microeconomists (Angrist, Card, Krueger, criterion can’t be met
Levitt, etc.) and greater availability of large, - But we can identify causal effects with
digitized administrative datasets (many of RDD, as shown in the right graph,
which captured unusual administrative specifically for subjects whose score is in a
rules for treatment assignments)---a close neighborhood of cutoff score 𝑐0
confluence of factors that allowed RDD to
● The average causal effect for this
flourish
subpopulation is X → 𝑐0 in the
- Main appeal is that RDD convincingly
eliminates selection bias limit
● It’s based on a simply, intuitive ● This is possible since the cutoff is
idea. And its underlying the sole point where treatment
assumptions are viewed by many and control subjects overlap in
as easier to accept and evaluate the limit
- Explicit assumptions must hold. Mainly,
continuity must hold: i.e., the cutoff itself
cannot be endogenous to some competing ● Considering differences in
intervention occuring at precisely the same resources for research and peer
moment the cutoff is triggering units into effects, are there heterogeneous
the D treatment category returns across public unis?
- I.e., the expected potential outcomes are - With positive selection into a flagship
continuous at the cutoff. If so, it rules out school, we might expect individuals with
competing interventions occurring at the higher ability (observed and unobserved)
same time to sort into that school. Since ability
- In the right graph, note there’s no arrow increases marginal product, they tend to
from X to Y, because 𝑐0 cut it off. At that earn more in the workforce—regardless if
point, X no longer has a direct effect on Y they attend the state flagship
- The null hypothesis is continuity, and any - Selection bias confounds our ability to
discontinuity necessarily implies some estimate the causal effect of attending the
cause, because the tendency for things to state flagship on earnings.
change gradually is what we come to - (Hoekstra, 2009) used RDD. He has data
expect from nature on all applications to the state flagship
● Darwin: Natura non facit saltum (building presumably a relationship with
(nature does not make jumps) the admissions office).
● Saying: “If you see a turtle on a ● Pro tip: Data acquisition (say, for
fencepost, you know he didn’t get RDD) requires far more soft skills
there by himself.” than what you’re used to
- We use our knowledge about selection into (friendship, respect, alliances).
treatment to estimate ATE ● “This isn’t as straightforward as
- We know that the probability of treatment simply downloading the CPS from
assignment changes discontinuously at 𝑐0, IPUMS; it’s going to take genuine
smiles, hustle, and luck.”
so our job is to simply compare people ● “It is of utmost importance that
above and below 𝑐0 to estimate the local you approach these individuals
average treatment effect (LATE) (Imbens with humility, genuine curiosity,
& Angrist, 1994)---a special type of ATE and most of all, scientific
- Since we don’t have overlap or common integrity.”
support, we must rely on extrapolation: we
compare units with different values of the
running variable, and they overlap in the
limit as X approaches the cutoff from either
direction
- All methods for RDD are ways of handling
the bias from extrapolation as cleanly as
possible
- Pictures of main results, including the
identification strategy, are absolutely
essential to any study attempting to
convince the reader of a causal effect
- RDD has a comparative advantage in
pictures: it’s a very visually intensive - Note in figure the horizontal axis, ranging
design (along with synthetic control) from a negative to a positive number, and
- Consider Hoesktra (2009) who looked at zero is at the center. Hoekstra (2009)
the causal effect of college on earnings. recentered the admissions criteria by
He estimated the causal effect of attending subtracting the admission cutoff from
the state flagship university students’ actual score. Recentered SAT is
● State flagship unis are often more the running variable.
selective than other public unis in - Vertical line at zero marks the cutoff, here
the same state the minimum SAT score for admissions.
● E.g., in Texas, the top 7% of ● It appears to be binding, but not
graduating HS students can deterministically, since some
select their university in state, and students who enrolled didn’t have
the modal first choice is UT the minimum SAT requirements
Austin
(other qualifications compensated Where ø is a vector of year dummies, w is
for lower SAT scores). a dummy for years after HS that earnings
- Hollow dots were used at regular intervals were observed, and Ө is a vector of
along the running variable. Dots represent dummies controlling for the cohort in which
conditional mean enrollments per the student applied to university
recentered SAT score. - Residuals were averaged for each
● Administrative dataset contains applicant, with the resulting average
thousands of observations, but he residual earnings used to implement a
shows only the conditional means partialled out future earnings variable
along evenly spaced bins of the - Student’s residuals (from the natural log of
running variable. earnings regression) were collapsed into
- There are two curvy lines fitting the data: conditional averages for bins along the
to the left of zero and to the right. recentered running variable, yielding the
Researcher fit lines separately to the left following graph
and right of the cutoff. - Note that the discontinuous jump at zero in
● They’re the least squares fitted earnings isn’t as compelling, so Hoekstra
values of the running variable, did hypothesis tests to determine if the
where the running variable can mean between the groups just above and
take on higher-ordered terms just below are the same.
(included in the regression). ● Turns out that it’s not a significant
● Doing so allows fitted values to difference: those just above the
more flexibly track the central cutoff earn 9.5% higher wages in
tendencies of the data. the long term than those below
- Finally, there’s a giant jump in the dots at (discontinuity is 0.095).
zero on the recentered running variable. ● Author explored various binning
● The probability of enrolling at the of the data (played around with
flagship uni jumps the bandwidth) and estimates
discontinuously when the student range from 7.4% to 11.1%.
barely hits the minimum SAT of
the school.
● If 𝑐0 = 1, 250, a student with
1,240 had a lower chance of
getting in; 10 points and you’re off
to a different path!
- The thing is, is a 1,240 student so different
from a 1,250 student?
- What if we have hundreds of students
getting 1,240 and 1,250 respectively?
Might they be similar on observable and
unobservable characteristics?
- If the uni is arbitrarily picking a reasonable
cutoff, are there reasons to believe they’re - So, at exactly the point where workers
also picking a cutoff where the natural experienced a jump in the probability of
ability of students jumps at the exact spot? enrolling in the state flagship uni, 10-15
- In the study, the state flagship uni sent the years later there’s a separate jump in
admissions data to a state office in which logged earnings of around 10%. Those
the employer submits unemployment who barely made it in made around 10%
insurance tax reports. more than those who just missed the
- The uni had social security numbers, so cutoff.
matching of student to future worker - Selection bias is present since the two
worked well. groups of applicants around the cutoff
- This yielded a matchin of admissions data have comparable future earnings in a
to quarterly earnings records from 1998 to world where neither is attending the state
2005. flagship uni.
- Hoekstra (2009) estimated: - All in all, the study shows that college
matters for long-term earnings, as well as
the type of college (even among public
unis).
- This is an ingenious natural experiment - Sharp RDD is where treatment is a
demonstrating the heart and soul of RDD. deterministic function of running variable
- RDD is all about finding jumps in the X.
probability of treatment as you move along - Fuzzy RDD represents a discontinuous
a running variable. jump in the probability of treatment where
- These jumps/discontinuities are often 𝑋 > 𝑐0. The cutoff is used as an
found in rules. instrumental variable for treatment
- For this reason, firms, gov’t agencies are
unknowingly sitting atop mountains of
potential RDD-based projects. Try to hunt
for them, and build relationships with
people who can supply you with stuch
data.
- “Take them out for coffee, get to know
them, learn about their job, and ask them
how treatment assignment works.”
- How are individual units assigned to a
program: randomly or via a rule (e.g., if a
running variable exceeds a threshold,
people switch into some intervention)?
- RDD is valid even if the rule isn’t arbitrary;
it’s just needs to be known, precise, and Estimation using an RDD
free of manipulation. Program usually has Sharp RD design
running variable X with a “hair trigger” not - In sharp RDD, treatment status is a
tightly related to the outcome being deterministic and continuous function of a
studied. running variable 𝑋𝑖, where:
● E.g., Probability of being arrested
for DUI greater if there’s 0.08
blood alcohol content.
● E.g., Probability of receiving
healthcare insurance jumps at
- If you know the value of 𝑋𝑖 for unit i, you
age 65.
● E.g., Probability of receiving know treatment assignment for i with
medical attention jumps if certainty
birthweight falls below 1.5 kg. - But if for every value of X you can perfectly
● E.g., Probability of attending predict the treatment assignment, it means
summer school greater when there are no overlaps along the running
grades fall below a minimum variable.
level. - Assuming constant treatment effects, in
- We need a lot of data around terms of potential outcomes we get:
discontinuities (we therefore usually need
large datasets for RDD, e.g.,
administrative datasets).
Estimation using an RDD
- Using the switching equation we get:
- There are two kinds of RDD studies: sharp
and fuzzy
- A sharp RD design is where the
probability of treatment goes from 0 to 1 at
the cutoff.
- A fuzzy RD design is where the Where the treatment effect parameter 𝛿 is
probability of treatment discontinuously the discontinuity in the conditional
increases at the cutoff. expectation function:
- At any rate, there should be a running
variable X that, upon reaching cutoff 𝑐0,
sees an increase in the likelihood of
receiving treatment
- The sharp RDD estimation is an average continuously related to the
causal effect of the treatment (aka LATE) running variable X.
as the running variable approaches the ● If there exist some omitted
cutoff in the limit—since it’s only in the limit variable wherein the outcome
we have overlap. could jump at 𝑐0 even if we
- Since identification in an RDD is a limiting disregard the treatment
case, we’re only identifying an average altogether, the continuity
causal effect for those units at the cutoff. assumption is violated, and our
- Insofar as those units have treatment methods don’t require the LATE
effects that differ from units along the rest - Consider Carpenter and Dobkin (2009),
of the running variable, we have only studying mortality rates in Uruguay for
estimated an ATE that’s local to the range different types of causes.
around the cutoff: - There’s a large discontinuous jump in
motor vehicle death rates at age 21, likely
because people tend to drink more at that
- Note the role that extrapolation plays in age, and sometimes when they’re driving.
estimating treatment effects with sharp - But what if there’s something else that
RDD. happens to 21-year-olds causing them to
● If unit i is just below 𝑐0, 𝐷𝑖 = 0. If become bad drivers? Or they graduate
from college and they become reckless
it’s just above 𝑐0, 𝐷𝑖 = 1 during celebrations?
● For any value of 𝑋𝑖, there are - The study looks at jump in motor vehicle
either units in the treatment or accidents at age 21 in Uruguay, where the
control groups but not both drinking age is actually 18.
● So RDD doesn’t have common ● Continuity can be tested with
support, and that’s why we rely reasonably defined placebos,
on extrapolation for the estimation even if it’s not a direct test.
- The key identifying assumption in RDD is
0
the continuity assumption: 𝐸[𝑌𝑖 |𝑋 = 𝑐0]
are continuous (smooth) functions of X
even across the 𝑐0 threshold
- Without the treatment, the expected
potential outcomes wouldn’t have jumped
(and they would’ve remained smooth
functions of X).
1
- That is, 𝐸[𝑌 |𝑋] would’ve jumped at 𝑐0;
else, something else other than the
1
treatment caused it to jump because 𝑌 is
already under treatment
- If so, then there are no competing
interventions at 𝑐0. Hence, continuity
explicitly rules out omitted variable bias at
the cutoff itself
● All other unobserved
determinants of Y are
- The figure shows the simulation result, 65. Recentering age by subtracting 65
which allows us to see the potential yields:
outcomes (not realistic). Where α = β0 + β165
1
- Note that the value of 𝐸[𝑌 |𝑋] is changing - What about nonlinear data-generating
continuously over X through 𝑐0—the processes? They can yield false positives
continuity assumption if we don’t handle the specification
- So, absent the treatment itself, the carefully
expected potential outcomes would’ve - If the underlying DGP is nonlinear, it may
remained a smooth function of X even as be a spurious result due to model
one passes through 𝑐0 misspecification.
- Sometimes if we’re fitting local linear
- If continuity held, then only the treatment regressions around the cutoff, we could
triggered at 𝑐0 could be responsible for spuriously pick up an effect simply
discrete jumps in 𝐸[𝑌|𝑋] because we imposed linearity on the
- Because of switching equation, we can model.
only observe actual outcomes. Since units
0 1
switch from 𝑌 to 𝑌 at 𝑐0, we can’t directly Estimation using an RDD
evaluate the continuity assumption
- Institutional knowledge can help establish
that there’s nothing else changing at the
cutoff that would otherwise shift potential
outcomes.
1
- In other simulation, while 𝑌 doesn’t jump
at 50 on the running variable X, Y will.
Note in the following graph the jump at the
discontinuity in the outcome (LATE):
Estimation using an RDD
Sharp RD design
- Now let’s look at the regression model we
can use to estimate the LATE parameter in
RDD.
- It’s common for authors to transform the
- The DGP was nonlinear, but when with
running variable X by recentering at 𝑐0
straight lines to the left or right of the
(although this isn’t necessary): cutoff, the trends in the running variable
generate a spurious discontinuity at the
cutoff.
- This shows up in the regression, too. If we
- This doesn’t change the interpretation of
fit the model with least squares regression
the treatment effect, only the intercept.
controlling for the running variable, we get
- In Card, Dobkin, and Maestas (2008),
a causal effect even if there isn’t one.
Medicare is triggered when a person turns
● The estimated effect of D on Y is results in an RDD paper (Lee and Lemieux
large and highly significant, even 2010).
if the true effect is zero. - To derive the regression model, first note
● We’ll need some way to model that the observed values must be used in
nonlinearity below and above the place of potential outcomes:
cutoff to check if, even with
nonlinearity, there would be an
outcome jump at 𝑐0 - The regression model is then:
* *
Where β1 = β11 − β01 and β𝑝 = β1𝑝 − β0𝑝
- The equation we looked at earlier was just
- Suppose the nonlinear relationship is
a special case of the above equation,
For some reasonably smooth function * *
𝑓(𝑋𝑖) where β1 = β𝑝 = 0
- The treatment effect at 𝑐0 is δ, and the
treatment effect at 𝑋𝑖 − 𝑐0 is
* * 𝑝
δ + β1𝑐 +... + β𝑝𝑐
For some reasonably smooth function
- Dsh
𝑓(𝑋𝑖)
- We’d fit the regression model:
- Since 𝑓(𝑋𝑖) is counterfactual for values of
𝑋𝑖 > 𝑐0, we can model the nonlinearity and
approximate 𝑓(𝑋𝑖) by using a pth-order
polynomial:
- But it’s uncommon to see higher-order
polynomials in estimating local linear
regressions, and using them could lead to
overfitting and bias (Gelman and Imbens
2019).
- Another way is to use local linear
regressions with linear and quadratic forms
only.
- Note that reg y D##c. (x x2 x3) translates
- We can generate 𝑓(𝑋𝑖) by allowing 𝑋𝑖
to:
terms to differ on both sides of the cutoff, - Once we model the data using a quadratic
by including them both individually and (cubic was ultimately unnecessary), there’s
interacting them with 𝐷𝑖: no estimated treatment effect at the cutoff.
- Also, there’s no effect in our least squares
regression.
Where 𝑋𝑖 is the recentered running
variable, 𝑋𝑖 − 𝑐0
- Centering ensures that the treatment effect - Another way is to estimate using
at 𝑋𝑖 = 𝑋0 is the coefficient on 𝐷𝑖 in a nonparametric kernel regression, where
we try to estimate regressions at the cutoff
regression model with interaction terms point, which can result in the boundary
- Allowing different functions on both sides problem (where bias is caused by strong
of the discontinuity should be the main
trends in expected potential outcomes
throughout the running variable). Example: Medicare and universal healthcare
- In the figure below, while the true effect is - Card et al. (2008) is a good example of
AB, with a certain bandwidth a rectangular sharp RDD, because it focuses on the
kernel would estimate the effect as A’B’ (a provision of universal HC insurance for the
biased estimator). elderly: Medicare at age 65.
- There’s systemic bias with the kernel - In 2014, Medicare was easily 14% of the
method if the underlying nonlinear function federal budget at $505 billion.
f(X) is upward- or downward-sloping. - Relevant also because of debates
surrounding Obama’s Affordable Care Act
and support for Medicare for All.
- In 2005, about a fifth of non-elderly adults
in the US lacked insurance, mostly from
lower-income families; nearly half were
African-American or Hispanic.
- Unequal insurance coverage contributes to
disparities in HC utilization and health
outcomes across socioeconomic status.
- But even among policies, there’s
heterogeneity in the form of different
- A standard solution is to run a local linear copays, deductibles, and other features
nonparametric regression, and this can that affect usage.
substantially reduce the bias. - Health insurance suffers from deep
- Kernel regression is a weighted regression selection bias, and this confounds
restricted to a window (hence, “local”). evidence that better insurance causes
- Kernel provides the weights to that better health outcomes.
regression. ● Both supply and demand for
● A rectangular kernel would give insurance depend on health
the same result as taking E[Y] at status, confounding observational
a given bin on X. comparisons between people with
● A triangular kernel gives more different insurance
importance to observations characteristics.
closest to the center. ● BUT for the elderly, it’s different:
- The model is some version of: less than 1% of the elderly is
uninsured, and most have
fee-for-service Medicare
Application: The close election coverage. Transition to Medicare
occurs sharply at age 65 (the
threshold for eligibility).
- Card et al. (2008) estimated a reduced
form model measuring the causal effect of
health insurance status on HC usage:
Where i indexes individuals, j
socioeconomic groups, a age; 𝑢𝑖𝑗𝑎 indexes
unobserved errors; 𝑦𝑖𝑗𝑎 HC usage, 𝑥𝑖𝑗𝑎 a
Estimation using an RDD
Sharp RD design set of covariates (like gender, region);
- Estimating the above in a given window of 𝑓𝑗(α; β) is a smooth function representing
width h around the cutoff is the age profile of outcome y for group j;
straightforward, but knowing how large or 𝑘
and 𝐶𝑖𝑗𝑎 (k=1,2,...,K) are characteristics of
small to make the bandwidth is not.
- This method is sensitive to the choice of the insurance coverage held by the
bandwidth, but some studies have looked individual (e.g., copayment rates)
at optimal bandwidths. (Bandwidths may - Suppose health insurance coverage can
also vary left and right of the cutoff.) be summarized by two dummy variables
1 2
𝐶𝑖𝑗𝑎 (any coverage) and 𝐶𝑖𝑗𝑎 (generous ● Sample selection criteria is to
drop records for people admitted
insurance). The authors estimated the
as transfers from other institutions
following linear probability models:
and limit people between 60 and
1 2
(Where β𝑗 , β𝑗 are group-specific 70 years at the age of admission.
1 2 ● Sample sizes are 4M for
coefficients, 𝑔𝑗 (𝑎), 𝑔𝑗 (𝑎) are smooth age
California, 2.8M for Florida, 3.1M
profiles for group j, and 𝐷𝑎 is a dummy if for NY.
the respondent is 65 or older. - Medicare is available to people at least 65
- Combining the 𝐶𝑖𝑗𝑎 equations and rewriting yo and worked 40 quarters or more in
covered employment or have a spouse
the reduced form model, we get:
who did.
- Coverage is available to younger people
with severe kidney disease, and recipients
1 1 2 2
Where ℎ(𝑎) = 𝑓𝑗(𝑎) + δ 𝑔𝑗 (𝑎) + δ 𝑔𝑗 (𝑎) of Social Security Disability Insurance.
- Eligible individuals can obtain Medicare
is the reduced form age profile for group j:
𝑦 1 1 2 2
hospital insurance (Part A) free or charge,
π𝑗 = π𝑗 δ + π𝑗 δ and and medical insurance (Part B) for a
𝑦 1 1 2 2 modest monthly premium.
𝑣𝑖𝑗𝑎 = 𝑢𝑖𝑗𝑎 + 𝑣𝑖𝑗𝑎δ + 𝑣𝑖𝑗𝑎δ is the error term
- Individuals receive notice of impending
2
- Assuming that profiles 𝑓𝑗(𝑎), 𝑔𝑗(𝑎), 𝑔𝑗 (𝑎) eligibility shortly before turning 65, and are
are continuous at age 65 (i.e., the informed they have to enroll in it and
continuity assumption necessary for choose whether to accept Part B
identification), then any discontinuity in y coverage.
must be due to insurance. - Coverage begins on the first day of the
- The magnitudes will depend on the size of month in which they earn 65.
1 2 - There are five insurance-related variables:
the insurance changes at 65 (π𝑗 , π𝑗 ) and
probability of Medicare coverage, any
on the associated causal effects (δ , δ )
1 2 health insurance coverage, private
- Card et al. (2008) used many datasets: a coverage, 2 or more forms of coverage,
survey and administrative records of individual’s primary health insurance is
hospitals from three states. managed care.
- First, they used the 1992-2003 National - Data are drawn from NHIS and for each
Health Interview Survey (NHIS) with birth characteristic authors show the incidence
year, birth month, and calendar quarter of rate at age 63-64, and the change at age
interview. 65 based on a version of the 𝐶𝐾 equations
● They used this to construct an that include a quadratic in age, fully
estimated age at the time of interacted with a post-65 dummy and
interview. controls for gender, educ, race/ethnicity,
● A person reaching 65 in the region, sample year.
interview quarter is coded as age - Alternative specifications were also used,
65 and 0 quarters such as a parametric model fit to a narrow
● Assuming uniform distribution of age window (63-67) and a local linear
interview dates, half of people will regression specification using a chosen
be 0-6 weeks younger than 65, bandwidth.
the other half 0-6 weeks older. - Both show similar estimates of the change
● Analysis is limited to people at 65.
between 55 and 75. Final sample - Recall:
has 160,821 observations. ● Treatment: Medicare-age
- Second, they used hospital discharge eligibility
records from California, Florida, and NY ● Outcome: Insurance coverage
● A complete census of discharges ● Running variable: age
from all hospitals in the three - Each cell in the following table shows the
states, except for federally ATE for the 65-year-old population
regulated institutions. complying with the treatment.
● Data files include info on age in - Unsurprisingly, the effect of receiving
months and time of admission. Medicare is to cause a very large increase
of being on Medicare, as well as reducing - Problem: At age 65, Medicare eligibility
coverage on private and managed care. changes as well as employment!
(Retirement age is typically 65.)
● Any abrupt employment change
can lead to differences in HC
utilization if nonworkers have
more time to visit doctors.
● Authors need to investigate this
possible confounder, by testing
for any potential discontinuities at
age 65 for confounding variables
using a third dataset (CPS).
● They ultimately find no evidence
for discontinuities in employment
at age 65.
- Authors then investigated the impact of
Medicare on access to HC utilization using
NHIS, which asked since 1997:
1. “During the past 12 months has
medical care been delayed for
this person because of worry
about the cost?”
2. “During the past 12 months was
there any time when this person
needed medical care but did not
get it because this person could
not afford it?”
- To formally establish identification in an 3. “Did the individual have at least
RDD, we need to rely on the assumption one doctor visit in the past year?”
that the CEF for both potential outcomes is 4. “Did the individual have one or
continuous at 65 years of age. more overnight hospital stays in
0 1 the past year?”
● That is, 𝐸[𝑌 |𝑎], 𝐸[𝑌 |𝑎] are
- The table below shows the ATE for the
continuous through age 65
complier population at the discontinuity.
- If assumption is plausible, the ATE at age
(Compliers are those whose treatment
65 is:
status changed as we moved the value of
𝑥𝑖 from just to the left of 𝑐0 to just to the
right.)
● Note that the share of the
- Continuity assumption requires that all relevant population who delayed
other factors, observed and unobserved, care the previous year fell by 1.8
that affect insurance coverage are trending points. (Similar to share who
smoothly at the cutoff. didn’t get care at all in previous
year.)
● Share who saw doctor went up,
as did share who stayed in
hospital.
● Small ATEs, but relatively
precisely estimated (small SEs).
● Results differed a lot by
race/ethnicity and education.
Estimation using an RDD
Sharp RD design
- It’s standard practice to estimate causal
effects using local polynomial regressions.
- In its simplest form, this means fitting a
linear specification separately on each side
of the cutoff using a least squares
regression.
● You’re using only the
observations within some
pre-specified window (hence,
”local”).
● As the true CEF is probably not
linear at this window, the resulting
- Results show modest effects on care and estimator likely suffers from
utilization, but what about the kinds of care specification bias.
they received? ● But if you can get the window
- The next figure shows the effect of narrow enough, the bias of the
Medicare on hip and knee replacements estimator is probably small
by age. The effects are largest for whites. relative to its SD.
- In conclusion, the authors find that - What if the window cannot be narrowed
universal HC coverage for the elderly enough (e.g., the running variable takes on
increases care and utilization as well as a few values only, or the gap between
coverage. values closest to the cutoff is large)? This
- Later, Card, Dobkin, and Maestas (2009) often happens.
found modest decreases in mortality rates. ● Then you may not have enough
observations close to the cutoff
for the local polynomial
regression.
● This can lead also to the
heteroskedasticity-robust
confidence internals to
undercover the average causal
effect because it’s not centered.
- Lee and Card (2008) and Lee and
Lemiuex (2010) suggest clustering of SEs
by the running variable.
- Kolesar and Rothe (2018) warn, though,
that this can be one of the worst
approaches, and it can actually worsen the
heteroskedastic-robust SEs.
● Instead, they propose 2
- Fuzzy RDD also requires the continuity
alternative CIs that have
assumption.
guaranteed coverage properties
- For identification, we must assume that the
under various restrictions on the
conditional expectation of potential
CEF.
0
● They’re ”honest” CIs, so they outcomes (like 𝐸[𝑌 |𝑋 < 𝑐0)) is changing
achieve correct coverage smoothly through 𝑐0
uniformly over all CEFs in large
- But what changes at 𝑐0 is the probability of
samples.
● Only available in R now. Stata treatment assignment (see graph below).
users can use the
heteroskedastic-robust SEs.
- Randomization inference can also be
used.
● Cattaneo et al. (2015): What if
treatment is randomly assigned
around the cutoff? If this
neighborhood exists, it’s as good
as a randomized experiment
around the cutoff.
● We can proceed as if only those - Estimating some ATE under fuzzy RDD is
observations closest to the similar to estimating a LATE with IV.
discontinutiy were randomly - A simple way is a type of Wald estimator:
assigned, so RI can be used to estimate some causal effect as the ratio of
get exact or approximate a reduced form difference in mean
p-values. outcomes around the cutoff and a reduced
form difference in the mean treatment
Estimation using an RDD assignment around the cutoff:
Fuzzy RD design
- Sometimes there’s a discontinuity in the
running variable, but not entirely
deterministic, and is associated with a - Assumptions for identification the same as
discontinuity in treatment assignment. any IV design: all caveats about exclusion
- Fuzzy RDD is when there’s an increase in restrictions, monotonicity, SUTVA, and the
the probability of treatment assignment (as strength of the first stage.
seen in Hoekstra (2009) and Angrist and - Note: this LATE applies only to the
Lavy (1999)). compliers (those who were actually
- Probabilistic treatment assignment means: assigned in the treatment group).
- One can also estimate the effect using
two-stage least squares (2SLS) or a
similar appropriate model, such as
- That is, the conditional probability is
limited-information maximum likelihood.
discontinuous as X approaches 𝑐0 in the
- There are two events: (1) when the
limit running variable exceeds the cutoff; (2)
when a unit is placed in the treatment.
- Let 𝑍𝑖 be an indicator for when 𝑋 > 𝑐0.
One can use 𝑍𝑖 and the interaction terms
as instruments for the treatment 𝐷𝑖
- If one uses only 𝑍𝑖 as an IV, it’s a just - Hahn et al. (2001): One needs the same
identified model (there’s exactly as many assumptions for identification as in IV.
instruments as endogenous variables). It - Imbens and Angrist (1994): As with other
has good finite sample properties binary IVs, the fuzzy RDD is estimating the
- Consider few regressions involved in this LATE, which is the ATE for the compliers
IV approach. Three possible regressions only
are the first stage, the second stage, and
the reduced form. Challenges to identification
- In the just identified case (only one IV for - In RDD literature, it’s common to provide
one endogenous variable), the first stage evidence of the credibility of identifying
regression is: assumptions. Even indirect evidence may
be persuasive.
- The continuity assumption can be violated
Where π is the causal effect of 𝑍𝑖 on the in practice if any of the following is true:
1. The assignment rule is known in
conditional probability of treatment. The
advance.
fitted values are then used in the second
2. Agents are interested in
stage.
adjusting.
- If we use 𝑍𝑖 and all its interactions as
3. Agents have time to adjust.
instruments for 𝐷𝑖, the estimated first stage 4. The cutoff is endogenous to
would be: factors that independently cause
potential outcomes to shift.
5. There’s nonrandom heaping
along the running variable.
- If we want to forgo estimating the full IV - E.g., Retaking an exam, self-reported
model, we might estimate just the reduced income, etc.
form. Quite popular and favored among - The cutoff is endogenous if some other
applied economists. unobservable characteristic could happen
- The fuzzy RDD reduced form regresses Y at the threshold, and this has a direct
onto the instrument and the running effect on the outcome.
variable: ● E.g., Age cutoffs for policies, such
as when a person turns 18 and
faces more severe penalties for
- As in the sharp RDD case, we can allow crimes. Turning 18 is correlated
the smooth function to be different on both wth HS graduation, voting rights,
sides of the discontinuity by interacting 𝑍𝑖 etc.
- One solution is the McCrary density test
with the running variable:
(McCrary, 2008) that can check if units are
“sorting on the running variable” (aka
manipulation).
● E.g., There are two rooms with
- If you wanted to present the estimated
patients in line for a life-saving
effect of the treatment on some outcome,
treatment. Patients in room A will
that necessitates estimating a first stage,
receive it, and those in B
using fitted values from that regression,
knowingly will receive nothing.
and estimating a second stage on those
● If you’re in room B, you’d walk up
fitted values.
to room A. In the extreme, room A
- The reduced form only estimates the
would be crowded, B empty.
causal effect of the instrument on the
(Recall: people optimize under
outcome, while the second stage model
constraints!)
with interaction terms would be the same
- Just as continuity is the null, here the null
as in sharp RDD:
is continuous density throughout the cutoff.
Bunching in the density at the cutoff is a
sign that someone is moving over to the
cutoff to take advantage of the rewards
Where 𝑥 are now not only normalized with there.
respect to 𝑐0 but are also fitted values - Formally, if we assume a desirable
obtained from the first-stage regressions. treatment D and an assignment rule
𝑋 ≥ 𝑐0, we expect individuals to sort into D that section, and pretending it
'
by choosing X such that 𝑋 ≥ 𝑐0—so long was a discontinuity 𝑐0. Then test if
as they’re able there’s a discontinuity in the
- If they do, it could imply selection bias outcome there; there should be
insofar as their sorting is a function of none.
potential outcomes.
- McCrary’s density test checks if there’s
bunching of units at the cutoff. The
alternative hypothesis is that there’s an
increase in density at the kink.
- Partition the assignment variable into bins,
and calculate frequencies in each bin
- Treat the frequency count as the
dependent variable in a local linear
regression. If you can estimate conditional
expectations, you have the data on the
running variable, so you can always do the
density test (Stata: -rddensity-.)
- This is a “high-powered” test, meaning
you’ll need lots of observations at 𝑐0 to
Example: Medical care of at-risk newborns
distinguish a density discontinuity from
- Consider Almond, Doyle, Kowalski, and
noise.
Williams (2010), who looked at the causal
effect of medical spending on health
Challenges to identification
outcomes (noting that many med
Fuzzy RD design
technologies are effective but too costly).
- Doctors assign patients to treatments
based on what they think would be best
given potential outcomes—but this violates
the independence assumption. (Think of
Dr Strange example.)
- If endogeneity is deep enough, controlling
for selection directly will be impossible.
Spurious correlations may emerge due to
Challenges to identification selection bias.
- A second test is a covariate balance test - Almond et al. (2010) figured that in the US,
(aka placebo test): there must not be an babies with very low birth weight (VLBW)
observable discontinuous change in the receive heightened medical attention.
average values of reasonable chosen - Using administrative records linked to
covariates around the cutoff. mortality data, they find that the 1-year
● Since these are pre-treatment infant mortality decreases by about 1 pp
characteristics, they should be when the child’s birth weight is just below
invariate to change in treatment the 1.5-kg threshold compared to those
assignment. born just above.
● E.g., Lee, Moretti, and Butler - Given the mean 1-year mortality of 5.5%,
(2004) evaluated the impact of this is a sizable estimate; medical
Democratic vote share just at interventions triggered by VLBW
50% on various demographic classification have benefits far exceeding
factors (income, education, race, costs.
eligibility to vote). - Some studies look at econometric issues
- A third test is an extension of the covariate related to heaping on the running variable:
balance test: there shouldn’t be effects on when there’s an excess number of units at
the outcome of interest at arbitrarily certain points along the running variable.
chosen cutoffs. - Here, it seems to be happening at regular
● Imbens and Lemieux (2008) 100-gram intervals, caused by a tendency
suggest looking at one side of for hospitals to round to the nearest
discontinuity, taking the median integer.
value of the running variable in
- In the figure below, long black lines one removes units in the vicinity of 1.5 kg
appearing regularly at the birth-weight and reestimate the model.
distribution are excess mass of children ● Parameter we’re estimating at the
born at those numbers. cutoff has become an even more
- This is unnatural, and is almost certainly unusual type of LATE that’s even
caused by sorting or rounding less informative about ATE that
● E.g., It could be due to less policymakers want to know.
sophisticated scales, or staff ● Sample size went down by 2%
rounding off birth weights to 1.5 but has very large effects on
kg to make a child eligible for 1-year mortality (50% lower than
increased medical attention. in Almond et al. (2010)).
- Caution: If the sample size is small relative
to the number of heaping units, donut hole
RDD could be infeasible.
- It also changes the parameter of interest to
be estimated in ways difficult to
explain/understand.
Application: The close election
- The close election design has become
popular in the RDD literature. Banks on the
- Almond et al. (2010) used the McCrary
fact that winners in political races are
density test and found no clear, statistically
declared when a candidate gets the
significant evidence of sorting on the
minimum needed share of votes.
running variable at the 1.5-kg cutoff.
- Since very close races represent
- In their main analysis, they found a causal
exogenous assignments of a party’s
effect of a 1 pp reduction in 1-year
victory, we can use close elections to
mortality.
identify the causal impact of the winner on
- Later work by Barreca et al. (2011) and
a variety of outcomes
Barreca et al. (2016) focused on the
- We can also test political economy
heaping phenomenon, and showed the
theories that are otherwise impossible to
shortcomings of the McCrary test.
evaluate.
● Data heap at 1.5 kg appears to
- The classic study here is Lee et al. (2004)
be babies whose mortality rates
- For instance, how to voters affect policy?
are unusually high—outliers
To politicians or voters pick policies? There
compared to both the immediate
are two competing theories:
left and right.
- Convergence theory: heterogeneous voter
● Such events wouldn’t occur
ideology forces each candidate to
naturally; there’s no reason to
moderate their position (like the median
believe that nature would produce
voter theorem).
heaps of children born with outlier
- Divergence theory: when the winning
health defects every 100 grams.
candidate pursues their most-preferred
- ”This heaping at 1,500 grams may be a
policy after taking office.
signal that poor-quality hospitals have
● Voters can’t compel candidates to
relatively high propensities to round birth
reach any kind of policy
weights but is also consistent with
compromise, and opposing
manipulation of recorded birth weights by
candidates choose very different
doctors, nurses, or parents to obtain
policies under different
favorable treatment for their children.”
counterfactual victory scenarios.
● ”Barreca et al. (2011) show that
- Let R, D be candidates in a congressional
this nonrandom heaping leads
race.
one to conclude that it is ’good’ to
- The policy space is a single dimension,
be strictly less than any 100-g
where the candidates’ policy preferences
cutoff between 1,000 and 3,000
in a period are quadratic loss functions
grams.”
u(l), v(l) and l is the policy variable.
- RDD must not be sensitive to observations
- Each player has a bliss point or the most
at the thresholds themselves. Barreca et
preferred location in a unidimensional
al.’s solution was donut hole RDD where
policy range.
*
- For Democrats, 𝑙 = 𝑐(0), and for
*
Republicans 𝑙 = 0.
- Ex ante, voters expect the candidate to
choose some policy, and the candidates
𝑒 𝑒
have probability of winning 𝑃(𝑥 , 𝑦 ), where
𝑒 𝑒
𝑥 , 𝑦 are the policies chosen by D,R Where γ is the total effect of the initial win
respectively. on future roll call votes.
𝑒 δ𝑃
● When 𝑥 > 𝑦 , 𝑒 > 0 and - Also, the following are both observable:
δ𝑥
δ𝑃
𝑒 < 0
δ𝑦
*
- 𝑃 is the underlying popularity of the
Democratic Party, or the probability that D - The elect component is estimated as the
would win if the policy chosen x equaled difference in mean voting records between
the Democrat’s bliss point c. 𝐷 𝑅
parties at t, and [𝑃𝑡+1 − 𝑃𝑡+1] can be
- Multiple Nash equilibria mean:
1. Partial/complete convergence: estimated by the fraction of districts won
voters affect policies by Democrats in t + 1.
2. Complete divergence: voters - We can net out the elect component to
elect politicians with fixed policy implicitly get the “effect” component since
preferences and therefore do we can estimate γ, the total effect of a
whatever they want. Popularity Democrat victory in t on 𝑅𝐶𝑡+1
*
has no effect on policies:
δ𝑥
* = 0 - Random assignment of 𝐷𝑡 is crucial, and
δ𝑃
● An exogenous shock to we use RDD. Without it, equation will
* reflect π1 and selection (Democratic
𝑃 (i.e., dropping
Democrats into the districts have more liberal bliss points).
district) does nothing to - Lee et al. (2004) used two primary
equilibrium policies. datasets:
- The potential roll-call voting record 1. How liberals voted, using the
outcomes of the candidate following some ADA Voting Scores (1946–1995):
election is: The Americans for Democratic
Action (ADA) compiled scores for
U.S. House Representatives
based on approximately 25
significant roll-call votes per
Where 𝐷𝑡 indicates whether a Democrat
Congress.
won the election. Only the winning ● Each lawmaker is
candidate’s policy is observed. assigned a score
- Converted into regression equations between 0 and 100, with
they’re: higher scores indicating
more liberal voting
records.
● Running variable is the
* vote share that went to a
- Since 𝑃 is unobservable, we can’t directly
Democrat.
estimate the above equations. But if can 2. Election Results (1946–1995):
randomize 𝐷𝑡, it would be independent of These data detail the vote shares
*
𝑃𝑡 and ϵ𝑡 for Democratic candidates in
House elections during the same
- Taking conditional expectations with
period.
respect to 𝐷𝑡, we get:
- Authors used exogenous variation in
Democratic wins to check if convergence
or divergence is correct.
- If convergence is true, lawmakers who just
barely won should vote almost identically.
If divergence is true, they should vote 10 points, but the contemporaneous effect
differently at the margins of a close race. becomes smaller.
● At the margins of a close race, - Effect on incumbency increases a lot.
voter preferences are supposedly - So, simply running the regression yields
the same. But if policies diverge different estimates when we include data
at the cutoff, politicians and not far from the cutoff itself.
voters are driving policy making.
- The exogenous shock comes from the
discontinuity in the running variable: at a
vote share of just above 0.5, the
Democratic candidate wins.
- Just around the cutoff, random chance
determines the Democratic win—random
assignment of 𝐷𝑡
- Now we replicate the results of Cattaneo,
Frandsen, and Titiunik (2015), using - Neither of the above regressions controlled
regressions limited to the window right for the running variable, or the recentered
around the cutoff. running variable.
● These are local regressions since - Let’s do that and simply subtract 0.5 from
they use the data close to the the running variable, so values of 0 are
cutoff. where the vote shares equals 0.5, negative
● The window is such that we use values of vote shares are less than 0.5,
only observations between 0.48 positive values above 0.5.
and 0.52 vote share. Regression
estimates the coefficient on 𝐷𝑡
right around the cutoff.
- While the incumbency effect falls closer to
what Lee et al. (2004) find, the effects are
- In Cattaneo et al. (2015), the effect of a still quite different.
Democratic victory increase liberal voting
by 21 points in the next period, 48 points in
the current period, and the probability of
reelection by 48%.
- Using this design they found evidence of
divergence and incumbency advantage.
- What if we allow the running variable to
vary on either side of the discontinuity?
● We need a regression line to be
on either side, so we must have 2
lines left and right of the
discontinuity.
● We need an interaction of the
running variable with the
treatment variable.
- What if we use all of the data? We
replicate Lee et al. (2004).
- We report the global regression analysis
with the running variable interacted with
the treatment variable.
- We get somewhat different results. The - This pulled down the coefficients, but they
effect on future ADA scores gets larger by remain larger than what as found when
only obs. within 0.02 points of the 0.05
were used.
- Recap: First we fit a model without
- Let’s now run a quadratic model: controlling for the running variable. We
then included the running variable:
interacting the Democratic vote share with
Democratic dummy, including a quadratic.
- Including the quadratic cases caused the - In all of the above, we extrapolated trend
estimated effect of a Democratic victory on lines from the running variable beyond the
future voting to fall considerably. support of the data to estimate LATE right
- Effect on contemporaneous voting is at the cutoff.
similar to Lee et al. (2004), as is the - Note that including the running variable in
incumbent effect. any form tended to reduce the effect of a
- This illustrates the standard steps using victory for Democrats on future
global regressions. But estimating global Democrating voting patterns.
regressions results in large coefficients. - Lee et al. (2004) estimated 21 points,
- Maybe there are strong outliers in the data which is attenuated considerably when we
causing the distance at 𝑐0 to spread more include controls for the running variable,
even when we estimate very local flexible
widely
regressions.
- Effect is smaller but significant, whereas
the immediate effect remains quite large.
- A solution is to again limit the analysis to a
smaller window.
- We drop observations far away from 𝑐0
and omit the influence of outliers from our
estimation at the cutoff.
- Since we used ± − 0.02 before, we use ± −
.05 just to mix things up.
- With N = 2, 441, we use fewer
observations, but compared to the
truncation in Cattaneo et al. (2015) it’s still
more observations (there, N = 915).
- Including the quadratic interaction pulled - There are other ways of exploring the
the estimated size of future voting down impact of the treatment at the cutoff.
considerable, even with the smaller - Hahn, Todd, and Klaauw (2001) framed
sample. estimation as a nonparametric problem,
and emphasized using local polynomial
regressions.
● In RDD contexts, this means
estimating a model, like Y = f(X) +
ϵ, that doesn’t assume a
functional form.
● Calculate E[Y] for each bin on X point of interest is at the boundary
like a histogram (in Stata it’s (boundary problem).
-cmogram-). ● They propose using local
● Let’s revisit Lee et al. (2004) and nonparametric regression where
show the relationship between weight is given to observations at
the Democratic win (as a function the center.
of the running variable, the - One can also estimate kernel-weighted
Democratic vote share) and the local polynomial regressions, or weighted
candidates’ second-period ASA regressions restricted to a window. The
score. chosen kernel provides the weights.
- A solution is to again limit the analysis to a ● A rectangular kernel gives the
smaller window. same results as E[Y] at a given
- We drop observations far away from 𝑐0 bin on X.
and omit the influence of outliers from our ● A triangular kernel would give
estimation at the cutoff. more importance to observations
closest to the center.
● This method is sensitive to the
chosen bandwidth size.
- Note: continuity assumption is an
untestable assumption (it involves
continuous CEFs of the potential outcomes
at the cutoff).
● But we can check for whether
there are changes in the CEFs for
other exogenous covariates that
cannot and should not be
changing as a result of the cutoff.
● So it’s common to look at race or
gender at the cutoff.
● Any RDD paper will always
involve such placebos, even if
- If there’s no apparent trend in the running they’re not direct tests of the
variable, polynomials aren’t so useful. continuity assumption.
Some papers use just the linear fit - The fundamental tradeoff is choosing
because there weren’t strong trends to between bandwidth versus bias/variance.
begin with. E.g., Carrell et al. (2011). ● The shorter the window, the lower
- Hahn et al. (2001) showed that one-sided the bias, but the variance of the
kernel estimation such as lowess may estimate increases because you
suffer from poor properties because the have less data.
● Calonico et al. (2014) developed ● Thankfully this doesn’t invalidate
a Stata command -rdrobust- for the entire close-election design,
optimal bandwidth selection. and Caughey and Sekhon (2011)
● Optimal bandwidth may vary from found their result only in a specific
left and right of the cutoff based subset of House races.
on some bias-variance trade-off.
Regression kink design
- Sometimes the idea of a jump doesn’t
describe what happens at the discontinuity.
- Note: the coefficient is 46.48 with SE 1.24. - Card, Lee, Pei, and Weber (2015)
- This method is data-greedy because it introduced the regression kink design
gobbles up data at the discontinuity. (RKD): rather than the cutoff causing a
- So use when you have large observations discontinuous jump in the treatment
so you have ample data points at the variable at the cutoff, it changes the first
cutoff. Otherwise, you may not have derivative (a kink).
enough power to pick up that effect. - They use kinks to identify the causal effect
- We also look at the implementation of the of a policy by exploiting the jump in the first
McCrary test, using local polynomial derivative.
density estimation. - Specifically, they look at the level of
- From the next graph, we see no signs of unemployment benefits and whether that
manipulation in the running variable at the affects the length of time spent
cutoff: unemployed in Austria.
● Unemployment benefits there are
based on income in a base
period, and a minimum benefit
level applies that isn’t binding for
people with low earnings
● Benefits are 55% of earnings in
the base period.
● There’s a maximum benefit level
adjusted every year, causing a
discontinuity in the schedule.
- Conclusion: close-election design has
become a cottage industry in econ and
polsci, and extended to other types of
elections and outcomes (e.g., Beland
2015).
● Eggers et al. (2014):
close-election design is one of the
best RD designs now, and
assumptions in it are likely to be
met in a wide variety of electoral
outcomes.
- Caughey and Sekhon (2011) question the
validity of Lee et al. (2004). They found
that bare winners and bare losers in House
elections differed considerably on
pretreatment covariates, and imbalance
got worse in the closest elections.
● Therefore, sorting problems get
more severe in the closes of
House races, suggesting these
races couldn’t be used for RDD.
Conclusion
- RDD is often considered a winning design
because it credibly identifies causal
effects.
- But its credibility stems from deep
institutional knowledge surrounding the
relationship between the running variable,
the cutoff, the treatment assignment, and
the outcomes themselves. Also, it relies on
the continuity assumption.
- RDD opportunities are out there, using
data from firms and government agencies
who face scarcity problems and must
ration a treatment.
- Find those cutoff points and you can have
a cheap yet powerfully informative natural
experiment (you don’t need to do
randomization).
LECTURE 7 influence of Sewall’s work on path
Instrumental variables analysis.
Intro Intuition of instrumental variables
- Instrumental variables (IV) design is of
the most important research designs ever
devised.
- Also unique in that it was devised by an
economist, and not imported from statistics
(like standard errors) or some other field
(like RCTs or RDD).
- IV design has distinctly economic origins, - To understand the IV estimator, let’s start
with its first use arising not from statistics with a DAG above. Note the backdoor path
but from a practical policy problem. between D,Y: D ← U → Y.
- Philip Wright (1861–1934), an economist ● Note that U is unobserved, so the
and mathematician, was deeply interested backdoor path remains open.
in the identification problem in ● With selection on unobservables,
econometrics. there’s no conditioning strategy
- In 1928, he wrote a book on tariffs satisfying the backdoor criterion.
affecting animal and vegetable oils. He - But there’s a mediated pathway from the
believed that recent tariff increases were instrumental variable (IV) or instrument
harming international relations. And he Z to Y via D.
wrote about the damage from the tariffs, ● Note, though, that even if Y varies
which had affected animal and vegetable with Z, Y is varying only because
oils. D varied. There’s no direct path
- He included Appendix B, which contained from Z to Y
the first known derivation of the ● When Z varies, D varies, causing
instrumental variables estimator. Y to vary. Z affects Y “only
● The IV method was originally through” D (aka the only through
developed to address assumption).
endogeneity when estimating - Imagine D consists of people making
supply and demand elasticities choices. Sometimes these choices affect
● Specifically, if there is one Y, sometimes they’re merely correlated
instrument for supply, and the with changes in Y due to unobserved
supply and demand errors are changes in U.
uncorrelated, then the elasticity of - But Z comes along and induces some but
demand can be identified. not all people in D to make different
- Philip’s son, Sewall Wright (1889–1988), decisions.
was a pioneering geneticist who developed ● When those people’s decisions
path analysis, a precursor to modern change, Y will change too
causal diagrams. because of the causal effect.
- Sewall’s ideas strongly influenced ● But all the correlation between
Appendix B, leading to speculation that he D,Y in that situation will reflect the
may have been its true author causal effect. The reason is D is a
- Because Appendix B used path analysis collider in the backdoor path
and addressed a key econometric between Z,Y: Z → D ← U → Y.
problem, historians questioned whether - Assume further that in D, only some of the
Sewall or Philip wrote it. people change their behavior because of
● Stock and Trebbi (2003) D. So Z causes a change in Y for just a
investigated this using stylometric subset of the population.
analysis, analyzing writing style - E.g., If the instrument only changes
through function word frequency women’s behavior, the causal effect of D
and grammatical patterns. on Y will only reflect the causal effect of
● Their analysis attributed all women’s choices (not men’s)
Appendix B text to Philip Wright, - First, if there are heterogeneous treatment
not Sewall, confirming his effects (e.g., men affect Y differently than
authorship, though it women), shock Z only identifies some of
acknowledged the intellectual the causal effect of D on Y.
● That causal effect may be valid correlated with U, by using the
only for the population of women exogenous variation from Z.
whose behavior changed in ● The variation in D induced by Z is
response to Z (not reflective of “clean”—not confounded by U.
how men’s behavior would affect ● If we isolate the part of D
Y). explained by Z (first-stage
- Second, if Z induces some change in Y regression in 2SLS), we can use
only via a fraction of the change in D, it’s that to explain Y in the second
almost as though we have less data to stage, and we can therefore
identify the causal effect than we really recover the causal effect of D on
have. Y.
- These illustrate two difficulties in - How do you know you have a good
interpreting IVs and identifying a instrument? Justifying an IV demands
parameter using IVs. strong theoretical reasoning, often via a
1. IVs only identify a causal effect DAG. IV must affect the outcome only
for any group of units whose through the treatment (the exclusion
behavior are changed as a result restriction).
of the instrument (aka the causal ● The exclusion restriction is
effect of the complier population untestable, and many economists
[e.g., only women complied with are skeptical of IVs because it’s
the instrument]). easy to imagine violations of the
2. Also, IVs are typically going to exclusion restriction.
have large SEs, and they will fail - Good IVs often sound weird or confusing.
to reject in many instances People find the relationship between
because they’re underpowered. instrument Z and outcome Y
confusing—unless they understand that Z
affects treatment D, which in turn affects Y.
● E.g., Saying that “mothers with
two boys work less” sounds odd
until you realize it affects family
size (they want to try for a girl
next), and then labor force (they’ll
- Note that we drew the DAG so that Z is
spend more time at home).
independent of U. D is a collider in Z → D
● Instrument (two boys) has no
← U, which implies Z, U are independent.
effect on the outcome (labor
- This is the exclusion restriction: the IV
market participation) except
estimator assumes that Z is independent
through the endogenous
of the variables that determine Y except for
treatment (family size
D.
preferences).
● “Exclusion” because the
● Instruments are quasi-random,
instrument’s direct effect on the
and unless you know the
outcome is excluded. The
mediated path or DAG, IVs sound
instrument is excluded from the
weird!
structural equation for the
- E.g., Kanye West’s “Ultralight Beam” has
outcome.
the lyric sung by Chance the Rapper : “I
- Note, too, that Z is correlated with D, and Z
made ‘Sunday Candy,’ I’m never going to
is correlated with Y only through its effect
hell / I met Kanye West, I’m never going to
on D.
fail”
- The relationship between Z, D is the first
● Making the song led to a church
stage of the two-stage least squares
visit, and then to a religious
estimator, a kind of IV estimator.
reconversion.
- To summarize:
● SC →? → H: Without knowing the
● To get the causal effect of D on Y,
endogenous treatment, you’re
can’t just regress Y on D because
clueless about how “Sugar
U is an open backdoor path that
Candy” and hell are related.
cannot be closed. There’s
(That’s the hallmark of a good IV!)
spurious correlation between D,Y.
● But meeting Kanye West is not a
● But we can purge the
good instrument, because Kanye
endogenous part of D that’s
can influence one’s success
through many channels (thus
violating the exclusion restriction).
● KW →? → F: Since it’s easy to
tell how knowing Kanye might
directly cause one’s success,
knowing Kanye is likely a bad
instrument.
- If you can easily craft a story between Z
and Y (or if the instrument obviously
affects the outcome), it’s likely the
exclusion restriction was violated, and IV is
likely invalid.
- In IV design, confusion is a feature—not a
bug.
- We can represent this DAG in a simple
Homogeneous treatment effects regression:
- There are two ways to discuss IV design:
one in a world where the treatment has the
same causal effect for everyone
(homogeneous treatment effects) and where Y is log of earnings, S is schooling
one where the effects differ in the in years, A is unobserved individual ability,
population (heterogeneous treatment and ϵ is an error term uncorrelated with
effects). schooling or ability
- With homogeneous treatment effect, we - A may be missing because, for instance,
won’t rely on the (more modern) potential the survey did not collect it or it’s missing
outcomes notation from the dataset (things like family
- IV methods are typically used to address background, intelligence, motivation,
omitted variable bias, measurement error, non-cognitive ability).
and simultaneity - Because A is unobserved, a more
- E.g. Quantity and price is determined by appropriate equation may be:
the intesection of supply and demand, so
any observational correlation between P, Q
is uninformative about the elasticities
Where η𝑖 is a composite error:
associated with supply or demand curves.
(The subject of Philip Wright’s η𝑖 = γ𝐴𝑖 + ϵ𝑖
development of IV in the first place.) - Assume schooling is correlated with ability,
- Let the homogeneous treatment effect be so therefore it’s correlated with η𝑖, making
δ.
it endogenous in the second, shorter
- E.g., College causes my wage and your
regression.
wage to increase by 10%.
- Only ϵ𝑖 is uncorrelated with the regressors
- E.g., Consider a classic labor market
problem where we want to get the impact by definition.
of schooling on earnings, but schooling is - We know from the derivation of the least
endogenous because of unobserved squares operator that the estimated value
ability: of δ is:
and substituting Y from the first model
above we get,
- So, if γ > 0 and Cov(A, S) > 0, then δ is from the 2SLS estimator, the relationship
upward-biased. (After all, ability and between the instrument and the treatment
schooling are positively related.) variable.
- Assume you found a good (and weird) - If you take the probability limit of this
instrument 𝑍𝑖 causing people to go to expression and assume Cov(A,Z) =
school, but is independent of student Cov(ϵ,Z) = 0, p lim δ = δ.
ability (we can get around the endogeneity - But if Z is not independent of η (because
problem) and the structural error term (so it’s correlated with A or ϵ), and if the
it’s weird). correlation between S,Z is weak, then in
- The DAG becomes: finite samples δ becomes severely biased.
Two-stage least squares
- One of the most intuitive IV estimators is
the two-stage least squares (2SLS).
- Suppose you have sample data on Y, S, Z.
For each observation i, assume the data is
generated as follows:
where Cov(Z, ϵ) = 0 (exclusion restriction)
- Let’s use Z to estimate δ. First, we get the
and β ≠ 0 (non-zero first stage).
covaraince of Y,Z: 𝑛
- Using the result that ∑ (𝑥𝑖 − 𝑥) = 0, the IV
𝑖=1
estimator can be written as:
- To isolate δ which is on the RHS, we can
estimate it with:
so long as Cov(A,Z) = 0 and Cov(ϵ,Z) = 0.
- Zero covariances are the statistical truth
contained in the IV DAG.
● This is what the exclusion - And when we substitute the true model for
restriction means: the instrument Y, we get:
must be independent of both
parts of the composite error term.
- But the exclusion restriction is a necessary
but not sufficient condition for IV to work.
(Otherwise we can use a random number
generator for an instrument.) We also need
the instrument to be very highly correlated
with the endogenous variable schooling S.
- We are dividing by Cov(S,Z), so it
necessarily requires that this covariance is
- From the earlier description of δ as the
not zero.
ratio of two covariances, we can derive:
- Consider again the IV estimator:
- In IV terminology, reduced form is the
relationship between the instrument and
the outcome; the first stage gets its name
- BUT, the exogenous variation in S driven
by the instrument is only a subset of the
total variation in schooling.
- So IV reduces the variation in the data, so
there’s less information available for
identification. And what little variation we
have left comes from only the units who
responded to the instrument in the first
place.
- This is critical later as we relax the
homogeneous treatment effect assumption
(i.e., we allow for heterogeneity).
so that βVar(Z) = Cov(Z, S). Example: Parental meth use by child abuse
- The IV estimator can therefore be rewritten - Let’s look at Cunningham and Finlay
as: (2012), who looked at the effects of
parental meth use on child abuse and
foster care admissions.
- Meth abuse causes heightened energy
and alertness, lower appetite, intense
euphoria, impaired judgment, and
psychosis.
- US epidemic started in West Coast, going
eastward in the 1990s.
- Intuitively, substance abuse causes poor
parenting. But all of that occurs in
equilibrium, and there’s possible selection
bias: parents using meth would’ve been
bad parents anyway—even without the
drugs.
- Note that βZ are fitted values of schooling
- Authorities were positing that increased
from the first-stage regression. So we’re
meth use was causing greater foster care
no longer using S, and instead we’re using
admissions. How to establish causality
its fitted values.
here?
● To see why, recall that S = γ + βZ
- Breaking Bad segment: meth is
+ ϵ, and 𝑆 = γ + βZ. synthesized from a reduction of two
- Therefore, the 2SLS estimator is given by: precursors, namely ephedrine and
pseudoephedrine.
- In 1995, Congress passed a law providing
safeguards in distributing products
containing ephedrine as the primary
- It can be shown that Cov(𝑆,Y) = β medicinal ingredient
- Traffickers shifted to pseudoephedrine,
Cov(Y,Z), and Var(β,Z) = Var(𝑆).
which was not covered by the regulation,
- Note that the 2SLS estimator uses only the
and they used it as the primary precursor.
fitted values of the endogenous regressors
- New law required distributors of all forms
for estimation. These were based on all
of pseudoephedrine to be subject to
variables used in the model including the
chemical registration.
excludable instrument.
- Precursor shocks said to be the largest
- Since all these instruments are exogenous
supply shocks in drug enforcement history
in the structural mode, the fitted values
in the US.
themselves have become exogenous, too
- With FOI request Cunningham and Finlay
● We are using only the variation in
(2012) requested data on undercover
schooling that’s exogenous.
purchases and seizures of illicit drugs, and
● We’re back to a world where
data included the price, drug type, weight,
we’re identifying causal effects
purity, and locations where they occurred.
from exogenous changes in
- They did a time series of prices of meth,
schooling caused by our
heroin, cocaine.
instrument.
- First supply intervention/disruption caused - Next graph shows reduced form: the effect
a quadrupling of retail street prices, of price shocks on foster care admissions.
adjusted for inflation, purity, weight. ● First supply intervention had a
Second intervention had a smaller but still negative effect on foster care
significant effect. admissions: fell from 8,000
- Note that the 1995 and 1997 shocks children removed per month to
uniquely impacted meth (not cocaine and 6,000, then rose again
heroin). ● Second intervention had a milder
- So, cocaine and heroin were not effect, possible because the price
substitutes to meth, and the effect of meth effect was about half, and
can be isolated (versus effect of substance domestic meth production was
abuse in a broad sense). replaced by Mexican meth
imports in the late 1990s.
● So by the end of the 1990s,
domestic meth production was
less prominent in total output, and
effect on price on admissions was
more muted.
- IV (higher price of meth) is weird because
by itself it doesn’t cause child abuse.
Higher prices (Z) reduce parents’
consumption of meth (D), and that in turn
causes a reduction in child abuse (Y).
- Inspecting raw data trends can give strong
clues for validity of first stage and reduced
- They used IV design, and in the first stage form. Helpful because exclusion restriction
the proxy for meth abuse is the number of is untestable!
people entering treatment who listed meth
as one of the substances they used in their
last episode of substance abuse.
- The next graph shows the first stage,
coming from the Treatment Episode Data
Set (TEDS), which includes all people
going into treatment for substance abuse
at federally funded clinics.
● Around 1995 and 1997, there
was a drop in self-admitted meth
admissions, also total meth
admissions. (Second wave effect
not so big, but there’s a tapering
off of meth admissions.)
● So with decline of meth
- Now let’s look at IV tables where we
admissions, there appears to be a
compare OLS and IV estimates.
first stage.
- In column 1, dep var is total entry into
foster care, and OLS shows no effect of
meth.
- In a 2SLS table, we show the first stage at
the bottom.
● A unit deviation in price from its
long-term trend is associated with
a fall of meth admissions into
treatment (proxy) by −0.0005 log
points (significant at the 1%
level).
● F-stat of 17.6 suggests that
instrument is strong enough for
identification.
- Now, the 2SLS estimate of the treatment: prices, their meth use was causing so
using only the exogenous variation in log severe child abuse that their children had
meth admissions, and assuming the to be removed from them and placed into
exclusion restriction holds, we can look at foster care.
the causal impact of log meth admissions ● Social costs are not apparent if
on log aggregate foster care admissions. you just look at county-level data
● Elasticity (log-log): a 10% for CA and only the 1997
increase in meth admissions for ephedrine shock (Dobkin and
treatment causes a 15.4% Nicosia 2009).
increase in children being ● Meth doesn’t cause crime in CA,
removed from their homes and but it harms children of meth
placed into foster care (effect is users and places strain on the
large and precise). foster care system.
● Importantly, OLS failed to detect
this effect! Example: Compulsory school attendance
- Another classic IV study is Angrist and
Krueger (1991), which investigated the
return to schooling
- In the US, a child enters a grade on the
basis of their birthday, and the cutoff
usually was late December:
● Children born on or before Dec
31 were assigned to 1st grade; if
born later, they’re assigned to
kindergarten.
● Two people (those born Dec 31
and Jan 1) were exogenously
assigned different grades.
- This affects when they get their high
school degree. For the most part of US
history, kids are forced to stay in high
school until 16. After that they can stop
schooling.
- This quirk means assigning more
schooling to people born later in the year
(exogenous variation in schooling).
- If you were born in December, you’d reach
16 with more education than the person
born in January
- Note that this is similar to RDD! (IV and
RDD are very similar strategies, actually).
- So Cunningham and Finlay (2012) found
effect of meth admissions on removals for
physical abuse and neglect. - The next graph shows the first stage:
● Possible channels include those with birthdays in the 3rd and 4th
parental incarceration, child quarter have more schooling on average
neglect, parental drug use, and than those in the 1st and 2nd quarters.
physical abuse. (Note the rise from 1 to 4 in each year.)
● Interestingly, there’s no effect of - This pattern is not so strong in later
parental drug use or parental cohorts, for whom the price on higher
incarceration (signs are negative levels of schooling was rising too much
and SEs are large). and fewer people were dropping out before
- In conclusion, for meth users whose finishing their degrees.
behavior was changed by rising real
schooling for high school students, and (2)
probability of being a college graduate.
- The next graph shows the reduced form
relationship of quarter of birth and log
weekly earnings.
- Note that along the peaks are those born
in the 3rd and 4th quarters, while along the
troughs are those born in the 1st and 2nd
quarters.
- Why would quarter of birth affect the
probability of being a HS grad, but not a
college grad?
- If it affected even college grad, the whole
design is dubious. It should impact only
high school completion, since it doesn’t
bind anyone beyond high school.
- IV (quarter of birth) is weird: how can it - Next table shows the second stage, both
affect earnings? If you know, however, that for OLS and 2SLS. They find a 7.1 percent
people born later in the year get more return for every additional year of
schooling (randomly, like a coin toss), logic schooling using OLS, but with 2SLS it’s
snaps into place. higher (8.9 percent).
- The only way earnings (Y) are affected by
birth month (Z) is through schooling (D).
- Angrist and Krueger (1991) used 3
dummies as instruments: 1st quarter, 2nd
quarter, 3rd quarter (4th—group with most
schooling—is omitted).
- If we regress schooling on the three
dummies, what sign would you expect for
their estimated coefficients? First stage - Even if quarter of birth is a good enough
regression: IV, the authors loaded up the first stage
with even more instruments: they used
specifications with 30 dummies (quarter of
where 𝑍𝑖 is the dummy for the first three birth × year) and 150 dummies (quarter of
birth × state) as instruments.
quarters, and π𝑖 is the coefficient on each ● They posited that quarter of birth
dummy effect may differ by cohort and
- The first stage results are produced in the state
next table. Coefficients are all negative ● Estimator had lower variance,
and significant for total years of education BUT many of these instruments
and the high school graduate dependent were weakly correlated with
variables. schooling and sometimes even
- Relationship weaker as you move beyond zero correlation. Same goes for
groups bound by compulsory schooling: cohorts (akin to two tables ago,
there’s no effect for (1) no. of years of with later cohorts).
- Bound, Jaeger, and Baker (1995), in their
critique of Angrist and Krueger (1991),
jumpstarted the weak instrument
literature.
- Consider a single endogenous regressor
and a simple constant treatment effect.
The causal model is y = βs + ϵ where s is
an endogenous regressor.
- The instrument is Z, and the first stage is
' '
𝑠 = 𝑍 π + η. (Note: that 𝑍 comes first; the
matrix form of our usual regression is Y =
Xβ + ϵ.)
- Assume first that ϵ, η are correlated.
Estimating the first equation by OLS yields - They then added in all 180 weak
a bias of: instruments, and problem persists
- The instruments are weak, and the bias of
the 2SLS coefficient is close to that of the
OLS bias.
- Bound et al. (1995) showed that as you
add more instruments, the bias of 2SLS
grows.
- Expressing the bias as a function of the
first-stage F-statistic, one gets:
where F is the population analogy of the
F-statistic for the joint significance of the
instruments in the first-stage regression.
- If the first stage is weak, and F → 0, the
σϵη
bias of 2SLS approaches 2 . Homogeneous treatment effects
σ𝑠
- All in all, if you have weak instruments, use
- But if the first stage is strong, F → ∞, and a just-identified model with your strongest
the bias of 2SLS goes to zero. IV.
- Therefore, adding more weak instruments - Or, use a limited-information maximum
causes the first-stage F-statistic to go to likelihood estimator (LIML), which reduces
zero, thereby increasing the bias of 2SLS. finite-sample bias and has similar
- The next table shows that when Bound et asymptotic properties to 2SLS under
al. (1995) added controls, the F-stat on the homogeneous treatment effects.
excludability of the instrument falls from - The real solution, however, is use better
13.5 to 1.6. instruments that satisfy the exclusion
- They’re running into weak instruments restriction.
(relationship between quarter of birth and - IV is powerful when there’s selection on
schooling got smaller for later cohorts). unobservables (a common problem). But it
may also be sensitive to the strength of the
instruments (you don’t want weak
instruments).
- More limitations emerge when we discuss
heterogeneous treatment effects.
Exercise: Think of good instruments in the
following studies
- Income → Health outcomes
- Police presence → Crime rates
- Microfinance → Household consumption
number fell within a certain range, person
Heterogeneous treatment effects was drafted.)
- When treatment effects vary across - First is SUTVA: the potential outcomes for
individuals, IV no longer identifies a i are unrelated to treatment status.
universal causal effect. Instead, there are ●
' '
If 𝑍𝑖 = 𝑍𝑖 , then 𝐷𝑖 (Z) = 𝐷𝑖 (𝑍 );
heterogeneous treatment effects.
and
- Each unit i has a unique response to the
' '
treatment: ● If 𝑍𝑖 = 𝑍𝑖 and 𝐷𝑖 = 𝐷𝑖 , then 𝑌𝑖
' '
(D,Z) = 𝑌𝑖 (𝐷 ,𝑍 ).
- You violate SUTVA if the status of a person
at risk of being drafted was affected by the
- What is IV estimating here? Under what draft status of others at risk of being
assumptions will IV identify a causal drafted (spillovers).
effect? - Second is the independence assumption,
- There turns out to be a huge tension aka “as good as random assignment.” The
between internal validity and external IV is independent of the potential
validity (findings apply to different outcomes and potential treatment
populations outside the study). assignments:
- Tension is so great it may undermine the
meaningfulness of the relevance of the
estimated causal effect—despite a valid IV
design. - Independence is sufficient for a causal
- We modify a bit the potential outcomes interpretation of the reduced form:
notation:
- This is why many prefer to work with just
- The new potential variable is the potential the reduced form. But you’re not likely to
treatment status (versus the observed be interested in the studying the
treatment status): instrument Z per se (you want to see the
1
● 𝐷𝑖 is the treatment status of i effect of D on Y).
when 𝑍𝑖 = 1 - Interestingly, independence means the first
stage measures the causal effect of Z on
0
● 𝐷𝑖 is the treatment status of i D:
when 𝑍𝑖 = 0
and the observed treatment status is
based on a treatment status switching - For instance, assignment using random
equation: draft lottery number is as good as random,
so it’s independent of potential military
service or earnings.
- Third is the exclusion restriction: any
effect of Z on Y must be via the effect of Z
on D. Put another way, 𝑌𝑖(𝐷𝑖 ,𝑍𝑖) is a
0 1 0
where π0𝑖 = E[𝐷𝑖 ], π1𝑖 = 𝐷𝑖 − 𝐷𝑖 (aka the function of only 𝐷𝑖 : for D = 0, 1,
heterogeneous causal effect of IV on 𝐷𝑖 ),
and E[π1𝑖 ] is the average causal effect of
𝑍𝑖 on 𝐷𝑖 .
- An individual’s earnings potential (as a
- Five assumptions are necessary for
veteran or nonveteran) is the same
identification given heterogeneous
regardless of draft eligibility status.
treatment effects.
- Violated if low lottery numbers affected
- We cite Angrist (1990) studying the effect
schooling by people avoiding the draft (so
of draft lottery number (randomly
lottery numbers would be correlated with
generated) as an instrument to explain the
earnings through the IV’s effect on military
effect of military service on earnings. (If the
service and on schooling).
- Note that independence (e.g., random - In Angrist (1990), IV estimates the ATE of
lottery number) doesn’t imply that the military service on earnings for
exclusion restriction is satisfied. subpopulations who enrolled in the military
- Fourth is the first stage: Z must be because of the draft. Only for those who
correlated with the endogenous variable so would not have served otherwise.
that ● It doesn’t identify causal effects
on patriots who would always
1 0
serve (for them, 𝐷𝑖 = 𝐷𝑖 = 1,
1 0
meaning 𝐷𝑖 − 𝐷𝑖 = 0, meaning
- Z must have a statistically significant effect
on the average probability of treatment. they will always serve).
- E.g., If you have a low lottery number, ● It also doesn’t tell us the effect of
does it increase the average probability of military service on those who
military service? If yes, it satisfies the first were exempted from military
stage requirement. service for medical reasons (for
1 0
- This is testable because it’s based on them, 𝐷𝑖 = 𝐷𝑖 = 0).
observable data (D,Z) (unlike the exclusion
- LATE framework partitions the population
restriction).
of units with an instrument into potentially
- Fifth is monotonicity assumption: IV must
4 mutually exclusive groups:
weakly operate in the same direction on all
1. Compliers: treatment status is
individual units.
affected by the instrument in the
- Formally, for all i, π1𝑖 ≥ 0 or π1𝑖 ≤ 0. 1 0
correct direction; 𝐷𝑖 = 1, 𝐷𝑖 = 0.
- That is, while the instrument may have no
effect on some people, all those affected 2. Defiers: treatment status is
are affected in the same direction affected by the instrument in the
1 0
(positively or negatively). wrong direction; 𝐷𝑖 = 0, 𝐷𝑖 = 1.
- E.g., Draft eligibility may have no effect on 3. Never takers: they never take the
the probability of military service for some, treatment, regardless of the
but when it does have an effect it shifts 1 0
instrument’s value; 𝐷𝑖 = 𝐷𝑖 = 0.
them all into service, out of service—but
not both. 4. Always takers: they always take
- Without monotonicity, IV estimators aren’t the treatment, regardless of the
1 0
guaranteed to estimate a weighted instrument’s value; 𝐷𝑖 = 𝐷𝑖 = 1.
average of the underlying causal effects of
- What does the table look like?
the affected group.
- With all 5 assumptions, IV estimates the
- If all 5 assumptions are satisfied, we have
ATE for compliers (LATE); “local” because
a valid IV strategy.
it applies to compliers only.
- Even if valid, though, it’s estimating a
- By contrast, with traditional IV with
different thing than with homogeneous
homogeneous treatment effects, compliers
treatment effects.
and defiers have the same treatment
- The IV estimator is estimating the LATE of
effect.
D on Y:
- Without further assumptions, LATE is
non-informative about effects on
never-takers and always-takers (the
instrument doesn’t affect their treatment
status).
- In most applications, we’re mostly
interested in estimating the ATE on the
whole population—but that’s usually not
- So the LATE parameter calculates the possible with IV.
difference in the potential outcomes
(average causal effect of D) for those Example: Living near a college
whose treatment status was changed by - We revisit the returns to schooling
the instrument. literature, looking at Card (1995) who
- Hence, we’re only averaging over estimated the regression:
1 0
treatment effects for whom 𝐷𝑖 − 𝐷𝑖 = 1.
where Y is log earnings, S years of
schooling, X a matrix of exogenous
covariates, ϵ an error term containing
unobserved ability, etc.
- Assuming ϵ contains ability and ability is
correlated with schooling, then Cov(S, ϵ)≠
0, so schooling is biased.
- Card instrumented schooling with the
college-in-county dummy variable.
● Data came from NLS Young Men
Cohort of the National
Longitudinal Survey, following
men aged 14-24 in 1966 until
1981
● One question is whether the
respondent lives in the same
county as a 4-year (and a 2-year)
college.
- This is a weird instrument, and it’s good:
the presence of a college increases the
likelihood of going to college by lowering
costs.
- We select on a group of compliers whose
behavior is affected by the variable: some
- OLS: An extra year of schooling increases
kids will always go to college regardless,
earnings by 7.1%.
and some will never do.
- With 2SLS (in Stata, ivregress 2sls),
- But there’s a group of compliers who go to
there’s a much larger impact (75% larger):
college only because their county has a
a 12.4% increase in earnings for every
college. Poorer (liquidity-constrained)
extra year of schooling.
people attended only because college
- In first stage, the college in county dummy
became slightly cheaper.
is associated with 0.327 more years of
- If returns of schooling to them are different
schooling, significant at 0.1%. F-stat
from always-takers, our estimates
exceeds 15, so we don’t have a weak
represent not ATE but LATE.
instrument.
- Results show that compliers have higher
return to schooling than the general
population. Why?
- Do compliers have more ability? No,
because if so, and we use the instrument,
the 2SLS estimate must be even smaller
than OLS!
- Instead, the likely reason is that compliers
are the ones initially underinvesting in
education due to higher marginal costs
(e.g., not living near a college).
● The instrument (college proximity)
lowers these costs, enabling them
to attend. This suggests their true
returns to schooling are high, but
they weren’t realizing them before
due to barriers.
● Thus, the likely explanation is
heterogeneous treatment effects,
with compliers benefiting more
from schooling
Example: Elasticity of demand for fish
- Graddy (2006) allegedly collected data
from Fulton Fish Market in New York. It’s
one of the largest fish markets in the world
(next to Tsukiji in Tokyo).
- We want to look at the price elasticity of
demand for fish, an extremely
heterogeneous and differentiated product
category. (Akin to Philip Wright’s study of
- Note that the estimated elasticity of
supply and demand.)
demand from OLS is −0.549. When we
- Elasticity of demand is a sequence of
use average wave height as instrument for
quantity and price pairs, but with only one
price, elasticity is −0.96. (A 10-percent
pair observed at a given point in time.
increase in price causes quantity to
- Demand is itself a sequence of potential
decrease by 9.6%.)
outcomes (quantity) associated with
- The instrument is strong (F > 22), and for a
different potential treatments (price). The
one-unit increase in wave height, price
demand curve itself is a real object, but
rose by 10%.
mostly unobserved.
- What is the instrument doing to supply,
- To trace out elasticity, we need an
exactly? If higher waves make it more
instrument correlated with just supply.
difficult to catch fish, is the composition of
- Graddy (2006) proposed a number of
caught fish also changing (and also
instruments, centering on weather at sea
quantities bought and sold)? If so, the
in the days before fish arrived to the
exclusion restriction is violated.
market.
- Another instrument is wind speed
- The first instrument is the average
(three-day lagged maximum wind speed).
maximum wave height in the previous two
Next table shows OLS and 2SLS
days.
estimates:
- We’re estimating the model:
where Q is log of quantity of whiting sold in
pounds, P is log of average daily price per
pound, X are day of the week dummies
and a time trend, and ϵ is a structural error
term.
- The next table shows OLS and 2SLS
estimates.
- Turns out this is a weak instrument: F < - In Oregon, Medicaid benefits for poor
10, and the estimated elasticity is twice as adults were scaled up in the 2000s.
large. ● Adults aged 19-64 with income
- This estimate is likely severely biased and less than 100% of the federal
less reliable than the previous one (which poverty line were eligible, so long
doesn’t convincingly satisfy the exclusion as they weren’t eligible for other
restriction, and may be at best a LATE programs.
relevant to compliers only). ● They also had to be uninsured for
- But LATE itself may be useful and fewer than 6 months and a legal
informative, if we think compliers’ causal US resident.
effects are similar to that of the broader - In this so-called Oregon Medicaid
population. Experiment, the state used a lottery to
enroll volunteers.
Popular IV designs ● For 5 weeks, people signed up,
- So long as you have a good instrument, IV and state used a lot of
can be used in any context. advertising.
- But some IV strategies have been used so ● From a list of 85,000, the state
many times they constitute their own drew 30,000 and they were given
designs. a chance to apply.
- We discuss three: ● If they did apply their entire
1. Lotteries household was enrolled, so long
2. Judge fixed effects as they returned the application
3. Bartik instruments within 45 days.
4. ● Of this, only 10,000 were
enrolled.
Popular IV designs - Prominent studies on this are Finkelstein
Lotteries et al. (2012) and Baicker et al. (2013), who
- A randomized lottery is often used as an gathered from 3rd parties several outcome
instrument for participation in a treatment. variables.
- E.g., In RCTs, treatment is randomly ● From the lottery sign up they got
assigned. But there would be positive the pre-randomization
selection bias if only those particularly demographic info.
likely to benefit from the treatment actually ● State admin records on Medicaid
volunteer to be in the treatment (versus enrollment were also collected,
people in control group who don’t have and became the primary measure
access to it at all). of a first stage (insurance
- If you compare means between treated coverage).
and untreated people with OLS, the ● Outcomes were hospital
treatment effect will be biased even in an discharge, mortality, credit, and
RCT because of “noncompliance.” they got them from mail surveys,
- A solution is that one can use a in-person surveys, and
randomized lottery as an instrument for measurements of blood samples,
being enrolled in the treatment. BMI, etc.
- IV is incredibly useful in experimental - Straightforward IV design. The two stages
designs, especially where people refuse to were:
comply with their treatment assignment or
even participate in the experiment
altogether.
- E.g., In the case of Medicaid in the US, where the first equation is the fist stage
what’s the effect of expanding access to (insurance regressed against lottery
public health insurance for low-income outcome and a bunch of covariates) and
adults? the second stage regresses
- Observational studies are confounded by individual-level outcomes against predicted
selection into health insurance. insurance and all the controls.
- RAND health insurance experiment in the - So long as the first stage is strong, F will
1970s was important, but it randomized be large, and the finite sample bias is
only cost-sharing and not health insurance small.
coverage.
- The next table shows the IV regression - They also found reductions in
results: using different samples, they out-of-pocket medical expenses, medical
showed a large effect of winning the lottery expenses, borrowing money or skipping
on enrollment. bills for medical expenses, and whether
- The probability of being enrolled increases they refused medical treatment due to
by 26%, and raised the number of months medical debt.
of being on Medicaid from 3.3 to 4 months.
- Oddly enough, impact on health outcomes
- The next table shows regression models: was unclear.
the first is the intent to treat estimates - There was an improvement in self-reported
(reduced form model) and the second is health outcomes, and more days in which
the LATE (full IV specification). they were physically and mentally healthy.
- Note that Medicaid increases the number - Meanwhile, there was a reduction in
of hospital admissions but had no effect on depression.
emergency room visits. But there’s a - Effects, however, were quite small. Also,
positive and significant effect on non-ER there’s no effect on mortality (we’ll return to
admissions this in discussion of DiD).
- So, Medicaid is increasing hospital
admission without putting strain on ERs
(which have scarce resources).
- Next table shows other HC-utilization Popular IV designs
outcomes. Focusing on LATE estimates, Judge fixed effects
Medicaid enrollees were 34% more likely - Judge fixed effects design (aka leniency
to have a usual place of care, 28% more design) has also become popular.
likely to have a personal doctor, 24% more - It’s been used to answer important
likely to complete their HC needs, 20% questions in the area of criminal justice
more likely to get all needed prescriptions, (specifically, the effect of criminal justice
and 14% increased satisfaction with their interventions on long-term outcomes).
quality of care. - E.g., In the US, jurisdictions will randomly
assign judges to defendants. In Harris
County, Texas, they used bingo machines
to assign defendants to one of dozens of
courts.
- Ingredients of good JFE design:
1. There’s a narrow pipeline through
which all individuals must pass;
2. Many randomly assigned
- Interestingly, Medicaid also helped to cope decision-makers (who assign a
with catastrophic health events, and next treatment to individuals) block
table shows impact on financial outcomes. individuals’ passage; and
- There was a reduction in personal debt by 3. There’s discretion among
$390, and reducing debt going to debt decision-makers
collection.
1. Independence assumption
2. Exclusion restriction
3. Monotonicity assumption
- First, independence is satisfied in most
cases because administrators are
randomly assigned to individual cases.
- The instrument (modeled as the average
propensity of a judge excluding the case or
simply a series of judge fixed effects)
easily passes the independence test.
- But defendants may engage in
endogenous sorting in response to the
- Gaudet, Harris, and St John (1933)
strictness of the judge (e.g., strategically
recognized that there were systematic
changing one’s plea so they can “forum
differences in judge sentencing behavior.
shop” and end up with a more lenient
- Besides guilt, what determined the
judge).
sentencing outcomes of defendants?
● Before the analysis, check on
- With random assignment of judges, the
pre-treatment covariates: all
characteristics of defendants should
observable characteristics must
remain approximately the same across
be equally distributed across
judges. So, any differences in sentencing
judges.
outcomes must be connected to the judge.
● Or, just use the original
- Next figure shows identification strategy on
assignment for identification (but
over 7,000 hand-collected cases, showing
that data might not be always
systematic differences in judge sentencing
available).
behavior.
- Second, violations of exclusion restriction
are a bigger problem.
- E.g., If a defendant is assigned to a severe
judge, the defense attorney and defendant
may choose to accept a lesser plea in
response to the judge’s ancitipated
severity—so the instrument (judge
severity) no longer affects the outcome
(plea deal) only through the sentence.
- Imbens and Angrist (1994) mentioned JFE - A more compelling exclusion restriction if,
in relation to the 5 identifying assumptions as in Dobbie et al. (2018), the instrument is
of IV. bail decision set by judges who have no
- Kling (2006) used randomized judge other effect on the judicial processes later
assignments with judge propensities to on.
instrument for incarceration length. He - Third, monotonicity is also difficult to
then linked defendants to employment and defend: here, monotonicity means that if
earnings records. He found no adverse one judge is stricter than another, that
effects on labor market outcomes from judge should always be stricter for all types
longer sentences. of defendants
- Mueller-Smith (2015) reached the opposite - E.g., If Judge A tends to sentence more
conclusion for Harris County, Texas: harshly than Judge B, then this must hold
incarceration increases the frequency and for all defendants, regardless of race, type
severity of recidivism, worsens labor of crime, or other characteristics. A
market outcomes, and increases defendant shouldn’t face a harsher
dependence on public assistance. sentence from Judge A in some cases but
- Literature features a ton of studies a lighter one in others.
showing that judicial severity causes - But this is a tough assumption to defend
adverse consequences on defendants, because judges are human, and their
including juvenile delinquency, high school decisions may depend on specific
outcomes, adult recidivism, teen circumstances.
pregnancy, future employment, etc. ● A judge might be lenient except in
- In JFE design, the three main identifying cases involving drugs or Black
assumptions are: defendants, for example.
● In such cases, their behavior isn’t who differ widely in propensity to set bail at
consistent across all types of affordable levels.
cases, and monotonicity fails. - Given a downward-sloping demand curve,
- Imbens and Angrist (1994) were skeptical more severe judges set expensive bails,
that judge assignment satisfies and will see more defendants unable to
monotonicity—biases and context-specific pay their bail, and are forced to remain in
behavior break the assumption that the detention prior to trial.
judge’s strictness is consistent across all - Author found that an increase in
cases. randomized pretrial detention leads to a
- Mueller-Smith (2015) used a parametric 13% increase in the likelihood of
strategy of instrumenting for all observed conviction.
sentencing dimensions, allowing the ● Caused by an increase in guilty
instruments’ effects on sentencing plea among defendants who
outcomes to be heterogeneous in otherwise would’ve been
defendant traits and crime characteristics. acquitted, or had their charges
- Frandsen, Lefgren, and Leslie (2023) dropped.
proposed a test for exclusion and ● Pretrial detention also increased
monotonicity based on relaxing the the length of incarceration by
monotonicity assumption 42%, and the amount of non-bail
● Requires that ATE on individuals fees owed by 41%.
violating monotonicity must be ● Cash bail contributes to a cycle of
identical to the ATE among some poverty (those unable to pay
subset of individuals who satisfy court fees are trapped in the
it. penal system with higher guilt
● Based on an observation that (1) rates, court fees, and reoffending
conditional on judge assignment, rates).
the average outcomes must fit a - When using judge assignment as an
continuous function of judge instrument, it’s can’t we just use each
propensities; and (2) the slope of judge’s average strictness (excluding the
that function is bounded by the defendant’s own case) as a single
width of the outcome variable’s instrument?
support. - Since each person gets a different judge,
● Requires simply that observed you’d get a unique instrument per
outcomes averaged by judges are person—so it seems like a good setup for
consistent with some function. two-stage least squares (2SLS).
● But their procedure tests for - But even though it looks like one
exclusion and monotonicity (they instrument, you’re actually using a
can’t be unbundled unless you high-dimensional set of judge fixed
have prior observation to rule out effects—potentially dozens or hundreds of
one of them). them.
- Some of these judges may not differ much
in their strictness, making their instruments
weak, which can bias your estimates back
toward ordinary least squares (OLS),
undermining your causal inference.
- This is still being resolved by economists.
- The dataset of Stevenson (2018) used
331,971 observations and 8 randomly
assigned bail judges.
- 2SLS suffers from finite sample problems
when there are weak instruments. item
Angrist, Imbens, and Krueger (1999)
- Consider Stevenson (2018) who looked at proposed the jackknife IV estimator
how cash bail affected cause outcomes. (JIVE) to remedy this; suitable if you have
- She used data scraped from online court several instruments and some are weak.
records in Philadelphia. ● JIVE is a leave-one-out estimator:
- Natural experiment hinged on the random use all observations except for
assignment of bail judges (“magistrates”) unit i.
● Suitable for JFE because ideally
the instrument is the mean
strictness of the judge in all other
cases, excluding the defendant’s
where 𝑌𝑙,𝑡 is log wages in location l in time
case.
- Let’s replicate the results: period t among native workers, 𝐼𝑙,𝑡 are
immigration flows in region l and time t, 𝑋𝑙,𝑡
are controls including region and time fixed
effects.
- δ is some ATE of the immigration flows’
effect on native wages.
- Here, immigration flows are going to be
highly correlated with the error term such
as the time-varying characteristics of
location l (e.g., changing amenities).
- The Bartik instrument is created by
interacting initial shares of geographic
regions—prior to the contemporaneous
immigration flow—with regional growth
rates.
- Deviations of a region’s growth from the
US national average are explained by
deviations in the growth prediction
variation from the national average.
- And deviations of the growth prediction
variables from the US national average are
- OLS results show no connection between due to the shares because the national
pre-trial detention and guilty plea. growth effect for any particular time period
- But when we use IV with binary JFE as is the same for all regions.
instruments, estimates range form 15% to - The Bartik instrument is given by:
21%.
- JIVE results even larger.
- Instruments are strong (see by regressing
detention onto the binary instruments): all
but two are statistically significant at 1%.
- In conclusion, JFE is a very popular form
of IV, and a powerful estimator of LATE. where 𝑧 0 are the initial share of
𝑙,𝑘,𝑡
- BUT make sure you can satisfy and immigrants from source country k in
defend the three identifying assumptions. location l; and 𝑚𝑘,𝑡 is the change in
Popular IV designs immigration from country k into the US.
Bartik instruments - The first term is the share variable, the
- Bartik instruments (aka shift-share second the shift variable.
instruments) were named after Bartik - The predicted flow of immigrants B into
(1991), but developed by earlier scholars. destination l is just a weighted average of
- They’re influential in migration economics, the national inflow rates from each country
trade, labor, public, etc. in which weights depend on the initial
- The idea is that OLS estimates of distribution of immigrants.
employment growth rates on labor market - Once we construct the instrument, we
outcomes are likely biased, since labor have a 2SLS estimator that first regresses
market outcomes are determined by labor the endogenous 𝐼𝑙,𝑡 onto the controls and
supply and demand. the Bartik instrument.
- Bartik instruments measure the change in - Use the fitted values to regress 𝑌𝑙,𝑡 onto 𝐼 𝑙,𝑡
a region’s labor demand due to changes in
the national demand for different to recover the impact of immigration flows
industries’ products. onto wages.
- Consider the following wage equation:
- Share: how much a place is initially - What identifying assumptions are needed?
exposed to something (industries, exports, There are two approaches or perspectives:
migration, etc). - Shares perspective (Goldsmith-Pinkham et
- Shift: how much that something changes al., 2020): Exogeneity assumption should
at a national/global level (the shock). focus on the initial industry shares, which
- Bartik instrument = share × shift. act as the instruments (they provide the
● E.g., A city with many auto exogenous variation).
workers (share) is more affected ● Researcher must make sure (and
by a national boom in auto jobs. convince readers) these shares
- The idea is that variations in local are exogenous conditional on
outcomes (employment, wages, etc) are observables (e.g., location fixed
plausibly exogenous, with the shock effects).
coming from outside the local economy ● The shifts mainly influence
and not due to local shocks/policies. instrument strength, but not
- Use Bartik instruments to study causal identification.
effects, especially when the treatment is ● If you think the initial shares are
endogenous (e.g., local job growth the proper instruments, they
affected by other unobservables). should measure differential
exogenous exposures to some
Popular IV designs common shock.
- As the shares are equilibrium values
based on labor supply and demand, it may
be tough to justify them as exogenous to
the structural unobserved determinants of
some future labor market outcome.
- Another challenge is the sheer number of
shifting values (e.g., hundreds of
industries, multiplied by many time
periods, may lead to violations of the
exclusion restriction).
- Note: If there’s a pre-period, this design
Popular IV designs resembles DiD (in which case we need to
Bartik instruments test for placebos, pre-trends, etc).
- Bartik instrument works as an instrument - Shifts perspective (Borusyak et al., 2022):
because it satisfies two conditions: Exogenous shares are sufficient but not
- First, the first stage: the Bartik instrument necessary for identification.
is correlated with the endogenous variable. ● Even if shares aren’t exogenous
● Bartik is constructed as a (i.e., they’re correlated with
weighted average of national outcomes indirectly), identification
shocks (shifts), where the weights is still possible if temporal shocks
are local exposures (shares). are exogenous and uncorrelated
● If national shocks matter, and a with bias in the shares (i.e.,
region is more exposed to shares can’t be correlated with
affected industries, the Bartik the differential changes
instrument will predict local associated with the national
outcomes. shock).
● E.g., If a city has many ● If the shock itself creates
manufacturing workers, and exogenous variation, that shifts
there’s a mfg boom, the the burden of excludability from
instrument predicts higher local shares to shocks.
job growth. ● Diagnostics: Focus on testing the
- Second, the exclusion restriction: the validity of shocks as exogenous
shares are predetermined and not related sources of variation.
to future shocks or unobserved local - Bartik may also be similar to JFE, since it’s
trends. just a specific combination of many
● The national shocks are instruments (analogously, the judge’s
exogenous and unrealated to propensity was a combination of many
local unobservables. binary fixed effects).
● If one assumes a null of constant
treatment effects, one can use
overidentification tests. But they
fail if there’s treatment
heterogeneity as opposed to a
violation of the exclusion
restriction.
● Or, if one assumes
cross-sectional heterogeneity—in
which treatment effects are
constant within a location
only—there are diagnostic aids in
Goldsmith-Pinkham et al. (2020).
- Finally, the Bartik estimator can be
decomposed into a weighted combination
of estimates, where each share is an
instrument (aka Rotemberg weights which
sum to one).
● Larger weights indicate that those
instruments are responsible for
more of the identifying variation.
● Weights tell us which shares get
more weight in the overall
estimate, helping which industry
shares must be scrutinized.
● You can be more confident in the
identification strategy if some
regions with large weights pass
some basic specification tests.
Conclusion
- IV design is powerful if your data suffers
from selection on unobservables.
- But because it has many limitations, many
researchers avoid it.
● It identifies only LATE under
heterogeneous treatment effects.
And its value depends on how
closely the compliers’ ATE
resembles that of other
subpopulations
● IV also has 5 identifying
assumptions (versus RDD which
just has one). It’s hard to imagine
a pure instrument satisfying all
that.
- At any rate, IV can come in handy, and the
best instruments come from in-depth
knowledge of institutional details of a
program or intervention (a function of your
investment in a field) (Angrist & Krueger,
2001).
- “Rarely will you find them from simply
downloading a new data set, though.
Intimate familiarity is how you find
instrumental variables, and there is, alas,
no shortcut to achieving that.”