Causal Inference
Introduction
1. Confusion Over Causality
● Spurious Correlation
Causally unrelated variables might happen to be highly correlated with each other over
some period of time.
● Anecdotes & Science Reporting
People have beliefs about causal effects in their own lives
Headlines often do not use the forms of the word cause, but do get interpreted causally
● Reverse Causality
Even if there is a causal relationship, sometimes the direction is unclear
2. Causal Inference
● Formal Definitions of causal effects
● Assumptions necessary to identity causal effects from data (Untestable)
● Rules about what variables need to be controlled for
● Sensitivity analyses to determine the impact of violations of assumptions on conclusions
“Observation studies are an interesting and challenging field which demands a good deal of
humility, since we can claim only to be groping towards the truth.” (Cochran, 1972)
3.1 Treatment and Outcomes
Suppose we are interested in the causal effect of some treatment A on some outcome Y.
● Treatment: binary or category
● Potential Outcomes:
Outcome we would see under each possible treatment option
Ya is the outcome that would be observed if treatment was set to A = a
● Counterfactuals:
Outcomes that would have been observed hd the treatment been different
E.g. If treatment was A = 1, then counterfactual outcome is Y0
Suppose Treatment A is binary (0, 1):
Before the treatment decision is made, any outcome is a potential outcome: Y0, Y1
After the study, there is an observed outcome Y = YA and counterfactual outcome Y1-A
3.2 Hypothetical Intervention
We will primarily focus on treatments that could be thought of as interventions → Well Defined in
“Potential Outcomes Framework”
● Intervention: can imagine being randomized / manipulated in a hypothetical trial
● Not Immutable Variables: immutable variables cannot be manipulated
● Has one version, i.e. no hidden versions of treatment
● Potentially actionable afterwards
3.3 Causal Effect
(Hypothetical World) (Real World)
● Treatment A had a causal effect on outcome Y if Y1 differs from Y0
● Average Causal Effect: E(Y1 - Y0) Average value of Y if every was treated with A = 1
minus average value of Y if every was treated with A = 0
● E(Y1-Y0) ≠ E(Y|A=1) - E(Y|A=0) due that E(Y|A=1) is subpopulation of people with A = 1
● Other Causal Effect
○ E(Y1/Y0): causal relative risk
○ E(Y1-Y0|A=1): causal effect of treatment on the treated
○ E(Y1-Y0|V=v): average causal effect in the subpopulation with covariate V=v
● Challenge - Fundamental Problem of Causal Inference: we only observe one treatment
and one outcome for each unit (so we consider population causal effect) → How do we
use observed data to link observed outcomes to potential outcomes?
3.4 Causal Assumptions
Identifiability of causal effects requires making some untestable assumptions about observed
data: Y, A and a set of pre-treatment covariates X.
1. Stable Unit Treatment Value Assumption (SUTVA): Yia|Ai=a
a. No interference: units do not interfere with each other
b. One version of treatment
→ Write potential outcome for the i-th person in terms of only that person’s treatments
2. Consistency Assumption: Y = Ya if A=a, for all a
→ The potential outcome under treatment A=a, Ya is equal to the observed outcome if
actual treatment received is A=a
3. Ignorability Assumption (No Unmeasured Confounders): Y0, Y1 丄 A | X
→ Within levels of X, treatment is randomly assigned
4. Positivity Assumption
For every set of values of X, treatment assignment was not deterministic, i.e.
P(A=a|X=x)>0 for all a and x
E(Y|A=a, X=x) (observed data) = E(Ya|A=a,X=x) (consistency) = E(Ya|X=x) (ignorability)
3.5 Causal Design
Cross-section User Design: Snapshot the time * user, not consider history and comparison is no
treatment, might be better if we have
● Incident User Design (new user design): restrict the treated population to those newly
initiating treatment → cleaner problem, regardless of history treatment experience
● Active Comparator: control also has treatment (similar type as T) → less confounder
3.6 Confounding
Confounders are often defined as variables that affects both treatment and outcome
→ For Ignorability: within levels of confounders, treatment and outcome are independent
Eg1. Assign the color of onboarding card (treatment) in Zephyr based on a coin flip, collect
onboarding member’s weekly macrosesion (outcome). → the coin flip is not a confounder since
it not affect the outcome
Eg2. If people with a family history of cancer (treatment) are more likely to develop cancer (the
outcome) → The family history
Eg3. If older people are at higher risk of cardiovascular disease (the outcome) and are more
likely to receive statins (the treatment) → The age is a confounder
Confounding Control:
1. Identifying a set of variables X that make the ignorablity assumption hold
2. Using statistical methods to control for these variables and estimate the causal effect
3.7 Causal Graphs
Causal Graph is a Direct Acyclic Graph (DAG) that helps for identifying coufouding variables to
achieve ignorability, by telling us
● which variables are independent from each other
● which variables are conditionally independent from each other
→ ways that we can factor and simplify the joint distribution
DAG: no undirected path + no cycle
Terminology:
● nodes / vertices, edge, path (way gets from one vertex to another traveling along edges)
● Parents and children (for adjacent), ancestors and descendant
DAG & Probabilities
1) 2)
1) 2)
Decomposition of joint distribution (factorization / DAG is compatible): start with root and
conditioning on parents along descendant line)
Paths & Associations & Blocking
1. Chains: A → X → Y
● A and Y are associated since informations from A makes it to Y
● Conditioning on B (a node in the middle of a chain) blocks the path from A to C.
Eg. A: temperature, X: whether or not sidewalks are icy, Y: whether or not someone falls
2. Forks: A ← X→ Y
● A and Y are associated since informations from X flows to both of them)
● Conditioning on X (a node in the middle of a fork) block the path from A to Y.
3. Inverted forks: A → X ← Y
● A and Y are independent since info from A and Y collide at X (collider)
● Conditioning on X induces an association between A and Y
Eg. A: state of on/off switch (coin flip), Y: same as A (other coin), X: whether the lightbulb
is lit up (lit up only if both A and B are in the on state)
Rules for d-separation
A path is d-separated by a set of nodes X if it contains a chain and the middle part is X / a fork
and the middle part is X / an inverted fork and the middle part is not in X.
Two nodes, A and Y are d-separated by a set of nodes X if it blocks every path from A to Y, i.e.
Y 丄 A | X. → Recall ignorability Assumption: Y 丄 A | X
3.8 How to control confounding variables (X)?
Pre: Frontdoor v.s. Backdoor path
● Frontdoor path from A to Y is one that begins with an arrow emanating of A
(not worry about it since they captures effects of treatment on outcome; not control for Z;
only in Causal mediation analysis involves understanding frontdoor paths)
● Backdoor path from A to Y are paths from A to Y that travel through arrow going into A
Backdoor paths confound the relationships between A and Y! → Need to be blocked!
Backdoor Path Criterion
A set of variables X is sufficient to control for confounding if 1. it block all backdoor paths from
treatment to outcome 2. it does not include any descendants of treatment
● Need to know causal DAG (With expertise & Assumptions)
● Many control choices → likely to be sufficient; Sensitivity analysis
Disjunctive Cause Criterion
Control for all (observed) causes of the exposure, the outcome, or both
(Property: If there is a set of observed variables that satisfy the backdoor path criterion then the
variables selected based on disjunctive cause criterion will be sufficient for control confounding)
● Not always select the smallest set of variables
● Is conceptually simpler (no need to know causal DAG)
● Will work if 1)such a set exists 2) correctly identify all observed causes of A and Y
3.9 Sensitivity Analysis
Overt Bias: there are imbalance on observed covariates
Hidden Bias: there are unobaserved variables that are confounders
Sensitivity Analysis: If there is hidden bias, determine how severe it would have to be to change
conclusions of statistical significant or not / direction of effect
Hypothesis: No hidden bias г=1
→ Increase г until evidence of treatment effect goes away (no longer statistically significant)
→ If г=1.1, very sensitive to hidden bias; if г=5, not very sensitive
(R package: sensitity2x2xk, sensitivityfull)
Observational Study
1. Randomized Trial Revisit
In a randomized trial, treatment assignment A would be determined randomly → erasing the
arrow from X to A → there are no backdoor path from treatment A to outcome Y.
v.s.
→ The distribution of pre-treatment variables X that affect Y will be the same in both treatment
groups (covariate balance) → if outcome distribution ends up differing, it will not be because of
X → X is dealt with at the design phase
Why not always randomize?
● Randomized experiments are expensive
● Sometimes randomizing is unethical / impractical
● Take time since need to wait for outcome data
2. Observational Study
Type I: Planned, prospective, observational studies with active data collection:
● Like trials: data collected on a common set of variables at planned times, outcomes
carefully measured, study protocols
● Unlike trails: regulations much weaker since not intervening, broader population eligible
for the study
Type II: Databases, retrospective, passive data collection
● Large sample sizes, inexpensive, potential for rapid analysis
● Data quality typically lower, no uniform standard of collection
3. Approaches
Analysis Process: define metric&population → select confounder&instrument → select model →
tune&calculate → validation
How to choose?
● One-time / Multiple treatment? Fixed Effect for multiple treatment & short-term
● Small sample size? Time Series, Matching
● Quick calculate? Regression, Stratification
● Not enough covariates? Doubly Robust, Time Series, Fixed Effeect, Propensity Score
Stratification
1. Methodology
- E(Y|A=a, X=x) = E(Ya|X=x): marginal causal effect in stratum
- P(X=x): probability / size of each stratum
- Overall effect direction may not equal to margin direction as Simpson’s Paradox
2. Challenges
- Ignorability Assumption might be violated as not enough X
- May lead to many empty cells as X dimensions / values increases
Matching
1. Definition
Matching is to match individuals in the treated group to individuals in the control group on the
covariates, attempting to make an observational study more like a randomized trial
Characteristics:
- Controlling for confounders is achieved at the design phase without looking into outcome
- Reveal lack of overlap in covariate distribution (positivity assumption need to be held)
- Deal with outcome like random trail once data are matched
- Why based on the treatment group? Usually treatment is a smaller group and we make
inference about the treated population. (Can have different population)
2. Methodology
Step 1: For every individual in the treatment group, find the matched individuals in the control
group based on covariates.
Step 1.1 Calculate the distance score based on distance measurements
1) Exactly match: distance infinity if not equal
2) Mahalanobis distance
*S is the covariance matrices of covariates, i.e. S = Cov(X)
*The square root of the sum of squared distances between each covariate scaled by the
covariance matrix (scale: covarites with higer covariance has lower weight of distance)
*Robust MD: use rank to replace X & S is still for original X → not affect by outliers
3) Propensity Score
Step 1.2 Select Matches based on distance scores
1) Greedy (Nearest neighbor) Matching
a) Randomly order list of treated subjects and control subjects
b) Start with first treated subject. Match to the control with smallest distance and
remove the matched control from the list
c) Repeat b) until all treated are matched
d) For k:1, go through the list again and find 2nd matches. Repeat until k matches
2) Optimal Matching
a) All M treatment * N control pairings
b) Select M best pairs that minimize the total distance
3) Sparse Optimal Matching
a) Match with blocks
b) Mismatches can be tolerated if fine balance can still be achieved
Caliper: A bad match can be defined using a caliper - maximum acceptable distance
- If no matches within caliper, the positivity assumption would be violated
- If excluding bad matches, positivity assumption holds but population is harder to define
- In PSM, threshold = 0.2*std(logit(PS))
Greedy v.s. Optimal
Greedy Matching Optimal Matching
Is global distance minimized? No Yes
Invariant to Initiate status? No Yes
Computation Fast Demanding / Can be Infeasible
Step 1.3 Assessing Balance
The purpose of matching is to achieve Stochastic Balance or less ideally, Fine Balance (each
covariates distribution are balanced)
Method 1: Test for difference in means between treatments and controls for each covariate
based on two sample t-test.
→ Drawback: p-value are dependent on sample size; however, we probably do not care much if
mean difference are small
Method 2: Create “Table 1” to compare pre-matching and post-matching balance based on
Standardized Mean Difference on each covariates.
Table 1
Step 2: After successfully matching and achieving adequate balance, proceed outcome analysis
with randomized test on matched/dependent sample groups
1) Paired-Samples T Test
H0: μd = 0 v.s. H1: μd ≠ 0 (μd: difference between mean of paired treatment and control)
X
2) Exact / Permutation Test
a) Compute test statistics T from observed data
b) Assume H0: no treatment effect
c) Randomly permute treatment assignment within pairs and recompute T
d) Repeat many times and see how unusual observed T is (by distribution p-value)
T=6 Permuted 1k times
3) McNemar’s Chi-squared Test (for binary outcome)
4) Conditional Logistic Regression (Matched binary outcome data)
5) Stratified Cox Model (Time-to-event / survival outcome data)
6) Generalized Estimating Equations
Weighting (IPTW)
1. Intuition
1) Use all of the data, but down-weight some and up-weight others
(matching v.s. weighting)
Matching: 1 : n → 1 : 1 (select 1 of n)
Weighting: 1 : n → (n+1) : (n+1) (weight 1, and weight n)
2. Methodology
2.1 Estimate Propensity Score
2.2 Create pseudo-population by inverse PS weights to achieve unconfounded groups
→ treatment assignment no longer depends on X → everyone is equally likely to be treated
under ignorability (π is correct) and positivity (π is not 0 or 1)
- Might need to trim the tail or weight truncation.
2.3 Assessing Balance
Covariate balance can be checked on the weighted sample using standardized differences
with Table 1 (or Plot) → Stratify on treatment & find Weighted mean and variances
→ If imbalance: refined propensity score, interactions? non-linear?
E.g.
2.3 Estimate the causal effect
1) Linear Regression model (linear marginal structural model)
E.g. For A=0/1
2) Generalized Marginal Structural Model
Propensity Score
1. Balancing Score
If we match on the balancing score π, we should achieve balance on X
Why? Ignorability: X same → P(A=1) same; if two groups same π, X distributions are balanced
so conditioning on balancing score = allocation probability
2. Propensity Score Definition
PS is the prob. of receiving treatment given covariates X (a balancing score)
Propensity score is a scalar - each subject will have exactly one value of the propensity score
- In a random trail, P(A=1|X) = P(A) = 0.5;
- In observational study, we need to estimate P(A=1|X) via logistic regression
Overlap - Positivity Assumption need to be held
If there is a lack of overlap, trimming the tails is an option.
E.g. remove controls with PS < min(treatment) & treatments with PS > max(control)
In practice, logit (log-odds) of PS is often used as it is unbounded, stretches the distribution
3. Trimming the data
Problems: If PS is close to 0 / 1, it might violated the positivity assumptions; Close to 0 → Cause
the weight in IPWT too larget → distort the result
Trimming the tails → Remove subjects who have extreme values of the PS (close to 0/1)
- Rule of Thumb: cut-off at 2% tail (above 98th and below 2nd)
Trimming the tails changes the population!
Marginal Structural Model (MSM)
1. Definition
A Model for the mean of population potential outcomes with treatments
, g() is a link function
Marginal: model that is not conditional on the confounders (population average)
Structural: model for potential outcomes, not observed outcomes
2. Linear MSM (continuous outcome) v.s. Logistic MSM (binary outcome)
3. MSM With Effect Modification
Suppose V is a subset of confounders that modifies the effect of A.
A linear MSM with effect modification
,
More generally, ,
h() is a function specifying parameters form of a and V (typically additive, linear).
4. Compare with Generalized Linear Model
MSM v.s. GLM not equivalent due to confounding (Y^a setting & potential not E(Y|A)
conditioning & observed)
(MSM) v.s. (GLM)
However, pseudo-population from IPTW is free from confounding ! ! !
→ Estimate MSM by solving observed data of IPTW population
→ Generalized linear model to solve beta
→ MSM to solve parameters
Doubly Robust (DR)
1. Definition
- Propensity Score Model + Outcome Regression Model
- IPTW: E(Y1) =
- Regression: E(Y1) = where
- Unbiased if either one is correctly specified
2. Justification
- If propensity score is correct, i.e. E(Ai) = πi(Xi)
- If outcome regression model is correct, i.e. Yi = m1(Xi)
Instrument Variable (IV)
1. Variables affecting A and Y
2. What is IV?
- Definition:
IV is a variable that affects treatment but not directly affects outcome.
It is an alternative causal inference method that does not rely on ignorability assumption.
- Example:
A: smoking during pregnancy; Y: birth weight; X: mother’s age, weight, etc.
Z: randomize to either receive encouragement to stop smoking (Z=1) or receive usual care (0)
- Types:
1) randomly assigned as part of study 2) believed to randomized in nature
3. IV method
3.1 Assumption
1) Associated with the treatment (as an encouragement) [check with data]
2) Exclusion Restriction: affect outcome only through A, i.e. not Z → Y or Z → U/X → Y
3.2 Measurement
With IV, we can measure the complier average causal effect with monotonicity assumption
*Compliance classes
*Monotonicity Assumption: there are no defiers → P(A) increase with more encouragement
3.3 Method
1) Complier Average Causal Effect (CACE)
E(Y|Z=1) - E(Y|Z=0) =
(E(Y|Z=1, always taker)*P(always taker) ← = E(Y|always taker)*P(always taker)
+ E(Y|Z=1, never taker)*P(never taker)
+ E(Y|Z=1, compliers)*P(compliers))
- (... for 0)
= E(Y|Z=1, compliers)*P(compliers) - E(Y|Z=0, compliers)*P(compliers)
= E(Yz=1|compliers)*P(compliers) - E(Yz=0|compliers)*P(compliers) ← Z is randomized for Y
= E(Ya=1|compliers)*P(compliers) - E(Ya=0|compliers)*P(compliers) ← compiler definition
= (E(Ya=1|compilers) - E(Ya=0|compliers))*(P(A|Z=1) - P(A|Z=0)) ← No defiers
So
● If perfect compliance, CACE = (intentioned treatment effect) ITT
● ITT is an underestimate of CACE since always taker / never taker exists
2) Two-stage Least Square Methods (2SLS)
Stage 1: →
Stage 2:
Then beta1 is the estimate of causal effect.
Notes:
- In an ordinary least squares (OLS) estimate, Yi = b0 + b1*Ai + ei, assume the error term
e and covariate A are independent → confounding will make correlated ← Z is
randomized of A and Y, so the two e is independent; A_hat is the projection of A onto
space spanned by Z
- In a binary scenario, Yi = b0 + Ai_est*b1 + ei = b0 + (a0_est + Zi*a1_est)*b1 + ei
So a1_est*b1 = E(Yi|Zi=1) - E(Yi|Zi=0) = (E(Ai|Zi=1) - E(Ai|Zi=0))*b1
So b1 is a consistent estimator of the CACE!
I.e. Z increase by 1 → A_est increase by a1_est → Y increase by a1_est*b1
- Consider covariates: regress A on Z and X, then regress Y on A_est and X
- Sensitivity Analysis:
- Exclusion Restriction: If Z does directly affect Y by an amount p, change?
- Monotonicity: If there was pi proportion of defiers, change?
- Strength of IVs is the proportion of compilers, i.e. E(A|Z=1) - E(A|Z=0)
For weak instrument, population inference (compiler) small & large variance
→ Can use methods for strengthening IV, e.g. near/far matching
Fixed Effect Model
1. Panel Data Model
λi - individual intercept, γt - time intercept, βk - explanatory slope, uit - error term
● Fixed Effect Model: effect λ/γ fixed with x
○ Individual fixed: only λ; time fixed: only γ; mixed
○ To calculate: 1) Dummy for each individual/time or 2) mean deviation
● Random Effect Model: effect λ/γ independent of x and follow some distribution
● FEM or REM? Hausman Test, REM assumption difficult to hold and infer population
2. Causal Practice
● Assumption:
○ Unobserved confounders are invariant during the time frame of analysis
○ Units can switch between treatment and control → be their own control
○ Markov Assumption: past treatments not affect current outcome → select data
● Method:
○ FEM with individual fixed
○ Can add confounding X with treatment for regression
○ How to choose T? Small - in-sig, large - time variant → rule of thumb: 4 week
Time Series Model
1. Method
● Step 1: Predict Y(t) with covariate X groups Y(c) in previous time series data
Step 2: Effect = actual - predicted time
● Bayesian structural TS models:
2. Notes
● Assumption
○ control groups are not affected by treatment
○ C & T relationship unchanged in pre/post period ← C predict other C
● Have cumulative effect by date
● Useful when there are limited covariates or few treatment units
● Define control? Start from whole / cohort, use covariates/PS to synthesize if needed
● Validation
○ Check A/A to make sure there's no effect. (fit and predict on pre-period data)
○ Check Model fit score (MAPE): if low, separate treatment / add season effect