Introduction to Handling Data
ECON20222 - Lecture 1
Ralf Becker and Martyn Andrews
Ralf Becker and Martyn Andrews Introduction to Handling Data 1 / 47
What is this course unit about?
Help you implement and interpret the main estimation and
inference techniques used in Economics
Focus on:
I causal inference
I the main pitfalls of time-series analysis
Ralf Becker and Martyn Andrews Introduction to Handling Data 2 / 47
This Week’s Empirical Question
Card, David ; Krueger, Alan B. (1994) Minimum Wages and
Employment: A Case Study of the Fast-Food Industry in New Jersey
and Pennsylvania, The American Economic Review, 84, 772-793.
Do higher minimum wages decrease employment (as predicted by
common-sense and a competitive labour market model)?
Ralf Becker and Martyn Andrews Introduction to Handling Data 3 / 47
The Research Question
“This paper presents new evidence on the effect of minimum wages on
establishment-level employment outcomes. We analyze the experiences
of 410 fast-food restaurants in New Jersey and Pennsylvania following
the increase in New Jersey’s minimum wage from $ 4.25 to $ 5.05 per
hour. Comparisons of employment, wages, and prices at stores in New
Jersey and Pennsylvania before and after the rise offer a simple method
for evaluating the effects of the minimum wage.”
Card, David ; Krueger, Alan B. (1994, p.772)
Ralf Becker and Martyn Andrews Introduction to Handling Data 4 / 47
Why Data Matter
The debate is still alive:
Overall negative effect on employment, IZA.
"Research findings are not unanimous, but especially for the US,
evidence suggests that minimum wages reduce the jobs available to
low-skill workers."
An overview of the empirical evidence is provided in this report by
Arindrajit Dube for the UK Government.
"Especially for the set of studies that consider broad groups of
workers, the overall evidence base suggests an employment impact
of close to zero."
Ralf Becker and Martyn Andrews Introduction to Handling Data 5 / 47
At the end of this unit . . .
You will be able to:
Understand and discuss the challenges of making causal inferences
Perform inference appropriate for the model being estimated
Interpret empirical results (with due caution!)
Discuss strengths and weaknesses of particular empirical
applications
Do intermediate data work in R
Confidently apply regression analysis in R
Apply more advanced causal inference techniques in R
Find coding help for any new challenges in R
Ralf Becker and Martyn Andrews Introduction to Handling Data 6 / 47
What you need to do
To learn in this unit you need to:
coding, cleaning data, struggling,
answering real questions, that there
self-learning, amazement at what
is not always a clear answer
you can do
Ralf Becker and Martyn Andrews Introduction to Handling Data 7 / 47
Assessment Structure and feedback
Online test (on the use of R) - 10%
End-of-Term exam (short answer questions) - 50%
Group coursework - 40% (see extra info)
Ralf Becker and Martyn Andrews Introduction to Handling Data 8 / 47
Aim for today
Statistics/Econometrics R Coding
Summary Statistics Introduce you to R and
Difference between population RStudio
and sample How do I learn R
Hypothesis testing Import data into R
Graphical Data Perform some basic data
Representations manipulation
Diff-in-Diff Analysis Perform hypothesis tests
Simple regression analysis Estimate a regression
Ralf Becker and Martyn Andrews Introduction to Handling Data 9 / 47
This Week’s Plan
Replicate some of the basic results presented in Card and Krueger
(1994)
Introduce the Difference-in-Difference methodology (Project!!)
[Sometimes known as “Diff-in-Diff” or DiD.]
Use this example to
I introduce you to R
I review some summary statistics
I review simple regression and its implementation
I introduce some basic visualisations
Ralf Becker and Martyn Andrews Introduction to Handling Data 10 / 47
Introduce R/R-Studio
R is a statistical software package, it is open source
and free
a lot of useful functionality is added by independent
researchers via packages (also for free)
RStudio is a user interface which makes working with
R easier. You need to install R before you install
RStudio.
ECLR is a web-resource we have set up to support
you in your R work.
Ralf Becker and Martyn Andrews Introduction to Handling Data 11 / 47
Welcome to RStudio
Ralf Becker and Martyn Andrews Introduction to Handling Data 12 / 47
Write Code Files or the Basic Workflow
keep an original data file (usually ‘.xlsx‘ or ‘.csv‘) and do not
overwrite this file
any manipulation we make to the data (data cleaning, statistical
analysis etc.) is command based and we collect all these commands
in a script file. R will then interpret and execute these commands.
It is hence like a recepie which you present to a chef. These script
files have extension ‘.r‘
you can also learn to write Rmarkdown files (‘.rmd‘). They combine
code with normal text and output.
When you write code you should ensure that you add comments to
your code. Comments are bit of text which is ignored by R
(everything after an ‘#‘) but helps you or someone else to decipher
what the code does.
By following the above advice you make it easy for yourself and others
to replicate your work.
Ralf Becker and Martyn Andrews Introduction to Handling Data 13 / 47
Prepare your code
We start by uploading the extra packages we need in our code.
The first time you need these packages at a computer you may need to
install these. Use the following code to do this
[Link](c("readxl","tidyverse","ggplot2","stargazer")
This only needs to be done once on a particular computer. However,
every time you want to use any of these packages in a code you need to
make them available to your code (load them):
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphiscs
library(stargazer) # to produce nice results tables
Ralf Becker and Martyn Andrews Introduction to Handling Data 14 / 47
The data
Then we load the data from excel
CKdata<- read_xlsx("CK_public.xlsx",na = ".")
na = "." indicates how missing data are coded.
Check some characteristics of the data which are now stored in CKdata:
Discuss [Link], number of obs and number of variables, their names
and variable types
str(CKdata) # prints some basic info on variables
## tibble[,46] [410 x 46] (S3: tbl_df/tbl/[Link])
## $ SHEET : num [1:410] 46 49 506 56 61 62 445 451 455 458 ...
## $ CHAIN : num [1:410] 1 2 2 4 4 4 1 1 2 2 ...
## $ CO_OWNED: num [1:410] 0 0 1 1 1 1 0 0 1 1 ...
## $ STATE : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ SOUTHJ : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ CENTRALJ: num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ NORTHJ : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ PA1 : num [1:410] 1 1 1 1 1 1 0 0 0 1 ...
## $ PA2 : num [1:410] 0 0 0 0 0 0 1 1 1 0 ...
## $ SHORE : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ NCALLS : num [1:410] 0 0 0 0 0 2 0 0 0 2 ...
Ralf Becker and Martyn Andrews Introduction to Handling Data 15 / 47
The data
To see the entire dataset (like in a spreadsheet):
Either click the little spreadsheet symbol next to the [Link] in the
Environment tab, or
view(CKdata) # prints some basic info on variables
Ralf Becker and Martyn Andrews Introduction to Handling Data 16 / 47
The data - Unit of observation
A unit of observation is a fast food restaurant.
Say observation 27 in our dataset is a Roy Rogers (CHAIN = 3) store in
Pennsylvania (STATE = 0) with 7 full time employees (EMPFT), 19
part-time employees (EMPPT) and 4 managers (NMGRS) in Feb 1992 and
17.5 in Dec
CKdata[27,]
## # A tibble: 1 x 46
## SHEET CHAIN CO_OWNED STATE SOUTHJ CENTRALJ NORTHJ PA1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 515 3 1 0 0 0 0 0
## # ... with 35 more variables: EMPFT <dbl>, EMPPT <dbl>, NMG
## # WAGE_ST <dbl>, INCTIME <dbl>, FIRSTINC <dbl>, BONUS <db
## # MEALS <dbl>, OPEN <dbl>, HRSOPEN <dbl>, PSODA <dbl>, PF
## # PENTREE <dbl>, NREGS <dbl>, NREGS11 <dbl>, TYPE2 <dbl>,
## # DATE2 <dbl>, NCALLS2 <dbl>, EMPFT2 <dbl>, EMPPT2 <dbl>,
## # WAGE_ST2 <dbl>, INCTIME2 <dbl>, FIRSTIN2 <dbl>, SPECIAL
Ralf Becker and Martyn Andrews Introduction to Handling Data 17 / 47
Addressing particular variables
If you want to call/use the entire spreadsheet/data frame/tibble then
you call CKdata.
But often you want to call one variable only:
CKdata$CHAIN, calls CHAIN only
CKdata["CHAIN"], calls CHAIN only
CKdata[2], calls CHAIN only, as it is the 2nd variable
And sometimes you want to call several, but not all, variables:
CKdata[c("STATE","CHAIN")]
c("STATE","CHAIN") creates a list of names. c really represents a
function, c for concatenation.
Also note: R is case sensitive, CHAIN 6= Chain
Ralf Becker and Martyn Andrews Introduction to Handling Data 18 / 47
Variable types
These are five basic data types.
character: "a", "swc"
numeric: 2, 15.5
integer: 2L (the L tells R to store this as an
integer)
logical: TRUE, FALSE
factor: a set number of categories
It is important that you know and understand differences between data
types. Each variable has has a particular type and some operations only
work for particular datatypes. For instance, we need num or int for any
mathematical operations.
In our [Link] we have only num variable types.
We will encounter logical variables frequently. they are very powerful
Ralf Becker and Martyn Andrews Introduction to Handling Data 19 / 47
factor variables
We store categorical variables as factor variables.
Sometimes you need to type convert to factor variables.
str(CKdata[c("STATE","CHAIN")]) # prints some basic info on v
## tibble[,2] [410 x 2] (S3: tbl_df/tbl/[Link])
## $ STATE: num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ CHAIN: num [1:410] 1 2 2 4 4 4 1 1 2 2 ...
STATE, 1 if New Jersey (NJ); 0 if Pennsylvania (Pa)
CHAIN, 1 = Burger King; 2 = KFC; 3 = Roy Rogers; 4 = Wendy’s
Ralf Becker and Martyn Andrews Introduction to Handling Data 20 / 47
factor variables
CKdata$STATEf <- [Link](CKdata$STATE)
levels(CKdata$STATEf) <- c("Pennsylvania","New Jersey")
CKdata$CHAINf <- [Link](CKdata$CHAIN)
levels(CKdata$CHAINf) <- c("Burger King","KFC", "Roy Rogers", "Wendy's")
CKdata$STATE calls variable STATE in dataframe ck_data
<- assigns what is on the right [Link](CKdata$STATE) to the
variable on the left CKdata$STATEf
[Link](CKdata$STATE) calls a function [Link] and applies
it to CKdata$STATE
str(CKdata[c("STATEf","CHAINf")]) # prints some basic info on variables
## tibble[,2] [410 x 2] (S3: tbl_df/tbl/[Link])
## $ STATEf: Factor w/ 2 levels "Pennsylvania",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CHAINf: Factor w/ 4 levels "Burger King",..: 1 2 2 4 4 4 1 1 2 2 ...
Ralf Becker and Martyn Andrews Introduction to Handling Data 21 / 47
factor variables
factor variables are variables with discrete categories. Which ones they
are you can find out with the levels() function:
levels(CKdata$CHAINf)
## [1] "Burger King" "KFC" "Roy Rogers" "Wendy's"
Ralf Becker and Martyn Andrews Introduction to Handling Data 22 / 47
Learn more about your data
Use the summary function for some initial summary stats for num or int
variables
WAGE_ST, starting wage ($/hr), Wave 1, before min wage increase,
Feb 1992
EMPFT, # full-time employees before policy implementation
summary(CKdata[c("WAGE_ST","EMPFT")])
## WAGE_ST EMPFT
## Min. :4.250 Min. : 0.000
## 1st Qu.:4.250 1st Qu.: 2.000
## Median :4.500 Median : 6.000
## Mean :4.616 Mean : 8.203
## 3rd Qu.:4.950 3rd Qu.:12.000
## Max. :5.750 Max. :60.000
## NA's :20 NA's :6
Ralf Becker and Martyn Andrews Introduction to Handling Data 23 / 47
Learn more about your data
How many obs in each state and what chains
Tab1 <- CKdata %>% group_by(STATEf) %>%
summarise(n = n()) %>%
print()
## # A tibble: 2 x 2
## STATEf n
## <fct> <int>
## 1 Pennsylvania 79
## 2 New Jersey 331
[Link](table(CKdata$CHAINf,CKdata$STATEf,dnn = c("Chain", "State")),margin = 2)
## State
## Chain Pennsylvania New Jersey
## Burger King 0.4430380 0.4108761
## KFC 0.1518987 0.2054381
## Roy Rogers 0.2151899 0.2477341
## Wendy's 0.1898734 0.1359517
Ralf Becker and Martyn Andrews Introduction to Handling Data 24 / 47
Scatter plot of the data
p1 <- ggplot(CKdata,aes(WAGE_ST,EMPFT)) +
geom_point(size=0.5) + # this produces the scatter plot
geom_smooth(method = "lm", se = FALSE) # adds the line
p1
60
40
EMPFT
20
0
4.5 5.0 5.5
WAGE_ST
Point out that each dot represents one store data. Point out line of best fit
Ralf Becker and Martyn Andrews Introduction to Handling Data 25 / 47
Regression Line
The line in the previous plot is the line of best fit coming for a linear
regression
EM P F T = α + βW AGE_ST + u (Population Model)
The population model is defined by unknown parameters α and β
and the unknown error terms u. We will use sample data to obtain
sample estimates of these parameters.
The error terms u contain the effects of any omitted variables and
reflect that any modelled relationship will only be an
approximation. The u are random variables
EM P F Tit = α
b + βb W AGE_STit + u
bit (Estimated Sample Model)
Here we have two subscripts as the data have a cross-section (i) and a
time-series dimension (t).
The regression line in the previous figure is represented by
EM
\ P F T it = α
b + βW
b AGE_STit ( Regression Line )
Ralf Becker and Martyn Andrews Introduction to Handling Data 26 / 47
Simple Regression Model and OLS
Regression analysis is the core technique used in Econometrics. It is
based on certain assumptions about the Population Model and the error
terms u (more on this in the next few weeks).
How to estimate parameters (get αb and β)
b using the available sample of
data? This is typically done by Ordinary Least Squares (OLS).
Ralf Becker and Martyn Andrews Introduction to Handling Data 27 / 47
Simple Regression Model and OLS
mod1 <- lm(EMPFT~WAGE_ST, data= CKdata)
summary(mod1)
##
## Call:
## lm(formula = EMPFT ~ WAGE_ST, data = CKdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.091 -5.898 -2.100 3.005 51.304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.468 5.807 -1.114 0.2660
## WAGE_ST 3.193 1.255 2.544 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.5 on 383 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.01662, Adjusted R-squared: 0.01405
## F-statistic: 6.472 on 1 and 383 DF, p-value: 0.01135
Ralf Becker and Martyn Andrews Introduction to Handling Data 28 / 47
OLS - nice output
stargazer(mod1,type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## EMPFT
## -----------------------------------------------
## WAGE_ST 3.193**
## (1.255)
##
## Constant -6.468
## (5.807)
##
## -----------------------------------------------
## Observations 385
## R2 0.017
## Adjusted R2 0.014
## Residual Std. Error 8.500 (df = 383)
## F Statistic 6.472** (df = 1; 383)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Ralf Becker and Martyn Andrews Introduction to Handling Data 29 / 47
OLS - calculation and interpretation
How were βb and α
b calculated?
Cov(EM
d P F Tit , W AGE_STit )
βb =
Vd
ar(W AGE_STit )
b = EM P F T it − βb ∗ W AGE_ST it
α
How to interpret βb = 3.193?
An increase of one unit in WAGE_ST (=USD1) is related to an increase in
about 3 full time employees (EMPFT).
Have we established that higher wages cause higher employment?
NO
Ralf Becker and Martyn Andrews Introduction to Handling Data 30 / 47
Regression Analysis - Underneath the hood
Need to recognise that in a sample β̂ and α̂ are really random variables.
For short EMPFT=E and WAGE_ST=W:
Cov(E,
d W)
β̂ =
Vd
ar(W )
Cov(α
d + β W + u, W )
=
Vd
ar(W )
Cov(α,
d W ) + β Cov(W,
d W ) + Cov(u,
d W)
=
Vd
ar(W )
Vd
ar(W ) Cov(u,
d W) Cov(u,
d W)
= β + =β+
Vd
ar(W ) Vd
ar(W ) Vd
ar(W )
So β̂ is a function of the random term u and hence is itself a random
variable. Once Cov(E,
d W ) and Vd
ar(W ) are replaced by sample
estimates we get ONE value which is draw from a random distribution.
Ralf Becker and Martyn Andrews Introduction to Handling Data 31 / 47
OLS - estimator properties
What can we learn from this?
If uit is a random variable, so is βb
Any particular value we get is a draw from a random distribution
An estimator is unbiased if, on average, the estimates would be
equal to the unknown β
at this stage the concept of unbiasedness may still be a little hazy
and that is fine
For this to happen we need to assume that Cov(u, x) = 0 as then
E(β)
b =β
Why do we need to assume this? Because while we do have values
for xit we do not have values for the unobserved error terms uit .
Hence we cannot test this. As you will find out, this is a thinking
exercise and whether it is true/false/sensible/appropriate is at the
core of what we do.
Ralf Becker and Martyn Andrews Introduction to Handling Data 32 / 47
OLS - the exogeneity assumption
For βb in yit = α + βxit + uit to be unbiased (i.e. on average correct) we
needed
Cov(uit , xit ) = 0
This is sometimes called the Exogeneity assumption. The error term has
to be uncorrelated to the explanatory variable xit
There are a lot of reasons why this assumption may be breached.
Simultaneity (W AGE_ST → EM P F T and
EM P F T → W AGE_ST )
Discuss the fact that we have to assume that causailty here goes in both
directions. Hence we cannot attach one one-directional causal interpretation to
the estimated coefficient. If you can estimate the model the other way round
Omitted relevant variables or unobserved heterogeneity
Measurement error in xit
Ralf Becker and Martyn Andrews Introduction to Handling Data 33 / 47
So how to make causal statements
Once we have found reasons to believe in the exogeneity assumption, the
next few lectures is to introduce various standard techniques that use
this assumption:
First Difference
Diff-in-Diff, to be used in Project
Instrumental Variables
Regression Discontinuity
All of them can be thought of as specific ways to apply a regression
model.
Ralf Becker and Martyn Andrews Introduction to Handling Data 34 / 47
Diff-in-Diff - The Problem
Do higher minimum wages decrease employment (as predicted by a
simplistic labour market model)?
Ralf Becker and Martyn Andrews Introduction to Handling Data 35 / 47
The Research Question
“This paper presents new evidence on the effect of minimum wages on
establishment-level employment outcomes. We analyze the experiences
of 410 fast-food restaurants in New Jersey and Pennsylvania following
the increase in New Jersey’s minimum wage from $ 4.25 to $ 5.05 per
hour. Comparisons of employment, wages, and prices at stores in New
Jersey and Pennsylvania before and after the rise offer a simple method
for evaluating the effects of the minimum wage.”
Card, David ; Krueger, Alan B. (1994, p.772)
Ralf Becker and Martyn Andrews Introduction to Handling Data 36 / 47
Wage distribution - Pre
Look at the distribution of starting wages before the change in minimum
wage in New Jersey (WAGE_ST).
At this stage it is not so important to understand the commands for
these plots.
The easiest way to plot a histogram is
hist(CKdata$WAGE_ST[CKdata$STATEf == "Pennsylvania"])
where, in square brackets, we select that we only want data fram
Pennsylvania.
hist(CKdata$WAGE_ST[CKdata$STATEf == "Pennsylvania"])
hist(CKdata$WAGE_ST[CKdata$STATEf == "New Jersey"])
Ralf Becker and Martyn Andrews Introduction to Handling Data 37 / 47
Wage distribution - Pre
Or here an alternative visualisation.
ggplot(CKdata,aes(WAGE_ST, colour = STATEf), colour = STATEf) +
geom_histogram(position="identity",
aes(y = ..density..),
bins = 10,
alpha = 0.2) +
ggtitle(paste("Starting wage distribution, Feb/Mar 1992"))
Starting wage distribution, Feb/Mar 1992
2.0
1.5 STATEf
density
1.0 Pennsylvania
New Jersey
0.5
0.0
4.5 5.0 5.5
WAGE_ST
Ralf Becker and Martyn Andrews Introduction to Handling Data 38 / 47
Wage distribution - Pre
Both plots sow that the starting wage distribution is fairly similar in
both states, with peaks at the minimum wage of $4.25 and $5.00.
Ralf Becker and Martyn Andrews Introduction to Handling Data 39 / 47
Policy Evaluation
First we can evaluate whether the legislation has been implemented.
Tab1 <- CKdata %>% group_by(STATEf) %>%
summarise(wage_FEB = mean(WAGE_ST,[Link] = TRUE),
wage_DEC = mean(WAGE_ST2,[Link] = TRUE)) %>%
print()
## # A tibble: 2 x 3
## STATEf wage_FEB wage_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 4.63 4.62
## 2 New Jersey 4.61 5.08
Average wage in New Jersey has increased.
Ralf Becker and Martyn Andrews Introduction to Handling Data 40 / 47
Policy Evaluation - Wage distribution
ggplot(CKdata,aes(WAGE_ST2, colour = STATEf), colour = STATEf) +
geom_histogram(position="identity",
aes(y = ..density..),
bins = 10,
alpha = 0.2) +
ggtitle(paste("Starting wage distribution, Nov/Dec 1992"))
Starting wage distribution, Nov/Dec 1992
4
3 STATEf
density
2 Pennsylvania
1 New Jersey
0
4.0 4.5 5.0 5.5 6.0
WAGE_ST2
Ralf Becker and Martyn Andrews Introduction to Handling Data 41 / 47
Policy Evaluation - Employment outcomes
Let’s measure employment before and after the policy change.
Calculate two new variables FTE and FTE2 (full time employment
equivalent before and after policy change)
CKdata$FTE <- CKdata$EMPFT + CKdata$NMGRS + 0.5*CKdata$EMPPT
CKdata <- CKdata %>% mutate(FTE2 = EMPFT2 + NMGRS2 + 0.5*EMPPT2)
TabDiD <- CKdata %>% group_by(STATEf) %>%
summarise(meanFTE_FEB = mean(FTE,[Link] = TRUE),
meanFTE_DEC = mean(FTE2,[Link] = TRUE)) %>%
print()
## # A tibble: 2 x 3
## STATEf meanFTE_FEB meanFTE_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 23.3 21.2
## 2 New Jersey 20.4 21.0
Ralf Becker and Martyn Andrews Introduction to Handling Data 42 / 47
Policy Evaluation - Diff-in-Diff estimator
ggplot(CKdata, aes(1992,FTE, colour = STATEf)) +
geom_point(alpha = 0.2) +
geom_point(aes(1993,FTE2),alpha = 0.2) +
labs(x = "Time") +
ggtitle(paste("Employment, FTE"))
Employment, FTE
80
60 STATEf
FTE
40 Pennsylvania
New Jersey
20
0
1992.00 1992.25 1992.50 1992.75 1993.00
Time
Ralf Becker and Martyn Andrews Introduction to Handling Data 43 / 47
Policy Evaluation - Diff-in-Diff estimator
ggplot(CKdata, aes(1992,FTE, colour = STATEf)) +
geom_jitter(alpha = 0.2) +
geom_jitter(aes(1993,FTE2),alpha = 0.2) +
labs(x = "Time") +
ggtitle(paste("Employment, FTE"))
Employment, FTE
80
60 STATEf
FTE
40 Pennsylvania
New Jersey
20
0
1992.0 1992.5 1993.0
Time
Ralf Becker and Martyn Andrews Introduction to Handling Data 44 / 47
Policy Evaluation - Diff-in-Diff estimator
ggplot(TabDiD, aes(1992,meanFTE_FEB, colour = STATEf)) +
geom_point(size = 3) +
geom_point(aes(1993,meanFTE_DEC),size=3) +
ylim(17, 24) +
labs(x = "Time") +
ggtitle(paste("Employment, mean FTE"))
Employment, mean FTE
24
meanFTE_FEB
22 STATEf
Pennsylvania
20
New Jersey
18
1992.00 1992.25 1992.50 1992.75 1993.00
Time
Ralf Becker and Martyn Andrews Introduction to Handling Data 45 / 47
Policy Evaluation - Diff-in-Diff estimator
print(TabDiD)
## # A tibble: 2 x 3
## STATEf meanFTE_FEB meanFTE_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 23.3 21.2
## 2 New Jersey 20.4 21.0
Numerically the DiD estimator is calculated as follows:
(21 - 20.4) - (21.2 - 23.3) = 2.7
Later: This can be calculated using a regression approach (has some
additional advantages)
Ralf Becker and Martyn Andrews Introduction to Handling Data 46 / 47
Outlook
Over the next weeks you will learn
to perform more advanced statistical analysis in R, such as:
I Hypothesis testing
I Multivariate regression analysis
I specification testing
to devise methods to draw causal inference
to understand the main pitfalls of time-series modelling and
forecasting
Ralf Becker and Martyn Andrews Introduction to Handling Data 47 / 47