0% found this document useful (0 votes)
9 views47 pages

Lecture 1 - Data Introduction 2020 Student

The course unit 'Introduction to Handling Data' aims to teach students about estimation and inference techniques in Economics, focusing on causal inference and time-series analysis pitfalls. It includes practical applications using R, covering topics such as regression analysis, data manipulation, and hypothesis testing, with a specific case study on the impact of minimum wage on employment in the fast-food industry. Assessment consists of an online test, an end-of-term exam, and group coursework.

Uploaded by

s.illingworth.a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views47 pages

Lecture 1 - Data Introduction 2020 Student

The course unit 'Introduction to Handling Data' aims to teach students about estimation and inference techniques in Economics, focusing on causal inference and time-series analysis pitfalls. It includes practical applications using R, covering topics such as regression analysis, data manipulation, and hypothesis testing, with a specific case study on the impact of minimum wage on employment in the fast-food industry. Assessment consists of an online test, an end-of-term exam, and group coursework.

Uploaded by

s.illingworth.a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Handling Data

ECON20222 - Lecture 1

Ralf Becker and Martyn Andrews

Ralf Becker and Martyn Andrews Introduction to Handling Data 1 / 47


What is this course unit about?

Help you implement and interpret the main estimation and


inference techniques used in Economics
Focus on:
I causal inference
I the main pitfalls of time-series analysis

Ralf Becker and Martyn Andrews Introduction to Handling Data 2 / 47


This Week’s Empirical Question

Card, David ; Krueger, Alan B. (1994) Minimum Wages and


Employment: A Case Study of the Fast-Food Industry in New Jersey
and Pennsylvania, The American Economic Review, 84, 772-793.
Do higher minimum wages decrease employment (as predicted by
common-sense and a competitive labour market model)?

Ralf Becker and Martyn Andrews Introduction to Handling Data 3 / 47


The Research Question

“This paper presents new evidence on the effect of minimum wages on


establishment-level employment outcomes. We analyze the experiences
of 410 fast-food restaurants in New Jersey and Pennsylvania following
the increase in New Jersey’s minimum wage from $ 4.25 to $ 5.05 per
hour. Comparisons of employment, wages, and prices at stores in New
Jersey and Pennsylvania before and after the rise offer a simple method
for evaluating the effects of the minimum wage.”
Card, David ; Krueger, Alan B. (1994, p.772)

Ralf Becker and Martyn Andrews Introduction to Handling Data 4 / 47


Why Data Matter

The debate is still alive:


Overall negative effect on employment, IZA.
"Research findings are not unanimous, but especially for the US,
evidence suggests that minimum wages reduce the jobs available to
low-skill workers."
An overview of the empirical evidence is provided in this report by
Arindrajit Dube for the UK Government.
"Especially for the set of studies that consider broad groups of
workers, the overall evidence base suggests an employment impact
of close to zero."

Ralf Becker and Martyn Andrews Introduction to Handling Data 5 / 47


At the end of this unit . . .

You will be able to:


Understand and discuss the challenges of making causal inferences
Perform inference appropriate for the model being estimated
Interpret empirical results (with due caution!)
Discuss strengths and weaknesses of particular empirical
applications
Do intermediate data work in R
Confidently apply regression analysis in R
Apply more advanced causal inference techniques in R
Find coding help for any new challenges in R

Ralf Becker and Martyn Andrews Introduction to Handling Data 6 / 47


What you need to do
To learn in this unit you need to:

coding, cleaning data, struggling,


answering real questions, that there
self-learning, amazement at what
is not always a clear answer
you can do

Ralf Becker and Martyn Andrews Introduction to Handling Data 7 / 47


Assessment Structure and feedback

Online test (on the use of R) - 10%


End-of-Term exam (short answer questions) - 50%
Group coursework - 40% (see extra info)

Ralf Becker and Martyn Andrews Introduction to Handling Data 8 / 47


Aim for today

Statistics/Econometrics R Coding
Summary Statistics Introduce you to R and
Difference between population RStudio
and sample How do I learn R
Hypothesis testing Import data into R
Graphical Data Perform some basic data
Representations manipulation
Diff-in-Diff Analysis Perform hypothesis tests
Simple regression analysis Estimate a regression

Ralf Becker and Martyn Andrews Introduction to Handling Data 9 / 47


This Week’s Plan

Replicate some of the basic results presented in Card and Krueger


(1994)
Introduce the Difference-in-Difference methodology (Project!!)
[Sometimes known as “Diff-in-Diff” or DiD.]
Use this example to
I introduce you to R
I review some summary statistics
I review simple regression and its implementation
I introduce some basic visualisations

Ralf Becker and Martyn Andrews Introduction to Handling Data 10 / 47


Introduce R/R-Studio

R is a statistical software package, it is open source


and free
a lot of useful functionality is added by independent
researchers via packages (also for free)

RStudio is a user interface which makes working with


R easier. You need to install R before you install
RStudio.

ECLR is a web-resource we have set up to support


you in your R work.

Ralf Becker and Martyn Andrews Introduction to Handling Data 11 / 47


Welcome to RStudio

Ralf Becker and Martyn Andrews Introduction to Handling Data 12 / 47


Write Code Files or the Basic Workflow
keep an original data file (usually ‘.xlsx‘ or ‘.csv‘) and do not
overwrite this file
any manipulation we make to the data (data cleaning, statistical
analysis etc.) is command based and we collect all these commands
in a script file. R will then interpret and execute these commands.
It is hence like a recepie which you present to a chef. These script
files have extension ‘.r‘
you can also learn to write Rmarkdown files (‘.rmd‘). They combine
code with normal text and output.
When you write code you should ensure that you add comments to
your code. Comments are bit of text which is ignored by R
(everything after an ‘#‘) but helps you or someone else to decipher
what the code does.
By following the above advice you make it easy for yourself and others
to replicate your work.
Ralf Becker and Martyn Andrews Introduction to Handling Data 13 / 47
Prepare your code

We start by uploading the extra packages we need in our code.


The first time you need these packages at a computer you may need to
install these. Use the following code to do this
[Link](c("readxl","tidyverse","ggplot2","stargazer")

This only needs to be done once on a particular computer. However,


every time you want to use any of these packages in a code you need to
make them available to your code (load them):
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphiscs
library(stargazer) # to produce nice results tables

Ralf Becker and Martyn Andrews Introduction to Handling Data 14 / 47


The data
Then we load the data from excel
CKdata<- read_xlsx("CK_public.xlsx",na = ".")

na = "." indicates how missing data are coded.


Check some characteristics of the data which are now stored in CKdata:
Discuss [Link], number of obs and number of variables, their names
and variable types
str(CKdata) # prints some basic info on variables

## tibble[,46] [410 x 46] (S3: tbl_df/tbl/[Link])


## $ SHEET : num [1:410] 46 49 506 56 61 62 445 451 455 458 ...
## $ CHAIN : num [1:410] 1 2 2 4 4 4 1 1 2 2 ...
## $ CO_OWNED: num [1:410] 0 0 1 1 1 1 0 0 1 1 ...
## $ STATE : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ SOUTHJ : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ CENTRALJ: num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ NORTHJ : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ PA1 : num [1:410] 1 1 1 1 1 1 0 0 0 1 ...
## $ PA2 : num [1:410] 0 0 0 0 0 0 1 1 1 0 ...
## $ SHORE : num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ NCALLS : num [1:410] 0 0 0 0 0 2 0 0 0 2 ...
Ralf Becker and Martyn Andrews Introduction to Handling Data 15 / 47
The data

To see the entire dataset (like in a spreadsheet):


Either click the little spreadsheet symbol next to the [Link] in the
Environment tab, or
view(CKdata) # prints some basic info on variables

Ralf Becker and Martyn Andrews Introduction to Handling Data 16 / 47


The data - Unit of observation
A unit of observation is a fast food restaurant.
Say observation 27 in our dataset is a Roy Rogers (CHAIN = 3) store in
Pennsylvania (STATE = 0) with 7 full time employees (EMPFT), 19
part-time employees (EMPPT) and 4 managers (NMGRS) in Feb 1992 and
17.5 in Dec
CKdata[27,]

## # A tibble: 1 x 46
## SHEET CHAIN CO_OWNED STATE SOUTHJ CENTRALJ NORTHJ PA1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 515 3 1 0 0 0 0 0
## # ... with 35 more variables: EMPFT <dbl>, EMPPT <dbl>, NMG
## # WAGE_ST <dbl>, INCTIME <dbl>, FIRSTINC <dbl>, BONUS <db
## # MEALS <dbl>, OPEN <dbl>, HRSOPEN <dbl>, PSODA <dbl>, PF
## # PENTREE <dbl>, NREGS <dbl>, NREGS11 <dbl>, TYPE2 <dbl>,
## # DATE2 <dbl>, NCALLS2 <dbl>, EMPFT2 <dbl>, EMPPT2 <dbl>,
## # WAGE_ST2 <dbl>, INCTIME2 <dbl>, FIRSTIN2 <dbl>, SPECIAL
Ralf Becker and Martyn Andrews Introduction to Handling Data 17 / 47
Addressing particular variables

If you want to call/use the entire spreadsheet/data frame/tibble then


you call CKdata.
But often you want to call one variable only:
CKdata$CHAIN, calls CHAIN only
CKdata["CHAIN"], calls CHAIN only
CKdata[2], calls CHAIN only, as it is the 2nd variable
And sometimes you want to call several, but not all, variables:
CKdata[c("STATE","CHAIN")]
c("STATE","CHAIN") creates a list of names. c really represents a
function, c for concatenation.
Also note: R is case sensitive, CHAIN 6= Chain

Ralf Becker and Martyn Andrews Introduction to Handling Data 18 / 47


Variable types
These are five basic data types.
character: "a", "swc"
numeric: 2, 15.5
integer: 2L (the L tells R to store this as an
integer)
logical: TRUE, FALSE
factor: a set number of categories
It is important that you know and understand differences between data
types. Each variable has has a particular type and some operations only
work for particular datatypes. For instance, we need num or int for any
mathematical operations.
In our [Link] we have only num variable types.
We will encounter logical variables frequently. they are very powerful
Ralf Becker and Martyn Andrews Introduction to Handling Data 19 / 47
factor variables

We store categorical variables as factor variables.


Sometimes you need to type convert to factor variables.
str(CKdata[c("STATE","CHAIN")]) # prints some basic info on v

## tibble[,2] [410 x 2] (S3: tbl_df/tbl/[Link])


## $ STATE: num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
## $ CHAIN: num [1:410] 1 2 2 4 4 4 1 1 2 2 ...
STATE, 1 if New Jersey (NJ); 0 if Pennsylvania (Pa)
CHAIN, 1 = Burger King; 2 = KFC; 3 = Roy Rogers; 4 = Wendy’s

Ralf Becker and Martyn Andrews Introduction to Handling Data 20 / 47


factor variables
CKdata$STATEf <- [Link](CKdata$STATE)
levels(CKdata$STATEf) <- c("Pennsylvania","New Jersey")

CKdata$CHAINf <- [Link](CKdata$CHAIN)


levels(CKdata$CHAINf) <- c("Burger King","KFC", "Roy Rogers", "Wendy's")

CKdata$STATE calls variable STATE in dataframe ck_data


<- assigns what is on the right [Link](CKdata$STATE) to the
variable on the left CKdata$STATEf
[Link](CKdata$STATE) calls a function [Link] and applies
it to CKdata$STATE
str(CKdata[c("STATEf","CHAINf")]) # prints some basic info on variables

## tibble[,2] [410 x 2] (S3: tbl_df/tbl/[Link])


## $ STATEf: Factor w/ 2 levels "Pennsylvania",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CHAINf: Factor w/ 4 levels "Burger King",..: 1 2 2 4 4 4 1 1 2 2 ...

Ralf Becker and Martyn Andrews Introduction to Handling Data 21 / 47


factor variables

factor variables are variables with discrete categories. Which ones they
are you can find out with the levels() function:
levels(CKdata$CHAINf)

## [1] "Burger King" "KFC" "Roy Rogers" "Wendy's"

Ralf Becker and Martyn Andrews Introduction to Handling Data 22 / 47


Learn more about your data

Use the summary function for some initial summary stats for num or int
variables
WAGE_ST, starting wage ($/hr), Wave 1, before min wage increase,
Feb 1992
EMPFT, # full-time employees before policy implementation
summary(CKdata[c("WAGE_ST","EMPFT")])

## WAGE_ST EMPFT
## Min. :4.250 Min. : 0.000
## 1st Qu.:4.250 1st Qu.: 2.000
## Median :4.500 Median : 6.000
## Mean :4.616 Mean : 8.203
## 3rd Qu.:4.950 3rd Qu.:12.000
## Max. :5.750 Max. :60.000
## NA's :20 NA's :6

Ralf Becker and Martyn Andrews Introduction to Handling Data 23 / 47


Learn more about your data

How many obs in each state and what chains


Tab1 <- CKdata %>% group_by(STATEf) %>%
summarise(n = n()) %>%
print()

## # A tibble: 2 x 2
## STATEf n
## <fct> <int>
## 1 Pennsylvania 79
## 2 New Jersey 331
[Link](table(CKdata$CHAINf,CKdata$STATEf,dnn = c("Chain", "State")),margin = 2)

## State
## Chain Pennsylvania New Jersey
## Burger King 0.4430380 0.4108761
## KFC 0.1518987 0.2054381
## Roy Rogers 0.2151899 0.2477341
## Wendy's 0.1898734 0.1359517

Ralf Becker and Martyn Andrews Introduction to Handling Data 24 / 47


Scatter plot of the data
p1 <- ggplot(CKdata,aes(WAGE_ST,EMPFT)) +
geom_point(size=0.5) + # this produces the scatter plot
geom_smooth(method = "lm", se = FALSE) # adds the line
p1

60

40
EMPFT

20

0
4.5 5.0 5.5
WAGE_ST

Point out that each dot represents one store data. Point out line of best fit

Ralf Becker and Martyn Andrews Introduction to Handling Data 25 / 47


Regression Line
The line in the previous plot is the line of best fit coming for a linear
regression
EM P F T = α + βW AGE_ST + u (Population Model)
The population model is defined by unknown parameters α and β
and the unknown error terms u. We will use sample data to obtain
sample estimates of these parameters.
The error terms u contain the effects of any omitted variables and
reflect that any modelled relationship will only be an
approximation. The u are random variables

EM P F Tit = α
b + βb W AGE_STit + u
bit (Estimated Sample Model)
Here we have two subscripts as the data have a cross-section (i) and a
time-series dimension (t).
The regression line in the previous figure is represented by
EM
\ P F T it = α
b + βW
b AGE_STit ( Regression Line )
Ralf Becker and Martyn Andrews Introduction to Handling Data 26 / 47
Simple Regression Model and OLS

Regression analysis is the core technique used in Econometrics. It is


based on certain assumptions about the Population Model and the error
terms u (more on this in the next few weeks).
How to estimate parameters (get αb and β)
b using the available sample of
data? This is typically done by Ordinary Least Squares (OLS).

Ralf Becker and Martyn Andrews Introduction to Handling Data 27 / 47


Simple Regression Model and OLS
mod1 <- lm(EMPFT~WAGE_ST, data= CKdata)
summary(mod1)

##
## Call:
## lm(formula = EMPFT ~ WAGE_ST, data = CKdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.091 -5.898 -2.100 3.005 51.304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.468 5.807 -1.114 0.2660
## WAGE_ST 3.193 1.255 2.544 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.5 on 383 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.01662, Adjusted R-squared: 0.01405
## F-statistic: 6.472 on 1 and 383 DF, p-value: 0.01135
Ralf Becker and Martyn Andrews Introduction to Handling Data 28 / 47
OLS - nice output
stargazer(mod1,type="text")

##
## ===============================================
## Dependent variable:
## ---------------------------
## EMPFT
## -----------------------------------------------
## WAGE_ST 3.193**
## (1.255)
##
## Constant -6.468
## (5.807)
##
## -----------------------------------------------
## Observations 385
## R2 0.017
## Adjusted R2 0.014
## Residual Std. Error 8.500 (df = 383)
## F Statistic 6.472** (df = 1; 383)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Ralf Becker and Martyn Andrews Introduction to Handling Data 29 / 47
OLS - calculation and interpretation

How were βb and α


b calculated?

Cov(EM
d P F Tit , W AGE_STit )
βb =
Vd
ar(W AGE_STit )
b = EM P F T it − βb ∗ W AGE_ST it
α

How to interpret βb = 3.193?


An increase of one unit in WAGE_ST (=USD1) is related to an increase in
about 3 full time employees (EMPFT).
Have we established that higher wages cause higher employment?
NO

Ralf Becker and Martyn Andrews Introduction to Handling Data 30 / 47


Regression Analysis - Underneath the hood
Need to recognise that in a sample β̂ and α̂ are really random variables.
For short EMPFT=E and WAGE_ST=W:

Cov(E,
d W)
β̂ =
Vd
ar(W )
Cov(α
d + β W + u, W )
=
Vd
ar(W )
Cov(α,
d W ) + β Cov(W,
d W ) + Cov(u,
d W)
=
Vd
ar(W )
Vd
ar(W ) Cov(u,
d W) Cov(u,
d W)
= β + =β+
Vd
ar(W ) Vd
ar(W ) Vd
ar(W )

So β̂ is a function of the random term u and hence is itself a random


variable. Once Cov(E,
d W ) and Vd
ar(W ) are replaced by sample
estimates we get ONE value which is draw from a random distribution.
Ralf Becker and Martyn Andrews Introduction to Handling Data 31 / 47
OLS - estimator properties
What can we learn from this?
If uit is a random variable, so is βb
Any particular value we get is a draw from a random distribution
An estimator is unbiased if, on average, the estimates would be
equal to the unknown β
at this stage the concept of unbiasedness may still be a little hazy
and that is fine
For this to happen we need to assume that Cov(u, x) = 0 as then
E(β)
b =β
Why do we need to assume this? Because while we do have values
for xit we do not have values for the unobserved error terms uit .
Hence we cannot test this. As you will find out, this is a thinking
exercise and whether it is true/false/sensible/appropriate is at the
core of what we do.
Ralf Becker and Martyn Andrews Introduction to Handling Data 32 / 47
OLS - the exogeneity assumption
For βb in yit = α + βxit + uit to be unbiased (i.e. on average correct) we
needed

Cov(uit , xit ) = 0

This is sometimes called the Exogeneity assumption. The error term has
to be uncorrelated to the explanatory variable xit
There are a lot of reasons why this assumption may be breached.
Simultaneity (W AGE_ST → EM P F T and
EM P F T → W AGE_ST )
Discuss the fact that we have to assume that causailty here goes in both
directions. Hence we cannot attach one one-directional causal interpretation to
the estimated coefficient. If you can estimate the model the other way round

Omitted relevant variables or unobserved heterogeneity


Measurement error in xit
Ralf Becker and Martyn Andrews Introduction to Handling Data 33 / 47
So how to make causal statements

Once we have found reasons to believe in the exogeneity assumption, the


next few lectures is to introduce various standard techniques that use
this assumption:
First Difference
Diff-in-Diff, to be used in Project
Instrumental Variables
Regression Discontinuity
All of them can be thought of as specific ways to apply a regression
model.

Ralf Becker and Martyn Andrews Introduction to Handling Data 34 / 47


Diff-in-Diff - The Problem

Do higher minimum wages decrease employment (as predicted by a


simplistic labour market model)?

Ralf Becker and Martyn Andrews Introduction to Handling Data 35 / 47


The Research Question

“This paper presents new evidence on the effect of minimum wages on


establishment-level employment outcomes. We analyze the experiences
of 410 fast-food restaurants in New Jersey and Pennsylvania following
the increase in New Jersey’s minimum wage from $ 4.25 to $ 5.05 per
hour. Comparisons of employment, wages, and prices at stores in New
Jersey and Pennsylvania before and after the rise offer a simple method
for evaluating the effects of the minimum wage.”
Card, David ; Krueger, Alan B. (1994, p.772)

Ralf Becker and Martyn Andrews Introduction to Handling Data 36 / 47


Wage distribution - Pre

Look at the distribution of starting wages before the change in minimum


wage in New Jersey (WAGE_ST).
At this stage it is not so important to understand the commands for
these plots.
The easiest way to plot a histogram is
hist(CKdata$WAGE_ST[CKdata$STATEf == "Pennsylvania"])
where, in square brackets, we select that we only want data fram
Pennsylvania.
hist(CKdata$WAGE_ST[CKdata$STATEf == "Pennsylvania"])
hist(CKdata$WAGE_ST[CKdata$STATEf == "New Jersey"])

Ralf Becker and Martyn Andrews Introduction to Handling Data 37 / 47


Wage distribution - Pre
Or here an alternative visualisation.
ggplot(CKdata,aes(WAGE_ST, colour = STATEf), colour = STATEf) +
geom_histogram(position="identity",
aes(y = ..density..),
bins = 10,
alpha = 0.2) +
ggtitle(paste("Starting wage distribution, Feb/Mar 1992"))

Starting wage distribution, Feb/Mar 1992


2.0

1.5 STATEf
density

1.0 Pennsylvania
New Jersey
0.5

0.0
4.5 5.0 5.5
WAGE_ST
Ralf Becker and Martyn Andrews Introduction to Handling Data 38 / 47
Wage distribution - Pre

Both plots sow that the starting wage distribution is fairly similar in
both states, with peaks at the minimum wage of $4.25 and $5.00.

Ralf Becker and Martyn Andrews Introduction to Handling Data 39 / 47


Policy Evaluation

First we can evaluate whether the legislation has been implemented.


Tab1 <- CKdata %>% group_by(STATEf) %>%
summarise(wage_FEB = mean(WAGE_ST,[Link] = TRUE),
wage_DEC = mean(WAGE_ST2,[Link] = TRUE)) %>%
print()

## # A tibble: 2 x 3
## STATEf wage_FEB wage_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 4.63 4.62
## 2 New Jersey 4.61 5.08
Average wage in New Jersey has increased.

Ralf Becker and Martyn Andrews Introduction to Handling Data 40 / 47


Policy Evaluation - Wage distribution
ggplot(CKdata,aes(WAGE_ST2, colour = STATEf), colour = STATEf) +
geom_histogram(position="identity",
aes(y = ..density..),
bins = 10,
alpha = 0.2) +
ggtitle(paste("Starting wage distribution, Nov/Dec 1992"))

Starting wage distribution, Nov/Dec 1992


4

3 STATEf
density

2 Pennsylvania

1 New Jersey

0
4.0 4.5 5.0 5.5 6.0
WAGE_ST2

Ralf Becker and Martyn Andrews Introduction to Handling Data 41 / 47


Policy Evaluation - Employment outcomes

Let’s measure employment before and after the policy change.


Calculate two new variables FTE and FTE2 (full time employment
equivalent before and after policy change)
CKdata$FTE <- CKdata$EMPFT + CKdata$NMGRS + 0.5*CKdata$EMPPT
CKdata <- CKdata %>% mutate(FTE2 = EMPFT2 + NMGRS2 + 0.5*EMPPT2)

TabDiD <- CKdata %>% group_by(STATEf) %>%


summarise(meanFTE_FEB = mean(FTE,[Link] = TRUE),
meanFTE_DEC = mean(FTE2,[Link] = TRUE)) %>%
print()

## # A tibble: 2 x 3
## STATEf meanFTE_FEB meanFTE_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 23.3 21.2
## 2 New Jersey 20.4 21.0

Ralf Becker and Martyn Andrews Introduction to Handling Data 42 / 47


Policy Evaluation - Diff-in-Diff estimator
ggplot(CKdata, aes(1992,FTE, colour = STATEf)) +
geom_point(alpha = 0.2) +
geom_point(aes(1993,FTE2),alpha = 0.2) +
labs(x = "Time") +
ggtitle(paste("Employment, FTE"))

Employment, FTE
80

60 STATEf
FTE

40 Pennsylvania
New Jersey
20

0
1992.00 1992.25 1992.50 1992.75 1993.00
Time

Ralf Becker and Martyn Andrews Introduction to Handling Data 43 / 47


Policy Evaluation - Diff-in-Diff estimator
ggplot(CKdata, aes(1992,FTE, colour = STATEf)) +
geom_jitter(alpha = 0.2) +
geom_jitter(aes(1993,FTE2),alpha = 0.2) +
labs(x = "Time") +
ggtitle(paste("Employment, FTE"))

Employment, FTE
80

60 STATEf
FTE

40 Pennsylvania
New Jersey
20

0
1992.0 1992.5 1993.0
Time

Ralf Becker and Martyn Andrews Introduction to Handling Data 44 / 47


Policy Evaluation - Diff-in-Diff estimator
ggplot(TabDiD, aes(1992,meanFTE_FEB, colour = STATEf)) +
geom_point(size = 3) +
geom_point(aes(1993,meanFTE_DEC),size=3) +
ylim(17, 24) +
labs(x = "Time") +
ggtitle(paste("Employment, mean FTE"))

Employment, mean FTE


24
meanFTE_FEB

22 STATEf
Pennsylvania
20
New Jersey
18

1992.00 1992.25 1992.50 1992.75 1993.00


Time

Ralf Becker and Martyn Andrews Introduction to Handling Data 45 / 47


Policy Evaluation - Diff-in-Diff estimator

print(TabDiD)

## # A tibble: 2 x 3
## STATEf meanFTE_FEB meanFTE_DEC
## <fct> <dbl> <dbl>
## 1 Pennsylvania 23.3 21.2
## 2 New Jersey 20.4 21.0
Numerically the DiD estimator is calculated as follows:
(21 - 20.4) - (21.2 - 23.3) = 2.7
Later: This can be calculated using a regression approach (has some
additional advantages)

Ralf Becker and Martyn Andrews Introduction to Handling Data 46 / 47


Outlook

Over the next weeks you will learn


to perform more advanced statistical analysis in R, such as:
I Hypothesis testing
I Multivariate regression analysis
I specification testing

to devise methods to draw causal inference


to understand the main pitfalls of time-series modelling and
forecasting

Ralf Becker and Martyn Andrews Introduction to Handling Data 47 / 47

You might also like