0% found this document useful (0 votes)
8 views43 pages

Section 8 P

Uploaded by

Joshua Jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views43 pages

Section 8 P

Uploaded by

Joshua Jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Gov 2001 Section 8:


Continuing with Binary and Count Outcomes

Konstantin Kashin1

March 27, 2013

1
Thanks to Jen Pan, Brandon Stewart, Iain Osgood, and Patrick Lam for
contributing to this material.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Outline

Administrative Issues

Zero-Inflated Logistic Regression

Counts: Poisson Model

Counts: Negative Binomial Model


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Replication Paper

▸ You will receive a group to re-replicate tonight


▸ Re-replication due Wednesday, April 3 at 7pm
▸ Aim to be helpful, not critical!
▸ Any questions about expectations?
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Outline

Administrative Issues

Zero-Inflated Logistic Regression

Counts: Poisson Model

Counts: Negative Binomial Model


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Why Zero-Inflation?

▸ What if we knew that something in our data were mismeasured?


▸ For example, what if we thought that some of our data were
sytematically zero rather than randomly zero? This could be
when:
1. Some data are spoiled or lost
2. Survey respondents put “zero” to an ordered answer on a survey
just to get it done.
If our data are mismeasured in some systematic way, our estimates will
be off.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

A Working Example: Fishing

You’re trying to figure out the probability of catching a fish in a park


from a survey. People were asked:
▸ How many children were in the group

▸ How many people were in the group

▸ Whether they caught a fish.


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

A Working Example: Fishing

The problem is, some people didn’t even fish! These people have
systematically zero fish.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Model

We’re going to assume that whether or not the person fished is the
outcome of a Bernoulli trial.

0 with probability ψi
Yi = {
Logistic with probability 1 − ψi
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Model

We can write out the distribution of Yi as:

ψi + (1 − ψi ) (1 − 1
) if yi = 0
P(Yi = yi ∣β, ψi ) { 1+e−Xβ
(1 − ψi ) ( 1+e1−Xβ ) if yi = 1

And we can put covariates on ψ:

1
ψ=
1 + e−zi γ
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Deriving the Likelihood

The likelihood function is proportional to the probability of Yi :

L(β, ψi ∣Yi ) ∝ P(Yi ∣β, ψi )


1 1−Yi
= [ψi + (1 − ψi ) (1 − )]
1 + e−Xi β
1 Yi
[(1 − ψi ) ( )]
1 + e−Xi β
1 1 1 1−Yi
= [ + (1 − ) (1 − )]
1 + e−zi γ 1 + e−zi γ 1 + e−Xi β
1 1 Y i
[(1 − )( )]
1+e i −z γ 1+e i −X β
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Deriving the Likelihood

Multiplying over all observations we get:


n 1−Y i
1 1 1
L(β, γ∣Y) = ∏[ + (1 − ) (1 − )]
i=1 1+e i
−z γ 1+e i
−z γ 1+e i
−X β

Yi
1 1
[(1 − ) ( )]
1 + e−zi γ 1 + e−Xi β
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Deriving the Likelihood

Taking the log we get:


n
1
ln L = ∑ {Yi ln [(1 − ψ) ( )] +
i=1 1 + e −X i β

1
(1 − Yi ) ln[ψ + (1 − ψ) (1 − )]}
1 + e−Xi β
n
1 1
= ∑ {Yi ln [(1 − )( )] +
i=1 1+e −z i γ 1 + e−Xi β
1 1 1
(1 − Yi ) ln [ + (1 − ) (1 − )]}
1+e−z i γ 1+e −z i γ 1 + e−Xi β
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Let’s program this in R

Load and get the data ready:


fish <- read.table("http://www.ats.ucla.edu/stat/R/dae/fish.csv"
X <- fish[c("child", "persons")]
Z <- fish[c("persons")]
X <- as.matrix(cbind(1,X))
Z <- as.matrix(cbind(1,Z))
y <- ifelse(fish$count>0,1,0)
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Let’s program this in R

Write out the Log-likelihood function


ll.zilogit <- function(par, X, Z, y){
beta <- par[1:ncol(X)]
gamma <- par[(ncol(X)+1):length(par)]
phi <- 1/(1+exp(-Z%*%gamma))
pie <- 1/(1+exp(-X%*%beta))
sum(y*log((1-phi)*pie) + (1-y)*(log(phi + (1-phi)*(1-pie))))
}
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Let’s program this in R

Optimize to get the results

par <- rep(1,(ncol(X)+ncol(Z)))


out <- optim(par, ll.zilogit, Z=Z, X=X,y=y, method="BFGS",
control=list(fnscale=-1), hessian=TRUE)

out$par
[1] 1.507470 -2.686476 1.447307 1.876404 -1.247189
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Plotting to See the Relationship

These numbers don’t mean a lot to us, so we can plot the predicted
probabilities of a person having not fished.

First, we have to simulate our gammas:

varcv.par <- solve(-out$hessian)


library(mvtnorm)
sim.pars <- rmvnorm(10000, out$par, varcv.par)
sim.z <- sim.pars[,(ncol(X)+1):length(par)]
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Plotting to See the Relationship

These numbers don’t mean a lot to us, so we can plot the predicted
probabilities of a group having not fished.

We then generate predicted probabilities that different sized groups


did not fish.

person.vec <- seq(1,4)


Zcovariates <- cbind(1, person.vec)
exp.holder <- matrix(NA, ncol=4, nrow=10000)
for(i in 1:length(person.vec)){
exp.holder[,i] <- 1/(1+exp(-Zcovariates[i,]%*%t(sim.z)))
}
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Plotting to See the Relationship

These numbers don’t mean a lot to us, so we can plot the predicted
probabilities of a group having not fished.

Using these numbers, we can plot the densities of probabilities, to get a


sense of the probability and the uncertainty.
plot(density(exp.holder[,4]), col="blue", xlim=c(0,1),
main="Probability of a Structural Zero", xlab="Probability")
lines(density(exp.holder[,3]), col="red")
lines(density(exp.holder[,2]), col="green")
lines(density(exp.holder[,1]), col="black")
legend(.7,12, legend=c("One Person", "Two People",
"Three People", "Four People"),
col=c("black", "green", "red", "blue"), lty=1)
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Plotting to See the Relationship


These numbers don’t mean a lot to us, so we can plot the predicted
probabilities of a group having not fished.

Probability of a Structural Zero


15

One Person
Two People
10

Three People
Four People
Density

5
0

0.0 0.2 0.4 0.6 0.8 1.0

Probability
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Outline

Administrative Issues

Zero-Inflated Logistic Regression

Counts: Poisson Model

Counts: Negative Binomial Model


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Distribution

It’s a discrete probability distribution which gives the probability that


some number of events will occur in a fixed period of time.
Examples:
1. number of terrorist attacks in a given year
2. number of publications by a professor in a career
3. number of times word “hope” is used in a Barack Obama speech
4. number of songs on a pop music CD
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Distribution


-Here’s the probability density function (PDF) for a random variable Y
that is distributed Pois(λ):
λy −λ
Pr(Y = y) = e
y!
-Suppose Y ∼ Pois(3). What’s Pr(Y = 4)?
34 −3
Pr(Y = 4) = e = 0.168.
4!

Poisson Distribution
0.15
Pr(Y=y)

0.1
0.05
0

0 2 4 6 8 10

y
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Distribution

One more time, the probability density function (PDF) for a random
variable Y that is distributed Pois(λ):

λy −λ
Pr(Y = y) = e
y!

Using a little bit of geometric series trickery, it isn’t too hard to show
λ y −λ
that E[Y] = ∑∞ y=0 y ⋅ y! e = λ.

It also turns out that Var(Y) = λ, a feature of the model we will discuss
later on.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Distribution

Poisson data arises when there is some discrete event which occurs
(possibly multiple times) at a constant rate for some fixed time period.

This constant rate assumption could be restated: the probability of an


event occurring at any moment is independent of whether an event
has occurred at any other moment.

Derivation of the distribution has some other technical first principles,


but the above is the most important.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Model for Event Counts

1. The stochastic component:

Yi ∼ Pois(λi )

2. The systematic component:

λi = exp(Xi β)

The likelihood is therefore:


n y
λi i −λi
L(β∣X, y) = ∏ e
i=1 yi !
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Poisson Model for Event Counts

And the log-likelihood


n
ln L(β∣X, y) = ∑ yi ln λi − ln(yi !) − λi
i=1
n
= ∑ yi ln(exp(Xi β) − ln(yi !) − exp(Xi β)
i=1
n
= ∑ yi (Xi β) − exp(Xi β)
i=1
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Comparing with the Linear Model


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Comparing with the Linear Model

Possible dimensions for comparison:


1. distribution of Y∣X
2. shape of the mean function
3. assumptions about Var(Y∣X)
4. calculating fitted values
5. meaning of intercept and slope

Generally: the linear model (OLS) is biased, inefficient, and


inconsistent for count data!
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Example: Civil Conflict in Northern Ireland

Background: a conflict largely along religious lines about the status of


Northern Ireland within the United Kingdom, and the division of
resources and political power between Northern Ireland’s Protestant
(mainly Unionist) and Catholic (mainly Republican) communities.

The data: the number of Republican deaths for every month from
1969, the beginning of sustained violence, to 2001 (at which point,
most organized violence had subsided). Also, the unemployment rates
in the two main religious communities.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Example: Civil Conflict in Northern Ireland


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Example: Civil Conflict in Northern Ireland

The model: Let Yi = # of Republican deaths in a month. Our sole


predictor for the moment will be: UC = the unemployment rate among
Northern Ireland’s Catholics.

Our model is then:


Yi ∼ Pois(λi )
and
λi = E[Yi ∣UiC ] = exp(β0 + β1 ∗ UiC ).
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Estimate (just as we have all along!)

mod <- zelig(repdeaths ~ cathunemp,


data = troubles, model = "poisson")

> summary(mod)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.295875 0.1805327 7.178064 7.070547e-13
cathunemp 1.406498 0.6689819 2.102445 3.551432e-02
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Our fitted model

λi = E[Yi ∣UiC ] = exp(1.296 + 1.407 ∗ UiC ).

50

40



repdeaths

30



● ●
● ●

20

● ●
● ●
● ● ●

● ●
● ● ●
●● ●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●● ● ● ●
10

● ● ● ●●● ● ●● ●
● ●● ● ● ● ● ●●
●● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ●● ●
●● ●
● ●● ● ●●● ●● ●● ● ●● ●
● ●● ● ●●● ●●●
●● ●● ● ● ● ● ●● ●
● ●● ● ●●●●● ● ●●● ● ● ●● ●●●● ●
● ● ●● ● ●● ●●
●● ● ● ●●●● ● ●● ● ●● ●●● ●
● ●● ●●● ●
●●●●●●● ●●● ● ●●●●●
●● ● ●

●●
●●● ● ●
●● ● ● ●● ●
●● ● ● ●●●●●● ●●●
● ● ●●● ● ●●● ● ● ●
●●
● ● ●●●●
● ●●●●● ●●●●●●● ● ● ● ●●●
●●●● ● ●●●●●
0

●●●

● ●● ●●●
● ●●●
●●●
●●

●●
●●

●●
●● ●●
● ●●●●
●●●●
●●● ●● ● ● ● ● ●●

0.10 0.20 0.30 0.40

cathunemp
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Some fitted and predicted values


Suppose UC is equal to .2.
mod.coef <- coef(mod); mod.vcov <- vcov(mod)
beta.draws <- mvrnorm(10000, mod.coef, mod.vcov)
lambda.draws <- exp(beta.draws[,1] + .2*beta.draws[,2])
outcome.draws <- rpois(10000, lambda.draws)

Expected Values Predicted Values


3.0

1500
2.0

Frequency
Density

1000
1.0

500
0.0

4.4 4.8 5.2 0 5 10 15

E[Y|U] Y|U
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Some fitted and predicted values

Is the difference between expected and predicted values clear? What


kind of uncertainty is accounted for in each of the two distributions?

Expected Values Predicted Values


3.0

1500
2.0

Frequency
Density

1000
1.0

500
0.0

0
4.4 4.8 5.2 0 5 10 15

E[Y|U] Y|U

Estimation uncertainty for expected values.


Both estimation uncertainty and fundamental uncertainty for
predicted values.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Overdispersion

36% of observations lie outside the 2.5% or 97.5% quantile of the


Poisson distribution that we are alleging generated them.

50
40 ●



repdeaths

30



● ●
● ●

20

● ●
● ●
● ● ●

● ●
● ● ●
●● ●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●● ● ● ●
10

● ● ● ●●● ● ●● ●
● ●● ● ● ● ● ●●
●● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ●● ●
●● ●
● ●● ● ●●● ●● ●● ● ●● ●
● ●● ● ●●● ●●●
●● ●● ● ● ● ● ●● ●
● ●● ● ●●●●● ● ●●● ● ● ●● ●●●● ●
● ● ●● ● ●● ●●
●● ● ● ●●●● ● ●● ● ●● ●●● ●
● ●● ●●● ●
●●●●●●● ●●● ● ●●●●●
●● ● ●

●●
●●● ● ●
●● ● ● ●● ●
●● ● ● ●●●●●● ●●●
● ● ●●● ● ●●● ● ● ●
●●
● ● ●●●●
● ●●●●● ●●●●●●● ● ● ● ●●●
●●●● ● ●●●●●
0

●●●

● ●● ●●●
● ●●●
●●●
●●

●●
●●

●●
●● ●●
● ●●●●
●●●●
●●● ●● ● ● ● ● ●●

0.10 0.20 0.30 0.40

cathunemp
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Outline

Administrative Issues

Zero-Inflated Logistic Regression

Counts: Poisson Model

Counts: Negative Binomial Model


Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Negative Binomial Model

The variance of the Poisson distribution is only equal to its mean if the
probability of an event occurring at any moment is independent of
whether an event has occurred at any other moment, and if the
occurrence rate is constant.

We can perturb this second assumption (constant rate) in order to


derive a distribution which can handle both violations of the constant
rate assumption and violations of the independence of events (or no
contagion) assumption.

The trick is to assume that λ varies, within the same observation span,
according to a new parameter we will introduce call ς.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Alternative Parameterization

Here’s the new stochastic component:

Yi ∣λi , ζi ∼ Poisson(ζi λi )
1 1
ζi ∼ Gamma ( 2 , 2 )
σ −1 σ −1
Note that Gamma distribution has a mean of 1. Therefore,
Poisson(ζi λi ) has mean λi . Note that the variance of this distribution is
σ 2 − 1. This means that as σ 2 goes to 1, the distribution of ζi collapses
to a spike over 1.
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Alternative Parameterization

Using a similar approach to that described in UPM pgs. 51-52 we can


derive the marginal distribution of Y as

Yi ∼ Negbin(λi , σ 2 )

where
Γ( σ 2λ−1
i
+ yi ) σ 2 − 1 yi 2 − 2λ i
fnb (yi ∣λi , σ ) =
2
( 2 ) (σ ) σ −1
y!Γ( 2λi ) σ
σ −1
Notes:
1. λi > 0 and σ > 1
2. E[Yi ] = λi and Var[Yi ] = λi σ 2 . What value of σ 2 would be
evidence against overdispersion?
3. We still have the same old systematic component: λi = exp(Xi β).
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Estimates

mod <- zelig(repdeaths ~ cathunemp, data = troubles,


model = "negbin")
summary(mod)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2959 0.1805 7.178 7.07e-13 ***
cathunemp 1.4065 0.6690 2.102 0.0355 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Theta: 0.8551
Std. Err.: 0.0754
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Overdispersion Handled!

5.68% of observations lie at or above the 95% quantile of the Negative


Binomial distribution that we are alleging generated them.

50
40 ●



repdeaths

30



● ●
● ●

20

● ●
● ●
● ● ●

● ●
● ● ●
●● ●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●● ● ● ●
10

● ● ● ●●● ● ●● ●
● ●● ● ● ● ● ●●
●● ● ● ● ● ●● ● ● ●●●
●● ● ● ● ● ●● ●
●● ●
● ●● ● ●●● ●● ●● ● ●● ●
● ●● ● ●●● ●●●
●● ●● ● ● ● ● ●● ●
● ●● ● ●●●●● ● ●●● ● ● ●● ●●●● ●
● ● ●● ● ●● ●●
●● ● ● ●●●● ● ●● ● ●● ●●● ●
● ●● ●●● ●
●●●●●●● ●●● ● ●●●●●
●● ● ●

●●
●●● ● ●
●● ● ● ●● ●
●● ● ● ●●●●●● ●●●
● ● ●●● ● ●●● ● ● ●
●●
● ● ●●●●
● ●●●●● ●●●●●●● ● ● ● ●●●
●●●● ● ●●●●●
0

●●●

● ●● ●●●
● ●●●
●●●
●●

●●
●●

●●
●● ●●
● ●●●●
●●●●
●●● ●● ● ● ● ● ●●

0.10 0.20 0.30 0.40

cathunemp
Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Other Models

Note that there are many other count models:


▸ Generalized Event Count (GEC) Model
▸ Zero-Inflated Poisson
▸ Zero-Inflated Negative Binomial
▸ Zero-Truncated Models
▸ Hurdle Models

You might also like