Count data analysis
By Lemma Derseh
Presentation outline
Poisson regression
Negative binomial Poisson regression
Zero-inflated Poisson regression
Poisson distribution
• The Poisson distribution assigns a positive probability to every
nonnegative integer 0, 1, 2, . . ., so that every nonnegative integer
becomes a mathematical possibility (albeit practically zero
possibility for most count values)
• The Poisson is different than the binomial, Bin(n, π), which takes on
numbers only up to some n, and leads to a proportion (out of n)
• But the Poisson is similar to the binomial in that it can be shown
that the Poisson is the limiting distribution of a Binomial for large n
and small π
• Furthermore, because of the simple form of the Poisson
distribution, it is often computationally preferred over the Binomial
Poisson distribution
We cannot consider count data as continuous normal because:
Count distribution is too skewed to satisfy normality
(incorrect test results)
Normal model does not necessarily prevent negative
estimated counts
If we just dichotomize count data (e.g. 0 vs >0) and analyze it
using logistic regression, then we:
Loss information resulting in under-powered tests
For instance 1 event is not really equal to 100 events
Poisson distribution
• Expected number of counts (per unit of time) is strictly
positive
• As mean increases, distribution approximates normal
Examples of count data
• Number of deaths due to HIV/AIDs in a region per month
• Number of car accidents in a city
• Number of malaria cases in a district
Interpretation of loglinear models
Let Yi be the observed count for experimental unit i
Yi|Xi ~ Poisson(μi)
log(μi) = Xi β
Assumption: the mean (i.e. μ) equals the variance (i.e. σ2)
• We think that the covariates influence the mean of the counts (μ)
in a multiplicative way, i.e. as a covariate increases by 1 unit, the
log of the mean increases by β units and this implies the mean
increases by a “fold-change” of or “scale factor” of exp(β).
• The log link is the canonical link in GLM for Poisson distribution
Interpretation of loglinear model
• Considering only one covariate, the model is:
log(μ) = α + βx
• Since the log of the expected value of Y is a linear function of
explanatory variable(s), and the expected value of Y is a
multiplicative function of x:
μ = eα + βx
= eαeβx
What does this mean for μ? How do we interpret ?
Interpretation: Example
Let a Poisson process involves the event of myocardial infarction (MI)
encountered per month and is required to model it using a single factor
say age of study participants (in years). Suppose the estimates are:
Intercept or α= 0.3396 and β=0.02565
Our estimated model will be:
log(μ) = 0.3396 + 0.02565x
Therefore, for a 1-year increase in age, the expectation (or mean)
number of event of MI increases by a factor of e0.2565 = 1.026 per
month
That is, the number of MIs is growing at the rate of 2.6% per month
due to a one year increase in age
Interpretation
• All the findings mentioned above are based on the assumption
that the mean equals to the variance or equi-dispersion.
• Empirically, however, we often find data that exhibit over-
dispersion, with a variance larger than the mean and rarely
under dispersion, when the variance less than the mean
• Thus we need to have models that accommodate the excess
variation.
Fitting count data
(Stata Oriented)
Data: We use data from the STATA log provided by Long (1990) on the number
of publications produced by Ph.D. biochemists to illustrate the application on
the models mentioned. The variables in the data set include art: articles in last
three years of Ph.D., fem: coded one for females, mar: coded one if married,
kid5: number of children under age five, phd: prestige of PhD program, and
ment: articles published by mentor in last three years.
It can be accessed from the web via: use http://www.stata-
press.com/data/lf2/couart2,clear
Example
Data assessment
• After running the data using Stata: sum art, one can see that the
mean number of articles is 1.69 and the variance is 3.71
• a bit more than twice the mean
• The data are over-dispersed, but of course we haven’t
considered any covariates
A Poisson Model
We can fit Poisson model with the five covariates in two
ways:
i. Use the command: poison
ii. Use the command: glm
We will also store the estimates for latter use using the
command: estimates store poison
A Poisson Model
. glm art fem mar kid5 phd ment, family(poisson) nolog
Generalized linear models No. of obs = 915
Optimization : ML Residual df = 909
Scale parameter = 1
Deviance = 1634.370984 (1/df) Deviance = 1.797988
Pearson = 1662.54655 (1/df) Pearson = 1.828984
Variance function: V(u) = u [Poisson]
Link function : g(u) = ln(u) [Log]
AIC = 3.621981
Log likelihood = -1651.056316 BIC = -4564.031
OIM
art Coef. Std. Err. z P>|z| [95% Conf. Interval]
fem -.2245942 .0546138 -4.11 0.000 -.3316352 -.1175532
mar .1552434 .0613747 2.53 0.011 .0349512 .2755356
kid5 -.1848827 .0401272 -4.61 0.000 -.2635305 -.1062349
phd .0128226 .0263972 0.49 0.627 -.038915 .0645601
ment .0255427 .0020061 12.73 0.000 .0216109 .0294746
_cons .3046168 .1029822 2.96 0.003 .1027755 .5064581
Negative Binomial approach
• In the context of count data, consider the assumption that the
variance is proportional to the mean. Specifically, var(Y ) = φE(Y ) =
φμ
• If φ= 1 then the variance equals the mean and we obtain the
Poisson mean-variance relationship. If φ > 1 then we have over-
dispersion, and if φ < 1 we would have under-dispersion, but this is
relatively rare
• φ= 1 when the outcome data is generated by pure Poisson process;
φ< 1 when outcome events are happening in a regulated manner;
φ> 1 when there are unobserved heterogeneities that should be
captured by random effect models like negative binomial or mixed
methods or both
Negative Binomial
• Modeling over-dispersion in count data should start from a Poisson
regression model and add a multiplicative random effect to represent
unobserved heterogeneity which leads into the negative binomial
regression model
• Suppose that the conditional distribution of the outcome Y given an
unobserved variable effect θ is Poisson with mean and variance μθ.
That is, Y|θ ~ P(θμ)
• In the context of the Ph.D. biochemists data, θ captures unobserved
factors effect that increase (if θ > 1) or decrease (if θ < 1) productivity
relative to what one would expect given the observed values of the
covariates, which is of course μ where logμ = x’β.
• For convenience we take E(θ) = 1, that represents the expected
outcome for the average individual given covariates x.
Negative Binomial
• Stata has a command called nbreg that can fit the negative
binomial model described here by maximum likelihood.
• The output uses alpha to label the variance of θ the unobserved
variable, which we call σ2.
Negative Binomial Regression
• Stata’s alpha is the variance of the multiplicative random effect and
corresponds to σ2
• It is estimated to be 0.44 and is statistically significant
Negative Binomial Regression
• Because the Poisson model is a special case of the negative binomial
when σ2 =0, we can use a likelihood ratio test to compare the two
models.
• Or to test the significance of Stata’s alpha we may think of computing
twice the difference in log-likelihoods between this model and the
Poisson model, 180.2, and treating it as chi-squared with one df.
• However, the usual asymptotic do not apply, because the null
hypothesis is on a boundary of the parameter space
• A better approximation is to treat the statistic as 50:50 mixture of zeros
and chi-squared with one df
• And Stata implements this procedure, reporting the statistics as
chi2bar.
Negative Binomial Regression
• Alternatively to compare the two models, we can treat the
statistics as a chi-square with one df which gives a conservative
test
• To test hypothesis about the regression coefficients, we can use
either the Wald or LR tests, which are possible because we have
made full distributional assumptions.
Zero-Inflated Poisson
• One way to model when there is a situation of inflated zeros is to
assume that the data come from a mixture of two populations,
one where the counts is always zero, and another where the
count has a Poisson distribution with mean μ
• In this model, zero counts can come from either population,
while positive counts come only from the second one
• In the context of publications by PhD biochemists data, we can
imagine that some had in mind jobs where publications wouldn’t
be important, while others were aiming for academic jobs where
a record of publications was expected.
• Members of the first group would publish zero articles, whereas
members of the second group would publish 0, 1, 2, …., a count
that may be assumed to have a Poisson distribution
Zero-Inflated Poisson
• The distribution of the outcome can be modeled in terms of two
parameters, π the probability of “always zero” and μ, the mean
number of publications for those not in the “always zero’ group
• A natural way to introduce covariates is to model the logit of the
probability π of always zero and the log of the mean μ for those
not in the “always zero” class
• Stata implements this combination in the zip command when the
counts are assumed Poisson
• A parallel development using a negative binomial model for the
counts in the second group leads to the zinb command.
• In both cases, the model for the probability of always zero is
specified in the inflate () option
Zero-Inflated Poisson
Zero-inflated Poisson model with all covariates in both equation
. zip art fem mar kid5 phd ment, inflate(fem mar kid5 phd ment) nolog
Zero-inflated Poisson regression Number of obs = 915
Nonzero obs = 640
Zero obs = 275
Inflation model = logit LR chi2(5) = 78.56
Log likelihood = -1604.773 Prob > chi2 = 0.0000
art Coef. Std. Err. z P>|z| [95% Conf. Interval]
art
fem -.2091446 .0634047 -3.30 0.001 -.3334155 -.0848737
mar .103751 .071111 1.46 0.145 -.035624 .243126
kid5 -.1433196 .0474293 -3.02 0.003 -.2362793 -.0503599
phd -.0061662 .0310086 -0.20 0.842 -.066942 .0546096
ment .0180977 .0022948 7.89 0.000 .0135999 .0225955
_cons .640839 .1213072 5.28 0.000 .4030814 .8785967
inflate
fem .1097465 .2800813 0.39 0.695 -.4392028 .6586958
mar -.3540107 .3176103 -1.11 0.265 -.9765155 .2684941
kid5 .2171001 .196481 1.10 0.269 -.1679956 .6021958
phd .0012702 .1452639 0.01 0.993 -.2834418 .2859821
ment -.134111 .0452461 -2.96 0.003 -.2227918 -.0454302
_cons -.5770618 .5093853 -1.13 0.257 -1.575439 .421315
Zero-Inflated Poisson
• Looking at the inflate equation we see that the only significant
predictor of being in the “always zero” class is the number of
articles published by the mentor, with each article by the mentor
associated with 13.4% lower odds of never publishing
• Looking at the equation for the mean number of articles among
those not in the always zero class, we find significant
disadvantages for females and scientists with children under five,
and
• A large positive effect of the number of publications by the
mentor, with each article associated with a 1.8% increase in the
expected number of publications