0% found this document useful (0 votes)

26 views24 pages

Shorten - Count Data Analysis

This document provides an overview of count data analysis methods including Poisson regression, negative binomial regression, and zero-inflated Poisson regression. It discusses the Poisson distribution and its properties. It also covers interpreting loglinear models, examples of count data, and fitting count data models in Stata. Key count data modeling techniques are explained including dealing with overdispersion using negative binomial regression and modeling excess zeros using zero-inflated Poisson regression.

Uploaded by

woretawendalew91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views24 pages

Shorten - Count Data Analysis

Uploaded by

woretawendalew91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Count data analysis

By Lemma Derseh
Presentation outline
Poisson regression

Negative binomial Poisson regression

Zero-inflated Poisson regression

Poisson distribution
• The Poisson distribution assigns a positive probability to every
nonnegative integer 0, 1, 2, . . ., so that every nonnegative integer
becomes a mathematical possibility (albeit practically zero
possibility for most count values)

• The Poisson is different than the binomial, Bin(n, π), which takes on
numbers only up to some n, and leads to a proportion (out of n)

• But the Poisson is similar to the binomial in that it can be shown

that the Poisson is the limiting distribution of a Binomial for large n
and small π

• Furthermore, because of the simple form of the Poisson

distribution, it is often computationally preferred over the Binomial
Poisson distribution
 We cannot consider count data as continuous normal because:
 Count distribution is too skewed to satisfy normality
(incorrect test results)
 Normal model does not necessarily prevent negative
estimated counts

 If we just dichotomize count data (e.g. 0 vs >0) and analyze it

using logistic regression, then we:
 Loss information resulting in under-powered tests
 For instance 1 event is not really equal to 100 events
Poisson distribution
• Expected number of counts (per unit of time) is strictly
positive

• As mean increases, distribution approximates normal

Examples of count data
• Number of deaths due to HIV/AIDs in a region per month

• Number of car accidents in a city

• Number of malaria cases in a district

Interpretation of loglinear models
Let Yi be the observed count for experimental unit i
Yi|Xi ~ Poisson(μi)
log(μi) = Xi β

Assumption: the mean (i.e. μ) equals the variance (i.e. σ2)

• We think that the covariates influence the mean of the counts (μ)
in a multiplicative way, i.e. as a covariate increases by 1 unit, the
log of the mean increases by β units and this implies the mean
increases by a “fold-change” of or “scale factor” of exp(β).

• The log link is the canonical link in GLM for Poisson distribution
Interpretation of loglinear model

• Considering only one covariate, the model is:

log(μ) = α + βx

• Since the log of the expected value of Y is a linear function of

explanatory variable(s), and the expected value of Y is a
multiplicative function of x:
μ = eα + βx
= eαeβx

What does this mean for μ? How do we interpret ?

Interpretation: Example
 Let a Poisson process involves the event of myocardial infarction (MI)
encountered per month and is required to model it using a single factor
say age of study participants (in years). Suppose the estimates are:
Intercept or α= 0.3396 and β=0.02565

 Our estimated model will be:

log(μ) = 0.3396 + 0.02565x

 Therefore, for a 1-year increase in age, the expectation (or mean)

number of event of MI increases by a factor of e0.2565 = 1.026 per
month

 That is, the number of MIs is growing at the rate of 2.6% per month
due to a one year increase in age
Interpretation

• All the findings mentioned above are based on the assumption

that the mean equals to the variance or equi-dispersion.

• Empirically, however, we often find data that exhibit over-

dispersion, with a variance larger than the mean and rarely
under dispersion, when the variance less than the mean

• Thus we need to have models that accommodate the excess

variation.
Fitting count data

(Stata Oriented)
Data: We use data from the STATA log provided by Long (1990) on the number
of publications produced by Ph.D. biochemists to illustrate the application on
the models mentioned. The variables in the data set include art: articles in last
three years of Ph.D., fem: coded one for females, mar: coded one if married,
kid5: number of children under age five, phd: prestige of PhD program, and
ment: articles published by mentor in last three years.

It can be accessed from the web via: use http://www.stata-

press.com/data/lf2/couart2,clear
Example

Data assessment
• After running the data using Stata: sum art, one can see that the
mean number of articles is 1.69 and the variance is 3.71

• a bit more than twice the mean

• The data are over-dispersed, but of course we haven’t

considered any covariates
A Poisson Model
 We can fit Poisson model with the five covariates in two
ways:

i. Use the command: poison

ii. Use the command: glm

 We will also store the estimates for latter use using the
command: estimates store poison
A Poisson Model
. glm art fem mar kid5 phd ment, family(poisson) nolog

Generalized linear models No. of obs = 915

Optimization : ML Residual df = 909
Scale parameter = 1
Deviance = 1634.370984 (1/df) Deviance = 1.797988
Pearson = 1662.54655 (1/df) Pearson = 1.828984

Variance function: V(u) = u [Poisson]

Link function : g(u) = ln(u) [Log]

AIC = 3.621981
Log likelihood = -1651.056316 BIC = -4564.031

OIM
art Coef. Std. Err. z P>|z| [95% Conf. Interval]

fem -.2245942 .0546138 -4.11 0.000 -.3316352 -.1175532

mar .1552434 .0613747 2.53 0.011 .0349512 .2755356
kid5 -.1848827 .0401272 -4.61 0.000 -.2635305 -.1062349
phd .0128226 .0263972 0.49 0.627 -.038915 .0645601
ment .0255427 .0020061 12.73 0.000 .0216109 .0294746
_cons .3046168 .1029822 2.96 0.003 .1027755 .5064581
Negative Binomial approach
• In the context of count data, consider the assumption that the
variance is proportional to the mean. Specifically, var(Y ) = φE(Y ) =
φμ

• If φ= 1 then the variance equals the mean and we obtain the

Poisson mean-variance relationship. If φ > 1 then we have over-
dispersion, and if φ < 1 we would have under-dispersion, but this is
relatively rare

• φ= 1 when the outcome data is generated by pure Poisson process;

φ< 1 when outcome events are happening in a regulated manner;
φ> 1 when there are unobserved heterogeneities that should be
captured by random effect models like negative binomial or mixed
methods or both
Negative Binomial
• Modeling over-dispersion in count data should start from a Poisson
regression model and add a multiplicative random effect to represent
unobserved heterogeneity which leads into the negative binomial
regression model

• Suppose that the conditional distribution of the outcome Y given an

unobserved variable effect θ is Poisson with mean and variance μθ.
That is, Y|θ ~ P(θμ)

• In the context of the Ph.D. biochemists data, θ captures unobserved

factors effect that increase (if θ > 1) or decrease (if θ < 1) productivity
relative to what one would expect given the observed values of the
covariates, which is of course μ where logμ = x’β.

• For convenience we take E(θ) = 1, that represents the expected

outcome for the average individual given covariates x.
Negative Binomial
• Stata has a command called nbreg that can fit the negative
binomial model described here by maximum likelihood.

• The output uses alpha to label the variance of θ the unobserved

variable, which we call σ2.
Negative Binomial Regression

• Stata’s alpha is the variance of the multiplicative random effect and

corresponds to σ2
• It is estimated to be 0.44 and is statistically significant
Negative Binomial Regression
• Because the Poisson model is a special case of the negative binomial
when σ2 =0, we can use a likelihood ratio test to compare the two
models.

• Or to test the significance of Stata’s alpha we may think of computing

twice the difference in log-likelihoods between this model and the
Poisson model, 180.2, and treating it as chi-squared with one df.

• However, the usual asymptotic do not apply, because the null

hypothesis is on a boundary of the parameter space

• A better approximation is to treat the statistic as 50:50 mixture of zeros

and chi-squared with one df

• And Stata implements this procedure, reporting the statistics as

chi2bar.
Negative Binomial Regression

• Alternatively to compare the two models, we can treat the

statistics as a chi-square with one df which gives a conservative
test

• To test hypothesis about the regression coefficients, we can use

either the Wald or LR tests, which are possible because we have
made full distributional assumptions.
Zero-Inflated Poisson
• One way to model when there is a situation of inflated zeros is to
assume that the data come from a mixture of two populations,
one where the counts is always zero, and another where the
count has a Poisson distribution with mean μ
• In this model, zero counts can come from either population,
while positive counts come only from the second one
• In the context of publications by PhD biochemists data, we can
imagine that some had in mind jobs where publications wouldn’t
be important, while others were aiming for academic jobs where
a record of publications was expected.
• Members of the first group would publish zero articles, whereas
members of the second group would publish 0, 1, 2, …., a count
that may be assumed to have a Poisson distribution
Zero-Inflated Poisson
• The distribution of the outcome can be modeled in terms of two
parameters, π the probability of “always zero” and μ, the mean
number of publications for those not in the “always zero’ group

• A natural way to introduce covariates is to model the logit of the

probability π of always zero and the log of the mean μ for those
not in the “always zero” class
• Stata implements this combination in the zip command when the
counts are assumed Poisson

• A parallel development using a negative binomial model for the

counts in the second group leads to the zinb command.

• In both cases, the model for the probability of always zero is

specified in the inflate () option
Zero-Inflated Poisson
Zero-inflated Poisson model with all covariates in both equation
. zip art fem mar kid5 phd ment, inflate(fem mar kid5 phd ment) nolog

Zero-inflated Poisson regression Number of obs = 915

Nonzero obs = 640
Zero obs = 275

Inflation model = logit LR chi2(5) = 78.56

Log likelihood = -1604.773 Prob > chi2 = 0.0000

art Coef. Std. Err. z P>|z| [95% Conf. Interval]

art
fem -.2091446 .0634047 -3.30 0.001 -.3334155 -.0848737
mar .103751 .071111 1.46 0.145 -.035624 .243126
kid5 -.1433196 .0474293 -3.02 0.003 -.2362793 -.0503599
phd -.0061662 .0310086 -0.20 0.842 -.066942 .0546096
ment .0180977 .0022948 7.89 0.000 .0135999 .0225955
_cons .640839 .1213072 5.28 0.000 .4030814 .8785967

inflate
fem .1097465 .2800813 0.39 0.695 -.4392028 .6586958
mar -.3540107 .3176103 -1.11 0.265 -.9765155 .2684941
kid5 .2171001 .196481 1.10 0.269 -.1679956 .6021958
phd .0012702 .1452639 0.01 0.993 -.2834418 .2859821
ment -.134111 .0452461 -2.96 0.003 -.2227918 -.0454302
_cons -.5770618 .5093853 -1.13 0.257 -1.575439 .421315
Zero-Inflated Poisson
• Looking at the inflate equation we see that the only significant
predictor of being in the “always zero” class is the number of
articles published by the mentor, with each article by the mentor
associated with 13.4% lower odds of never publishing

• Looking at the equation for the mean number of articles among

those not in the always zero class, we find significant
disadvantages for females and scientists with children under five,
and

• A large positive effect of the number of publications by the

mentor, with each article associated with a 1.8% increase in the
expected number of publications

Countdata2018 2
No ratings yet
Countdata2018 2
23 pages
Bayesian Poisson Regression Guide
No ratings yet
Bayesian Poisson Regression Guide
122 pages
Modeling Count Data
No ratings yet
Modeling Count Data
6 pages
Poisson Regression Models
No ratings yet
Poisson Regression Models
14 pages
Section 8 P
No ratings yet
Section 8 P
43 pages
Count Data Models Explained
No ratings yet
Count Data Models Explained
7 pages
Cópia de Aula5 - Contagem
No ratings yet
Cópia de Aula5 - Contagem
28 pages
Poisson Regression - Stata Data Analysis Examples
No ratings yet
Poisson Regression - Stata Data Analysis Examples
12 pages
Count Data Models in SAS
No ratings yet
Count Data Models in SAS
12 pages
Poisson vs. Negative Binomial Regression
No ratings yet
Poisson vs. Negative Binomial Regression
38 pages
Chapter 11 Generalized
No ratings yet
Chapter 11 Generalized
28 pages
Poisson Regression Analysis in Stata
No ratings yet
Poisson Regression Analysis in Stata
23 pages
Lect 12
No ratings yet
Lect 12
36 pages
Poisson Regression Analysis in SPSS
No ratings yet
Poisson Regression Analysis in SPSS
34 pages
Understanding Poisson Regression Models
No ratings yet
Understanding Poisson Regression Models
19 pages
PSQF6270 Example4a Binomial
No ratings yet
PSQF6270 Example4a Binomial
13 pages
Zero-Inflated Count Models with COUNTREG
No ratings yet
Zero-Inflated Count Models with COUNTREG
11 pages
The Poisson Regression Model
No ratings yet
The Poisson Regression Model
6 pages
Binomial Distribution: ,.... 2, 1, 0 Where) 1 (
No ratings yet
Binomial Distribution: ,.... 2, 1, 0 Where) 1 (
15 pages
L19 CountDataModels v2
No ratings yet
L19 CountDataModels v2
36 pages
Poisson
No ratings yet
Poisson
54 pages
TCRM CountData
No ratings yet
TCRM CountData
43 pages
Count Models Poisson NB
No ratings yet
Count Models Poisson NB
10 pages
Chapter-09 Count Data, Poisson Regression, and Log-Linear Model
No ratings yet
Chapter-09 Count Data, Poisson Regression, and Log-Linear Model
15 pages
Poisson Regression for Counts
No ratings yet
Poisson Regression for Counts
51 pages
Poisson Regression
No ratings yet
Poisson Regression
12 pages
Poisson & Negative Binomial Regressions: Notes
No ratings yet
Poisson & Negative Binomial Regressions: Notes
30 pages
PBHS32700 Lecture12
No ratings yet
PBHS32700 Lecture12
18 pages
Poisson Regression for Statisticians
No ratings yet
Poisson Regression for Statisticians
3 pages
Generalized Linear Models-1
No ratings yet
Generalized Linear Models-1
29 pages
Poisson Models for Count Data
No ratings yet
Poisson Models for Count Data
49 pages
Models For Counts
No ratings yet
Models For Counts
59 pages
EPB 8120-Introduction To Biostatistics Exam-2020-2021
No ratings yet
EPB 8120-Introduction To Biostatistics Exam-2020-2021
6 pages
Count Data Analysis Using Poisson Regression: University of Southeastern Philippines Advanced Studies Mintal, Davao City
No ratings yet
Count Data Analysis Using Poisson Regression: University of Southeastern Philippines Advanced Studies Mintal, Davao City
19 pages
Poisson Regression
No ratings yet
Poisson Regression
60 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Count Data 2012
No ratings yet
Count Data 2012
20 pages
Logit Probit
No ratings yet
Logit Probit
66 pages
PH6205 RTutorial 4
No ratings yet
PH6205 RTutorial 4
5 pages
Modeling
100% (1)
Modeling
300 pages
Genlog Poisson
No ratings yet
Genlog Poisson
16 pages
Poisson Regression Guide
No ratings yet
Poisson Regression Guide
15 pages
BS 4 Years Past Paper STAT 2
No ratings yet
BS 4 Years Past Paper STAT 2
15 pages
2.2 The Poisson Distribution
No ratings yet
2.2 The Poisson Distribution
22 pages
An Introduction To Generalized Linear Models (Third Edition, 2008) by Annette Dobson & Adrian Barnett Outline of Solutions For Selected Exercises
No ratings yet
An Introduction To Generalized Linear Models (Third Edition, 2008) by Annette Dobson & Adrian Barnett Outline of Solutions For Selected Exercises
23 pages
Medical Statistics Assignment
No ratings yet
Medical Statistics Assignment
24 pages
L1.1 Distributions
No ratings yet
L1.1 Distributions
36 pages
6 - Poisson Reg
No ratings yet
6 - Poisson Reg
46 pages
STAT PRELIM - Hemal
No ratings yet
STAT PRELIM - Hemal
11 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
26 pages
Poisson Distribution & Process Guide
No ratings yet
Poisson Distribution & Process Guide
6 pages
Sample Final
No ratings yet
Sample Final
10 pages
2017may 02323 02402 Solution en
No ratings yet
2017may 02323 02402 Solution en
43 pages
Poisson Distribution: Submitted By: 2014-UETR-CS-04 Submitted To: Sir Ihsan-Ul-Ghafoor
No ratings yet
Poisson Distribution: Submitted By: 2014-UETR-CS-04 Submitted To: Sir Ihsan-Ul-Ghafoor
27 pages
Probability Review Questions
No ratings yet
Probability Review Questions
6 pages
Chap1 Introduction 2may24
No ratings yet
Chap1 Introduction 2may24
21 pages
Probability and Queueing Theory Guide
No ratings yet
Probability and Queueing Theory Guide
19 pages
MIT18 650F16 PSet8
No ratings yet
MIT18 650F16 PSet8
4 pages
Lecture 10
No ratings yet
Lecture 10
14 pages
4405 11042 1 SM
No ratings yet
4405 11042 1 SM
12 pages
435 - Problem Set 1 (Solution)
No ratings yet
435 - Problem Set 1 (Solution)
9 pages
Regression Analysis and Equation Answer
No ratings yet
Regression Analysis and Equation Answer
33 pages
SEE2003 Final Exam Review
No ratings yet
SEE2003 Final Exam Review
24 pages
Unit II Notes Correlation and Regression
No ratings yet
Unit II Notes Correlation and Regression
19 pages
Topic 4.5 Correlational Analysis
No ratings yet
Topic 4.5 Correlational Analysis
28 pages
Statistical
No ratings yet
Statistical
69 pages
Ensemble Methods Unit - 4
No ratings yet
Ensemble Methods Unit - 4
17 pages
Unit 2 - Data Science Methodology
No ratings yet
Unit 2 - Data Science Methodology
11 pages
Quantitative Techniques for BBA Students
No ratings yet
Quantitative Techniques for BBA Students
1 page
A Dynamic Panel Gravity Model Application - 2 - 2 - 1
No ratings yet
A Dynamic Panel Gravity Model Application - 2 - 2 - 1
5 pages
Jurnal Pengaruh Kompensasi Terhadap Kinerja Melalui Motivasi Sebagai Vaeriabel Intervening
No ratings yet
Jurnal Pengaruh Kompensasi Terhadap Kinerja Melalui Motivasi Sebagai Vaeriabel Intervening
14 pages
Regression Analysis Guide
100% (1)
Regression Analysis Guide
35 pages
EMF - Prático
No ratings yet
EMF - Prático
32 pages
Practice Problems 01
No ratings yet
Practice Problems 01
5 pages
Understanding Autocorrelation in Econometrics
No ratings yet
Understanding Autocorrelation in Econometrics
17 pages
المراجعة التسويقية وأهميتها في تحسين الأداء التسويقي للمؤسسة الخدمية دراسة حالة مؤسسة اتصالات الجزائر الوحدة العملية للاتصالات ورقلة
No ratings yet
المراجعة التسويقية وأهميتها في تحسين الأداء التسويقي للمؤسسة الخدمية دراسة حالة مؤسسة اتصالات الجزائر الوحدة العملية للاتصالات ورقلة
25 pages
Lecture Time Series 3
No ratings yet
Lecture Time Series 3
37 pages
Topic 7. VAR Models
No ratings yet
Topic 7. VAR Models
44 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
Advanced Quantitative Methods
No ratings yet
Advanced Quantitative Methods
12 pages
Kruskal Wallis Test
No ratings yet
Kruskal Wallis Test
3 pages
Chapter 1 - Multivariate
100% (1)
Chapter 1 - Multivariate
30 pages
Cronbach's Alpha Calculator
No ratings yet
Cronbach's Alpha Calculator
5 pages
Econometrics II Cha 2-2
No ratings yet
Econometrics II Cha 2-2
42 pages
Output SPSS
No ratings yet
Output SPSS
19 pages
Fpsyt 2 1519699
No ratings yet
Fpsyt 2 1519699
11 pages
Curve Fitting Techniques Overview
No ratings yet
Curve Fitting Techniques Overview
44 pages