0% found this document useful (0 votes)

136 views19 pages

Understanding Poisson Regression Models

- The document discusses Poisson regression, a generalized linear model used to model count data that follows a Poisson distribution. - Key properties of count data include being discrete and non-negative numbers with variance increasing with the mean. The Poisson distribution is appropriate for modeling this type of data. - Poisson regression models the log of the mean response (λ) as a linear combination of predictors, resulting in a multiplicative effect of predictors on the expected count. - Overdispersion can occur if the data violates the Poisson assumption of independent observations. This results in more extreme observed values than the model predicts.

Uploaded by

ግራዶ ጎማ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views19 pages

Understanding Poisson Regression Models

Uploaded by

ግራዶ ጎማ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Poisson regression

November 8, 2021

Objectives
• Know when and how to use a generalized linear model with a Poisson distribution.
• Detect overdispersion in a generalized linear model.
• Adapt the estimation of parameters and the selection of models to the case of overdispersed models.

Introduction
At the last class, we applied logistic regression to the analysis of binary and binomial data. In this course, we
will see another type of generalized linear model, Poisson regression, which is used to model the number of
observations of a certain event in a defined sampling unit.
Here are some examples of this type of data in ecology:
• the number of seedlings of a species in several quadrats;
• the number of species (species richness) observed at different sites;
• the number of individuals of a species of bird heard in a certain period of time during different visits to
a site.
Or in other areas:
• the number of patients arriving at the emergency room per hour;
• the number of cars crossing an intersection per minute.

Count data properties

• The count data are discrete numbers ≥ 0; we can have 0, 1, 2, . . . observations, but not -1 or 1.5.
• When the mean number of observations is small, the distribution of observations is generally asymmetric
(e.g. if the mean is 1, some sites will have more than 2, but never 0).
• The variance of the number of observations tends to increase with the mean: for example, if we observe
a mean of 1 species of mushroom per plot of 10x10m, the observations will vary perhaps between 0 and
3, but if we observe a mean of 10, the observations could vary between 6 and 15.
These properties justify the use of a distribution different from the normal distribution to model the observed
response: the Poisson distribution.

Poisson distribution

Suppose that the response y represents the number of observations of an event in a given interval (interval
of time, length, area, etc.). Suppose moreover that these events are independent, that is to say that the
observation of a first event does not influence the probability of observing a second one or not.

1
In this case, the variable y follows a Poisson distribution, with a parameter λ representing the mean rate of
observations per interval. The probability of a certain value of y as a function of λ is given by the following
equation.

λy −λ
P (y|λ) = e
y!

√ has only one parameter: λ corresponds to both its mean and its variance (so the
The Poisson distribution
standard deviation is λ).
In R, the rpois function can generate data according to a Poisson distribution, and dpois gives the probability
of a given value of y according to the above equation.
# Barplot with 10000 values drawn from the
# Poisson distribution with lambda = 3
ggplot(NULL, aes(x = rpois(10000, 3))) +
geom_bar()

2000

1500
count

1000

500

0
0 4 8 12
rpois(10000, 3)
# Probability of y = 1 if lambda = 3
dpois(1, 3)

## [1] 0.1493612

Comparison with the binomial distribution

For the binomial distribution, we had a positive or negative response for each individual (or sample unit) and
we were trying to predict the probability of a positive response p.

2
In the Poisson distribution, the response is the number of observations per sample unit and we try to predict
the mean of that number.
If we have a binomial response that is rare at the individual level (p is small) and the population n is large,
then we can model the number of cases at the population level by a Poisson distribution with λ = np.
Example: We are studying a birch forest where chaga is present on a small proportion (e.g. ~ 1%) of trees.
Instead of counting birch trees affected and unaffected by the chaga, one could only count the number of
chaga conks in plots of a given size and model that number with a Poisson distribution.

Comparison with the normal distribution

When λ is large enough, the Poisson distribution becomes more symmetrical and approaches a normal
distribution.

λ=5 λ = 20

1500 750
count

count

1000 500

500 250

0 0
0 5 10 15 10 20 30 40
y y
In other words, if the mean is large enough, the number of observations behaves almost like a continuous
variable following a normal distribution with µ = λ.
In that case, the linear regression model could apply. However, keep in mind that each model makes different
assumptions about the variance of the response. In the linear regression model, the mean µ depends on the
predictors, but the variance σ 2 is constant. In the Poisson regression model, the mean and the variance
depend on the predictors, because they are both equal to λ. Diagnostic plots can tell you which model is the
most appropriate.

Independence of observations and overdispersion

As mentioned above, the Poisson distribution assumes the independence of individual observations. It is rare
that this condition is perfectly respected in nature. For example:
• the probability of seeing a seedling of one species increases near other seedlings (agglomeration) if the
seeds are dispersed in groups near the parent;
• the observation of a species on a plot affects the likelihood of observing others based on competition or
facilitation relationships between these species.
Failure to meet the assumptions of the Poisson distribution can lead to overdispersion of observations,
i.e., more extreme values are observed than predicted by the Poisson model. In the first example above,
agglomeration results in more plots with 0 seedlings and more plots with a large number of seedlings.

3
In the graph below, the bars in red follow a Poisson distribution while the bars in blue represent a case of
overdispersion. The mean of y is 5 in both cases.

1500
count

1000

500

0
0 10 20 30
y
We will see later how to identify and address cases of overdispersion.

Poisson regression
Poisson regression is a generalized linear model where the response y follows a Poisson distribution:

y ∼ P ois(λ)

Since λ must be a positive number, we use the log function as a link with the linear predictor.

m
X
log λ = η = β0 + βi x i
i=1

According to the properties of the log function, a value of 0 of the predictor η corresponds to λ = 1, a positive
value corresponds to λ > 1 and a negative value to λ < 1.

4
7
6
5
4
λ

3
2
1
0
−2 −1 0 1 2
η

By inverting the log, we obtain a exponential relationship between the mean of the response (λ) and the
predictors:
Pm
ŷ = λ = eβ0 + i=1 βi xi

Since adding exponents is equivalent to a multiplication, this model corresponds to multiplicative effects of
the predictors on the response.

ŷ = eβ0 eβ1 x1 eβ2 x2 . . .

The logarithm is the default link function for Poisson regression. To properly interpret the results of the
regression, remember that a logarithmic transformation is applied to the mean response.

Summary of the generalized linear models seen in this course

Model Distribution Default link Inverse link

Linear Normal: y ∼ N (µ, σ) Identity: µ = η µ=η
regression
Logistic Binomial: y ∼ B(n, p) Logit: log(p/(1 − p)) = η p = 1/(1 + e−η )
regression
Poisson Poisson: y ∼ P ois(λ) Log: log(λ) = η λ = eη
regression

Example

The species.csv dataset shows the number of species of plants in different quadrats according to the total
biomass in the quadrat, and soil pH coded on a three-level scale (low, mid, high).
sp <- read.csv("../donnees/species.csv")
# Niveaux de pH dans le bon ordre
sp$pH <- factor(sp$pH, levels = c("low", "mid", "high"))
str(sp)

5
## 'data.frame': 90 obs. of 3 variables:
## $ pH : Factor w/ 3 levels "low","mid","high": 3 3 3 3 3 3 3 3 3 3 ...
## $ Biomass: num 0.469 1.731 2.09 3.926 4.367 ...
## $ Species: int 30 39 44 35 25 29 23 18 19 12 ...
Let’s look at the distribution of the response according to the predictors:
ggplot(sp, aes(x = Biomass, y = Species)) +
geom_point()

30
Species

0
0.0 2.5 5.0 7.5 10.0
Biomass
ggplot(sp, aes(x = pH, y = Species)) +
geom_boxplot()

30
Species

0
low mid high
pH
On these graphs, we notice that the conditions associated with a larger mean number of species (low biomass,
high pH) also have greater variance. This suggests that a Poisson regression might be appropriate.
As with logistic regression, Poisson regression uses the glm function. You must specify the poisson family

6
and (optionally) the log link.
mod_sp <- glm(Species ~ Biomass + pH, sp, family = poisson(link = "log"))
summary(mod_sp)

7
Std. Pearson resid.
Residuals vs Fitted Normal Q−Q
18 18
4
Residuals

1 3
1
−2

−2
80
70 1 70

2.5 3.0 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Std. Pearson resid.

Scale−Location Residuals vs Leverage
0.0 1.0 2.0

18 18
0.5
70 1

2
Cook's distance
−2
20 1

2.5 3.0 3.5 0.00 0.04 0.08 0.12

Predicted values Leverage

The first graph shows a trend in residuals (more negative at the extremes and more positive at the center).
Given the number of points, this trend is probably not due to chance but represents a systematic effect that
is not taken into account in this model.
So let’s try a more complex model where there is interaction between the effect of biomass and pH.
mod_sp_inter <- glm(Species ~ Biomass * pH, sp, family = poisson)

8
Std. Pearson resid.
Residuals vs Fitted Normal Q−Q
18 18
Residuals

1 3

1 3
88 88
−2

−2
20 20

2.0 2.5 3.0 3.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Std. Pearson resid.

Scale−Location Residuals vs Leverage
2.0

18 18 0.5

3
88 20
1.0

0
Cook's distance 1
0.0

−3
20

2.0 2.5 3.0 3.5 0.00 0.05 0.10 0.15

Predicted values Leverage

Without having completely eliminated the trend, this fit seems much better.
Note: In this case, these residuals approach normality, according to the quantile-quantile plot, because the
observed values of the response are quite high (mean number of 20 species per quadrat). We had the same
situation for binomial logistic regression when n was high. However, we do not expect the quantile-quantile
plot to show a straight line if λ is small, even if the Poisson model applies perfectly.
Now let’s look at the model summary.
summary(mod_sp_inter)

##
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4978 -0.7485 -0.0402 0.5575 3.2297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.95255 0.08240 35.833 < 2e-16 ***
## Biomass -0.26216 0.03803 -6.893 5.47e-12 ***
## pHmid 0.48411 0.10723 4.515 6.34e-06 ***
## pHhigh 0.81557 0.10284 7.931 2.18e-15 ***
## Biomass:pHmid 0.12314 0.04270 2.884 0.003927 **
## Biomass:pHhigh 0.15503 0.04003 3.873 0.000108 ***
## ---

9
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 83.201 on 84 degrees of freedom
## AIC: 514.39
##
## Number of Fisher Scoring iterations: 4
The AIC for this model is 514 compared to 526 for the no-interaction model, which confirms the better fit.
The pseudo-R2 is also high (1 - 83/452 = 0.82).
Here is a graph showing the superposition of the expected values of the response (curves plotted from
fitted(mod_sp_inter)) and observed values (points) as a function of biomass and pH.
ggplot(sp, aes(x = Biomass, y = Species, color = pH)) +
geom_point() +
geom_line(aes(y = fitted(mod_sp_inter))) +
scale_color_brewer(palette = "Dark2")

30
pH
Species

low
mid
20
high

0
0.0 2.5 5.0 7.5 10.0
Biomass

Interpretation of coefficients

The interpretation of the model coefficients is more complex with the interactions, so let’s start with the
model without interaction.

10
summary(mod_sp)

##
## Call:
## glm(formula = Species ~ Biomass + pH, family = poisson(link = "log"),
## data = sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5959 -0.6989 -0.0737 0.6647 3.5604
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.71255 0.05713 47.48 <2e-16 ***
## Biomass -0.12756 0.01014 -12.58 <2e-16 ***
## pHmid 0.69123 0.06815 10.14 <2e-16 ***
## pHhigh 1.13639 0.06720 16.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 99.242 on 86 degrees of freedom
## AIC: 526.43
##
## Number of Fisher Scoring iterations: 4
easier to first determine the effect of each coefficient of the model on the linear predictor: η =
It is P
m
β0 + i=1 βi xi , then deduce the effect on the mean response (λ) from the function λ = eη .
For example, the intercept indicates that in the case where the biomass is 0 and the pH is low (reference
level), η = 2.71 and therefore λ = eη = 15.0. Of course, this value is not realistic because there can be no
species without biomass.
For other parameters:
• The Biomass coefficient indicates that η decreases by 0.13 for each increase of one unit of biomass, if
the other variables (i.e. the pH) remain constant. An additive change of -0.13 on the logarithmic scale
corresponds to a multiplicative change of e−0.13 = 0.88, a loss of 12% of the mean number of species
per additional biomass unit.
• The coefficient of pHmid indicates that when going from a low pH to a medium pH (for the same
biomass), η increases by 0.69. Since e0.69 = 2.0, the mean number of species at an medium pH is twice
that at a low pH.
• Similarly, the coefficient of pHhigh indicates that the mean number of species is multiplied by e1.14 = 3.13
if we go from a low pH to a high pH for the same biomass.
In summary, in the case where there are no interactions, the effects are additive on the linear predictor scale.
The βi coefficient of the variable xi indicates that an increase of 1 unit of xi , while keeping the other variables
constant, results in a multiplication of the mean response by eβi .
Let’s look now at the case of the interaction:
summary(mod_sp_inter)

11
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4978 -0.7485 -0.0402 0.5575 3.2297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.95255 0.08240 35.833 < 2e-16 ***
## Biomass -0.26216 0.03803 -6.893 5.47e-12 ***
## pHmid 0.48411 0.10723 4.515 6.34e-06 ***
## pHhigh 0.81557 0.10284 7.931 2.18e-15 ***
## Biomass:pHmid 0.12314 0.04270 2.884 0.003927 **
## Biomass:pHhigh 0.15503 0.04003 3.873 0.000108 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 83.201 on 84 degrees of freedom
## AIC: 514.39
##
## Number of Fisher Scoring iterations: 4
In the model with interaction, the effect of pH depends on the biomass and vice versa. So let’s consider the
three pH levels separately; for simplicity, the effect “biomass = 0” will be indicated even if it is not a realistic
level of the variable.
• At low pH we have: η = 2.95 − 0.26 × biomass. Therefore, the mean number of species is e2.95 = 19 for
a biomass of 0 and decreases by 23% (e−0.26 = 0.77) per unit of biomass.
• At medium pH, we have: η = (2.95 + 0.48) + (−0.26 + 0.12) × biomass, or η = 3.43 − 0.14 × biomass.
Therefore, the mean number of species is 30.9 for a biomass of 0 and decreases by 13% per unit of
biomass.
• At high pH, we have: η = (2.95 + 0.82) + (−0.26 + 0.16) × biomass, equivalent to an mean number of
species of 43.4 for a biomass of 0 and a 10% decrease per unit of biomass.
For an interaction between a numeric variable and a categorical variable, it is easier to describe the effects
for each level of the categorical variable, as we just did. For an interaction between two numeric variables,
we can visualize the effects from the fitted value curves, but it becomes difficult to interpret the individual
coefficients.

Overdispersion
The summary of the Poisson regression results in the previous section included the message: (Dispersion
parameter for poisson family taken to be 1). This message reminds us that the variance of the
observations is not estimated independently in this model (unlike the linear regression model), but is equal to
the mean value λ estimated according to the predictors at each point.
As mentioned in the introduction, the non-independence of individual observations may overdisperse the
data with respect to the assumptions of the Poisson distribution. This overdispersion is represented by a
parameter φ which multiplies the expected variance: for a mean λ, the variance becomes φλ.

12
Less frequently, it happens that φ < 1, corresponding to underdispersion of the observations. Unlike
overdispersion, where observations tend to be clustered (e.g., a few quadrats with many individuals, and
many with few or no individuals), underdispersion means that observations are distributed more evenly than
expected. This could be due, for example, to intraspecific competition or territoriality (in the case of animals)
which leads individuals to space themselves at regular intervals.

Estimation of the dispersion parameter

To detect this overdispersion or under-dispersion, we will use the χ2 statistic, which we saw before in the
context of contingency tables. This statistic is calculated from the square deviations between the observed
(y) and expected (ŷ) values of the response, normalized by the expected value, for each of the n points in the
dataset.

n
X (yk − yˆk )2
χ2 =
yˆk
k=1

If the data follow the Poisson distribution, the mean value of χ2 is equal to the number of residual degrees
of freedom of the model: dfres = n − p, where p is the number of estimated parameters. To estimate the
dispersion parameter φ, we will use the estimator:

χ2
ĉ =
dfres

For the model with interactions, we obtain almost exactly the theoretical value of 1:
chisq <- sum((sp$Species - fitted(mod_sp_inter))^2/fitted(mod_sp_inter))
chisq/mod_sp_inter$df.residual

## [1] 0.9970014
For the model without interaction, we detect a small overdispersion.
chisq <- sum((sp$Species - fitted(mod_sp))^2/fitted(mod_sp))
chisq/mod_sp$df.residual

## [1] 1.10675
To determine if the χ2 differs significantly from the expected value according to the Poisson distribution, we
can calculate the probability of obtaining a higher χ2 if the model is correct.
1 - pchisq(chisq, df = mod_sp$df.residual)

## [1] 0.2336238
In this case, the overdispersion is not significant, so it is not necessary to adjust the results of the model.

Overdispersed data example

The galapagos.csv dataset gives the number of plant species (Species) and the number of endemic species
(Endemics) for different islands in the Galapagos according to their area, their mean elevation, the area of
the nearest island (Adjacent), the distance to the nearest island (Nearest) and the distance to Santa Cruz
island (Scruz).
glp <- read.csv("../donnees/galapagos.csv")
str(glp)

13
## 'data.frame': 30 obs. of 8 variables:
## $ Name : chr "Baltra" "Bartolome" "Caldwell" "Champion" ...
## $ Species : int 58 31 3 25 2 18 24 10 8 2 ...
## $ Endemics : int 23 21 3 9 1 11 0 7 4 2 ...
## $ Area : num 25.09 1.24 0.21 0.1 0.05 ...
## $ Elevation: int 346 109 114 46 77 119 93 168 71 112 ...
## $ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
## $ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...
## $ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
Suppose we want to determine how the number of endemic plant species (those present on a single island)
varies according to the predictors Area, Elevation, Nearest and Adjacent. Since each of these predictors is
distributed very asymmetrically (see the example below for area), we will use the logarithm of each predictor.

4 4
log(Endemics)

3 log(Endemics) 3

2 2

1 1

0 0
0 1000 2000 3000 4000 −5 0 5
Area log(Area)
We fit the following model in R. Note that the response is Endemics and not log(Endemics), because the
log link is included in the glm model with Poisson distribution.
mod_glp <- glm(Endemics ~ log(Area) + log(Elevation) + log(Nearest) + log(Adjacent),
data = glp, family = poisson)

Diagnostic graphs indicate the presence of some extreme values, although none of Cook’s distances are >1.
par(mfrow = c(2,2))
plot(mod_glp)

14
Std. Pearson resid.
Residuals vs Fitted Normal Q−Q
22 22
Residuals

132 13
2
2

2
−2

−2
1.5 2.5 3.5 4.5 −2 −1 0 1 2

Predicted values Theoretical Quantiles

Std. Pearson resid.

Scale−Location Residuals vs Leverage
22 22
2
13
1.5

2
0.5

Cook's distance
−2
16 1
0.0

1.5 2.5 3.5 4.5 0.0 0.1 0.2 0.3 0.4 0.5

Predicted values Leverage

par(mfrow = c(1, 1))

The model indicates significant effects for 3 parameters: the number of endemic species increases with the
area of the island; it decreases with distance to the nearest island and the area of that nearest island.
summary(mod_glp)

##
## Call:
## glm(formula = Endemics ~ log(Area) + log(Elevation) + log(Nearest) +
## log(Adjacent), family = poisson, data = glp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4411 -1.7215 0.0330 0.7395 3.8476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.47684 0.45817 5.406 6.45e-08 ***
## log(Area) 0.27504 0.02922 9.414 < 2e-16 ***
## log(Elevation) 0.01251 0.09251 0.135 0.89243
## log(Nearest) -0.06128 0.02118 -2.894 0.00381 **
## log(Adjacent) -0.04576 0.01099 -4.162 3.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)

15
##
## Null deviance: 743.55 on 29 degrees of freedom
## Residual deviance: 95.66 on 25 degrees of freedom
## AIC: 238.61
##
## Number of Fisher Scoring iterations: 5
However, when testing the χ2 , we note that the data is overdispersed.
chisq <- sum((glp$Endemics - fitted(mod_glp))^2/fitted(mod_glp))
1 - pchisq(chisq, df = mod_glp$df.residual)

## [1] 1.109039e-09
disp <- chisq/mod_glp$df.residual
disp

## [1] 3.700441
The estimate of the dispersion parameter is 3.7. When ĉ is not too high (typically, we suggest ĉ < 4), the
√
Poisson regression coefficient estimates remain valid, but we need to multiply their standard errors by ĉ.
In other words, overdispersion does not introduce bias, but increases the uncertainty over the values of the
coefficients. If the overdispersion is very large, it is better to use another model.
To correct the standard errors, just add a dispersion argument when calling the summary function:
summary(mod_glp, dispersion = disp)

##
## Call:
## glm(formula = Endemics ~ log(Area) + log(Elevation) + log(Nearest) +
## log(Adjacent), family = poisson, data = glp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4411 -1.7215 0.0330 0.7395 3.8476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.47684 0.88137 2.810 0.00495 **
## log(Area) 0.27504 0.05620 4.894 9.89e-07 ***
## log(Elevation) 0.01251 0.17796 0.070 0.94396
## log(Nearest) -0.06128 0.04074 -1.504 0.13251
## log(Adjacent) -0.04576 0.02115 -2.164 0.03049 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 3.700441)
##
## Null deviance: 743.55 on 29 degrees of freedom
## Residual deviance: 95.66 on 25 degrees of freedom
## AIC: 238.61
##
## Number of Fisher Scoring iterations: 5
Equivalently, the quasipoisson family can be used to automatically estimate the dispersion parameter.

16
mod_quasi <- glm(Endemics ~ log(Area) + log(Elevation) + log(Nearest) + log(Adjacent),
data = glp, family = quasipoisson)
summary(mod_quasi)

##
## Call:
## glm(formula = Endemics ~ log(Area) + log(Elevation) + log(Nearest) +
## log(Adjacent), family = quasipoisson, data = glp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4411 -1.7215 0.0330 0.7395 3.8476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47684 0.88137 2.810 0.00948 **
## log(Area) 0.27504 0.05620 4.894 4.91e-05 ***
## log(Elevation) 0.01251 0.17796 0.070 0.94451
## log(Nearest) -0.06128 0.04074 -1.504 0.14504
## log(Adjacent) -0.04576 0.02115 -2.164 0.04025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.700441)
##
## Null deviance: 743.55 on 29 degrees of freedom
## Residual deviance: 95.66 on 25 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5

Dispersion parameter for the binomial family

For a regression where the response is binomial (number of successes y and failures n − y), it can also be
overdispersed or underdispersed. In this case, the dispersion parameter is estimated in the same way with
the χ2 statistic. You can also use the quasibinomial family with theglm function.
For a binary response (0 or 1), the concept of dispersion parameter does not apply because the variance is
always p(1 − p).

Model selection with overdispersion

In addition to increasing the standard error of the model coefficients, overdispersion also affects model
selection with the AIC. QAIC, the equivalent of the AIC for quasibinomial or quasipoisson type models, is
calculated by dividing the first term of the AIC by the estimate of the dispersion parameter.

2 log L
QAIC = − + 2K
ĉ
Similarly, we can modify the AICc to obtain the QAICc.

2 log L n
QAICc = − + 2K
ĉ n−K −1

17
The ĉ itself counts as an estimated parameter, so the value of K also increases by 1 compared to the model
without overdispersion.
If we compare several models with the QAIC or QIACc, we must use a single value of ĉ, the one calculated
from the most complex model. This value is supplied to the AICcmodavg package functions (for example,
aictab) with the c.hat argument.
For example, consider the following four models to predict the number of endemic species on the Galapagos
Islands.
mod_area <- glm(Endemics ~ log(Area), data = glp, family = poisson)
mod_area_near <- glm(Endemics ~ log(Area) + log(Nearest), data = glp, family = poisson)
mod_area_adj <- glm(Endemics ~ log(Area) + log(Adjacent), data = glp, family = poisson)
mod_comp <- glm(Endemics ~ log(Area) + log(Nearest) + log(Adjacent), data = glp, family = poisson)

Note that we use the poisson family rather than quasipoisson; we will let the model selection function
apply the correction. The dispersion parameter of the complete model can be calculated with the c_hat
function of AICcmodavg. (This value differs a bit from the one above because we have removed the Elevation
predictor.)
library(AICcmodavg)

disp <- c_hat(mod_comp)

disp

## 'c-hat' 3.56 (method: pearson estimator)

Here is the QAICc table for these models:
mod_list <- list(mod_area = mod_area, mod_area_near = mod_area_near,
mod_area_adj = mod_area_adj, mod_comp = mod_comp)
aictab(mod_list, c.hat = disp)

##
## Model selection based on QAICc:
## (c-hat estimate = 3.562208)
##
## K QAICc Delta_QAICc QAICcWt Cum.Wt Quasi.LL
## mod_area_adj 4 76.19 0.00 0.42 0.42 -33.30
## mod_comp 5 76.68 0.49 0.33 0.75 -32.09
## mod_area 3 78.49 2.29 0.13 0.89 -35.78
## mod_area_near 4 78.82 2.63 0.11 1.00 -34.61
Let’s now compare with the version obtained without accounting for overdispersion.
aictab(mod_list)

##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## mod_comp 4 238.23 0.00 0.95 0.95 -114.31
## mod_area_adj 3 244.14 5.91 0.05 1.00 -118.61
## mod_area_near 3 253.50 15.27 0.00 1.00 -123.29
## mod_area 2 259.37 21.14 0.00 1.00 -127.46
Not only does the overdispersion change the ranking of the models, but it decreases the difference between
the models: the ∆QAICc are much closer together than the ∆AICc. The four models have a weight greater
than 10% in the first case, while the first model gets 95% of the weight in the second table. Thus, there is

18
more uncertainty about which model is best.

Summary

Poisson regression

• The Poisson distribution represents the number of independent observations of an event in a sampling
interval.
• The Poisson regression model is based on a Poisson distribution of the response around its mean value;
this value is related to the linear predictor by the logarithm function.
• Because of the logarithmic relationship, the additive effects on the linear predictor scale become
multiplicative effects on the scale of the response.
• The binomial and Poisson distributions assume a certain relationship between the mean value of the
response and its variance. Overdispersion occurs when the variance is larger than expected by the
distribution.
• If the dispersion parameter is not too high (ĉ < 4), we √
can still use the Poisson (or binomial) model
and multiply the standard errors of the coefficients by ĉ. For model selection, the AIC(c) must be
replaced by the QAIC(c) to account for overdispersion.

Statistical modeling: general points to remember

We studied different types of models this semester (ANOVA, linear regression, generalized linear models and
mixed models in the next classes). You may have noticed that we apply similar steps in each case.
• Explore the data (look at the distribution of individual variables and correlations);
• Choose a model type and fit it (lm, glm, etc.);
• Verify the model assumptions (e.g. diagnostic graphs, dispersion parameter);
• Interpret the estimated values of the coefficients;
• Visualize model predictions;
• If several candidate models: compare the fit, select the best model or make average predictions from
multiple models.
You should be able to complete each of its steps for the statistical models seen in this course and those that
you will apply to your research.

Count Data Models in SAS
No ratings yet
Count Data Models in SAS
12 pages
Poisson Regression
No ratings yet
Poisson Regression
12 pages
Kaplan-Meier Estimator: Association. The Journal Editor, John Tukey, Convinced Them To Combine Their
No ratings yet
Kaplan-Meier Estimator: Association. The Journal Editor, John Tukey, Convinced Them To Combine Their
7 pages
Analyzing Grouped Data Statistics
No ratings yet
Analyzing Grouped Data Statistics
51 pages
Regression Logistic Regression
100% (1)
Regression Logistic Regression
37 pages
A Confidence Interval For The Median Survival Time
No ratings yet
A Confidence Interval For The Median Survival Time
14 pages
Lecture 10 Randomized Complete Block Design Last Lecture
100% (1)
Lecture 10 Randomized Complete Block Design Last Lecture
4 pages
Statatistical Inferences
No ratings yet
Statatistical Inferences
22 pages
0 Ppt1 Introduction To Biostatistics123
No ratings yet
0 Ppt1 Introduction To Biostatistics123
59 pages
Survival Analysis With STATA 1701597623
No ratings yet
Survival Analysis With STATA 1701597623
252 pages
Spss Poisson Regression
No ratings yet
Spss Poisson Regression
17 pages
Steps in Epidemic Investigation
100% (1)
Steps in Epidemic Investigation
6 pages
Lecture 2A - Biological Variability, Descriptive Stats
No ratings yet
Lecture 2A - Biological Variability, Descriptive Stats
9 pages
(Cox (1972) ) Regression Models and Life Tables PDF
No ratings yet
(Cox (1972) ) Regression Models and Life Tables PDF
35 pages
Dummy Regression
No ratings yet
Dummy Regression
23 pages
Kaplan-Meier and Log-Rank Analysis
No ratings yet
Kaplan-Meier and Log-Rank Analysis
39 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
38 pages
Survival Analysis
No ratings yet
Survival Analysis
28 pages
Chapter 3 Sampling Methods
No ratings yet
Chapter 3 Sampling Methods
46 pages
Aiken & West (1991) Chap07 PDF
No ratings yet
Aiken & West (1991) Chap07 PDF
14 pages
Oecd TG 488
No ratings yet
Oecd TG 488
23 pages
Social Class and Health Outcomes Analysis
No ratings yet
Social Class and Health Outcomes Analysis
5 pages
13 Pag Design and Analysis of Experiments in The Health Sciences
100% (1)
13 Pag Design and Analysis of Experiments in The Health Sciences
13 pages
Measuring The Occurrence of Disease: Dr. Elijah Kakande MBCHB, MPH Department of Public Health
No ratings yet
Measuring The Occurrence of Disease: Dr. Elijah Kakande MBCHB, MPH Department of Public Health
25 pages
Survival Analysis
No ratings yet
Survival Analysis
30 pages
ROBINS-I Detailed Guidance
No ratings yet
ROBINS-I Detailed Guidance
53 pages
Optimization & Stochastic Theory
No ratings yet
Optimization & Stochastic Theory
29 pages
SIS Model For An Infectious Disease
No ratings yet
SIS Model For An Infectious Disease
3 pages
Survival Analysis Dengan Pendekatan R
No ratings yet
Survival Analysis Dengan Pendekatan R
32 pages
Week 1 (Part 2) - Introduction To Biostatistics and Epidemiology
No ratings yet
Week 1 (Part 2) - Introduction To Biostatistics and Epidemiology
4 pages
Odds Ratio, Hazard Ratio and Relative Risk: Janez Stare Delphine Maucort-Boulch
No ratings yet
Odds Ratio, Hazard Ratio and Relative Risk: Janez Stare Delphine Maucort-Boulch
9 pages
Longitudinal Data Analysis
100% (1)
Longitudinal Data Analysis
103 pages
Stata 11: GMM Estimation Guide
No ratings yet
Stata 11: GMM Estimation Guide
29 pages
CT 4201304
No ratings yet
CT 4201304
26 pages
Advanced Survival Analysis Guide
No ratings yet
Advanced Survival Analysis Guide
71 pages
Cox Proportional Hazard Model Overview
No ratings yet
Cox Proportional Hazard Model Overview
34 pages
Biostatistics Lecture Notes Overview
50% (4)
Biostatistics Lecture Notes Overview
36 pages
Cross Sectional Studies 1
No ratings yet
Cross Sectional Studies 1
49 pages
Advanced Statistical Distributions
No ratings yet
Advanced Statistical Distributions
13 pages
Epid 600 Class 4 Measures of Association
No ratings yet
Epid 600 Class 4 Measures of Association
59 pages
Statistics for Educators & Analysts
100% (1)
Statistics for Educators & Analysts
5 pages
Introduction to Survival Analysis
No ratings yet
Introduction to Survival Analysis
54 pages
Binomial Distribution
No ratings yet
Binomial Distribution
26 pages
Different Types of Study Designs
No ratings yet
Different Types of Study Designs
42 pages
Understanding Logistic Regression Basics
100% (7)
Understanding Logistic Regression Basics
21 pages
Competency 6
No ratings yet
Competency 6
7 pages
Logistic Regression Survival Analysis Kaplan-Meier
No ratings yet
Logistic Regression Survival Analysis Kaplan-Meier
13 pages
Survival Analysis for Academics
No ratings yet
Survival Analysis for Academics
133 pages
151 Practice Final 1
100% (1)
151 Practice Final 1
11 pages
Exam With Model Answers
No ratings yet
Exam With Model Answers
4 pages
The Three MS: Analysis Data
No ratings yet
The Three MS: Analysis Data
5 pages
Poisson Regression Guide
No ratings yet
Poisson Regression Guide
15 pages
Poisson
No ratings yet
Poisson
54 pages
Chapter 11 Generalized
No ratings yet
Chapter 11 Generalized
28 pages
6 - Poisson Reg
No ratings yet
6 - Poisson Reg
46 pages
Shorten - Count Data Analysis
No ratings yet
Shorten - Count Data Analysis
24 pages
Countdata2018 2
No ratings yet
Countdata2018 2
23 pages
Binomial vs Poisson Distributions in Biology
No ratings yet
Binomial vs Poisson Distributions in Biology
3 pages
Lect 12
No ratings yet
Lect 12
36 pages
Assumptions
No ratings yet
Assumptions
9 pages
CEV Option
No ratings yet
CEV Option
55 pages
Generalized Fractional Brownian Motion (GFBM)
No ratings yet
Generalized Fractional Brownian Motion (GFBM)
48 pages
07 GLM
No ratings yet
07 GLM
49 pages
Lectures Notes
No ratings yet
Lectures Notes
173 pages
Highway Safety for Civil Engineers
No ratings yet
Highway Safety for Civil Engineers
14 pages
Vasili Kandinsky
No ratings yet
Vasili Kandinsky
2 pages
Identity and the End of Irony
No ratings yet
Identity and the End of Irony
8 pages
F. Guanzon and Sons Inc. v. Register of Deeds of Manila
No ratings yet
F. Guanzon and Sons Inc. v. Register of Deeds of Manila
4 pages
Edwards Apg Active Pirani Gauges
No ratings yet
Edwards Apg Active Pirani Gauges
2 pages
Surya in 12 Houses: Vedic Insights
No ratings yet
Surya in 12 Houses: Vedic Insights
6 pages
CS 228 Revision 2 Skidding Resistance
No ratings yet
CS 228 Revision 2 Skidding Resistance
98 pages
Framework of Accounting
No ratings yet
Framework of Accounting
11 pages
Piezoelectric Actuators (Joshua E. Segel) (Z-Library)
No ratings yet
Piezoelectric Actuators (Joshua E. Segel) (Z-Library)
156 pages
赚钱的艺术art of Money Getting
No ratings yet
赚钱的艺术art of Money Getting
47 pages
Periodontics and Host Modulation
No ratings yet
Periodontics and Host Modulation
24 pages
Gloria Macapagal Arroyo Pres
No ratings yet
Gloria Macapagal Arroyo Pres
23 pages
Rslogix 5000 Software Release History: Installation Notes
No ratings yet
Rslogix 5000 Software Release History: Installation Notes
79 pages
Report Your Landing - Worker Streams: New Brunswick Provincial Nominee Program (NBPNP)
No ratings yet
Report Your Landing - Worker Streams: New Brunswick Provincial Nominee Program (NBPNP)
3 pages
IT Leadership & Service Delivery
No ratings yet
IT Leadership & Service Delivery
3 pages
Volodymyr Zelenskyy: Ukraine's 6th President
No ratings yet
Volodymyr Zelenskyy: Ukraine's 6th President
2 pages
Polyester Yarn Tech for Industry
No ratings yet
Polyester Yarn Tech for Industry
8 pages
Technical Service Report
No ratings yet
Technical Service Report
1 page
Curriculum Input Adaptation
No ratings yet
Curriculum Input Adaptation
7 pages
Hindu Marriage Act: Maintenance Rights Explained
No ratings yet
Hindu Marriage Act: Maintenance Rights Explained
12 pages
Bainite Transformation in Medium Carbon Steel
No ratings yet
Bainite Transformation in Medium Carbon Steel
11 pages
Before Mestizaje: Published Online by Cambridge University Press
No ratings yet
Before Mestizaje: Published Online by Cambridge University Press
308 pages
Problem Unites Us, Religion Divides Us
No ratings yet
Problem Unites Us, Religion Divides Us
1 page
Future Career and Life Goals Essay
No ratings yet
Future Career and Life Goals Essay
1 page
GNM 1st Year Results - October 2015
No ratings yet
GNM 1st Year Results - October 2015
830 pages
Group Work Economics FINAL
No ratings yet
Group Work Economics FINAL
22 pages
Lesson 2 Prelim
No ratings yet
Lesson 2 Prelim
6 pages
HSK 4 Nouns 91 120
No ratings yet
HSK 4 Nouns 91 120
11 pages
Falcon Xa Series
No ratings yet
Falcon Xa Series
770 pages