Chapter 11 Generalized
Chapter 11 Generalized
Mark Andrews
Contents
Poisson regression 2
Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Poisson regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Prediction in Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Bayesian approaches to Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References 28
In the previous chapter, we covered logistic regression models, each of which are types of generalized linear
models. Generalized linear models are regression models that extend the normal linear model so that it
can model data that is not normally distributed around a mean that is a linear function of predictors. The
general definition of a generalized linear model is as follows. Assuming that our data is as follows:
where y1 , y2 . . . yn are the observed values of an outcome variable and ~x1 , ~x2 . . . ~xn are the corresponding
vectors of predictors and each ~xi = x1i , x2i . . . xki . . . xKi , a generalized linear model of this data is
K
X
yi ∼ D(θi , ψ), f (θi ) = β0 + βk xki , for i ∈ 1 . . . n,
k=1
where D(θi , ψ) is some probability distribution centered at θi and with an optional parameter ψ that controls
the scale or shape of the distribution, and where f is a monotonic (and thus invertible) link function. As an
example, we’ve already seen that the binary logistic regression is
K
X
yi ∼ Bernoulli(θi ), logit (θi ) = β0 + βk xki , for i ∈ 1 . . . n.
k=1
Thus, in this case, D(θi , ψ) is Bernoulli(θi ), there so there is no optional ψ parameter, and the link function
is the logit function.
1
In this chapter, we will cover some other generalized linear models, and ones that are specifically designed to
model count data. Count data is simply data of counts, or the number of times something has happened.
Examples of count data are widespread: the number of car accidents that occur in a region each day (or
week, year, etc); the number of extramarital affairs that a married person has in a year; the number of times
a person visits a doctor in a year; and so on. Count data must be non-negative integers: they can take values
of 0, but not negative values, and they must be a whole numbers. Although there are many models that
could be considered under this general topic, we will cover just some of them, but ones that are very widely
used or otherwise very useful. In particular, we will cover Poisson regression, negative binomial regression,
and so-called zero-inflated count models, particularly zero-inflated Poisson regression.
Poisson regression
In Poisson regression, our data are
(y1 , ~x1 ), (y2 , ~x2 ) . . . (yi , ~xi ) . . . (yn , ~xn ),
as described above, and where each yi ∈ 0, 1, 2, . . .. In other words, yi takes on a non-negative integer value
which specifically represents a count, or the number of times something happened a period of time.
λ = 3.5 λ=5
0.20
0.15
0.10
0.05
0.00
P(x)
λ = 10 λ = 15
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 0 5 10 15 20 25
x
In order to deal with count outcome data in a regression framework, we first must use an appropriate
probability distribution as a model of the outcome variable. A default choice here is the Poisson distribution.
A Poisson distribution is a probability distribution over non-negative integers. If x is a Poisson random
variable, it takes on values k ∈ 0, 1, . . ., and the probability that x = k is
e−λ λk
Poisson(x = k|λ) = P(x = k|λ) = .
k!
2
Here, λ is the Poisson distribution’s single parameter. While usually denoted by λ, it is usually known as the
rate or the rate parameter. It gives the average value of the random variable x. Unlike the values of x, which
must be non-negative integers, λ is not constrained to be an integer, it is just constrained to be non-negative.
In Figure 1, we plot four different Poisson distributions, which differ from one another by the value of λ.
The Poisson distribution can be understood as limit of a binomial distribution. The binomial distribution is
also a distribution over counts but where there are a fixed number of times, known as the number of trials
and signified by n, an event can happen, known as a success, and where the probability of a success, signified
by θ, is independent and identical on each trial. As the number of trials in a binomial distributions tends to
infinity, but if θ · n is held constant, then the distribution tends to a Poisson distribution with parameter
λ = θ · n. This phenomenon is illustrated in Figure 2.
a n = 10 n = 20
0.25
0.20
0.15
0.10
0.05
0.00
P(x)
n = 100 n = 1000
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 0 5 10 15 20 25
x
b 0.25
0.20
0.15
P(x)
0.10
0.05
0.00
0 5 10 15 20 25
x
Figure 2: a) Binomial distributions with increasing values of n, which denote the number of trials, but where
θ · n is held constant at θ · n = 5. b) A Poisson distribution with λ = 5.
This relationship between the binomial distribution and the Poisson distribution helps us to understand why
the Poisson distribution commonly occurs in the natural and social world. In situations where events occur
independently with fixed probability θ, when the number of opportunities when these events can occur is
3
very large but where θ is very low, then the distribution of the number of times an event occurs tends to a
Poisson distribution. As an example, the number of occasions when a car accident can occur on any given
day is extremely high, yet the probability of an accident occurring on any one of these occasions is very low,
and so the resulting distribution of the number of car accidents is well described by a Poisson distribution.
In Poisson regression, we assume that each yi is distributed as a Poisson distribution with parameter λi
and assume that λi is determined by a function of ~xi . For analogous reasons to what occurs in the case of
logistic regression, we can not have each λi being a linear function of ~xi because each λi is constrained to take
non-negative values only, and in general, if we allow λi to be a linear function of ~xi , we can not guarantee
that it will be constrained to be non-negative. For this reason, again analogously to what happened in the
logistic regression, we can transform λi to another variable φi , which can take any value in R, and then treat
φi as the linear function of ~xi . For this, as before, we need an invertible link function f : R+ 7→ R that can
map any value of λi to a unique value of φ, and vice versa. For this case, we have a number of options for f ,
but the default choice is simply the natural logarithm function:
φi = f (λi ) = log(λi ).
which is identical to
K
X
yi ∼ Poisson(λi ), λi = eφi , φi = β0 + βk xki , for i ∈ 1 . . . n.
k=1
we see that in the case of Poisson regression, D(θi , ψ) is Poisson(λi ), where we follow conventions and use λi
instead of θi as the location parameters, and where there is no optional ψ parameter, and where the f link
function is the log function.
As an example of a problem seemingly suited to a Poisson regression model, we will use the following data set.
lbw_df <- read_csv('data/lbw.csv')
lbw_df
#> # A tibble: 189 x 11
#> ...1 low smoke race age lwt ptl ht ui ftv bwt
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 2 19 182 0 0 1 0 2523
#> 2 2 0 0 3 33 155 0 0 0 3 2551
#> 3 3 0 1 1 20 105 0 0 0 1 2557
#> 4 4 0 1 1 21 108 0 0 1 2 2594
#> 5 5 0 1 1 18 107 0 0 1 0 2600
#> 6 6 0 0 3 21 124 0 0 0 0 2622
#> 7 7 0 0 1 22 118 0 0 0 1 2637
#> 8 8 0 0 3 17 103 0 0 0 1 2637
#> 9 9 0 1 1 29 123 0 0 0 1 2663
#> 10 10 0 1 1 26 113 0 0 0 0 2665
#> # ... with 179 more rows
This gives us data relating to low birth weight infants. One variable in this data set is ftv, which is the
number of visits to the doctor by the mother in her trimester of pregnancy. In Figure 3, we show the
4
distribution of value of ftv as function of the age of the mother, which we have grouped by age tercile.
There, we see that the distribution shifts upwards as we go from the 1st, 2nd to 3rd age tercile. Thus, we
could model ftv as a Poisson variable whose mean varies as a function of age, as well as other potentially
interesting explanatory variables.
1 2 3
40
30
n
20
10
0
0 2 4 6 0 2 4 6 0 2 4 6
ftv
Figure 3: The number of visits to a doctor in the first trimester of pregnancy for each age tercile.
In general, in Poisson regression, we model a count response variable as a Poisson distribution whose parameter
λ varies by a set of explanatory variables. More precisely, we model the log of λ as a linear function of the
explanatory variables.
where PK
~
λi = e~xi β = eβ0 + k=1 βk xki ,
| |
and where ~y = [y1 , y2 . . . yn ] , β~ = [β0 , β1 . . . βK ] , X is a matrix of n stacked row vectors ~1, ~x1 , ~x2 . . . ~xn ,
where ~1 is a row of K + 1 ones. The logarithm of the likelihood is then defined as
The maximum likelihood estimator is the value of β~ that maximizes this function. We obtain this calculating
~ y , X) with respect to β,
the gradient of L(β|~ ~ setting this to equal zero, and solving for β.
~ As with the logistic
5
regression, this is done using a Newton-Raphson method, and the resulting estimator is labelled β̂. Similarly
to the case of the logistic regression, the asymptotic sampling distribution of β̂ is
~ (X W X) | −1
β̂ ∼ N (β, ),
where W is an n × n diagonal matrix whose ith element is
λ̂i = e~xi β̂ .
This entails that
β̂k − βk
q ∼ N (0, 1),
| −1
(X W X)kk
q
| −1
where (X W X)kk is the standard error term se
ˆ k . This is the basis for hypothesis testing and confidence
intervals for the coefficients.
6
(z_stat <- estimates / std_err)
#> (Intercept) age
#> -3.955444 3.509178
The corresponding p-values are given by Pr(>|z|), which is also easily verified as the probability of getting a
value as or more extreme than z value in a standard normal distribution, as we see in the following.
2 * pnorm(abs(z_stat), lower.tail = F)
#> (Intercept) age
#> 7.639269e-05 4.494944e-04
The 95% confidence interval for age is as follows.
confint.default(lbw_m, parm='age')
#> 2.5 % 97.5 %
#> age 0.02176194 0.07682551
We can confirm that this is β̂k ± se
ˆ k · ζ(0.975) .
estimates['age'] + c(-1, 1) * std_err['age'] * qnorm(0.975)
#> [1] 0.02176194 0.07682551
For example, the predicted log of the rate for mothers aged 20, 25, 30 is easily calculated as follows.
estimates['(Intercept)'] + estimates['age'] * c(20, 25, 30)
#> [1] -0.42689167 -0.18042304 0.06604559
And so the predicted rate for these women is as follows
exp(estimates['(Intercept)'] + estimates['age'] * c(20, 25, 30))
#> [1] 0.6525342 0.8349169 1.0682754
As we seen above, these calculations are easier using the predict function. There, we have option of obtaining
these predictions on the linear scale, the default, or by using type = 'response' to give predictions after
the inverse of the link function is applied.
lbw_df_new <- tibble(age = c(20, 25, 30))
predict(lbw_m, newdata = lbw_df_new)
#> 1 2 3
#> -0.42689167 -0.18042304 0.06604559
predict(lbw_m, newdata = lbw_df_new, type = 'response')
#> 1 2 3
#> 0.6525342 0.8349169 1.0682754
We also saw that these predictions can be even more easily performed using add_predictions.
lbw_df_new %>%
add_predictions(lbw_m)
#> # A tibble: 3 x 2
#> age pred
#> <dbl> <dbl>
7
#> 1 20 -0.427
#> 2 25 -0.180
#> 3 30 0.0660
lbw_df_new %>%
add_predictions(lbw_m, type='response')
#> # A tibble: 3 x 2
#> age pred
#> <dbl> <dbl>
#> 1 20 0.653
#> 2 25 0.835
#> 3 30 1.07
Given that φι = ~xι β̂ and that β̂ has the multivariate normal distribution stated above, then φι will have the
following sampling distribution.
~ ~xι (X | W X)−1 ~x| ).
φ̂ι ∼ N (~xι β, ι
| {z }
ˆ 2ι
se
Using the se.fit = TRUE option in predict, we can obtain the standard errors for prediction.
predict(lbw_m, newdata = lbw_df_new, se.fit = T)$se.fit
#> 1 2 3
#> 0.10547495 0.08172315 0.10999279
We can verify that these are calculated as stated above.
x_iota <- model_matrix(lbw_df_new, ~ age) %>%
as.matrix()
8
Model comparison
Just as we did in the case of binary logistic regression, we can compare nested Poisson regression models
using a likelihood ratio test. If we have one model with a set of K predictors and another model with K 0 < K
predictors, the null hypothesis when comparing these two models is that the coefficients of all the K − K 0
predictors in the larger model but not in the smaller one are simultaneously zero. In other words, if the larger
model M1 has K predictors whose coefficients are β0 , β1 . . . βK 0 . . . βK , and the smaller model M0 has K 0
predictors whose coefficients are β0 , β1 . . . βK 0 , then the null hypothesis is
βK 0 +1 = βK 0 +2 = . . . = βK = 0.
We can test this null hypothesis by comparing the maximum of the likelihood of the model with the K
predictors to that of the model with the K 0 predictors. Under this null hypothesis, -2 times the log of the
likelihood ratio of the models will be distributed as χ2K−K 0 .
As an example, consider the model whose predictor variables are age, low, and smoke, where low is a binary
variable that indicates if the birth weight of the newborn infant was low (low = 1) or not (low = 0), and
smoke is a binary variable that indicates if the pregnant woman was a smoker (smoke = 1) or not (smoke
= 0). We will denote this model with three predictors by M1 . We can then compare this to the model
with age alone, which we will denote by M0 . The null hypothesis when comparing M1 and M0 is that the
coefficients for low and smoke are both zero. To test this null hypothesis, we calculate L1 and L0 , which are
the likelihoods of M1 and M1 evaluated at their maximum. According to the null hypothesis,
L0
−2 log ∼ χ22 ,
L1
where the degrees of freedom of 2 is the difference between the number of predictors in M1 and M0 . We can
calculate -2 times the log of the likelihood by the difference of the deviances
L0
∆D = −2 log = D0 − D1 ,
L1
9
pchisq(delta_deviance, df = 2, lower.tail = F)
#> [1] 0.8285266
This null hypothesis test can be performed more easily with the generic anova function.
anova(lbw_m_1, lbw_m, test='Chisq')
#> Analysis of Deviance Table
#>
#> Model 1: ftv ~ age + low + smoke
#> Model 2: ftv ~ age
#> Resid. Df Resid. Dev Df Deviance Pr(>Chi)
#> 1 185 252.58
#> 2 187 252.96 -2 -0.37621 0.8285
From this result, we can not reject the null hypothesis that coefficients for low and smoke are simultaneously
zero. Put less formally, the model with age, low, and smoke is not significantly better at predicting the
ftv outcome variable than the model with age alone, and so we can conclude the low and smoke are not
significant predictors of ftv, at least when age is known.
~ = β0 , β1 . . . βK :
but now our aim is to infer the posterior distribution over β
likelihood prior
z }| { z }| {
~ y , X) ∝ P(~y |X, β)
P(β|~ ~ P(β)
~ .
Just as was the case with binary logistic regression, there is no analytic solution to the posterior distribution,
and so numerical methods are necessary. As we explained already, a powerful and general numerical method
is to use Markov Chain Monte Carlo. Practically, the most powerful general purpose MCMC Bayesian
modelling software is the probabilistic programming language Stan, for which we have the extremely easy to
use R interface package brms.
We perform a Bayesian Poisson regression model of the lbw data with outcome variable ftv and predictor
age using brms as follows.
lbw_m_bayes <- brm(ftv ~ age,
family = poisson(link = 'log'),
data = lbw_df)
The priors are very similar to the prior used by default by the logistic regression analysis above:
prior_summary(lbw_m_bayes)
#> prior class coef group resp dpar nlpar lb ub
#> (flat) b
#> (flat) b age
#> student_t(3, -2.3, 2.5) Intercept
#> source
#> default
#> (vectorized)
#> default
10
A uniform prior is on the coefficient for age and a non-standard t-distribution is on the intercept term. Using
these priors, again, just like the binary logistic regression, by using the default settings, we use 4 chains, each
with 2000 iterations, and where the initial 1000 iterations are discarded, leaving to 4000 total samples from
the posterior.
We can view the summary of the posterior distribution as follows.
summary(lbw_m_bayes)$fixed
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept -1.40625739 0.37027631 -2.15780630 -0.68461735 0.9998975 2458.783 2171.800
#> age 0.04864108 0.01451812 0.01959422 0.07733883 1.0003270 2694.482 2536.169
As we can see, the Rhat values close to 1 and the relatively high ESS values indicate that this sampler has
converged and mixed well. As was the case with binary logistic regression, the mean and standard deviation
of the posterior distribution very closely match the maximum likelihood estimator and the standard error
of the sampling distribution. Likewise, the 95% high posterior density interval closely matches the 95%
confidence interval.
~ where ~xι is a vector of values of the predictors
The posterior distribution over the predicted value of φι = ~xι β,
can be obtained similarly to the case of binary logistic regression:
posterior_linpred(lbw_m_bayes, newdata = lbw_df_new) %>%
as_tibble() %>%
map_df(~quantile(., probs=c(0.025, 0.5, 0.975))) %>%
as.matrix() %>%
t() %>%
as_tibble() %>%
set_names(c('l-95% CI', 'prediction', 'u-95% CI')) %>%
bind_cols(lbw_df_new, .)
#> # A tibble: 3 x 4
#> age `l-95% CI` prediction `u-95% CI`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 20 -0.650 -0.353 -0.168
#> 2 25 -0.431 -0.188 0.0558
#> 3 30 -0.235 -0.0356 0.259
As we can see, these are very close to the confidence intervals for predictions in the classical approach.
In addition to the posterior distribution over φι = ~xι β, ~ we can can also calculate the posterior predictive
distribution, which is defined as follows:
Z
~ P(β|~
P(yι |~xι , ~y , X) = P(yι |~xι β) ~ y , X) dβ,
~
| {z }
posterior
where
~ = e−λι λyι ι ~
P(yι |~xι , β) , where λι = eφι , φι = ~xι β.
yι !
The posterior predictive distribution gives return a probability distribution over the counts 0, 1, 2 . . ., just
like a Poisson distribution, but it essentially averages over all possible values of β~ according to the posterior
distribution.
Using Stan/brms, the posterior_predict function can be used to draw samples from the posterior predictive
distribution: for each sample from the posterior.
pp_samples <- posterior_predict(lbw_m_bayes, newdata = lbw_df_new)
~
This returns a matrix of 4000 rows and 3, where each element of each column is a sample from P(yι |~xι , β),
and each column represents the different values of age that we are making predictions about. We plot the
histograms of these samples for each value of age in Figure 4.
11
20 25 30
2000
1500
count
1000
500
0
0 2 4 6 0 2 4 6 0 2 4 6
ftv
Figure 4: Posterior predictive distribution of the number of visits to the doctor by women of different ages.
12
0.20
0.15
λ
3.5
P(x)
5
0.10
10
15
0.05
0.00
0 5 10 15 20 25
x
Figure 5: A series of Poisson distributions with increasing means. As the mean of the distribution increases,
so too does the variance.
13
200
count
100
0 5 10 15 20
publications
Figure 6: A histogram of the number of publications by PhD students in the field of bio-chemistry.
Mq <- glm(publications ~ 1,
family=quasipoisson(link = 'log'),
data = biochemists_df)
summary(Mq)$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.5264408 0.03761239 13.99647 1.791686e-40
Note that the standard error is now 0.038. The quasi Poisson model calculates an overdispersion parameter,
which is roughly the ratio of the variance to the mean, and multiplies the standard error by its square root.
In this example, the overdispersion parameter is estimated to be 2.191. This value is returned in the summary
output and can be obtained directly as follows.
summary(Mq)$dispersion
#> [1] 2.191389
We can see that this is very close to the ratio of the variance of publications to the mean.
var(publications)/mean(publications)
#> [1] 2.191358
The square root of this is 1.48. Multiplying this by the standard error of the Poisson model leads to 0.038.
An alternative and more principled approach to modelling overdispersed count data is to use a negative
binomial regression model. As we will see, there are close links between the Poisson and negative binomial
model, but for simplicity, we can see the negative binomial distribution as similar to a Poisson distribution,
but with an additional dispersion parameter.
14
binomial distribution. In this example, an outcome of Heads is regarded as a success and each flip is a trial.
The probability mass function for a binomial random variable y is as follows.
n m
Binomial(y = k|n, θ) = P(y = m|n, θ) = θ (1 − θ)n−m .
m
0.15
0.10
p
0.05
0.00
0 10 20
x
By contrast to a binomial distribution, a negative binomial distribution gives the probability distribution
over the number of failures before r successes in a set of independent binary outcome (success or failure)
trials where the probability of a success is, as before, a fixed constant θ. Again consider the coin flipping
scenario. The negative binomial distribution tells the probability of observing any number of Tails (failures)
before r Heads (successes) occur. For example, in Figure 8, we show a set of binomial distributions, with each
one giving the probability distribution over the number of failures, e.g. Tails, that occur before we observe r
successes, e.g. Heads, when the probability of a success is θ, for different values of r and θ.
The probability mass function for a negative binomial random variable y, with parameters r and θ is
r+k−1 r
NegBinomial(y = k|r, θ) = P(y = k|r, θ) = θ (1 − θ)k ,
k
or more generally
Γ(r + k) r
NegBinomial(y = k|r, θ) = P(x = k|r, θ) = θ (1 − θ)k ,
Γ(r)k!
where Γ() is a Gamma function (note that Γ(n) = (n − 1)!). In the negative binomial distribution, the mean
of the distribution is
1−θ
µ= × r,
θ
and so
r
θ= ,
r+µ
and so we can generally parameterize the distribution by µ and r.
A negative binomial distribution is equivalent as weighted sum of Poisson distributions. We illustrate this
in Figure 9 where we show an average of four different Poisson disributions. More precisely, the negative
15
a b 0.08
0.100
0.06
0.075
0.04
p
p
0.050
0.025 0.02
0.000 0.00
0 10 20 30 0 10 20 30
x x
c 0.15 d
0.09
0.10
0.06
p
p
0.05
0.03
0.00 0.00
0 10 20 30 0 10 20 30
x x
Figure 8: Negative binomial distributions with parameters a) r = 2 and θ = 0.25, b) r = 3 and θ = 0.25, c)
r = 4 and θ = 0.5, and d) r = 7 and θ = 0.5.
binomial distribution with parameters r and θ is an infinite mixture of Poisson distributions with all possible
values of λ from 0 to ∞ and where the weighting distribition is a gamma distribution with shape parameter
θ
r and scale parameter s = 1−θ .
Z ∞
θ
NegBinomial(y = k|r, θ) = Poisson(y = k|λ)Gamma(λ|α = r, s = 1−θ )dλ.
0
We have seen that a Poisson distribution arises when there is a large number of opportunities for an event to
happen but a low probability of it happening on any one of those opportunities. Given that the negative
binomial is a weighted average of Poisson distributions, we can now see that it arises from a similar process
as the Poisson distribution, but where there is a probability distribution (specifically a gamma distribution)
over the probability of the event happening on any one opportunity.
In other words, our probability distribution for outcome variable yi is a negative binomial distribution whose
mean is µi , and which as an additional parameter r, whose value is a fixed but unknown constant. The
16
a
0.20 λ
0.15 3.5
5
P(x)
0.10 7
10
0.05
average
0.00
0 5 10 15 20 25
x
b 0.8
0.6
y
0.4
0.2
0.0
0.0 2.5 5.0 7.5 10.0
x
Figure 9: A negative binomial distribution is an infinite weighted sum of Poisson distributions, where the
weighting distribution is a gamma distribution. In a) we show a set of four different Poisson distributions
and their (uweighted) average. In b) we show a gamma distribution with shape parameter r = 2 and scale
parameter s = θ/(1 − θ), where θ = −.25.
link function is, like in the case of the Poisson model, the natural logarithm, and so we model the natural
logarithm of µi as a linear function of the K predictors.
In R, we perform negative binomial regression using the glm.nb function from the MASS package, and we use
glm.nb very similarly to how we have used other regression functions in R like lm and glm. For example,
we perform the intercept only negative binomial regression on the biochemists_df’s publication data as
follows.
library(MASS)
17
The coefficients for the regression model are obtained as per usual.
summary(Mnb)$coefficients
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.5264408 0.03586252 14.67942 8.734017e-49
From this, we can see that, for all i, µi is
µ = e0.526 = 1.693.
Using the relationship between the probability θ and µ and r from above, we have
r 1.706
θ= = = 0.502.
r+µ 1.706 + 1.693
In other words, our model (using no predictor variables) of the distribution of the number of publications by
PhD in bio-chemistry is estimated to be a negative binomial distribution with parameters θ = 0.502 and
r = 1.706. This distribution is shown in Figure 10.
0.3
0.2
p
0.1
0.0
Figure 10: A negative binomial distribution with parameters θ = 0.502 and r = 1.706. This is the estimated
model of the distribution of the number of publications by PhD students in bio-chemistry.
Now let us use negative binomial regression model with predictors. Specifically, we will use gender as the
predictor of the average of the distribution of the number of publications.
Mnb1 <- glm.nb(publications ~ gender, data=biochemists_df)
summary(Mnb1)$coefficients
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.6326491 0.04716825 13.412604 5.101555e-41
#> genderWomen -0.2471766 0.07203652 -3.431268 6.007661e-04
Prediction in negative binomial regression works exactly like in Poisson regression. We can extract the
estimates of the coefficients using coef.
estimates <- coef(Mnb1)
In this model, the two values for gender are Men and Women, with Men being the base level of the binary
dummy code that is corresponding to gender in the regression. Thus, predicted log of the mean of number
of publications for men is 0.633 and for women it is 0.633 + −0.247 = 0.385, and so the predicted means
18
for men and women are e0.633 = 1.883 and e0.633+−0.247 = 1.47, respectively. This predictions can be made
more easily using the predict or add_predictions functions. For example, to get the predicted logs of the
means, we do the following.
tibble(gender = c('Men', 'Women')) %>%
add_predictions(Mnb1)
#> # A tibble: 2 x 2
#> gender pred
#> <chr> <dbl>
#> 1 Men 0.633
#> 2 Women 0.385
The predicted means can be obtained as follows.
tibble(gender = c('Men', 'Women')) %>%
add_predictions(Mnb1, type= 'response')
#> # A tibble: 2 x 2
#> gender pred
#> <chr> <dbl>
#> 1 Men 1.88
#> 2 Women 1.47
The negative binomial distributions corresponding to these means are shown in Figure 11.
In a negative binomial regression, for any predictor k, eβk has the same interpretation as it would have in a
Poisson regression, namely the factor by which the mean of the outcome variable increases for a unit change
in the predictor. The coefficient corresponding to gender is -0.247, and so e−0.247 = 0.781 is the factor by
which the mean of the number of number of publications increases as we go from mean to women. Obviously,
this is a value less than 1, and so we see that the mean decreases as we go from men to women.
0.3
0.2 Gender
value
Men
Women
0.1
0.0
Figure 11: The estimated negative binomial distributions of the number of publications by male and female
PhD students in bio-chemistry.
As in the case of logistic and Poisson regression, we estimate the coefficients using maximum likelihood
estimation. In addition, we also estimate the value of r using maximum likelihood estimation. Once we have
the maximum likelihood estimate for all the parameters, we can calculate the log of the likelihood function
as its maximum, or the deviance, which is −2 times the log likelihood. In glm.nb, the value of log of the
likelihood can be obtained by logLik.
19
logLik(Mnb1)
#> 'log Lik.' -1604.082 (df=3)
The deviance is not the value reported as deviance in the summary output, nor by using the function
deviance. Instead, we multply the log-likelihood by −2, or equivalent use the negative of the value of the
twologlik attribute of the model
c(-2 * logLik(Mnb1), -Mnb1$twologlik)
#> [1] 3208.165 3208.165
We can compare nested negative binomial regressions using the generic anova function as we did with logistic
regression or Poisson regression. For example, here we compare models Mnb and Mnb1.
anova(Mnb, Mnb1)
#> Likelihood ratio tests of Negative Binomial Models
#>
#> Response: publications
#> Model theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
#> 1 1 1.706205 914 -3219.873
#> 2 gender 1.760904 913 -3208.165 1 vs 2 1 11.70892 0.0006220128
This layout is not identical to how it was when we uses glm based models. However, it is easy to verify that
the value of LR stat. is the difference of the deviance of the two models.
deviances <- -c(Mnb$twologlik, Mnb1$twologlik)
deviances
#> [1] 3219.873 3208.165
deviances[1] - deviances[2]
#> [1] 11.70892
Thus we have
∆D = D0 − D1 = 3219.873 − 3208.165 = 11.709.
Under the null hypothesis of equal predictive power of the two models, ∆D is distributed a χ2 distribution
with degrees of freedom equal to the difference in the number of parameters between the two models, which
is 1 in this case. Thus, the p-value for the null hypothesis is
pchisq(deviances[1] - deviances[2], df = 1, lower.tail = F)
#> [1] 0.0006220128
which is value of Pr(Chi) reported in the anova table.
20
#> (flat) b default
#> (flat) b genderWomen (vectorized)
#> (flat) b Ichildren>0TRUE (vectorized)
#> (flat) b marriedSingle (vectorized)
#> (flat) b mentor (vectorized)
#> (flat) b prestige (vectorized)
#> student_t(3, 0, 2.5) Intercept default
#> gamma(0.01, 0.01) shape 0 default
This tells us that we use a flat improper prior for coefficient for the predictors, a student t-distribution for the
intercept term, and a Gamma distribution for the r parameter of the negative binomial distribution whose
shape and rate (or inverse scale) parameters are 0.01 and 0.01. The Gamma prior will have a mean of exactly
one, a variance of 100, and a positive skew of 20.
The coefficients summary is as follows.
summary(Mnb2_bayes)$fixed
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 0.38883799 0.128936596 0.14061841 0.63674303 1.0052193 5657.373 3205.330
#> genderWomen -0.20835614 0.073530981 -0.35186184 -0.06209811 0.9998066 5668.787 2978.681
#> marriedSingle -0.14273189 0.086422619 -0.31501035 0.02848022 1.0002953 4904.067 2920.246
#> Ichildren>0TRUE -0.22891800 0.087640701 -0.40104654 -0.05937863 1.0021289 4813.428 3160.174
#> prestige 0.01593058 0.036026462 -0.05352791 0.08671809 1.0031660 5223.980 3292.215
#> mentor 0.02938006 0.003363474 0.02289493 0.03606538 1.0009252 5901.048 3506.838
Like the many cases we have seen above, these results are largely in line with those from classical maximum
likelihood based methods, as we can see if we compare the results above to those of glm.nb applied to the
same data.
Mnb2 <- glm.nb(publications ~ gender + married + I(children > 0) + prestige + mentor,
data = biochemists_df)
summary(Mnb2)$coefficients
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.39147046 0.128450977 3.0476254 2.306573e-03
#> genderWomen -0.20637739 0.072876028 -2.8318967 4.627279e-03
#> marriedSingle -0.14384319 0.084788294 -1.6964983 8.979156e-02
#> I(children > 0)TRUE -0.22895331 0.085438163 -2.6797546 7.367614e-03
#> prestige 0.01547769 0.035978126 0.4301973 6.670521e-01
#> mentor 0.02927434 0.003229138 9.0656829 1.238261e-19
The posterior summary for the r parameter is as follows.
summary(Mnb2_bayes)$spec_pars
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> shape 2.230603 0.2667462 1.776507 2.791699 1.001183 6360.45 3114.755
We can see that this is very close to that estimated with glm.nb.
Mnb2$theta
#> [1] 2.239056
21
smoking_df <- read_csv('data/smoking.csv')
smoking_df
#> # A tibble: 807 x 10
#> educ cigpric white age income cigs restaurn lincome agesq lcigpric
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 16 60.5 1 46 20000 0 0 9.90 2116 4.10
#> 2 16 57.9 1 40 30000 0 0 10.3 1600 4.06
#> 3 12 57.7 1 58 30000 3 0 10.3 3364 4.05
#> 4 13.5 57.9 1 30 20000 0 0 9.90 900 4.06
#> 5 10 58.3 1 17 20000 0 0 9.90 289 4.07
#> 6 6 59.3 1 86 6500 0 0 8.78 7396 4.08
#> 7 12 57.9 1 35 20000 0 0 9.90 1225 4.06
#> 8 15 57.9 1 48 30000 0 0 10.3 2304 4.06
#> 9 12 59.2 1 48 20000 0 0 9.90 2304 4.08
#> 10 12 58.0 1 31 20000 0 0 9.90 961 4.06
#> # ... with 797 more rows
In this, for each of 807 individual, we have their number of years in formal education (educ), their age, and
their reported number of cigarettes smoked per day (cigs). A bar plot of cigs, shown in Figure 12, shows
that there are an excessive number of zero values.
500
400
300
count
200
100
0 20 40 60 80
cigs
To model count variables like cigs, we can use zero-inflated models, such as zero-inflated Poisson or zero-
inflated negative binomial models. Here, we will just consider the example of zero-inflated Poisson regression,
but all the principles apply equal to zero-inflated negative binomial and other count regression models.
yi ∼ N (µ, σ 2 ), for i ∈ 1 . . . n,
22
By contrast, a K = 3 component mixture of normal distributions model assumes that there is a discrete
latent variable z1 , z2 . . . zn corresponding to each of y1 , y2 . . . yn , where each zi ∈ {1, 2, 3}, and for i ∈ 1 . . . n,
2
N (µ1 , σ1 ), if zi = 1
yi ∼ N (µ2 , σ22 ), if zi = 2 ,
N (µ3 , σ32 ), if zi = 3
zi ∼ P (π),
where π = [π1 , π2 , π3 ] is a probability distribution of {1, 2, 3}. More generally for any value of K, we can
write the K mixture of normals as follows.
and π = [π1 , π2 . . . πK ] is a probability distribution over 1 . . . K. In other words, each yi is assumed to be drawn
from one of K normal distributions whose mean and variance parameters are (µ1 , σ12 ), (µ2 , σ22 ) . . . (µK , σK 2
).
Which of these K distributions each yi is drawn from is determined by the value of the latent vari-
able zi ∈ 1 . . . K, for each zi , P(zi = k) = πk . In a model like this, we must infer the values of
(µ1 , σ12 ), (µ2 , σ22 ) . . . (µK , σK
2
) and π and also the posterior probability that zi = k for each value of k,
see Figure 13 for an illustration of the problem in the case of K = 3 normal distributions. Without delving
into the details, the traditional maximum likelihood based approach to this inference is to use the Expectation
Maximization (EM) algorithm.
a b
100
90
75
count
count
60
50
30
25
0 0
−10 0 10 20 −10 0 10 20
x x
Figure 13: A probabilistic mixture of K = 3 normal distributions. A histogram of the observed data is shown
in a), and we model each observed value as drawn from one of three different normal distributions, e.g. shown
in b). The parameters of the normal distributions, the relative probabilities of the three distributions, as well
as the probability that any one observation came from each distribution must be simultaneously inferred.
The mixture models discussed so far were not regression models. However, we can easily extend them
description to apply to regression models. For this, let us assume our data is {(y1 , ~x1 ), (y2 , ~x2 ) . . . (yn , ~xn )},
just as we have described it repeatedly above. In a non-mixture normal linear regression model, we have seen
that our model is as follows.
K
X
yi ∼ N (µi , σ 2 ), µi = β0 + βk xki , for i ∈ 1 . . . n.
k=1
23
On the other hand, in a mixture of K normal linear models, we assume that there is a latent variable
z1 , z2 . . . zn corresponding to each observations, with each zi ∈ K, and
K
X
yi ∼ N (µi , σz2i ), µi = β0[zi ] + βk[zi ] xki , zi ∈ P(π) for i ∈ 1 . . . n,
k=1
where π = [π1 , π2 . . . πK ] is a probability distribution over 1 . . . K. Note that here, we have K sets of regression
coefficients, (β~1 , σ12 ), (β~2 , σ22 ) . . . (β
~K , σ 2 ), each one defining a different linear model.
K
In the mixture of regressions just provided, the probability that zi takes any value from 1 . . . K is determined
by the fixed but unknown probability distribution π. We may, however, extend the mixture of regressions
model to allow each zi to also vary with the predictors ~xi . For example, each zi could be modelled using a
categorical logistic regression of the kind we saw in Chapter 10.
As interesting as these mixture of normal linear regression models are, we will not explore them further here.
However, we have described them in order to introduce zero-inflated Poisson models, which are special type
of mixture of regression models. Specifically, a zero inflated Poisson regression is K = 2 mixture regression
model. There are two component models, and so each latent variable zi is binary valued, i.e. zi ∈ {0, }.
Furthermore, the probability that zi = 1 is a logistic regression function of the predictor ~xi . The two
component of the zero-inflated Poisson model are as follows. 1. A Poisson distribution. 2. A zero-valued
point mass distribution (a probability distribution with all its mass at zero).
More precisely, in a zero-Inflated Poisson regression, our data are
and
K
θi X
log = γ0 + γk xki .
1 − θi
k=1
In other words, λi is modelled just as in ordinary Poisson regression and θi is modelled as in logistic regression.
We are using β~ and ~γ to make it clear that these are two separate sets of regression coefficients.
Zero-inflated Poisson in R
We can perform zero-inflated Poisson regression, as well as other zero-inflated count regression model, using
functions in the pscl. Here, we model how cigs varies as a function of educ.
library(pscl)
Mzip <- zeroinfl(cigs ~ educ, data=smoking_df)
The two set of coefficients can be obtained as follows.
summary(Mzip)$coefficients
#> $count
#> Estimate Std. Error z value Pr(>|z|)
24
#> (Intercept) 2.69785404 0.056778696 47.515252 0.000000e+00
#> educ 0.03471929 0.004536397 7.653495 1.955885e-14
#>
#> $zero
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.56273266 0.30605009 -1.838695 0.0659601142
#> educ 0.08356933 0.02417456 3.456912 0.0005464026
From this, we see that the logistic regression model for each observation is estimated to be
θi
log = γ0 + γ1 xi = −0.563 + 0.084xi ,
1 − θi
λi = e2.698+0.21 = 18.287.
λi = e2.698+0.63 = 27.738.
From this we see that as educ increases the average number of cigarettes smoked also increases. This is an
interesting result. It shows that the effect of education on smoking behaviour is a not a very simple one and
that two opposing effects are happening at the same time. On the one hand, as education increases, it is
more likely that a person does not smoke at all. This was revealed by the logistic regression model. On the
25
other hand, if the person is a smoker, then they more educated they are, the more they smoke. This was
revealed by the Poisson model.
These two predictions can be performed more efficiently and with less error using predict or add_predictions.
Let us consider the range of values for educ from 6 to 18 in steps of 2 years.
smoking_df_new <- tibble(educ = seq(6, 18, by = 2))
The predictions that zi = 1, and hence that person of that level of education is a non-smoker, can be done
using type = 'zero' as follows.
smoking_df_new %>%
add_predictions(Mzip, type = 'zero')
#> # A tibble: 7 x 2
#> educ pred
#> <dbl> <dbl>
#> 1 6 0.485
#> 2 8 0.526
#> 3 10 0.568
#> 4 12 0.608
#> 5 14 0.647
#> 6 16 0.684
#> 7 18 0.719
The predictions that λi = 1, and hence the average smoked by a smoker of that level of education, can be
done using type = 'count' as follows.
smoking_df_new %>%
add_predictions(Mzip, type = 'count')
#> # A tibble: 7 x 2
#> educ pred
#> <dbl> <dbl>
#> 1 6 18.3
#> 2 8 19.6
#> 3 10 21.0
#> 4 12 22.5
#> 5 14 24.1
#> 6 16 25.9
#> 7 18 27.7
Now let us consider the average number of cigarettes smoked by a person, who might be smoker or a
non-smoker, given that we know there level of education. Put more generally, what is the expected value of
yi given ~xi in a zero-inflated Poisson model? This is the sum of two quantities. The first is the average value
of yi given ~xi when zi = 0 multiplied by the probability that zi = 0. The second is the average value of yi
given ~xi when zi = 1 multiplied by the probability that zi = 1. This second value is always zero: if zi = 1
then yi = 0 necessarily. The first value is
λi × (1 − θi ).
We can obtain these predictions using type = 'response'.
smoking_df_new %>%
add_predictions(Mzip, type = 'response')
#> # A tibble: 7 x 2
#> educ pred
#> <dbl> <dbl>
#> 1 6 9.42
#> 2 8 9.28
#> 3 10 9.08
26
#> 4 12 8.82
#> 5 14 8.51
#> 6 16 8.17
#> 7 18 7.78
We can verify that these predictions are as defined above as follows. As we’ve seen, λ and θ are calculated
using predict with type = 'count' and type = 'zero', respectively. Putting these in a data frame, we
can then calculate λ × (1 − θ).
smoking_df_new %>%
mutate(lambda = predict(Mzip, newdata = ., type = 'count'),
theta = predict(Mzip, newdata = ., type = 'zero'),
response = lambda * (1-theta)
)
#> # A tibble: 7 x 4
#> educ lambda theta response
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 18.3 0.485 9.42
#> 2 8 19.6 0.526 9.28
#> 3 10 21.0 0.568 9.08
#> 4 12 22.5 0.608 8.82
#> 5 14 24.1 0.647 8.51
#> 6 16 25.9 0.684 8.17
#> 7 18 27.7 0.719 7.78
27
As may be clear, this model is in fact the following.
(
Poisson(λi ) if zi = 0,
yi ∼ ,
0, if zi = 1
zi ∼ Bernoulli(θ),
where λi is a function of the predictors ~xi , specifically
K
X
log(λi ) = β0 + βk xki ,
k=1
but θ is a fixed constant that does not vary with ~xi . To obtain the model, as we used it above in the classical
inference based example, where the log odds of θi is a a linear function ~xi , we must define two regression
formulas: one for the Poisson model and the other for the logistic regression model. We do so using the
brmsformula function, which is also available as bf.
Mzip_bayes <- brm(bf(cigs ~ educ, zi ~ educ),
family = zero_inflated_poisson(link = "log", link_zi = "logit"),
data = smoking_df)
Note that inside bf, there are two formulas. The first is as above, and second the logistic regression model
for the zi latent variable.
Again, we have accepted all the defaults. Let us look at the priors that have been used.
prior_summary(Mzip_bayes)
#> prior class coef group resp dpar nlpar lb ub
#> (flat) b
#> (flat) b educ
#> (flat) b zi
#> (flat) b educ zi
#> student_t(3, -2.3, 2.5) Intercept
#> logistic(0, 1) Intercept zi
#> source
#> default
#> (vectorized)
#> default
#> (vectorized)
#> default
#> default
Much of this as similar to previous examples. However, we note that the prior on the intercept term for the
logistic regression model is a standard logistic distribution. We saw this distribution when discussing the
latent variable formulation of the logistic regression in Chapter 10. It is close to a normal distribution with
zero mean and standard deviation of 1.63.
The coefficients for both the Poisson and logistic regresion model can be obtained from the fixed attribute
in the summary output, which we can see is very close to the estimates in classical inference model above.
summary(Mzip_bayes)$fixed
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 2.69816456 0.056020797 2.58828770 2.81194012 1.000689 5492.946 3215.403
#> zi_Intercept -0.56746971 0.309771515 -1.17336711 0.04338648 1.001765 4113.549 2877.782
#> educ 0.03468162 0.004454841 0.02559739 0.04337649 1.000191 5350.191 2890.342
#> zi_educ 0.08392619 0.024516775 0.03640201 0.13122232 1.002506 3983.940 2933.668
References
28