PH6205 R Tutorial 4
Haoran Xue
(Last Updated February 25, 2025)
1 Scottish Lip Cancer Data
We analyze the Scottish Lip Cancer Data. First we load the data into R:
Lip.data = read.csv("/Users/x/Dropbox/PH6205/Slides4/SC_lip_cancer.csv")
head(Lip.data)
## id district obs pop propag
## 1 1 Caithness 11 83190 10
## 2 2 Sutherland 5 37521 16
## 3 3 Ross-Cromarty 15 129271 10
## 4 4 Banff-Buchan 39 231337 16
## 5 5 Nairn 3 29374 10
## 6 6 Skye-Lochalsh 9 28324 16
1.1 Poisson Regression
Next, we fit the Poisson Regression of obs on propag, with log(pop) as the offset:
obs = Lip.data$obs
pop = Lip.data$pop
propag = Lip.data$propag
poisson.model = glm(obs ~ propag, offset = log(pop), family = poisson(link = "log"))
summary(poisson.model)
##
## Call:
## glm(formula = obs ~ propag, family = poisson(link = "log"), offset = log(pop))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.82457 0.07006 -154.50 <2e-16 ***
## propag 0.08114 0.00603 13.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 423.92 on 55 degrees of freedom
## Residual deviance: 256.92 on 54 degrees of freedom
## AIC: 468.89
##
## Number of Fisher Scoring iterations: 5
1
1.2 Plot for Checking Overdispersion
We can draw the scatter plot of (y − λ̂)2 versus λ̂ to visually check for overdispersion:
lambda.hat = fitted(poisson.model)
plot(lambda.hat,(obs - lambda.hat)ˆ2)
abline(0,1,col = "red",lwd = 2)
500
(obs − lambda.hat)^2
400
300
200
100
0
0 10 20 30 40
lambda.hat
Here the red solid line is y = x. Most of the points are above the red line, i.e. (y − λ̂)2 > λ̂ in general. This
indicates the existence of overdispersion.
We can also calculate the dispersion parameter as
phi = sum((obs - lambda.hat)ˆ2/lambda.hat)/(56-1-1)
phi
## [1] 5.430562
1.3 Quasi-Poisson Regression
We fit a Quasi-Poisson Regression to deal with the overdispersion.
Quasipoisson.model = glm(obs ~ propag, offset = log(pop), family = quasipoisson)
summary(Quasipoisson.model)
##
## Call:
## glm(formula = obs ~ propag, family = quasipoisson, offset = log(pop))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.82457 0.16326 -66.301 < 2e-16 ***
## propag 0.08115 0.01405 5.774 3.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
2
## (Dispersion parameter for quasipoisson family taken to be 5.430563)
##
## Null deviance: 423.92 on 55 degrees of freedom
## Residual deviance: 256.92 on 54 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
1.4 Negative Binomial Regression
We can also fit a Negative Binomial Regression to deal with the overdispersion. We use the function
glm.nb() from the R package MASS to fit the Negative Binomial Regression model.
library(MASS)
NB.model = glm.nb(formula = obs ~ propag + offset(log(pop)))
summary(NB.model)
##
## Call:
## glm.nb(formula = obs ~ propag + offset(log(pop)), init.theta = 2.531752157,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.66858 0.15792 -67.558 < 2e-16 ***
## propag 0.08499 0.01408 6.036 1.58e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.5318) family taken to be 1)
##
## Null deviance: 93.005 on 55 degrees of freedom
## Residual deviance: 63.170 on 54 degrees of freedom
## AIC: 356.58
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.532
## Std. Err.: 0.642
##
## 2 x log-likelihood: -350.575
2 Vietnam Highly Pathogenic Avian Influenza Data
We use the Vietnam Highly Pathogenic Avian Influenza Data to illustrate the Zero-inflated Poisson
Regression model. First we load the data:
VN.new = read.csv("/Users/x/Dropbox/PH6205/Slides4/Vietnam2003HPAI.csv")
head(VN.new)
## obs pop pop.dens road.dens river.dens elevation
## 1 0 1 0.5731923 -0.9532340 -0.58138623 0.4925043
## 2 0 1 0.7068947 -0.7887648 -0.72589578 0.4782148
## 3 0 1 0.4386707 0.3990558 -0.24816734 0.3940719
3
## 4 0 1 0.7793921 0.2686765 0.40521494 0.4497510
## 5 0 1 0.6405167 -0.7609713 0.12034335 0.4716434
## 6 0 1 0.2975146 -0.1706405 -0.06887103 0.3827288
obs = VN.new$obs
pop = VN.new$pop
pop.dens = VN.new$pop.dens
road.dens = VN.new$road.dens
river.dens = VN.new$river.dens
elevation = VN.new$elevation
We can use the table() function to show the counts for different values of obs:
table(obs)
## obs
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1407 156 67 39 23 18 15 14 10 8 3 7 5 1 1 1
## 16 17 18 21
## 2 1 2 1
We can see, out of 1781 districts, there are 1407 report obs as 0. Next we draw to bar plot for better
visualization. (Note: the codes for drawing bar plot are not required.)
barplot(c((table(obs))[1:19],0,0,(table(obs))[20]),names.arg = 0:21,xlab = "obs",ylab = "Count")
1400
1000
Count
600
200
0
0 2 4 6 8 10 12 14 16 18 20
obs
We fit a Zero-inflated Poisson Regression model. We use the function zeroinfl() from the R package
pscl.
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
4
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
ZIP.VN = zeroinfl(obs ~ road.dens + river.dens + elevation | pop.dens,
offset = log(pop),dist = c("poisson"))
summary(ZIP.VN)
##
## Call:
## zeroinfl(formula = obs ~ road.dens + river.dens + elevation | pop.dens,
## offset = log(pop), dist = c("poisson"))
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -4.10910 -0.37482 -0.17808 -0.06144 17.42802
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.97356 0.07965 -24.778 < 2e-16 ***
## road.dens -0.12954 0.02251 -5.755 8.64e-09 ***
## river.dens 0.28272 0.08009 3.530 0.000416 ***
## elevation -1.00963 0.09860 -10.239 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0195 0.1690 6.032 1.62e-09 ***
## pop.dens 3.1881 0.3105 10.268 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 13
## Log-likelihood: -1425 on 6 Df