0% found this document useful (0 votes)
28 views196 pages

CDA Course

Uploaded by

karthyfsu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views196 pages

CDA Course

Uploaded by

karthyfsu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Intro to Categorical Data Analysis

Nicholas Reich and Anna Liu, based on Agresti Ch 1

Where this course fits


• Third course in Biostat “Methods Sequence”, after intro stats and linear regression.
• Good lead-in to random effects models, machine learning/classification models.
• Balance of traditional stat theory and application.
• Most applications will have biomedical/public health focus.

History of the course


• Taught since mid-1980s at UMass-Amherst (PUBHLTH 743)
• Led to most cited statistics book in print (> 30,000 citations)

Focus of this course (different from the original)


• Foundational concepts
– Analysis of contingency tables
– Generalized Linear Models (GLMs)
– Discussion of Bayesian and frequentist approaches
• A taste of common, modern extensions to GLMs
– Machine Learning classification methods
– Longitudinal data (repeated measures)
– Zero-inflated models, over-dispersion

Course Introduction
• This course focuses on methods for categorical response, or outcome variables.
– Binary, e.g.
– Nominal, e.g.
– Ordinal, e.g.
– Discrete-valued (“interval”), e.g.

1
• Explanatory, or predictor variables can be any type
• Very generally, we are trying to build models

Types of categorical variables


• The way that a variable is measured determines its classification
– What are different ways that a variable on education could be classified?

• The granularity of your data matters!


– In terms of information per measured datapoint, discrete variables > ordinal variables > nominal
variables
– This has implications for study design and sample size.

Distributions of categorical variables: Binomial


Let y1 , y2 , · · · , yn denote observations from n independent and identical trials such that

P (Yi = 1) = π P (Yi = 0) = 1 − π
Pn
The total number of successes (1s) Y = i=1 Yi has the binomial distribution, denoted by bin(n, π). The
probability mass function for the possible outcomes y for Y is
 
n
p(y) = π y (1 − π)(n−y) , y = 0, 1, ..., n
y

with µ = E(Y ) = nπ and σ 2 = V ar(Y ) = nπ(1 − π).


• The binomial distribution converges to normality as n increases, for fixed π, the approximation being
reasonable when n[min(π, 1 − π)] is as small as 5.
• Interactive binomial distribution

Statistical inference
Inference is the use of sample data to estimate unknown parameters of the population. One method we will
focus on is maximum likelihood estimation (MLE).

2
Population Sample

Population
Sample Statistic
Parameter

Statistical inference: maximum likelihood


• The likelihood function is the likelihood (or probability in the discrete case) of the sample of your
data X1 , ..., Xn , given the unknown parameter(s) β. Denoted as l(β|X1 , ..., Xn ) or simply l(β).
• The MLE of β is defined as
β̂ = sup l(β) = sup L(β)
β β

where L(β) = log(l(β)). The MLE is the parameter value under which the data observed have the
highest probability of occurrence.

Statistical inference: MLE (con’t)


• MLE have desirable properties: under weak regularity conditions, MLE have large-sample normal
distributions; they are asymptotically consistent, converging to the parameter as n increases; and
they are asymptotically efficient, producing large-sample standard errors no greater than those from
other estimation methods.

Covariance matrix of the MLE


Let cov(β̂) denote the asymptotic convariance matrix of β̂, where β is a multidimensional parameter.

3
• Under regularity conditions, cov(β̂) is the inverse of the information matrix, which is
 2 
∂ L(β)
[I(β)]i,j = −E
∂βi ∂βj

• The standard errors are the square roots of the diagonal elements for the inverse of the information
matrix. The greater the curvature of the log likelihood function, the smaller the standard errors.

Statistical inference for Binomial parameter


• The binomial log likelihood function is

L(π) = log[π y (1 − π)(n−y) ]

= y log(π) + (n − y) log(1 − π)
• Differentiating wrt π and setting it to 0 gives the MLE π̂ = y/n.
• The Fisher information is
I(π) = n/[π(1 − π)]
• The asympotic distribution of the MLE π̂ is N (π, π(1 − π)/n).

Statistical inference for Binomial parameter


The score, Wald, and likelihood ratio tests use different information from this curve to draw inference about
π.

Wald test
Consider the hypothesis
H0 : β = β0 H1 : β 6= β0

The Wald test defines a test statistic


q p
z = (β̂ − β0 )/SE, where SE = 1/ I(β̂) = π̂(1 − π̂)/n

Under H0 : β = β0 , the wald test statistic z is approximately standard normal. Therefore H0 is rejected if
|z| > zα/2 .

Likelihood ratio test


The likelihood ratio test (LRT) is defined as

−2 log Λ = −2 log(l0 /l1 ) = −2(L0 − L1 )

where l0 and l1 are the maximized likelihood under H0 and H0 ∪ H1 . The null hypothesis is rejected if
−2 log Λ > χ2α (df ) where df is the difference in the dimensions of the parameter spaces under H0 ∪ H1 and
H0 .

4
Figure 1: binomial likelihood

5
Score (a.k.a. Wilson) test
Score test, also called the Wilson or Lagrange multiplier test, is based on the slope and expected curvature
of the log-likelihood function L(β) at the null value β0 . It utilizes the size of the score function

u(β) = ∂L(β)/∂β

evaluated at β0 .
The score test statistic is
u(β0 ) π̂ − π0
z= =p
[I(β0 )]1/2 π0 (1 − π0 )/n
.

Example: Estimating the proportion of Vegetarians


Students in a class were surveyed whether they are vegetarians. Of n = 25 students, y = 0 answered “yes”.
• Using the Wald method, compute the 95% confidence interval for π (true proportion of vegetarians in
the population):

• Using the Score method, compute the 95% confidence interval for π (true proportion of vegetarians in
the population):

Warning about the Wald test


• When a parameter falls near the boundary of the sample space, often sample estimates of standard
errors are poor and the Wald method does not provide a sensible answer.
• For small to moderate sample sizes, the likelihood-ratio and score tests are usually more reliable than
the Wald test, having actual error rates closer to the nominal level.

Comparison of the tests


There are lots of different methods to compute CIs for a binomial proportion!
library(binom)
[Link](x=0, n=25)

## method x n mean lower upper


## 1 agresti-coull 0 25 0.00000000 -0.02439494 0.15758719
## 2 asymptotic 0 25 0.00000000 0.00000000 0.00000000
## 3 bayes 0 25 0.01923077 0.00000000 0.07323939
## 4 cloglog 0 25 0.00000000 0.00000000 0.13718517
## 5 exact 0 25 0.00000000 0.00000000 0.13718517
## 6 logit 0 25 0.00000000 0.00000000 0.13718517
## 7 probit 0 25 0.00000000 0.00000000 0.13718517
## 8 profile 0 25 0.00000000 0.00000000 0.12291101
## 9 lrt 0 25 0.00000000 0.00000000 0.07398085
## 10 [Link] 0 25 0.00000000 0.00000000 0.16577301
## 11 wilson 0 25 0.00000000 0.00000000 0.13319225

6
Bayesian inference for binomial parameters
Bayesian analyses incorporate “prior information” about parameters using
• prior subjective belief about a parameter, or
• prior knowledge from other studies, or
• very little knowledge (a “weakly informative” prior)
Prior distribution (g) is combined with the likelihood (f ) to create a posterior (h):

f (y|θ)g(θ)
h(θ|y) =
f (y)
∝ f (y|θ)g(θ)

Using Beta distributions for priors


If π ∼ beta(α1 , α2 ) (for α1 > 0 and α2 > 0) then g(π) ∝ π α1 −1 (1 − π)α2 −1 .

Beta is a conjugate prior distribution for a binomial parameter, implying that the posterior is also a beta distribution, specifically, h
follows a beta(y + α1 , n − y + α2 ).
Shiny app for Bayesian inference of a Binomial.

An exercise
1. Write down your prior belief about the probability that this coin will land heads.
2. Share it with the class
3. Use the prior probabilities to estimate a beta distribution.
library(MASS)
x <- c(
## enter probabilities here
)
fitdistr(x, "beta", list(shape1=1,shape2=1))

4. Use the app to see how the posterior changes as we flip the coin.

7
Intro to Contingency Tables

Author: Nicholas Reich and Anna Liu, based on Agresti Ch 1

Course: Categorical Data Analysis (BIOSTATS 743)

Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Distributions of categorical variables: Multinomial
Suppose that each of n independent and identical trials can have
outcome in any of c categories. Let
(
1 if trial i has outcome in category j
yij =
0 otherwise

Then yi = (yi1 , ..., yic ) represents a multinomial trial with


P P
j yij = 1. Let nj = i yij denote the number of trials having
outcome in category j. The counts (n1 , n2 , ..., nc ) have the
multinomial distribution. The multinomial pmf is
n!
 
p(n1 , ..., nc−1 ) = π1n1 π2n2 ...πcnc ,
n1 !n2 !...nc !
where πj = P(Yij = 1)

E (nj ) = nπj , Var (nj ) = nπj (1 − πj )

Cov (ni , nj ) = −nπi πj


Statistical inference for multinomial parameters

Given n observations in c categories, nj occur in category j,


j = 1, ..., c. The multinomial log-likelihood function is
X
L(π) = nj log πj
j

Maximizing this gives MLE

π̂j = nj /n
The Chi-Squared distribution

This is not a distribution for the data but rather a sampling


distribution for many statistics.
I The chi-squared distribution with degreespof freedom by df has
mean df , variance 2(df ), and skewness 8/df . It converges
(slowly) to normality as df increases, the approximation being
reasonably good when df is at least about 50.
I Let Z ∼ N(0, 1), then Z 2 ∼ χ2 (1)
I The reproductive property: if X12 ∼ χ2 (ν1 ) and
X22 ∼ χ2 (ν2 ), then X 2 = X12 + X22 ∼ χ2 (ν1 + ν2 ). In particular,
X = Z12 + Z22 + ... + Zν2 ∼ χ2 (ν) with the standard normal Z ’s.
Chi-square goodness-of-fit test for a specified multinomial

Consider hypothesis H0 : πj = πj0 , j = 1, ..., c, - Chi-square


goodness-of-fit statistic (score)
X (nj − µj )2
X2 =
j
µj

where µj = nπj0 is called expected frequencies under H0 .


I Let Xo2 denote the observed value of X 2 . The P-value is
P(X 2 > Xo2 ).
I For large samples, X 2 has approximately a chi-squared
distribution with df = c − 1. The P-value is approximated by
P(χ2c−1 ≥ Xo2 ).
LRT test for a specified multinomial
I LRT statistic
X
G 2 = −2 log Λ = 2 nj log(nj /nπj0 )
j

For large n, G 2 has a chi-squared null distribution with


df = c − 1.
I When H0 holds, the goodness-of-fit Chi-squiare X 2 and the
likelihood ratio G 2 both have large-sample chi-squared
distributions with df = c − 1.
I For fixed c, as n increases the distribution of X 2 usually
converges to chi-squared more quickly than that of G 2 . The
chi-squared approximation is often poor for G 2 when n/c < 5.
When c is large, it can be decent for X 2 for n/c as small as 1
if table does not contain both very small and moderately large
expected frequencies.
Distributions of categorical variables: Poisson

One simple distribution for count data that do not result from a
fixed number of trials. The Poisson pmf is

e −µ µy
p(y ) = , y = 0, 1, 2, ... E (Y ) = Var (Y ) = µ
y!
For adult residents of Britain who visit France this year, let
I Y1 = number who fly there
I Y2 =number who travel there by train without a car
I Y3 =number who travel there by ferry without a car
I Y4 =number who take a car
A poisson model for (Y1 , Y2 , Y3 , Y4 ) treats these as independent
Poisson random variables, with parameters (µ1 , µ2 , µ3 , µ4 ). The
P
total n = i Yi also has a Possion distribution, with parameter
P
i µi .
Distributions of categorical variables: Poisson

P
The conditional distribution of (Y1 , Y2 , Y3 , Y4 ) given i Yi = n is
P
multinomial(n, πi = µi / j µj )
Example: A survey of student characteristics

In the R data set survey, the Smoke column records the survey
response about the student’s smoking habit. As there are exactly
four proper response in the survey: “Heavy”, “Regul” (regularly),
“Occas” (occasionally) and “Never”, the Smoke data is multinomial.
library(MASS) # load the MASS package
levels(survey$Smoke)

## [1] "Heavy" "Never" "Occas" "Regul"


([Link] = table(survey$Smoke))

##
## Heavy Never Occas Regul
## 11 189 19 17
Example: A survey of student characteristics

Suppose the campus smoking data are as shown above. You wish to
test null hypothesis of whether the frequency of smoking is the
same in all of the groups on campus, or H0 : πj = πj0 , j = 1, ..., 4.
([Link] <- [Link]([Link],
p = rep(1/length([Link]), length

##
## Chi-squared test for given probabilities
##
## data: [Link]
## X-squared = 382.51, df = 3, p-value < 2.2e-16
Thus, there is strong evidence against the null hypothesis that all
groups are equally represented on campus (p<.0001).
Example (continued): expected and observed counts

[Link]$expected

## Heavy Never Occas Regul


## 59 59 59 59
[Link]$observed

##
## Heavy Never Occas Regul
## 11 189 19 17
Testing with estimated expected frequencies
In some applications, the hypothesized πj0 = πj0 (θ) are functions of
a smaller set of unknown parameters θ.
For example, consider a scenario (Table 1.1 in CDA) in which we are
studying the rates of infection in dairy calves. Some calves become
infected with pneumonia. A subset of those calves also develop a
secondary infection within two weeks of the first infection clearing
up. The goal of the study was to test whether the probability of
primary infection was the same as the conditional probability of
secondary infection, given that the calf got the primary infection.
Let π be the probability of primary infection. Fill in the following
2x2 table with the associated probabilites under the null hypothesis:

Secondary Infection
Primary Infection Yes No Total
Yes
No
Example continued
Let nab denote the number of observations in row a and column b.

Secondary Infection
Primary Infection Yes No Total
Yes n11 n12
No n21 n22

The ML estimate of π is the value maximizing the kernel of the


multinomial likelihood

(π 2 )n11 (π − π 2 )n12 (1 − π)n22

The MLE is

π̂ = (2n11 + n12 )/(2n11 + 2n12 + n22 )


Example continued

One process for drawing inference in this setting would be the


following:
I Obtain the ML estimates of expected frequencies: µ̂j = nπj0 (θ̂)
by plugging in the ML estimates θ̂ of θ
I Replace µj by µ̂j in the definition of X 2 and G 2
I Use the approximate distributions of X 2 and G 2 are χ2df with
df = (c − 1) − dim(θ).
Example continued
A sample of 156 dairy calves born in Okeechobee County, Florida,
were classified according to whether they caught pneumonia within
60 days of birth. Calves that got a pneumonia infection were also
classified according to whether they got a secondary infection within
2 weeks after the first infection cleared up.

Secondary Infection
Primary Infection Yes No
Yes 30(38.1) 63(39.0)
No 0 63(78.9)

The MLE is

π̂ = (2n11 + n12 )/(2n11 + 2n12 + n22 ) = 0.494

The score statistic is X 2 = 19.7. It follows a Chi-square distribution


with df = c − p − 1 = (3 − 1) − 1 = 1.
Example continued

The p-value is
P(χ21 > 19.7) =

1-pchisq(19.7, df=1)

## [1] 9.060137e-06
Therefore, the evidence suggests that the probability of primary and
secondary infections being the same is not supported by the data.
Under H0 , we would anticipate that many more calves would have
secondary infections than did end up being infected. “The
researchers concluded that primary infection had an immunizing
efect tht reduced the likelihood of a secondary infection.”
Lecture 3: Contingency Tables

Author: Nick Reich / transcribed by Josh Nugent and Nutcha Wattanachit

Course: Categorical Data Analysis (BIOSTATS 743)

Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Contingency Tables (notation)
Imagine we have random variables X and Y . Let X and Y be
categorical variables with I and J categories, respectively. We can
draw a contingency table with i rows and j columns:

Y =1 Y =2 ··· Y =i ··· Y =J
X =1 n11 n12 ··· n1j ··· n1J n1+
X =2 n21 n22 ··· n2j ··· n2J n2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i ni1 ni2 ··· nij ··· niJ ni+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I nI1 nI2 ··· nIj ··· nIJ nI+
n+1 n+2 ··· n+j ··· n+J n

The generic entry nij denotes the count of observations when X = i


and Y = j. This table also represents the joint distribution of X
and Y when X and Y are categorical.
Contingency Tables (notation and example)
q
The notation n+j is the sum of the j th column: n+j = nij over all
q
i, while ni+ is the sum of the i th row: ni+ = nij over all j. The
q
total is simply n = ij nij .
Contingency tables are ubiquitous in scientific papers and in the
media. Example:
Contingency Tables - Probabilities
A contingency table can be represented by probabilities as well.
Define fiij to be a the population parameter representing the true
probability of being in the ij th cell - the probability that both X = i
and Y = j. Formally, fiij = Pr (X = i, Y = j), the joint probability
of X and Y for all i = 1, ..., I and j = 1, ..., J.

Y =1 Y =2 ··· Y =i ··· Y =J
X =1 fi11 fi12 ··· fi1j ··· fi1J fi1+
X =2 fi21 fi22 ··· fi2j ··· fi2J fi2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i fii1 fii2 ··· fiij ··· fiiJ fii+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I fiI1 fiI2 ··· fiIj ··· fiIJ fiI+
fi+1 fi+2 ··· fi+j ··· fi+J fi
Contingency Tables - Marginal Probabilities

We can also use these probabilities to find the marginal probability


distributions of X and Y : fii+ = P(X = i) and fi+j = P(Y = j).

Y =1 Y =2 ··· Y =i ··· Y =J
X =1 fi11 fi12 ··· fi1j ··· fi1J fi1+
X =2 fi21 fi22 ··· fi2j ··· fi2J fi2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i fii1 fii2 ··· fiij ··· fiiJ fii+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I fiI1 fiI2 ··· fiIj ··· fiIJ fiI+
fi+1 fi+2 ··· fi+j ··· fi+J fi
Contingency Tables - Conditional Probabilities

Finally, we can use our contingency table to find conditional


probabilies - the probability that you are in the j th cell, conditioned
on being in the i th row:
fiij
fij|i = = Pr (Y = j|X = i)
fii+

Another way of saying this is the probability of outcome j if i is


already known. Example: Probability that you have arthritis, if you
are over 65.
Contingency Tables - Conditional Distributions

Conditional distributions of Y given X = i,


Ó Ô
fi1|i , fi2|i , fi3|i , ..., fij|i

are key in modeling. There are similarities here to a regression-like


problem, where we are trying to describe an outcome variable as a
function of a predictor variable. This is similar to the conditional
formulation of E [Y |X ] in regression where we are modeling an
outcome of Y conditional on observed X . Stay tuned for more on
this later in the course.
Sampling Methods - Poisson
How do we obtain our data for a contingency table? Different
sampling methods will call for different analyses, so we need to
understand the key features of the methods.
1. Poisson Sampling:

I The total n is not fixed; theoretically unbounded


I Counts are generally observed in a fixed time interval
I Cell counts nij ind
≥ Poisson(µij )
Example 1: Observations of how many people are using
Mac/PC/Linux operating systems over the course of an hour at
three different locations on campus:

Mac PC Linux
DuBois
SciLi
Arnold
n
Sampling Methods - Multinomial (fixed n)
2. Multinomial Sampling (fixed n)

I Row/column totals are not fixed, but the overall n is.


I Multinomial distribution with I ú J categories.

For example, we could make a table of counts of operating systems


among the people in a single classroom at UMass by age:
Mac PC Linux total
30 or older
Under 30
total n=16

I Counts distributed nij ≥ Multinomial(n, fi), where fi is a vector


of all probabilities for the cells in the table.

Another example: Cohort study. Enroll 5,000 people and track them
over a year. Measure video game playing (none / low / high) and
hospitalizations for repetitive strain injuries.
Sampling Methods - Multinomial (fixed row n / column n)

3. Multinomial Sampling (fixed row or column n)

I Both n and ni (or nj ) are known for all i (or j).


I Example: We want to know the video-game playing habits of
Republicans, Democrats, and Independents. We survey 500 in
each group.
None Low High total
Dem 500
Rep n21 n22 n23 500
Ind 500
Sampling Methods - Multinomial (fixed row n / column n)
I With fixed row totals, the vector of counts in each row will
follow a multinomial distribution based on the row totals:
Q R Q R
ni1 fi1|i
c d c d
c ni2 d c fi2|i d
c .. d ≥ Multinomial(ni+ , ˛
fi ) with ˛
fi = c .. d
c d c d
a . b a . b
nij fij|i

Another example: A case-control study:

Case Control
SE_1 . . . . . .
SE_2 . . . . . .
SE_3 . . . . . .
Total 1000 1000
The ‰2 Statistic

The chi-squared test statistic compares the observed and expected


values for all count observations in the cells of a table:
ÿ (Oij ≠ Eij )2
Test Statistic =
i,j
Eij

* Eij is the expected value for each cell based on its probability
under the null hypothesis
* Oij is the value observed for that cell
* We reject the H0 if our ‰2 statistic is large - a larger TS means
larger deviations from expected counts
* Test is 2-sided by virtue of (...)2
* Compare to a (1 ≠ –)%ile of the ‰2df distribution with appropriate
degrees of freedom
Goodness of Fit ‰2 Test

H0 : fi1 = fi01 , fi2 = fi02 , ..., fik = fi0k

This answers the question: Does a set of counts follow a specified


(H0 ) distribution?
n = total # of observations
E i = fi0 i · n
df = k ≠ 1
q (O≠E )2
Simply evaluate using test statistic above: E
Note: fi1 = fi2 = ... = fik is a special case.
Fisher’s Exact Test for Independence
For small sample sizes, methods like this “use exact small-sample
distributions rather than large sample approximations.” (Agresti
p. 90-92) For 2x2 tables, the test looks at all possible combinations
of outcomes under

H0 : variables independent

to determine whether the observed outome is unusual enough to


reject the null.
Conditioning on both sets of marginal totals, we have
A BA B
n1+ n2+
t n+1 ≠ t
p(t) = p(n11 = t) = A B
n
n+1

Note: Becuase this is a 2x2 table with row and column totals
known, n11 determines the other three counts.
Fisher’s Exact Test: Example

Muriel Bristol, a colleague of the esteemed statistician Sir Ronald


Fisher, claimed that she could, upon tasting a cup of tea mixed with
milk, divine whether the cup had had milk poured in before the tea,
or tea before the milk. They performed an experiment to test her
claim. . .
Fisher’s Exact Test: Example
Guess Poured First
Poured First Milk Tea Total
Milk 3 1 4
Tea 1 3 4
Total 4 4

In this case, the P-value for Fisher’s exact test is the null probability
of this table and of tables having even more evidence in favor of her
claim.
The
A observed
BA Btable,
A tB 0 = 3 correct choices, has null probability
4 4 8
/ = 0.229.
3 1 4
The only table more extreme in the direction of Ha is n11 = 4, which
has a probability of 0.014. The P-value is P(n11 Ø 3) = 0.243.
Despite the underwhelming evidence in this test, Bristol did
eventually convince Fisher that she could tell the difference.
For more, see Agresti p. 90 - 92.
Birth Order and Gender

We have data on 1,000 2-child families. It is typically thought that


birth order/gender of two offspring from the same parents are i.i.d.
≥ Bernoulli(0.5). So, we can fill in the expected values: 250 for
each group.

H0 : fi1 = fi2 = fi3 = fi4 = 0.25
Ha : at least 1 fii ”= 0.25

First Child M F
Second Child M F M F Totals
Count 218 227 278 277 1000
Expected Value 250 250 250 250 1000
Birth Order and Gender
Chi-squared test for this data:
q4
k=1 (nobs ≠ nexp )2
TS =
nexp

n = 1000, k = 4, df = k-1 = 3
– = 0.05

n_obs <- c(218, 227, 278, 277)


n_exp <- c(250, 250, 250, 250)
(ts <- sum((n_obs-n_exp)^2/n_exp))

## [1] 12.264

(pval <- 1 - pchisq(ts, df = 3))

## [1] 0.006531413
Birth Order and Gender
With a p-value of 0.0065, we reject the null hypothesis at a
significance level of 0.05. We have evidence to suggest that at least
one fik ”= 0.25.
Chi−Square Density Graph
0.25
0.20
0.15
dchisq(x, df = 3)

0.10
0.05
0.00

0 5 10 15
Intro to Contingency Tables

Author: Nicholas Reich

Course: Categorical Data Analysis (BIOSTATS 743)

Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Independence
Definition:
Two categorical variable are independent iff

πij = πi+ π+j , ∀ i ∈ {1, 2, ..I} and j ∈ {1, 2, ..J}


or
P(X = i, Y = j) = P(X = i)P(Y = j)
Independence implies that the conditional distribution reverts to
marginal distribution
πij πi+ πj+
πj|i = = = πj+
πi+ πi+

or under the independence assumption

P(Y = j| X = i) = P(Y = j)
Testing for independence (Two-way contigency table)
I Under H0 : πij = πi+ π+j , ∀ i, j, the expected cell counts are

µij = nπi+ π+j

I Usually πi+ and π+j are unknown. Their MLEs are

ni+ n+j
π̂i+ = , π̂+j =
n n

I Estimated expected cell counts are


ni+ n+j
µ̂ij = nπ̂i+ π̂+j =
n
I Pearson χ2 statistic:
I X
J
(nij − µ̂ij )2
X2 = =
X

i=1 j=1
µ2
I µ̂ij requires estimating πi+ and π+j which have degrees of
freedom I − 1 and J − 1, respectively. Notice the constraints
i πi+ = j π+j = 1
P P

I The degrees of freedom is

(IJ = 1) − (I − 1) − (J − 1) = (I − 1)(J − 1)

I X 2 is asymptotically χ2(I−1)(J−1)
I It is helpful to look at the residuals

(O − E )2
{ }
E
The residuals can give useful information about where the
model is fitting well or not
Measure of Diagnostic Tests

Diagnosis
Disease Status + -
D π11 π12
D π21 π22

I Sensitivity: P(+|D) = π11


π1+

I Specificity: P(−|D) = π22


π2+

I An ideal diagnostic test has high Sensitivity, Specificity


Example:
Diagnosis
Disease Status + -
D 0.86 0.14
D 0.12 0.88

I Sensitivity = 0.86

I Specificity = 0.88

However, from the clinical point, sensitivity and specificity do not


provide useful information. So we introduce Positive Predictive
Value and Negative Predictive Value
I Positive predictive value (PPV) = P(D|+) = π11
π+1
I Negative predictive value (NPV) = P(D|−) = ππ+2
22

I Relationship between PPV and sensitivity:

P(D ∩ +)
PPV = P(D|+) =
P(+)
P(+|D)P(D)
=
P(+|D)P(D) + P(+|D)P(D)
P(+|D)P(D)
=
P(+|D)P(D) + (1 − P(−|D))P(D)
Sensitivity × Prevalence
=
Sensitivity × Prevalence + (1 − Specificity) × (1 − Prevalence)
The same example:

Diagnosis
Disease Status + -
D 0.86 0.14
D 0.12 0.88

I If the the prevalence is P(D) = 0.02


I PPV = 0.86×0.02+0.12×0.98
0.86×0.02
≈ 13%
I Notice:
π11
PPV 6=
π11 + π21
I This is only true when n1
n1 +n2 equals the disease prevalence
Comparing two groups
We first consider 2 × 2 tables. Suppose that the response variable
Y has two categories: success and failure. The explanatory variable
X has two categories, group 1 and group 2, with fixed sample sizes
in each group.

Response Y
Explanatory X Success Failure Row Total
group 1 n11 = x1 n12 = n1 − x1 n1
group 2 n21 = x2 n22 = n1 − x2 n2

The goal is to compare the probability of an outcome (success) of Y


across the two levels of X.
Assume:X1 ∼ bin(n1 , π1 ), X2 ∼ bin(n2 , π2 )

I difference of proportions
I relative risk
I odds ratio
Difference of Proportions

Response Y
Explanatory X Success Failure Row Total
group 1 n11 = x1 n12 = n1 − x1 n1
group 2 n21 = x2 n22 = n1 − x2 n2

I The difference of proportions of successes is: π1 − π2


I Comparison on failures is equivalent to comparison on
successes:
(1 − π1 ) − (1 − π2 ) = π2 − π1
I Difference of proportions takes values in [−1, 1]
I The estimate of π1 − π2 is π̂1 − π̂2 = nn111 − nn212
I the estimate of the asymptotic standard error:

π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) 1/2


σ̂(π̂1 − π̂2 ) = [ − ]
n1 n2
I The statistic for testing H0 : π1 = π2 vs. Ha : π1 6= π2

Z = (π̂1 − π̂2 )/σ̂(π̂1 − π̂2 )

which follows a standard normal distribution (normal + normal


= normal)
I The CI is given by

(π̂1 − π̂2 ) ± Zα/2 σ̂(π̂1 − π̂2 )


Relative Risk

I Definition
r = π1 /π2
I Motivation: The difference between π1 = 0.010 and
π2 = 0.001 is more noteworthy than the difference between
π1 = 0.410 and π2 = 0.401. The “relative risk”
(0.010/0.001=10, 0.410/0.401=1.02) is more informative than
“difference of proportions” (0.009 for both).
I The estimate of r is
r̂ = π̂1 /π̂2
I The estimator converges to normality faster on the log scale.
I The estimator of log r is

log r̂ = log π̂1 − log π̂2

The asymptotic standard error of log r̂


1 − π1 1 − π2 1/2
σ̂(log r̂ ) = ( + )
π 1 n1 π 2 n2

I Delta method: If n(β̂ − β0 ) → N(0, σ 2 ), then

n(f (β̂) − f (β0 )) → N(0, [f 0(β0 )]2 σ 2 )

for any function f satisfying the condition that f 0(β) exists


I Here β = π1 or π2 and f (β) = log(π1 ) or log(π1 )
I The CI for log r̂ is

[log r̂ − Z1−α/2 σ̂(log r̂ ), log r̂ + Z1−α/2 σ̂(log r̂ )]

I The CI for r̂ is

[exp{log r̂ − Z1−α/2 σ̂(log r̂ )}, exp{log r̂ + Z1−α/2 σ̂(log r̂ )}]


Odds Ratio

I Odds in group 1:
π1
φ1 =
(1 − π1 )
I Interpretation: φ1 = 3 means a success is three times as likely
as a failure in group 1
I Odds ratio:
φ1 π1 /(1 − π1 )
θ= = ∼ χ2
φ2 π2 /(1 − π2 )
I Interpretation: θ = 4 means the odds of success in group 1 are
four times the odds of success in group 2
I The estimate is
n11 n22
θ̂ =
n12 n21
I log(θ̂) converge to normality much faster than θ̂
I An estimate of asymptotic standard error for log(θ̂) is
s
1 1 1 1
σ̂(log θ̂) = + + +
n11 n12 n21 n22
This formula can be derived using the Delta method Recall
log θ̂ = log(π̂1 ) − log(1 − π̂1 ) − log(π̂2 ) + log(1 − π̂2 )
First, f (β) = log(π̂1 ) − log(1 − π̂1 )

π1 (1 − π1 ) 1 1
σ= , f 0(β) = +
n1 π1 1 − π1
1 1
[f 0(β)]2 σ 2 = +
n1 π1 n1 (1 − π1 )
The estimate is 1
n11 + 1
n12

Similar, when f (β) = log(π̂2 ) − log(1 − π̂2 )


I The Wald CI for log θ̂ is

log θ̂ ± Zα/2 σ̂(log θ̂)

I Exponentiation of the endpoints provides a confidence interval


for θ̂
Relationship between Odds Ratio and Relative Risk

I A large relative risk does not imply large odds ratio


I From the definitions of relative risk and odds ratio, we have
π1 1 − π2 1 − π2
θ= = relative risk ×
π2 1 − π1 1 − π1
I When probabilities π1 and π2 (the risk in each row group)are
both very small, then the second ratio above ≈ 1. Thus

odds ratio ≈ relative risk

I This means when relative risk is not directly estimable, e.g., in


case-control studies, and the probabilities π1 and π2 are both
very small, the relative risk can be approximated by the odds
ratio.
Case-Control Studies and Odds Ratio

Consider the case-control study of lung cancer:

Lung Cancer
Smoker Cases Controls
Yes 688 650
No 21 59
Total 709 709

I People are recruited based on lung cancer status, therefore


P(Y = j) is known. However P(X = i) is unknown
I Conditional probabilities P(X = i|Y = j) can be estimated
I Conditional probabilities P(Y = j|X = i) cannot be estimated
I Relative risk and difference of proportions cannot be estimated
I Odds can be estimated:
P(Case|Smoker)
Odds of lung cancer among smoker =
P(Control|Smoker)
P(Case ∩ Smoker)P(Smoker)
=
P(Control ∩ Smoker)P(Smoker)
P(Case ∩ Smoker)
=
P(Control ∩ Smoker)
= 688/650 = 1.06

I Odds is irrelevant to the probability of being a smoker


I Odds ratio can also be estimated:
P(X = 1|Y = 1)P(X = 2|Y = 2)
θ= = 2.97
P(X = 1|Y = 2)P(X = 2|Y = 1)
Supplementary: Review of the Delta Method

The Delta method builds upon the Central Limit Theorem to allow
us to examine the convergence of the distribution of a function g of
a random variable X .
It is not too complicated to derive the Delta method in the
univariate case. We need to use Slutsky’s Theorem along the way; it
will be helpful to first review ideas of convergence in order to better
understand where Slutsky’s Theorem fits into the derivation.
Delta Method: Convergence of Random Variables

Consider a sequence of random variables X1 , X2 , . . . , Xn , where the


distribution of Xi may be a function of of i.

I Let Fn (x ) be the CDF for Xn and F (x ) be the CDF for X . It is


said that Xn converges in distribution to X , written
Xn → d X , if limn→∞ [Fn (x ) − F (x )] = 0 for all x where F (x ) is
continuous.
I It is said that Xn converges in probability to X , written
Xn → p X if limn→∞ [Xn − X ] = 0.

Note that if Xn → p X , then Fn (x ) → d F (x ), since


Fn (x ) = P(Xn ≤ x ) and F (x ) = P(X ≤ x ). (This is not a proof,
but an intuition. The Wikipedia article on convergence has a nice
proof.)
Delta Method: Slutsky’s Theorem and First-Order Taylor
Approximation

I Recall that Slutsky’s Theorem tells us that if some random


variable Xn converges in distribution to X and some random
variable Yn converges in probability to c, then Xn + Yn
converges in distribution to X + c and Xn Yn converges in
distribution to cX .
I Recall that the first-order Taylor approximation of a
function g centered at u can be written as
g(x ) = g0(u)(x − u) + g(u) + R(x ), where
i
R(x ) = ni=2 g (i) (u) (x −u)
i! .
P
Delta Method: Hand-wave-y Derivation


Suppose we know that n(Xn − θ) → d N(0, σ 2 ) and we are
interested in the behavior of some function g(Xn ) as n → ∞. If
g0(θ) exists and is not zero, we can write
g(Xn ) ≈ g0(θ)(Xn − θ) + g(θ) using Taylor’s approximation:


(Xn − θ)i
g(Xn ) = g0(θ)(Xn − θ) + g(θ) + g (i) (θ)
X

i=2
i!
Delta Method: Hand-wave-y Derivation

Some manipulation gives:


√ √ √ √ X (Xn − θ)i
ng(Xn ) = n∗g0(θ)(Xn −θ)+ n∗g(θ)+ n∗ g (i) (θ)
i=2
i!

or, using the definition of R from the previous slide,

√ √ √
n(g(Xn ) − g(θ)) = n ∗ g0(θ)(Xn − θ) + n ∗ R(Xn )
Delta Method: Hand-wave-y Derivation

Since g0(θ) is a constant with respect to n and



n(Xn − θ) →d N(0, σ 2 ), we have


g0(θ) n(Xn − θ) →d N(0, σ 2 (g0(θ))2 )
.
It can be shown that the remainder term R(Xn ) →p 0 (see the
Stephens link from McGill below for a proof).
We now have the necessary setup to apply Slutsky’s Theorem, and
we can conclude that

n(g(Xn ) − g(θ)) →d N(0, σ 2 (g0(θ))2 )

.
Delta Method: References

I [Link]
[Link]
I https:
//[Link]/wiki/Convergence_of_random_variables
I [Link]
I [Link]
I [Link]
[Link]
Lecture 5: Likelihood Ratio Confidence Intervals &
Bayesian Methods for Contingency Tables
Nick Reich / transcribed by Bianca Doone and Guandong Yang

Likelihood Ratio Confidence Intervals


Motivating Example: Poisson Sample-Mean Estimation (1-parameter Poisson Mean)
• The probability mass function (pmf) for the Poisson distribution is defined as:
e−µ µy
p(y|µ) = y! , y = 0, 1, ... (non-negative integers)
with E(Y ) = V(Y ) = µ
iid
• Given observations y1 , y2 , ..., yn , assuming yi ∼ Poisson(µ), the log-likelihood is defined as:

! Pn !
n
Y e−µ µyi e−nµ µ i=1 yi
L(µ|y) = log = log Qn
i=1
yi ! i=1 yi !

n
X
= −nµ + log(µ) yi + C
i=1
Qn
where C = i=1 yi ! is a constant.

LR CI’s - Motivating Example Cont’d

n
X
L(µ|y) ∝ −nµ + log(µ) yi
i=1

• Note that taking the first derivative and setting equal to zero provides us the MLE for the data:

n
1X
µ̂ = ȳ = yi
n i=1

• Let ỹ = (2, 2, 3)T . Then the log-likelihood is

L(µ|ỹ) ∝ −3µ + 7 log(µ)


maximized at µ̂ = 7/3.

1
LR CI’s - Visualization of this likelihood

0
−10 −5
Log Likelihood

−20
−30

0 2 4 6 8

mu

LR CI’s - A Few Notes


• Each point on this curve represents a “fit” to the data.

• In general, adding more data implies that the likelihood is going to be lower.

• Likelihoods always need to be relative (i.e. L1 vs. L0 ).

• Using an absolute scale is not particularly meaningful.

(1 − α)% likelihood-based C.I.


Suppose Θ is a set of parameters, and we let Θ or a subset of Θ vary
Heuristic Definition:
• We are interested in the set Θ for which

LRTS(Θ) = −2[L(Θ) − L(Θ̂)] < χ2df (1 − α)

with Θ̂ as the fixed value at the MLE, and the LR test statistic compared to the (1 − α)th quantile of
χ2df
• The set of Θ where this holds is a (1 − α)% confidence interval with degrees of freedom equal to the
number of parameters the likelihood is varying over (free parameters)

2
LR CI’s - Motivating Example Cont’d
Motivating Example: Poisson Sample-Mean Estimation (1-parameter Poisson Mean)
• Let Θ = µ, then we can define:
LRTS(µ) = −2[L(µ) − L(µ̂)]

• Setting α = 0.05, we can define a 95% confidence interval for µ as

µ : LRTS(µ) < χ21 (.95)




= µ : −2[L(µ) − L(µ̂)] < χ21 (.95)




χ21 (.95)
 
= µ : L(µ) > L(µ̂) −
2

LR CI’s - Motivating Example Cont’d


Using the MLE, µ̂ = 7/3, and the value of the χ21 (.95) = 3.84, we derive an approximate 95% confidence
interval for µ:  
3.84
µ : L(µ) > L(7/3) − ≈ {µ : 1.1 ≤ µ ≤ 4.5}
2
0

L = L(mu)
−1
Log Likelihood

−2

L = L(mu) − Z/2
−3
−4
−5

0 2 4 6 8

mu

Bayesian Method for Contingency Tables


Bayesian methods for contingency tables may be a good alternative to small sample-size methods because
there is less reliance on large sample theory...but could be sensitive to prior choice.
• Supplement on impact of priors: See Chapter 3.6.4 in Agresti.

3
Beta/Binomial Example - Handedness
Q: Are women and men left-handed at the same rate?
Gender RH LH Total
Men 43 9 n1 = 52
Women 44 4 n2 = 48
Total 77 13 100
In other words:
• Is there a difference in the proportions of men who are left-handed and women who are left-handed?

H0 : Pr(left-handed|male) = Pr(left-handed|female)

• Is the difference between left-handed men and left-handed women equal to zero?

H0 : Pr(left-handed|male) − Pr(left-handed|female) = 0

Beta/Binomial Example - Handedness


When we have 2 × 2 table, the chi-square test for independence is equal to two-sided test for different
proportion.
dat <- matrix(c(43, 9 ,44, 4), ncol =2, byrow = T)
chi <- [Link](dat, correct = F)
dif <- [Link](dat, correct = F)

Tests Test Statistics DF P-vlaue


Chi-square test 1.777 1 0.182
Difference Two proportion 1.777 1 0.182

Beta/Binomial Example - Handedness


• Probability Structure:
– Men who are left-handed: Y1 ∼ Bin(n1 , π1 )
– Women who are left-handed: Y2 ∼ Bin(n2 , π2 )
• Observed Data:
– (y1 , y2 ) = (9, 4)
– (n1 , n2 ) = (52, 48)
• Let us assign a Uniform prior onto π1 and π2 , such that
– π1 ∼ U (0, 1) (also considered ∼ Beta(1, 1))
– π2 ∼ U (0, 1)
• Because the Beta distribution is a conjugate prior to the Binomial likelihood, the posterior distribution
for pi1 and π2 is
p(π1 |y, n) ∼ Beta(y1 + 1, n1 − y1 + 1)
p(π2 |y, n) ∼ Beta(y2 + 1, n2 − y2 + 1)

4
Bayesian Method - Computational Technique
1. Simulate N independent draws from p(π1 |y, n) and p(π2 |y, n)
2. Compute θi , i = 1, ..., N
3. Plot empirical posterior
4. Calculate summary statistics

Bayesian Method - Multinomial/Dirichlet


• Suppose y is a vector of counts with number of observations for each possible outcome, j
• Then, the likelihood can be written as
k
Y
p(y|θ) ∝ θjyi
j=1
P
where j θj = 1 and θ is a vector of probabilities for j.
• The conjugate prior distribution is a multivariate generalization of the Beta distribution: The Dirichlet

Bayesian Method - Multinomial/Dirichlet


• We set the Dirichlet distribution as the prior for θ: θ ∼ Dir(α), with pdf:
k
α −1
Y
p(θ|α) ∝ θj j
j=1
P
where α is a hyper parameter, and θj > 0, j θj = 1
• The posterior distribution can then be derived as
 
α1 + y1
 α2 + y2 
p(θ|y) ∼ Dir 
 
.. 
 . 
αk + yk

• Plausible “non-informative” priors


P
– Set αj = 1, ∀j gives equal density to any vector θ such that j θj = 1
– Set αj = 0, ∀j (improper prior) gives a uniform distribution in log(θj ) (if yi > 0, ∀j, we have a
proper posterior)

Multinomial/Dirichlet - Example in R
Adapted from Bayesian Data Analysis 3
• A poll was conducted with n = 1447 participants, with the following results:
– Obama: y1 = 727
– Romney: y2 = 583
– Other: y3 = 137

5
• The estimand of interest is θ1 − θ2
• Assuming simple random sampling, we have
 
θ1
(y1 , y2 , y3 ) ∼ Multinomial(n,  θ2 )
θ3
• We now apply the same computational technique as in the univariate case. . .

Multinomial/Dirichlet - Example in R

#data
y <- c(727, 583, 137)

#"uniform" hyperparameter
a <- c(1,1,1)

#prior
pri <- rdirichlet(100, a)

#Generate Posterior
postr <- rdirichlet(1000, y+a)

Multinomial/Dirichlet - Visualization of Prior

1.5

Group 1
1.0

0.5

0.0

1.5
density

Group 2

1.0

0.5

0.0

1.5
Group 3

1.0

0.5

0.0
0.00 0.25 0.50 0.75 1.00
value

6
Multinomial/Dirichlet - Visualization of Prior (3D)

V2
100

20
80
40

60
60

40
80

20
10
0

V1 V3
20

40

60

80

0
10

7
Multinomial/Dirichlet - Visualization of Posterior

50
40

Group 1
30
20
10
0
50
40
density

Group 2
30
20
10
0
50
40

Group 3
30
20
10
0
0.00 0.25 0.50 0.75 1.00
value

8
Multinomial/Dirichlet - Visualization of Posterior (3D)

V2
100

20
80
40

60
60

40
80

20
10
0

V1 V3
20

40

60

80

0
10

Multinomial/Dirichlet - Summary Statistics

poll_diff <- postr[,1]-postr[,2]


hist(poll_diff, main = main)

9
Histogram of difference between group 1 and group 2
150
100
Frequency

50
0

0.05 0.10 0.15

poll_diff

Multinomial/Dirichlet - Summary Statistics

### Point Estimates


mean(poll_diff)

## [1] 0.09999157
### P-value
mean(poll_diff >0)

## [1] 1
### 95% CI
quantile(poll_diff, c(.025, .975))

## 2.5% 97.5%
## 0.05538754 0.14811648

10
Lecture 6: GLMs

Author: Nicholas Reich Transcribed by Nutcha Wattanachit/Edited by


Bianca Doone

Course: Categorical Data Analysis (BIOSTATS 743)

Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Generalized Linear Models (GLMS)
GLMs are extensions or generalization of linear regression models to
encompass non-normal response distribution and modeling functions
of the mean . - Example for ordinary LM:

E[Y|X]
iid
Y = Xβ + , i ∼ N (0, σ 2 )
The best fit line on the following plot represents E (Y |X ).
100
80
60

y
40
20
0

x
5 10 15 20
Overview of GLMs

I Early versions of GLM were used by Fisher in 1920s and GLM


theories were unified in 1970s.
I Fairly flexible parametric framework, good at describing
relationships and associations between variables
I Fairly simple (‘transparent’) and interpretable, but basic GLMs
are not generally seen as the best approach for predictions.
I Both frequentist and Bayesian methods can be used for
parametric and nonparametric models.
GLMs: Parametric vs. Nonparametric Models

I Parametric models: Assumes data follow a fixed distribution


defined by some parameters. GLMs are examples of parametric
models. If assumed model is “close to” the truth, these
methods are both accurate and precise.
I Nonparametric models: Does not assume data follow a fixed
distribution, thus could be a better approach if assumptions are
violated.
Components of GLMs
1. Random Component: Response variable Y with N observations
from a distribution in the exponential family:
I One parameter: f (yi |θi ) = a(θi )b(yi ) exp{yi Q(θi )}
i −b(θi )
I Two parameters: f (yi |θi , Φ) = exp{ yi θa(Φ) + c(yi , Φ)},
where Φ is fixed for all observations
I Q(θi ) is the natural parameter
2. Systematic Component: The linear predictor relating ηi to Xi :
I ηi = Xi β
3. Link Function: Connects random and systematic components
I µi = E (Yi )
I ηi = g(µi ) = g(E (Yi |Xi )) = Xi β
I g(µi ) is the link function of µi
g(µ) = µ, called the identity link, has ηi = µi (a linear model for a
mean itself).
Example 1: Normal Distribution (with fixed variance)
Suppose yi follows a normal distribution with
I mean µi = ŷi = E (Yi |Xi )
I fixed variance σ 2 . The pdf is defined as

1 −(yi − µi )2
f (yi |µi , σ 2 ) = p exp{ }
(2πσ 2 ) 2σ 2
1 −yi2 2yi µi −µ2i
=p exp{ }exp{ }exp{ }
(2πσ 2 ) 2σ 2 2σ 2 2σ 2

I Where:
I θ = µi
2
I a(µi ) = exp{ −µ
2σ 2 }
i

−yi2
I b(yi ) = exp{ 2σ2 }
I Q(µi ) = exp{ σµ2i }
Example 2: Binomial Logit for binary outcome data

I Pr (Yi = 1) = πi = E (Yi |Xi )


I  π yi
i
f (yi |θi ) = π yi (1 − πi )1−yi = (1 − πi )
1 − πi
n πi o
= (1 − πi ) exp yi log
1 − πi
I Where:
I θ = πi
I a(πi ) = 1 − πi
I b(yi ) = 1
 
I Q(πi ) = log 1−ππi
i

I The natural parameter Q(πi ) implies the canonical link


 
πi
function: logit(π) = log 1−π i
Example 3: Poisson for count outcome data

I Yi ∼ Pois(µi )
I
e −µi µyi i
f (yi |µi ) =
yi !
1
= e −µi exp{yi log µi }
yi
I Where:
I θ = µi
I a(µi ) = e −µi
 
I b(yi ) = y1
i
I Q(µi ) = log µi
Deviance
For a particular GLM for observations y = (y1 , ..., yN ), let L(µ|y )
denote the log-likelihood function expressed in terms of the means
µ = (µ1 , ..., µN ). The deviance of a Poisson or binomial GLM is

D = −2[L(µ̂|y ) − L(y |y )]

I L(µ̂|y ) denotes the maximum of the log likelihood for y1 , ..., yn


expressed in terms of µ̂1 , ..., µ̂n
I L(y |y ) is called a saturated model (a perfect fit where µ̂i = yi ,
representing “best case” scenario). This model is not useful,
since it does not provide data reduction. However, it serves as
a baseline for comparison with other model fits.
I Relationship with LRTs: This is the likelihood-ratio statistic for
testing the null hypothesis that the model holds against the
general alternative (i.e., the saturated model)
Logistic Regression

For “simple” one predictor case where Yi ∼ Bernoulli(πi ) and


Pr (Yi = 1) = πi :

πi
 
logit(πi ) = log
1 − πi
= logit(Pr (Yi = 1))
= logit(E [Yi ])
= g(E [Yi ])
= Xβ
= β0 + βi xi ,
eX β
which implies Pr (Yi = 1) = 1+e X β
.
I g does not have to be a linear function (linear model means
linear with respect to β).
Logistic Regression (Cont.)

The graphs below illustrate the correspondence between the linear


systematic component and the logit link. The logit transformation
restricts the range Yi to be between 0 and 1.

1.0
10

β1 > 0 β1 > 0
Log−odds scale, logit(πi)

0.8
Probability scale, πi
5

0.6
0

0.4
−5

0.2
β1 < 0 β1 < 0
−10

0.0
0 5 10 15 20 0 5 10 15 20
Example: Linear Probability vs. Logistic Regression Models

I For a binary response, the linear probability model


π(x ) = α + β1 X1 + ... + βp Xp with independent observations is
a GLM with binomial random component and identity link
function
I Logistic regression model is a GLM with binomial random
component and logit link function
Example: Linear Probability vs. Logistic Regression Models
(Cont.)

An epidemiological survey of 2484 subjects to investigate snoring as


a risk factor for heart disease.
n<-c(1379, 638, 213, 254)
snoring<-rep(c(0,2,4,5),n)
y<-rep(rep(c(1,0),4),c(24,1355,35,603,21,192,30,224))
Example: Linear Probability vs. Logistic Regression Models
(Cont.)
library(MASS)
logitreg <- function(x, y, wt = rep(1, length(y)),
intercept = T, start = rep(0, p), ...)
{
fmin <- function(beta, X, y, w) {
p <- plogis(X %*% beta)
-sum(2 * w * ifelse(y, log(p), log(1-p)))
}
gmin <- function(beta, X, y, w) {
eta <- X %*% beta; p <- plogis(eta)
t(-2 * (w *dlogis(eta) * ifelse(y, 1/p, -1/(1-p))))%*% X
}
if([Link](dim(x))) dim(x) <- c(length(x), 1)
dn <- dimnames(x)[[2]]
if(!length(dn)) dn <- paste("Var", 1:ncol(x), sep="")
p <- ncol(x) + intercept
if(intercept) {x <- cbind(1, x); dn <- c("(Intercept)", dn)}
if([Link](y)) y <- (unclass(y) != 1)
fit <- optim(start, fmin, gmin, X = x, y = y, w = wt, ...)
names(fit$par) <- dn
invisible(fit)
}
[Link]<-logitreg(x=snoring, y=y, hessian=T, method="BFGS")
[Link]$par

## (Intercept) Var1
## -3.866245 0.397335

- Logistic regression model fit: logit[π̂(x )]= - 3.87 + 0.40x


Example: Linear Probability vs. Logistic Regression Models
(Cont.)
lpmreg <- function(x, y, wt = rep(1, length(y)),
intercept = T, start = rep(0, p), ...)
{
fmin <- function(beta, X, y, w) {
p <- X %*% beta
-sum(2 * w * ifelse(y, log(p), log(1-p)))
}
gmin <- function(beta, X, y, w) {
p <- X %*% beta;
t(-2 * (w * ifelse(y, 1/p, -1/(1-p))))%*% X
}
if([Link](dim(x))) dim(x) <- c(length(x), 1)
dn <- dimnames(x)[[2]]
if(!length(dn)) dn <- paste("Var", 1:ncol(x), sep="")
p <- ncol(x) + intercept
if(intercept) {x <- cbind(1, x); dn <- c("(Intercept)", dn)}
if([Link](y)) y <- (unclass(y) != 1)
fit <- optim(start, fmin, gmin, X = x, y = y, w = wt, ...)
names(fit$par) <- dn
invisible(fit)
}
[Link]<-lpmreg(x=snoring, y=y, start=c(.05,.05), hessian=T, method="BFGS")
[Link]$par

## (Intercept) Var1
## 0.01724645 0.01977784

- Linear probability model fit: π̂(x ) = 0.0172 + 0.0198x


Example: Linear Probability vs. Logistic Regression Models
(Cont.)
Coefficient Interpretation in Logistic Regression

Our goal is to say in words what βj is. Consider

logit(Pr (Yi = 1)) = β0 + β1 X1i + β2 X2i + ...

I The logit function at Xi = k and at one-unit increase k + 1 are


given by:

logit(Pr (Yi = 1|X1 = k, X2 = z)) = β0 + β1 k + β2 z


logit(Pr (Yi = 1|X1 = k + 1, X2 = z)) = β0 + β1 (k + 1) + β2 z
Coefficient Interpretation in Logistic Regression (Cont.)

I Subtracting the first equation from the second:

log[odds(πi |X1 = k+1, X2 = z)]−log[odds(πi |X1 = k, X2 = z)] = β1

I The difference can be expressed as

odds(πi |Xi = k + 1, X2 = z)
 
log = log odds ratio
odds(πi |Xi = k, X2 = z)

I We can write log OR = β1 or log OR = e β1 .


Coefficient Interpretation in Logistic Regression (Cont.)

I For continuous Xi : For every one-unit increase in Xi , the


estimated odds of outcome changes by a factor of e β1 or by
[(e β1 − 1) × 100]%, controlling for other variables
I For categorical Xi : Group Xi has e β1 times the odds of
outcome compared to group Xj , controlling for other variables
Continuous X Categorical X
−5

−5
Log−odds scale, logit(πi)

Log−odds scale, logit(πi)


−6

−6
−7

−7
β1 comparing ORs
−8

−8
−9

−9
−10

−10

k k+1 group A group B

0 1 2 3 4 5 0 1 2 3 4 5
Lecture 7: GLMs: Score equations,
Residuals

Author: Nick Reich / Transcribed by Bing Miu and Yukun Li

Course: Categorical Data Analysis (BIOSTATS 743)


Likelihood Equations for GLMs

I The GLM likelihood function is given as follows:


* X
L( β ) = log(f (yi |θi , φ))
i
X n yi θi − b(θi ) o
= + C (yi , φ)
i
a(φ)
X yi θi − b(θi ) X
= + C (yi , φ)
i
a(φ) i

I φ is a dispersion parameter. Not indexed by i, assumed to be


fixed
I θi contains β, from ηi
I C (yi , φ) is from the random component.
Score Equations
I Taking the derivative of the log likelihood function, set it equal
to 0 *
∂L( β ) X ∂Li
= = 0, ∀j
∂βj i
∂βj
i −µi )
I Since ∂L i
= (ya(φ) , µi = b 0 (θi ), Var (Yi ) = b 00 (θi )a(φ), and
P∂θi
ηi = j βj xij
X ∂Li X yi − µi a(φ) ∂µi
0= = xij
i
∂βj i
a(φ) Var (Yi ) ∂ηi
X (yi − µi )xij ∂µi
=
i
Var (Yi ) ∂ηi

I V (θ) = b 00 (θ), b 00 (θ) is the variance function of the GLM.


I µi = E [Yi |xi ] = g −1 (Xi β). These functions are typically
non-linear with respect to β’s, thus require iterative
computation solutions.
Example: Score Equation from Binomial GLM (Ch5.5.1)
Y~ Binomial(ni , πi )

I The joint probability mass function:


N
π(xi )yi [1 − π(xi )]ni −yi
Y

i=1

I The log likelihood:


XX  X h X i
L(β) = yi xij βj − ni log 1 + exp βj xij
j i i j

I The score equation:


*
∂L( β ) X e Xi β
= (yi − ni π̂i )xij note that π̂i =
∂βj i
1 + e Xi β
.
Asymptotic Covariance of β̂:

I The likelihood function determines the asymptotic covariance


of the ML estimate for β̂.
I Given the information matrix, I with hj elements:
*
h −∂ 2 L( β ) i N
X xih xij  ∂µi 2
I=E =
∂βh βj i=1
Var (Yi ) ∂ηi

where wi denotes
1  ∂µ 2
i
wi =
Var (Yi ) ∂ηi
Asymptotic Covariance Matrix of β̂:

I The information matrix, I is equivalent to:


I= N T
P
i=1 xih xij wi = X WX
I W is a diagonal matrix with wi as the diagonal element. In
practice, W is evulated at β̂ MLE and depdent on the link
function
I The square root of the main diagonal elements of (X T WX )−1
are estimated standard errors of β̂
Analogous to SLR

SLR GLM
σ̂ 2
Var (β̂i ) PN the i th main diagnal
(xi −x̄ )2
i=1
element of (X T WX )−1
Cov (β̂i ) σ̂ 2 (X T X )−1 (X T WX )−1
Residual and Diagnostics

I Deviance Tests
I Measure of goodness of fit in GLM based on likelihood
I Most useful as a comparison between models (used as a
screening method to identify important covariates)
I Use the saturated model as a baseline for comparison with other
model fits
I For Poisson or binomial GLM: D = −2[L(µ̂|y ) − L(y |y )].
I Example of Deviance

Model D((y , µ̂) )


(Y − µ̂i )2
P
Gaussian
P i
Poisson 2 (yi ln( µ̂yii ) − (yi − µ̂i ))
2 (y ln( yi )+(n −y )ln( ni −yi ))
P
Bionomial i µ̂i i i ni −µ̂i
Deviance tests for nested models
I Consider two models, M0 with fitted values µ̂0 and M1 with
fitted values µ̂1 :
I M0 is nested within M1

η1µ1 = β0 + β1 X11 + β2 X12


η0µ0 = β0 + β1 X11

I Simpler models have smaller log likelihood and larger deviance:


L(µ̂0 |y ) ≤ L(µ̂1 |y ) and D(y |µ̂1 ) ≤ D(y |µ̂0 ).
I The likelihood-ratio statistic comparing the two models is the
difference between the deviances.

−2[L(µ̂0 |y ) − L(µ̂1 |y )]
= −2[L(µ̂0 |y ) − L(y |y )] − {−2[L(µ̂1 |y ) − L(y |y )]}
= D(y |µ̂0 ) − D(y |µ̂1 )
Hypothesis test with differences in Deviance

I H0 : βi1 = ... = βij = 0, fit a full and reduced model


I Hypothesis test with difference in deviance as test statistics. df
is the number of parameter different between µ1 and µ0

D(y |µ̂0 ) − D(y |µ̂1 ) ∼ χ2df

I Reject H0 if the the chi-square calculated value is larger than


χ2df ,1−α , where df is the number of parameters difference
between µ0 and µ1 .
Residual Examinations

I Pearson residuals :
eip = √y −µ̂i , where µi = g −1 (ηi ) = g −1 (Xi β)
V (µ̂i )
I Deviance residuals√:
eid = sign(yi − µ̂i ) di , where di is the deviance contribution of
(
1 x >0
ith obs. and sign(x ) =
−1 x ≤ 0
I Standardized residuals:
ri = p ei , where ei = √y −µ̂i , hc1 is the measure of
(1−hbi ) V (µ̂i )
leverage, and ri ∼
= N(0, 1)
Residual Plot
Problem: Residual plot is hard to interpret for logistic regression
2
1
Residuals

0
−1
−2

−3 −2 −1 0 1 2 3

Expected Values
Binned Residual Plot
I Group observations into ordered groups (by xj , ŷ or xij ), with
equal number of observations per group.
I Compute group-wise average for raw residuals
I Plot the average residuals vs predicted value. Each dot
represent a group.
Average Residuals

0.0 0.4
−0.6

−2 −1 0 1 2

Expected Values
Binned Residual Plot (Part 2)
I Red lines indicate ± 2 standard-error bounds, within which one
would expect about 95% of the binned residuals to fall.
I R function avaiable.

linrary(arm)
binnedplot(x ,y, nclass...)
# x <- Expected values. # y <- Residuals values.
# nclass <- Number of bins.
Average Residuals

0.2
−0.6

−2 −1 0 1 2

Expected Values
Binned Residual Plot (Part 3)

I In practice may need to fiddle with the number of observations


per group. Default will take the value of nclass according to
the n such that:
– if n ≥ 100, nclass = floor (sqrt(length(x )));
– if 10 < n < 100, nclass = 10;
– if n < 10, nclass = floor (n/2).
Ex: Binned Residual Plot with different bin sizes
bin size = 10 bin size = 50

0.4
0.2
Average Residuals

Average Residuals

0.2
0.1

0.0
0.0

−0.2
−0.2

−3 −2 −1 0 1 2 −4 −2 0 2 4

Expected Values Expected Values

bin size = 100 bin size = 500


0.4

0.4
Average Residuals

Average Residuals

0.2
0.2

0.0
0.0

−0.2
−0.2

−0.4

−4 −2 0 2 4 −4 −2 0 2 4

Expected Values Expected Values


Lecture 8: GLMs: Residuals,
Newton-Raphson Algorithm, CIs

Author: Nick Reich / Transcribed by Yukun Li and Daveed Goldenberg

Course: Categorical Data Analysis (BIOSTATS 743)


Model Diagnostics via Residuals

I Example: Heart disease and blood pressure (Ch.6.2.2)


I Random sample of male residents in Framingham, MA aged
40-57. The response variable is whether they developed
coronary heart disease during a six-year follow-up period.
I Let πi be the probability of heart disease for blood pressure
category i.
I Let xi be the categories of systolic blood pressure
Model Diagnostics via Residuals

I Independent model (Blood Pressure is independent of Heart


Disease)
logit(πi ) = α
I Linear logit model (models association between Blood Pressure
and Heart Disease)

logit(πi ) = α + βxi
Model Diagnostics via Residuals

SBP Independence Linear Logit


<117 -2.62 -1.11
117-126 -0.12 2.37
127-136 -2.02 -0.95
137-146 -0.74 -0.57
147-156 0.84 0.13
157-166 0.93 -0.33
167-186 3.76 0.65
>186 3.07 -0.18
Model Diagnostics via Residuals
I Standardized residual plot of two models

4 independence
linear logit
3
2
Residuals

1
0
−2

120 140 160 180

SBP
Model Diagnostics via Residuals

I The Independent model has increasing residuals as blood


pressure increases, a pattern like this breaks our assumption
that residuals are normally distributed with mean of 0 and
variance of 1.
I The residuals shows that the Independent model does not seem
to be a good model
I However, the Linear Logit model appears to have no pattern
throughout the residual plot and appears to be a good model
Newton Raphson Algorithm
I An iterative method for solving nonlinear equations
I General steps:
I 1. initial guess for the solution
I 2. approximate the function locallly and find maximum
I 3. the maximum becomes the next guess
I 4. repeat steps 2 and 3 until convergence
I Recall the solution to the estimating equations:

∂L(β)
=0
∂β
∂L(β) ∂L(β)
µT = ( , ..., )
∂β1 ∂βp
∂ 2 L(β)
H=( ), ∀ i, j = 1, 2, ..., p
∂βi ∂βj
I where H is Hessian matrix, also called observed information.
Newton Raphson Algorithm

I We are trying to maximize L(β) through this iterative process


I starting with t = 1

µ(t) = µ(β (t) )


H (t) = H(β (t) )

I where β (t) is our t th guess of β


Newton Raphson Algorithm

I Let µ(t) and H (t) be µ and H evaluated at β (t) .


I According to Taylor series expansion,
1
L(β) ≈ L(β (t) )+µ(t)T (β −β (t) )+ (β −β (t) )T H (t) (β −β (t) )
2
I Solve ∂L(β)/∂β ≈ µ(t) + H (t) (β − β (t) ) = 0, we get

β (t+1) = β (t) − (H (t) )−1 µ(t)


Fisher Scoring Method

I Fisher scoring is an alternative iterative method for solving


likelihood equations.
I Use the expected value of Hessian matrix, called the expected
information, denoted as J .

∂ 2 L(β)
J = (−E ( )) ∀ i, j = 1, 2, ..., p
∂βi ∂βj
β (t+1) = β (t) − (J (t) )−1 µ(t)
Inference for GLMs

I The curvature of our likelihood function determines the


uncertainty/information about a parameter
I Classical/Frequentist inference assumes under certain regularity
conditions, that parameters follows Multivariate Normal
distribution centered at β̂ MLE with asymptotic covariance
approximately by H −1 or J .
I for confidence interval:

β̂ ∼ N(β̂ MLE , SE (β̂ MLE ))

I for hypothesis testing:

β̂ MLE ∼ N(β̂0 , SE (β̂ MLE ))


Inference for GLMs

I Bayesian inference
I Sample directly from posterior multivariate distribution and
calculate the credible sets.
I Likelihood based inference
I Similarly, directly calculate confidence intervals from likelihood
function.
Inference for GLMs

I SE and CI for GLMs (Frequentist edition)


I 95% CI for β:
β̂ ± 1.96SE (β̂)
I 95% CI for logit(µi ):

logit(µi ) = α + βxi
⇒ Var (logit(µ̂i )) = Var (α̂ + β̂xi )
= Var (α̂) + xi2 Var (β̂) + 2Cov (α̂, β̂)
⇒ CI : logit(µ̂i ) ± 1.96SE (logit(µ̂i ))

I We can get the 95% CI for µi with Delta method and the CI
for logit(µ̂i ).
Multivariate GLMs

Author: Nicholas Reich, transcribed by Kate Hoff Shutta and Herb


Susmann

Course: Categorical Data Analysis (BIOSTATS 743)

Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Overview: Models for Multinomial Responses
Note: This lecture focuses mainly on the Baseline Category Logit
Model (see Agresti Ch. 8), but for the exam we are responsible for
reading Chapter 8 of the text and being familiar with all types of
models for multinomial responses introduced there.

I GLMs for Nominal Responses


I Baseline Category Logit Model (Multinomial Logit Model)
I Multinomial Probit Model

I GLMs for Ordinal Responses


I Cumulative Logit Model
I Cumulative Link Models
I Cumulative Probit Model
I Cumulative Log-Log Model
I Adjacent-Categories Logit Models
I Continuation-Ratio Logit Models

I Discrete-Choice Models
I Conditional Logit Models (and relationship to Multinomial Logit Model)
I Multinomial Probit Discrete-Choice Models
I Extension to Nested Logit and Mixed Logit Models
I Extension to Discrete Choice Model with Ordered Categories
Baseline Category Logit Model
The Baseline Category Logit (BCL) model is appropriate for
modeling nominal response data as a function of one or more
categorical or quantitative covariates.

I Example: Modeling choice of voter candidate as a function of


voter age (quantitative), gender (categorical nominal), race
(categorical nominal), and socioeconomic status (categorical
ordinal).
I Example: Modeling transcription factor binding to a promoter
region as a function of transcription factor abundance
(quantitative), affinity for the binding site (quantitative), and
primary immune response activation status (categorical binary).
I Non-Example: Modeling consumer choice of soda size as a
function of air temperature (quantitative) and time of day
(quantitative). Soda size is a categorical ordinal variable, so
although this model will technically work, it does not
incorporate all of the information that our data contain.
BCL Model Formulation

Consider the set of J possible values of a categorical response


variable {C1 , C2 , . . . , CJ } and the vector of P covariates
~ = (X1 , X2 , . . . , XP )
X
Goal: For a particular vector of covariates x~i = (xi1 , xi2 , . . . , xiP ),
predict Yi , the category to which the observation with covariates x~i
belongs. (Note that Yi ∈ {C1 , . . . , CJ }.)
Intermediate Goal: For all j ∈ 1, . . . , J, use training data to fit
xi ) under the constraint that Jj=1 πj (~
xi ) = P(Yi = CJ |~
P
πj (~ xi ) = 1
Conditional on the observed covariates and the estimates for the
functions πj , Yi is Multinomial:

Yi |~
xi ∼ Multinomial(1, {π1 (~
xi ), . . . , πJ (~
xi )})
Overview of Modeling Process

I Choose one of the J categories as a baseline. Without loss of


generality, use CJ (since the Cj are nominal and ordering is
irrelevant).
I Let βj = (βj1 , . . . , βjP ) be the category-specific coefficients of the
covariates x~i for a particular category CJ . (note the dimensions of βj
are P x 1)
I Recall x~i = (xi1 , xi2 , . . . , xiP ) is P x 1
I We now can calculate the following scalar quantity, which is a log
probability ratio that is modeled as a linear function of the covariates
x~i :  
πj (~
xi )
log = αj + βj T x~i
πJ (~
xi )
Overview of Modeling Process, continued

I Specifying the probabilities πj relative to the reference category


πJ specifies a similar log probability ratio for any two categories
πa , πb , a 6= b, since

πa (~
xi ) πb (~
xi ) πa (~
xi )
     
log − log = log
πJ (~
xi ) πJ (~
xi ) πb (~
xi )

I Note that we only need to model (J − 1) of the probabilities πj ,


since the constraint Jj=1 πj (~
P
xi ) = 1 uniquely constrains the
th
J conditional on the (J − 1).
Formulation of the BCL Model as a Multivariate GLM
Response Vector

y~i = (yi1 , yi2 , . . . , yi(J−1) )

Expected Response Vector

E [~
yi ] = g (µ~i )

Argument to Link Function


µ~i = (µi1 , µi2 , . . . , µi(J−1) )
= (π1 (~
xi ), π2 (~
xi ), . . . , πJ−1 (~
xi ))

Link Function
!T
π1 (~
xi ) π2 (~
xi ) π(J−1) (~xi )
g (µ~i ) = log , log , . . . , log = Xi β
πJ (~
xi ) πJ (~
xi ) πJ (~
xi )

where Xi and β are defined on the next slide


Formulation of the BCL Model as a Multivariate GLM
Matrix of Covariates
Xi is a (J − 1) x P(J − 1) matrix (recall that P is the number of
covariates) constructed from blocks of the form
(1, xi1 , xi2 , . . . , xi(P−1) )

 
1 xi1 . . . xiP 0 0 ... 0 ... 0 ... 0

 0 0 ... 0 1 xi1 . . . xiP . . . 0 . . . 0 

Xi =  .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 
. . . . . . . . . . . . . . . . . .
 
 
0 0 ... 0 0 0 ... 0 . . . 1 xi1 .. xip

Vector of Parameters
β is a column vector with dimension (J − 1)P x 1, containing the
category-specific coefficients αj and βjk for j ∈ {1, J − 1} and
k ∈ {1, P}:
β = (α1 , β11 , . . . , β1P , α2 , β21 , . . . , β2P , . . . , αJ−1 , β(J−1)1 , . . . ,β(J−1)P )T
Multivariate GLM : The Mechanics of Prediction

I Xi is J − 1 x P(J − 1) and β is P(J − 1) x 1

I y~i = g(µ~i ) = Xiβ is a J − 1 x 1 column vector

(j)
Let Xi refer to the j th row vector of Xi . Then the dot product of
(j)
Xi with the parameter vector β is the predicted log probability
ratio for observation i and non-reference category Cj :

πj (~
xi )
 
(j)
yij = g(µ~i ) = log = Xi · β
πJ (~
xi )
Multivariate GLM : Example of the Mechanics of
Prediction
Suppose we wish to calculate yi1 .
The first row vector of Xi is:
(1)
Xi = (1, xi1 , xi2 , . . . , xiP , 0, 0, 0, . . . , 0)
The column vector of parameters β is the same for all i:
β = (α1 , β11 , . . . , β1P , α2 , β21 , . . . , β2P , . . . ,αJ−1 , β(J−1)1 , . . . , β(J−1)P )
Their dot product gives us the predicted yi1 :
 
π1 (~
xi )
yi1 =g(π1 (~
xi )) = log
πJ (~
xi )
(1)
=Xi ·β
=1α1 + xi1 β11 + · · · + xiP β1P
+ 0 ∗ α2 +0∗β21 + · · · + 0∗β2p
+ ...
+ 0∗αJ−1 +0∗β(J−1)1 + · · · + 0∗β(J−1)p
=1α1 + xi1 β11 + · · · + xiP β1P
Response Probabilities

Note the following relationship:

πj (~
xi ) exp(Xiβj )
 
log = Xiβ =⇒ πj (~
xi ) = PJ−1
πJ (~
xi ) 1 + n=1 exp(Xiβn )

The argument of the log function here is sometimes referred to as


the “relative risk” in the public health setting.
Response Probabilities
Plotting the πj (~
xi ) as a function of one covariate xij can provide a nice graphic of how these probabilities compare
to one another when projected onto xij × πj (i.e., compare the category-specific response probabilities for different
values of the j th covariate for subject i with all other covariates held constant).
Using χ2 or G 2 as a Model Check

When all predictors in a model are categorical and the training data
can be represented in a contingency table that is not sparse, the
chi 2 or G 2 goodness-of-fit tests used earlier in the semester can be
used to assess whether or not the fitted BCL model is appropriate.
(generate “expected” contingency table from predicted results and
then “residuals” are expected-observed)
If some predictors are not categorical or the contingency table is
sparse, these statistics are “valid only for comparing nested models
differing by relatively few terms” (A. Agresti, Categorical Data
Analysis p. 294). This means that they cannot validly be used as a
model check overall, but they can be used to compare fit of full
vs. reduced models if the full model only has “relatively few” more
covariates than the reduced one(s).
Example: Using Symptoms to Classify Disease (Reich Lab
Research)
Motivating Question: Confirmatory clinical tests are expensive
and take time, meaning they are not a reasonable diagnostic option
in many public health settings. Can we instead design a model that
can use routine observable symptoms to classify sick individuals
accurately? (Adapted from work in progress by Brown et. al. )

Categories: Covariates (a few of many in the


I C1 : Dengue actual model):

I C2 : Zika I Age
I C3 : Flu I Headache
I C4 : Chikingunya I Rash
I C5 : Other I Conjunctivitis
I C6 : No Diagnosis I ...
Using Symptoms to Classify Disease, Continued
Assume that each individual can only have one disease at once, and
let yi be a binary vector representing the gold-standard diagnosis of
the i th individual in the training set. For example, using the ordering
on the previous slide, the observation

y1 = (0, 0, 1, 0, 0, 0)

means that individual 1 was diagnosed with the flu using


gold-standard methods.
The proposed model chooses the category C6 : No Diagnosis as the
baseline category, estimates the πj based on the training data
{(yi , x~i )} and finds the set of parameters β such that

πj (~
xi )
 
log = Xi β
π6 (~
xi )
Using Symptoms to Classify Disease: Visualization Method
We might use a graphic like the one below to represent resulting estimates
for β (these results are just randomly generated from a standard normal
distribution):
Using Symptoms to Classify Disease: Interpretation of
Graphic

Here are a few interpretations of what this model’s coefficients would


mean from our classmates, taken with author permission from the course
discussion page on Piazza.
“A dark blue rectangle means that your probability of being diagnosed with that
particular disease (given the presence of that covariate) is lower than the probability
that you will have the negative diagnosis (given the presence of that covariate). In
particular, the ratio of the probabilities is e β , which in the ‘dark blue case’ could be
something like e −1 . If it were that particular value, that means that P(that_disease) /
P(neg_diagnosis) would be about .36 - that is, your probability of the diagnosis is
about 1/3 of the probability of the baseline.” - Yukun Li and Josh Nugent
“Holding all else constant, given you have a covariate, the risk ratio of having a
disease versus a negative test is e β .” - Bianca Doone
“If X is a binary category variable (group1: X = 1, group2: X = 0): Holding other
variable constant, the risk ratio of having a j th disease versus a negative test in group
1 is e β times the risk ratio in group 2. If X is a contiuous variable: Holding other
variable constant, with one unit increase in X , the risk ratio of having a j th disease
versus a negative test is multiply by e β .” - Guandong (Elliot) Yang
Supplementary - Utility Functions and Probit Models

In a setting where the response variable is categorical and represents


an individual’s choice as a function of certain covariates, we can
define a utility function that takes on values U1 , . . . , UJ for each of
C1 , . . . , Cj categories. The “voter choice” example from earlier in
these notes represents such a setting.
Models based on utility functions assume that the individual will
make the choice of maximum utility, i.e., choose the category j∗
such that Uj∗ = maxj {Uj }.
Utility is typically different for each individual, so a more detailed
formulation defines Ui = (Ui1 , . . . , UiJ ) for each individual i, and
predicts response ji∗ such that Uij∗ = maxj {Uij }.
Supplementary - Utility Functions and Probit Models
(Agresti p.299)
If a utility function is used as the link function in the multivariate
GLM, we get an equation of the form:

Uij = αj + β T
j (~
xi ) + ij

Under the assumption that the distribution of the ij are i.i.d. with
the extreme value distribution, McFadden (1974) showed that this
model is equivalent to the BCL model. In this setting, the
interpretation of βj is the expected change in Ui j with a change of
one unit in covariate xij , all other covariates held constant.
Recall that the extreme value distribution has CDF:

FX (x ) = exp(−exp(−x ))
What if the ij are not assumed to have this distribution?
Supplementary - Utility Functions and Probit Models
(Agresti p.299)

If we instead assume ij are i.i.d. with the standard Normal


distribution, the resulting model

Uij = αj + β T
j (~
xi ) + ij

is the multinomial probit model. In this setting, the


interpretation of βj is also the expected change in Ui j with a change
of one unit in covariate xij , all other covariates held constant, but
the link Uij is the probit function rather than the logit function.
Why Probit over Logit?
Implicit in the multinomial logit model is dependence on the
Independence of Irrelevant Alternatives (IIA) axiom.
Framed in the language of utility functions, the IIA axiom says:

I If C = {C1 , C2 } represents the categorical outcome set with


utilities Ui = {Ui1 , Ui2 } such that Ui1 > Ui2 , , then adding a
third option C3 to the outcome set will not change this
ordering.

The multinomial probit model does not depend on the IIA axiom, and is
therefore an interesting approach for many applications, including voting
theory.
Example: In the 2016 election, if the only two candidates in the mix were
Hillary Clinton and Jill Stein, a voter might have chosen Jill Stein knowing
that Hillary was likely to win anyways but that a vote for Jill represented
their beliefs. However, introducing Donald Trump into the mix might have
convinced that voter that they should choose Hillary instead of Jill, since a
third-party vote for Jill would draw from Hillary’s chance. Thus the IIA
axiom is violated. The multinomial probit model can model this setting.
Example: Alvarez and Katz (2007) Multinomial Probit
Model for Election Choice in Chile in 2005

Alvarez and Katz study the 2005 election in Chile, in which


candidates came from three main coalitions with four main
candidates:

I Left coalition (Tomas Hirsch Goldschmidt)

I Center-left Concertacion coalition (Michelle Bachelet Jeria)

I Conservative Alianza por Chile coalition (Independent Democratic


Union - Joaquin Lavin Infante, National Renewal Party - Sebastian
Pinera Echenique)
Example: Alvarez and Katz (2007) Multinomial Probit
Model for Election Choice in Chile in 2005
None of the four candidates won a majority in the first vote, so Chile held
a run-off election and eventually elected Michelle Bachelet Jeria, who had
the highest proportion of votes in the original election.

“We . . . find that the presence of a second conservative candidate


significantly affected citizens’ electoral behavior, increasing the
support for the right and influencing the electoral outcome in a way
that cannot be accounted for by analyses focused exclusively on
citizens’ party identification.”

R. Michael Alvarez and Gabriel Katz, 2007. A Bayesian Multinomial


Probit Analysis of Voter Choice in Chile’s 2005 Presidential Election Social
Science Working Paper 1287, California Institute of Technology, Division
of the Humanities and Social Sciences.
[Election Results] [Link]
election,_2005%E2%80%9306
Lecture 10: GLMs: Poisson
Regression, Overdispersion

Author: Nick Reich / Transcribed by Daveed Goldenberg, edited by Josh


Nugent

Course: Categorical Data Analysis (BIOSTATS 743)


Poisson GLMs

I Imagine you have count data following the Poisson distribution:

Yi ∼ Poisson(λi )

I Yi is the total count in the time interval, λi is E (Yi ), that is,


the risk/rate of occurrence in some time interval,
I We use a log link for our GLM:

ηi = Xi β = log λi = g(λi ) = g(E [yi ])


Poisson GLMs

I Key Points:
I Log link implies multiplicative effect of covariates

log(λ) = β0 + β1 X1 + β2 X2
λ = e β0 e β1 X1 e β2 X2
- Relative risk is the interpretation for e β

log(λi |X1 = k + 1, X2 = c) = β0 + β1 (k + 1) + β2 (c)


−log(λi |X1 = k, X2 = c) = β0 + β1 (k) + β2 (c)
log((λi |X1 = k + 1, X2 = c)/λi |X1 = k + 1, X2 = c) = β1
Exposure / Offset Term

Often Poisson models have an ‘exposure’ or ‘offset’ term,


representing a demoninator of some kind. Examples: Let ui be
offset for Yi . . .

I Disease incidence: Yi = the number of cases of flu in a


population in 1 year (in location i), ui = population size
I Accident rates: Yi = the number of traffic accidents at site i in
1 day, ui = average number of vehicles travelling through site i
in 1 day, or ui = the number of vehicles through site i yesterday
I The offset is used to scale the Yi
Exposure / Offset Term

Yi ∼ Poisson(ui ∗ λi )
E (Yi ) = ui ∗ λi
log(E (Yi )) = log(ui ) + log(λi )

I log(ui ) is our offset (from observed data, can be thought of as


an intercept)
I log(λi ) is our ηi (the linear predictor)
Exposure / Offset Term

I In R, the Poisson glm can be specified with an offset


I glm(Y ∼ X1 + X2 , family = ‘poison’, offset = log(u), data . . . )
I the log is important in order to get the correct offset
I The offset term is adding more information to the model but
not estimating a coefficient
Exposure / Offset Term

Yi ∼ Poisson(ui ∗ λi )

I The Yi could be cases per day


I ui could be population (persons)
I then λi would be cases per day per population (persons)
I which makes this a rate for an individual

log(E (Yi )) − log(ui ) = log(λi )


log(E (Yi )/ui ) = log(λi )
Overdispersion
I In Poisson models for Yi ∼ Poisson(λi )

Var (Yi ) = λi

I In GLM estimation notation

µi = E (λi )
Var (µi ) = λi

I In an overdispersed model, the variance is higher because of


some variability not captured by Poisson

Var (µi ) = φλi


φ>0
- Overdispersion implies φ > 1
Overdispersion

I Likelihood equations for Poisson GLM

N
X (yi − ui )xij ∂µi
=0
i=1
Var (µi ) ∂ηi
j = 0, .., p

I Depends on the distribution of Y through µi and Var(µ)


I φ drops out of the likelihood equations - this makes sense;
variability won’t affect the MLE - that is, βs are identical for
models with φ > 1 and φ = 1
I However, φ does impact estimated standard errors
Overdispersion

∂ui 2
wi = ( ) /Var (Yi )
∂ηi
cov (β̂) = (X T WX )−1 = φcov (β̂)

I φ does not affect the βs but it does affect their covariance as a


scaling factor
Is Overdispersion Term Needed in a Model?

I (See example 4.7.4 in Agresti)


I Start with standardized residuals
I Assume:

yi − yˆi
zi = p
Var (yˆi )
yi − µi
= √ ∼ N(0, 1)
µi
n
X
zi2 ∼ χ2n−k
i=1
I where k is the number of parameters
I if the sum of zi2 is large (compare to chi-squared), we may
need an overdispersion term φ
Is Overdispersion Term Needed in a Model?

Pn 2
i=1 zi
φ̂ =
n−k

I summarizes overdispersion in data compared to the fitted model


I if φ2 > 1, we should use the “quasipoisson” family in R’s glm()
function
I The SEs of a quasipoisson model q are equivalent to the SEs of
the Poisson model miultiplied by φ̂

q
SEqp (β̂) = SEp (β̂) ∗ φ̂
Lecture 11 : Smoothing: Penalized Likelihood Methods
and GAMs
Tyrell Richu
10/23/2018

Penalized Likelihood
• Consider an arbitrary model with generic parameter β, a log-likelihood function L(β).

• Let λ(.) denotes the roughness penalty which decreases as the values of β are smoother (i.e. uniformly
close to zero). The penalized likelihood estimator L*(β) is :

L∗ (β) = L(β) − λ(β),

• Penalized likelihood methods are examples of regularization methods. It is a general approach for
modifying ML methods to give sensible answers in unstable situations such as modeling using data sets
consisting too many variables.

Types of Penalties λ(β)


• L2 -norms (Ridge Regression) : λ(β) = λ j βj2
P
P P
• L1 -norms (LASSO) : λ(β) = λ j |β|, subject to the constraint j |β| ≤ K, where K is some constant.
• L0 -norms : λ(β) ∝ non − zeroβj -AIC/BIC methods are a special case of L_{0}-penalization but it’s
hard to optimize for large j.

How to select λ(β) for penalized likelihood


-The degree of smoothing depends on the smoothing parameter λ, the choice of which reflects the bias/variance
trade-off. When λ increases, the estimates {βj } decrease towards zero, thus decreasing the variance but
increases the bias.
• K-fold Cross-validation Goal : We are interested in choosing a λ based on fitting the model to part of
the data and then checking the goodness of fit in terms of prediction for the remaining data.
• Step 1: Fix λ0 .
• Step 2: Do this k-times, leave out the fraction 1/k of the data and predict it using the model fit for the
remaining data. Choose the value of λ which has the lowest prediction error.
• Step 3: Compute the error for λ0
• Step 4: Repeat for k-values of λ. Then, choose the value of λ which has the lowest prediction error.
Note: Bayesian methods can also approximate penalized likelihood if prior(β) ∝ exp(−λ(β)) = posterior(β) ∝
L∗ (β)

1
Pros/Cons of Penalized Likelihood
• L2 -norms (-) : Useless for finding a rigid model, because all the variables remain in the model.
• L1 -norms (+) : Allows us to plot estimates as a function of λ to summarize how explanatory variables,
βj drop out as λ increases by selecting individual indicators rather than entire factors.
• L1 -norms (-) : May overly penalize βj that are truly large may hold high bias, making inference difficult.
Solution: adjust the penalty function such that it includes both the L_{1} and L_{2} norms.

General Additive Models (GAMs)


• GAMs are another type of GLM that specifies a link function g(.) and a distribution for the random
component.
P
• In GLMs, we had g(µi ) = j βj xij
P
• In GAMs, g(µi ) = j sj (xij ), where s_{j}(.) is unspecified smooth function of predictor j. Examples:
cubic splines: cubic polynomials over sets of disjoint intervals, joined together at boundaries called
knots.
-We can fit GAMs using the backfitting algorithm, similar to Newton’s method, to utilize local smoothing.
• Step 1: Initialize sj = 0
• Step 2: For each rth iteration, update sj such that
(r) (r)
X (r)
sj = yi − sk (xik ), j = 1, ..., p
k6=j

• This will fit a model that assigns a deviance and an approximate degree of freedom to each sj in the
additive predictor, allowing inference about each term. The df helps determine how smooth the GAM
fit looks. (e.g. Smooth functions with df = 4 look similar to cubic polynomials, which has 4 parameters)
• Like with GLMs, we can compare deviances for nested models to test whether a model gives a significantly
better fit than a simpler model.

Final Notes
• GAMs and penalized likelihood methods are stronger than GLMs because they impersonate GLMs
in assuming a binomial distribution for a binary response and having a df value associated with each
explanatory effect.

2
Lecture 13: Clustered Data, An
Introduction to Applied Mixed
Models

Author: Nick Reich / Transcribed by Gregory Guranich, edited by Bing


Miu

Course: Categorical Data Analysis (BIOSTATS 743)


Clustered Data
I Clustered Data:
I Hierarchy, nested populations
I Longitudinal, Correlated observations or sets .eg repeated
measure
I Example: (Hierarchy of nested populations)
I Patient ∈ Hospital ∈ Region
Clustered Data
I Example: (Longitudinal)
I Repeated measurements on the same unit of observation
I Patients have multiple temperature measurements over time
I Temperature ∈ Patient ∈ Clinic

Figure 1: Alkema 2017


Implications of Clustered Data
I Observations in clusters violates our assumptions of
independence
I Clustered data are less informative and less generalizable than
independent data
I Example: Subjects within a household are more similar and
will vary less. Similar genetics and behavior
Example Notation

Mixed models, also known as multilevel or hierarchical models, are


used to model cluster data. We will use the following notation to
model such data.
Yij = Response variable
Trtij = Treatment variable
Example Notation 2

j = 1, 2, ..., n
i = 1, 2, ..., ni
X
N= ni = total number of groups
i
(
1, if Trtij was treated
Treatment Trtij =
0, otherwise
Example to begin with

I Let Yij denotes the counts of mortality

Yij ∼ Poisson(λij , Pi )

log(λij ) = β0 + β1 Trtij + β2 Xi
I We want to make inference about β1 . What is the treatment
effect?
I Observations are not iid, need “adjustment”
I How can we account for variance within groups?
I 1. Marginal models with generalized estimating equations
(GEE) for variance adjustment
I 2. generalized linear mixed models (GLMM’s)
GLMM’s Models

I Account for modeling for individual and group level variation in


estimating group-level coefficients
I Allow the proper measure of variation in individual level
regression coefficients
I For point estimates, Shrinkage (relative to sample size) of
parameters toward group means
I Rule of thumb
I number of groups greater than 5
I substainial variation among groups
GLMMS Model Notation

I where β is a fixed effect and µi are varying (random) effects

g(E [yij |µi ]) = XijT β + ZijT µi


µi ∼ N(0, Gθ )

I Agresti uses i to represent the group, Yij is the jth observation


in ith group
I So here we have i different µ which are draws from a normal
I The µi have a common distribution
I Notice β has no subscript, β is fixed
I β is a global estimate and µi is a group specific estimate
I g is a glm link
I θ is a parameter that governs the distribution of the random
effect
Estimation
L(β, θ) = f (~y |β, θ)
Z
= f (~y |~
µ)f (~
µ)du

I where we integrate across our marginal f (µ̂)


I these problems usually do not have closed form solutions
I How to approximate? Use Bayesian MCMC or HMC
algorithims to sample from posterior distribution of β and θ
I approximate with Laplace methods (some coefficients are
subject to penalty terms)
I Degrees of Freedom: Approximated by estension of the Hat
matrix
I some closed for solutions (ex: beta-binomial conjugate in next
lecture)

tr (H) = p
p ≤ tr (H) ≤ p + q
Example
I Lets say we are looking at treatment(spinal implants) to relieve
back pain. Rows represent visit 1 and columns represent visit 2

Figure 3:

I pain indicator Yij and group indicator Xij


(
1 patient j at visit i has no pain
yij =
0 patient j at visit i has pain
(
1 i=2; patient’s second visit
xij =
0 i=1; pateint’s first visit
Example (possible models)
I logistic-normal model

logit(P(Yij = 1|µi )) = α + βXij + µi


µi ∼ N(0, σµ2 )

I logistic-normal, similar notation for bayesians with weakly


informative priors

Yi ∼ N(π, σy2 )
logit(π) = αi + βXij
αi ∼ N(β0 , σα2 )
β0 ∼ N(0, 100)
σα ∼ U(0, 5)
Example (interpretation)

logit(P(Yij = 1|µi )) = α + βXij + µi


µi ∼ N(0, σµ2 )

I α ∼ log odds of pain free at visit 1


I β ∼ change in log odds of being pain free comparing visit 2 to
visit 1
Example (interpretation)

I With mixed effect µi our intercept can now vary


Lecture 14: Mixture Distributions
in GLMMs

Author: Nick Reich / Transcribed by Hachem Saddiki, edited by Gregory


Guranich

Course: Categorical Data Analysis (BIOSTATS 743)


Mixture Distributions

I A mixture distribution is a probability distribution derived from


a convex (weighted) combination of individual probability
distributions, called mixture components.
I The weights associated with each mixture component are
called mixture weights; these are non-negative and sum to 1.
I Notation: Given a set of mixture components gk (x ) and
weights wk indexed by k = 1, ..., K , the mixture distribution
can be written as

K
X
f (x ) = wk gk (x ) (1)
k=1

PK
where k=1 wk = 1 and wk ≥ 0 ∀k.
Mixture Distributions (cont’d)

I Mixture distributions are not to be confused with the


convolution of probability distributions.
I The former models the distribution of a random variable as
the weighted sum of two or more mixture distributions; while
the latter models the value of a random variable as the sum of
the values of two or more random variables.
I Mixture distributions are typically used to model a statistical
population with subpopulations that are believed to share
similar characteristics.
I Mixture distributions are formulated as hierarchical models
with hidden (latent) factors, often making exact statistical
inference an intractable task.
GLMM as a model for mixture distributions

I Consider an analysis of children’s gender in relation to their


mother. Define a binary random variable Yij which equals 1 if
the j-th child born to the i-th mother is a boy.
I We write a GLMM model for the outcome as follows

Yij |αi ∼ Bernoulli(πi )


πi = logit −1 (αi )
αi ∼ Normal(0, σα2 )

I In this example, the population of interest are the children, and


the mixture distribution accounts for subpopulations of children
that share the same mother. The rationale is that these
children, being from the same mother, would share similar
(genetic or other) characteristics.
GLMM as a model for mixture distributions (cont’d)

I Using the law of total expectation, we can compute the


expectation of Yi as:
E [Yi ] = E [E [Yi |αi ]] = E [πi ] = E [logit −1 (αi )]
I Unfortunately, an exact expression for E [Yi ] involves a
complicated integral form, and so exact inference is intractable
for this model formulation.
I Nonetheless, in this simple setting, one can appeal to a Monte
Carlo simulation to approximate the expectation of interest.
I Alternatively, one can opt for a different model formulation
that is amenable to exact expressions for the expectation and
variance of interest.
Beta-Binomial Model

I Following the same example of children’s gender and their


mother, we can formulate a Beta-Binomial model for the
outcome Yij as follows:

Yi |πi ∼ Binomial(ni , πi )
πi ∼ Beta(α, β)

I In here, we model πi as a draw from a Beta distribution.


I It is easy to see that the Yi reduces to a Bernoulli distribution
when α = β = 1.
Beta-Binomial Model (cont’d)

I Another appealing property of the model is that the Beta


distribution is conjugate to the Binomial distribution.
I This leads to a closed form expression for the mean and
variance of Yi :

ni α
E [Yi ] = (2)
α+β
ni αβ(α + β + ni )
Var (Yi ) = (3)
(α + β)2 (α + β + 1)
Shrinkage Effect for Beta-Binomial
I We can use the conjugacy of the Beta prior and Binomial
likelihood to derive a distributional form for the posterior πi |yi .

yi |πi ∼ Binomial(ni , πi )
πi |α, β ∼ Beta(α, β)
p(πi |yi , α, β) ∝ p(yi |πi )p(πi |α, β)
∝ πiyi (1 − πi )ni −yi ∗ πiα−1 (1 − πi )β−1
∝ πiyi +α−1 (1 − πi )ni −yi +β−1
∼ Beta(yi + α, ni − yi + β)

I Having found a closed form expression for the posterior πi |yi ,


we can compute the posterior mean as

(yi + α)
E (πi |yi , α, β) = (4)
(ni + α + β)
Shrinkage Effect for Beta-Binomial (cont’d)

I From Eq(4), we can clearly see the influence of the parameters


of the prior (α, β) on the posterior mean πi |yi .
I As (α, β) increase, E [π|yi ] decreases; this is sometimes referred
to as a shrinkage effect.
I The effect of the prior on the posterior mean gets weaker as
the sample size increases. Intuitively, this reflects the Bayesian
perspective of relying more on prior information when few data
points are available; and relying on the data when the sample
size is large.
I We provide an illustration of the shape of the posterior π|yi for
different settings of the prior π|α, β. We use a fixed sample
size of N = 20.
Shrinkage Effect: plot of posterior for different prior
settings of the Beta-Binomial model
5

Prior Distn.
Beta(1,1)
Beta(1,3)
Beta(3,1)
4

Beta(.5,.5)
Posterior density

3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

Pi
Extentions to Models for Count
Data

Author: Nicholas Reich, transcribed by Herb Susmann edited by Hachem


Saddiki

Course: Categorical Data Analysis (BIOSTATS 743)


Extensions to Models for Count Data

There are several ways to extend models for count data in order to
capture properties like overdispersion.
I Poisson model with adjustment for overdispersion (see previous
notes)
I Poisson-Gamma Model
I Generalized Linear Mixed Models (GLMMs)
Poisson-Gamma Model

A Poisson-Gamma model is one way to account for overdispersion in


models of count data. The model has two parts:
I First, we assume that the outcome variable follows a Poisson
distribution: Y |λ ∼ Poisson(λ).
I Second, we assume that the rate parameter for that Poisson
distribution itself follows a Gamma distribution:
λ ∼ Gamma(k, µ).
I Under this parameterization, E[λ] = µ, Var[λ] = µ2 /k.
I We can alternatively parameterize in terms of a dispersion
parameter γ = 1/k.
Poisson-Gamma Model

I Under these two assumptions, the marginal distribution of Y


follows a negative binomial distribution:
Y ∼ NegativeBinomial(k, µ)
I See this blog post for a proof that the Poisson-Gamma model
is a negative binomial distribution.
I The mean and variance of the Poisson-Gamma model is not
equal (as opposed to a Poisson model), which allows it to
account for overdispersion.
Poisson-Gamma Model

I The expected value of Y is given by:

E[Y ] = E[E[Y |λ]]


= E[λ]

I And the variance:

Var[Y ] = E[Var[Y |λ]] + Var[E(Y |λ)]


= E[λ] + Var[λ]
= µ + µ2 /k, or equivalently
= µ + γµ2

I Note that as γ → 0 (k → ∞), the distribution of Y


approaches a Poisson distribution.
Generalized Linear Mixed Models
Another approach is to use a Generalized Linear Mixed Model
(GLMM).
I First, assume that the outcome Yi follows a Poisson
distribution.
I Assume the link-transformed expected value of the outcome is
a linear function of the covariates and random effects:

log(E[Yi |µi ]) = XijT β + µi

I Finally, assume that the random effects ui follow a distribution:

ui ∼ N(0, σ 2 )

I This example uses a log link and assumes the ui are normally
distributed.
Generalized Linear Mixed Models

I Note that you need to use a link that transforms the linear
predictor to a non-negative value. For example, the identity link
leads to structural problems because a negative linear predictor
implies a negative expected count, which is impossible.
Other choices are possible for the distribution of ui :
I Assuming ui ∼ Gamma(1, γ) implies a negative binomially
distributed outcome Y .
I Another possible choice is assume ui follow a log-normal
distribution.
I Each choice implies a different structure for the random
intercepts.

You might also like