Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
Lecture #2 Introduction to CDA
Objectives of this lecture:
After reading this unit, you should be able to
• basic concepts of CDA
• distribution function of categorical data
Multinomial Distribution
Nominal and ordinal response variables have more than two possible outcomes. When the
observations are independent with the same category probabilities for each, the probability
distribution of counts in the outcome categories is the multinomial.
Suppose we have scenario where there are r=3 outcomes, with probabilities p1, p2, p3 respectively,
such that p1+p2+p3=1. Suppose we have n=7 independent trials, and let Y=(Y1,Y2,Y3) be the rvtr of
counts of each outcome. Suppose we define each Xi as a one-hot vector (exactly one 1, and the rest
n
0) as below, so that Y =∑ X i (this is exactly like how adding indicators/Bernoulli’s gives us a
i=1
Binomial):
Now, what is the probability of this outcome (two of outcome1, one of outcome2, and four of
outcome 3) that is, (Y1=2, Y2=1, Y3=4)? We get the following:
7!
pY , Y 2 , Y3 ( 2,1,4)= p2 × p 12× p43
1
2!1!4! 1
= 7 p1 × p 2 × p 3
2 1 4
( )
2,1,4
This describes the joint distribution of the random vector Y=(Y1,Y2, Y3), and its PMF shoul dremind
7
of you of the binomial PMF. We just count the number of ways
( 2,1,4 ) to get these counts
(multinomial coefficient), and make sure we get each outcome that many times p 21× p12× p43
Now let us define the Multinomial Distribution more generally:
Introduction to categorical data analysis
Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
Let c denote the number of outcome categories. We denote their probabilities by (p1 , p2 , . . . , pc),
where ∑ p j =1 . For n independent observations, the multinomial probability that x1 fall in
j
category 1, x2 fall in category 2, ... , xc fall in category c, where ∑ x j =n , equals
j
n! x x x
p ( x 1 , x 2 ,. . . , x n)= 1
p p .. . p c
2 c
x 1 ! x 2 ! .. . x n ! 1 2
The binomial distribution is the special case with c = 2 categories.
Then, the mean and variance of multinomial distribution are
μi =npi
and σ2i =npi (1−p i)
Then, we can specify the entire mean vector E[Y] and covariance matrix:
np1
.
E [Y ]=np= .
.
npc
[] var (Y i )=npi (1− p i ) Cov(Y i ,Y j )=−npi p j ( for i ≠ j )
Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables;
let’s decompose them into their Bernoull trials. We’ll use different dummy indices as we’re dealing
with covariances.
Let Xik for k=1,...,n be indicator/Bernoulli rvs of whether the kth trial resulted in outcome i, so that
n
X i =∑ X ik
k =1
Similarly, let Xjl for l =1,...,n be indicators of whether the l-th trial resulted in outcome j, so that
n
X k =∑ X jl .
l =1
Before we begin, we should argue that Cov(Xik,Xjl )=0 when k ≠l since k and l are different trials
and are independent.
Furthermore, E[XikXjk]=0 since it’s not possible that both out come i and j occur at trial k.
n n
Cov( X i , X j )=Cov
n n
( ∑ X ik , ∑ X jl
k=1 l=1
)
= ∑ ∑ Cov ( X ik , X jl )
k=1 l=1
n
= ∑ Cov (X ik , X jl )
k=1
Introduction to categorical data analysis
Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
n
= ∑ E [ X ik , X jk ]−E [ X ik ] E [ X jk ]
k=1
n− p i
= ∑ pj
k =1
=−npi p j
Note that in the third line we dropped one of the sums because the indicators across different trials
k, are independent (zero covariance). Hence, we just need to sum when k=l.
Relationship between Poisson and Multinomial Distribution
A Poisson model for (Y 1 ,Y 2 ,Y 3 , Y 4 ) treat as independent Piosson random variables, with
parameters (μ 1 , μ 2 , μ 3 , μ 4). The joint probability mass function for {Yi} is the product of the four
mass functions of the form
μ iy e
−μ i i
P ( y i )= ; y =0 ,1 , 2 , 3 , 4
y i!
The total n=∑ Y i also has a Piosson distribution, with parameters ∑ μi .
i i
With Poisson sampling the total count n is random rather than fixed. If we assume a Poisson model
but condition on n, {Yi} no longer have Poisson distributions. Since each Yi can not exceed n. Given
n, {Yi} are also no longer independent, since the value of one affects the possible range for the
others.
For c independent Poisson variates, with E (Y i)=μ i , the conditional probability of a set of counts
{ni} satisfying ∑ Y i=n is
i
P(Y 1=n1 , Y 2=n 2 , . . . ,Y n=nc )
P (Y 1=n1 , Y 2=n2 , .. . ,Y c =n c )∣∑ Y j =n =
[ j ] P ( ∑ Y j=n)
j
ni
n!
∏μ i
= × i
∏ ni! (∑ μ j )n
i i
ni
n! μi
= ×∏
∏ ni ! i
i
(∑ )
j
μj
n!
= ×∏ π ni i
~ multi(n , {πi })
n
∏ i i
!
i
Introduction to categorical data analysis
Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
Likelihood Function
Let x 1 , x 2 , . . , x n be a random sample of size n from a population with density function f(x). If the
joint pdf may be regarded as a function of θ , it is called likelihood function and denoted by
L( θ ) defined as
L( θ )= f ( x 1 ; θ ) f ( x 2 ; θ ) ,. . . , f ( x n ; θ )=∏ f ( x i ; θ )
i
Maximum Likelihood Estimate
Let L( θ )=∏ f ( x i ; θ ) be the likelihood function for the random variables x 1 , x 2 , . . , x n . Then
i
the value of θ is called maximize the likelihood function that the value of θ is called maximum
likelihood estimator. It is usually denoted by θ^ .
The MLE of θ is the solution of likelihood equation
∂ L( θ )
∂θ =0
If θ^ is the MLE of θ then
∂ L (θ)
=0 and
∂θ∣θ =θ^
∂ 2 L( θ )
<0
∂θ 2∣θ =θ^
Likelihood Function and ML Estimate for Binomial Distribution
If an experiment has n trails with success probability π , then the probability mass function
becomes
p ( y)= ( ny) π (1−π )
y n− y
; y=0,1,2 , . . , n
The binomial coefficient ( ny) has no influence on where the maximum occurs with respect to
π . Thus ignore it. The binomial log-likelihood function becomes
y n− y
L( π )=log p ( y )=log [ π (1−π ) ]
= y log( π )+(n− y ) log (1− π)
Differentiating with respect to π yields
y
π^ =
n
δ2 L( π )
To calculate variance of the MLE, calculate and then taking expectation
δ π2
Introduction to categorical data analysis
Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
δ2 L ( π ) n
−E
δπ (
2
=
π (1− π ) )
Thus, the asymptotic variance of hat %pi is
1 π (1− π )
=
2 n
−E ( δ δL(π π ) ) 2
Estimation of Multinomial Parameters
Suppose a multinomial experiment consist of n trails and each trail can result in any of c possible outcomes
y 1 , y 2 , . . . , y c . Further suppose that each possible outcome can occur with probabilities
π1 , π2 , . .. , πc . Then the probability distribution becomes,
n!
p ( y)= π y . π y .. . π cy
1 2 c
y 1!, y 2! , .. . , y c ! 1 2
c
n!
= ∏ πiy i
∏ y i! i
i
The Lagrangian with the constrains than has the following form
L( π , λ )=log L ( π )+ λ (1−∑ πi )
i
To find maximum, we differentiate the above function [Link] π and outcomes becomes
y
πi = λi
To solve for λ , we take sum both sides and make use of our initial constant
y
∑ πi=∑ λi
i I
λ =∑ y i=n
i
Thus the MLE of πi becomes
yi
π^ i =
n
Likelihood-Ratio Chi-squared Test
Let us consider the following hypothesis
H o : π j =π j 0 ; j=1,2 , . . . , c
vs H 1 : π j≠ π j 0
Introduction to categorical data analysis
Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU
The kernel of the multinomial likelihood is
∏ π yj j
Under H o the likelihood is maximized when π^ j= π jo .
In the general case, it is maximized when π^ j= y i / n . The ratio of the likelihoods equals
∏ (π j 0 )y j
π j0 y
j
∏ ( y j / n) y j ( y j /n ) j
Λ= =∏ j
Thus, the likelihood-ratio statistic denoted by G2 is
G2 =−2 log Λ=−2 ∑ [ y j log π j 0 −log ( y j / n ) ]
j
= 2 ∑ y j log ( y j / n π j 0 )
j
This statistic is called the likelihood-ratio chi-squared statistic. The larger the value of G2 , the
greater the evidence against H 0.
Example:
Reference Book:
i. Agresti A. (2019), An Introduction to Categorical Data Analysis, 3rd edition, A John Wiley & Sons
Inc., Publication.
ii.
<><><><><><><><><> End <><><><><><><><><>
Introduction to categorical data analysis