COLUMBIA UNIVERSITY Prof.
Marco Avella Medina
Statistical Inference and Modeling Spring 2022
Lecture Notes 7: Markov Chains
1 Some Examples
Until now we have focused on the i.i.d. random sample framework. However, in many
applications there are dependence structures in the data that cannot be ignored. In the
three examples below we consider modeling the data presented as dependent discrete random
variables.
1.1 Snow Fall Data
(a) Observed Snow Depth (b) Binary Time Series
Figure 1: Snow Fall Data
Figure 1 gives the depth of snow in day for the period between December 17 and December
31 in Central Park from 1912 to 2009. Figure 1a records the depth in mm of the snow, if we
observed a snow depth of at least 50 mm on the jth day is represented with a ‘1’, while ‘0’
indicates otherwise. Therefore we will have the binary time series in Figure 1b.
1.2 DNA Data
Figure 2 gives an example of DNA sequence. The table in Figure 2a shows the first 150 bases
from a sequence of 1572 bases found in a human genome. The plot in Figure 2a shows the
proportion of times the bases (A, C, T or G) appear in a window of width 100 centered at
position t. The estimated proportions seem to be fairly stable along the different locations
in the sequence. In Figure 2b shows the observed frequencies and estimated proportions of
16 possible successive pairs of bases.
1
(a) Proportion of Bases (b) Successive Pairs
Figure 2: DNA Sequence Data
1.3 Breast Cancer Data
Figure 3 gives the initial and follow-up status for 37 breast cancer patients treated for spinal
metastases. Three possible status are: able to walk unaided (1), unable to walk unaided
(2), and dead (3). The times of follow-up are 0, 3, 6, 12, 24, and 60 months after treatment
began. Note that, Woman 24 was alive after 6 months but her ability to walk was not
recorded. In this example, we can see there are apparent relationships between different
status. Furthermore, state 3 (dead) is an absorbing state as it will not have a follow up
status.
Figure 3: Breast Cancer Data
2
2 Markov Chain Model
2.1 Statistical Model
2.1.1 Basics
Let Xt denote a process taking values in a state space S = {1, . . . , S}. The data is of the
form X0 = s0 , Xt1 = s1 , . . . , Xtk = sk where 0 < t1 < . . . , tk .
Example.
• Snow fall data: t is the day, there are total k = 15 days in December per year (between
December 17 and December 31) and the state space is S = {1, 2} for the states “no
snow” and “snow”.
• DNA data: t is the location in the DNA sequence, we have k = 1571 total location
and the state space S = {1, 2, 3, 4} = {A, C, G, T }
• Breast cancer data: in total, k = 5 at most with t0 = 0, t1 = 3, t2 = 6, t3 = 12,
t4 = 24, and t5 = 60 months, and state space S = {1, 2, 3} representing the states
“walk unaided”, “not able to walk unaided” and “dead”.
2.1.2 Markov Property
Assume that we have a sample Xt1 , . . . , Xtk of dependent data. Using conditional probabil-
ities, we can always write the joint frequency function as
P(X0 = s0 , Xt1 = st1 , . . . , Xtk = sk )
=P(X0 = s0 , Xt1 = st1 , . . . , Xtk−1 = stk−1 )P(Xtk = sk |X0 = s0 , Xt1 = s1 , . . . , Xtk−1 = sk−1 )
the former part can be calculated using the same rule
P(X0 = s0 , Xt1 = st1 , . . . , Xtk−1 = stk−1 )
=P(X0 = s0 , Xt1 = st1 , . . . , Xtk−2 = stk−2 )P(Xtk−1 = sk−1 |X0 = s0 , Xt1 = s1 , . . . , Xtk−2 = sk−2 ).
Iterating conditional probabilities following the same logic leads to
P(X0 = s0 , Xt1 = st1 , . . . , Xtk = sk )
k
Y
=P(X0 = s0 ) P(Xtj = sj |X0 = s0 , Xt1 = s1 , . . . , Xtj−1 = sj−1 ).
j=1
The (first-order) Markov Property assumption requires that given the “present” Xi , the
“future”, Xi+1 , Xi+2 , . . . , is independent of the “past” , Xi−1 , Xi−2 , . . . . This leads to the
simplification
n
Y
P(X0 = s0 , . . . , Xtk = sk ) = P(X0 = s0 ) P(Xti = si |Xti−1 = si−1 ).
i=1
A random process that satisfies this (first order) Markov Property, in which the future is
independent of the past, given the present is called a (first order) Markov Process.
3
2.1.3 Markov Chains
A Markov Chain is a Markov process in discrete time taking values in a state space
S = {1, . . . , S}.
(a) Snow Fall Data (b) DNA Sequence Data (c) Breast Cancer Data
Figure 4: Markov Chain Models
Example. The three examples mentioned above will have Markov Chain graph showed
in Figure 4. The arrows indicate the direction of possible transitions in the following time
step.
2.2 Stationarity and Transition Probabilities
• Stationarity
We will assume that the process Xt is stationary i.e.
P(Xt = s|Xu = r) = P(Xt−u = s|X0 = r).
Hence the conditional probabilities only depend on the time differences. In particular,
stationarity assumes that the probability of transitioning between states will not change
over time.
• Transition Probabilities
For a stationary chain Xt observed at a discrete equally spaced times t = 0, 1, . . . , k,
we define the transition probabilities as
prs = P(X1 = s|X0 = r), r, s ∈ S
Note that we only considered the event X1 = s|X0 = r, since the transition probabilities
are assumed to stay fixed over time and we have equally spaced time.
2.3 Transition Matrix
• The transition probabilities define the S × S transition
P matrix PP
whose r, s element is
prs . Summing all elements in rth row of S will get Sj=1 prj = Sj=1 P(X1 = j|X0 =
r) = 1. Since each row of the transition matrix should sum up to 1, we have that
P1S = 1S , where 1S denotes the S-dimensional vector of ones.
4
(0) (0) (0) (0)
• Consider an initial probability vector p(0) = (p1 p2 . . . pS )> where pr = P(X0 = r).
Then the probability distribution function of each state in time step 1 is
p> >
(1) =p(0) P
p11 p12 . . . p1S
(0) p21 p22 . . . p2S
(0) (0)
=[p1 p2 . . . pS ] ..
.. . . ..
. . . .
pS1 pS2 . . . pSS
hX s s
X (0) s
X i
(0) (0)
= pj pj1 pj pj2 . . . pj pjS
j=1 j=1 j=1
(0)
Therefore the sth element of p> >
P
(1) = p(0) P is P(X1 = s) = j=1 pj pjs . With the same
reasoning, the probability distribution function of Xn is given by p> > n
(n) = p(0) P .
• A Markov chain with is said to have a stationary distribution π if π > P = π > and
π > 1S = 1S . More specifically, π is the eigenvector of P associated with the eigenvalue
λ = 1. It is the limiting distribution π > = limn→∞ p> > n
(n) = limn→∞ p(0) P .
Example. Consider a simple four-state Markov chain model as below in Figure 5. The
model diagram and transition matrix are shown in Figure 5a and 5b respectively.
(a) Model Diagram (b) Transition Matrix
Figure 5: Four-state Markov chain
By solving the π > P = π > and π > 1S = 1S , we get
1/3π2 = π1
π1 + 1/2π3 + 1/2π4 = π2
1/3π2 + 1/2π4 = π3
1/3π2 + 1/2π3 = π4
π + π + π + π = 1
1 2 3 4
Further solving this linear system we get a unique solution π = ( 81 38 41 14 )> . We can see
the stationary distribution is independent of initial probabilities. In this example, there
is a unique stationary distribution, however, it is not always the case.
5
Example. From the diagram representing the Markov Chain in Figure 6a, we can con-
struct a transition matrix as P. We can check that the elements in each row of the
(a) Model Diagram (b) Transition Matrix
Figure 6: Six-state Markov chain
transition matrix sum to 1 since
1 1
2
0 0 0 0
2
1 1
1 3
0 0 0 0 1 1
41 1
4
1 1
0 0 1 1
P1S = 14 4 4 4 × =
= 1S
4 0 41 14 0 14
1 1
0 0 0 0 1 1 1 1
2 2
0 0 0 0 12 12 1 1
Further assume its initial probability is p(0) = ( 41 41 0 1
4
0 14 )> , we can compute P(X2 = C)
as the the third element of p> 2
(0) P , which is 0.031.
1 1
2
2
0 0 0 0
2
1 3
0 0 0 0
4 4
1 11 1
0 0 h i
p> 2
1 1 1 1 4
0 0 44 4
(0) P = = 0.234 0.375 0.031 0.031 0.156 0.172
4 4 4 4 1 0 14 14 0 41
4
0 0 0 0 1 1
2 2
0 0 0 0 12 21
The stationary distribution π = (π1 π2 . . . π6 )> has to satisfy π > P = π > and π > 1S = 1S ,
solving the equation we get
2π1 = π2
π = π = 0
3 4
π5 = π6
3π1 + 2π5 = 1
In this case, there isn’t a unique stationary distribution, one of them can be π =
(0.2 0.4 0 0 0.2 0.2)> . In fact, there is a unique stationary distribution when the chain is
both irreducible and positive recurrent. 1
1
Check Section Classification of chains, pp.229-232 in A.C. Davison (2003) for details.
6
2.4 Irreducibility
We say that state j is reachable from i, denoted by i −→ j, if there exists an integer n ≥ 0
such that Pijn > 0. A Markov chain is irreducible if and only if all states are reachable from
each other.
Example. The Markov chain illustrated in Fig 5 is irreducible, but the other example
from Fig 6 is not.
Remark. If the Markov chain is reducible, it implies that there exists at least one “isolated
island” that cannot be reached from remaining states.
2.5 Periodicity (Optional)
The period of a state i in Markov chain model with transition matrix P is given by
d(i) = gcd{m ≥ 1 : Piim > 0}.
If d(i) = 1, then the state i is called aperiodic.
Remark 1. A Markov chain is called aperiodic if and only if all its states are aperiodic.
Remark 2. If a Markov chain is aperiodic and has finite states, then there exists some
positive integer N such that:
Piim > 0
for any m ≥ N and state i.
Remark 3. If an irreducible Markov chain is aperiodic and has finite states, then there
exists some positive integer N such that:
Pijm > 0
for any m ≥ N , states i and j.
2.6 Recurrence and Transience (Optional)
Let τi,i denote the return time to state i given X0 = i.
τi,i = min{k : Xk = i|X0 = i}
If the Markov chain never revisits i, then τi,i = ∞. The state i is recurrent if P(τi,i < ∞) = 1,
or transient if P(τi,i = ∞) > 0.
If state i is recurrent and E(τi,i ) < ∞, then we say the state i is positive recurrent. Otherwise,
state i is called null recurrent.
Example. Recurrence and transience of simple random walk on Rd . It can be proved
that random walk is recurrent when d = 1, 2 but is transient when d ≥ 3. A famous
quote: “A drunken man will always find his way home but a drunken bird may get lost
forever.”
7
Remark. If all states in an irreducible Markov chain are positive recurrent, then we say that
the Markov chain is positive recurrent. If all states in an irreducible Markov chain are null
recurrent, then we say that the Markov chain is null recurrent. If all states in an irreducible
Markov chain are transient, then we say that the Markov chain is transient.
3 Likelihood Inference
3.1 Likelihood Function
Assuming we have observed data s0 , s1 , . . . , sk at times 0, 1, . . . , k, from a stationary Markov
chain Xt , the likelihood function is
k−1
Y
P(X0 = s0 , . . . , Xk = sk ) = P(X0 = s0 ) P(Xi+1 = si+1 |Xi = si )
i=0
k−1
Y
= P(X0 = s0 ) psi si+1
i=1
S Y
Y S
= p0 pnrsrs ,
r=1 s=1
Qk−1
where nrs is the number of observed transitions from r to s. Since i=1 psi si+1 = ps1 s2 ×
p s3 ×· · ·×psk−1 sk , it suffices to count the number of times each transition occurs. Therefore,
Qs2k−1 QS QS nrs
i=1 psi si+1 = r=1 s=1 prs .
From expression above we get the log-likelihood
S X
X S
`(P) = nrs log prs + log p0 ,
r=1 s=1
Taking partial derivative to calculate the MLE of prs , the second term log p0 will disappear,
so the optimization does not depend on the initial probabilities.
The MLE of prs can be shown to be of the form
nrs
p̂rs = ,
nr·
where nr· = Sj=1 nrj .
P
This is the most intuitive estimator that simply counts the proportion of transitions, as Fig-
ure 2b in the DNA example.
Let the transition matrix be P = (p> > > >
1 p2 . . . pS ) , where pr is the transition probability
vector from state r, and let P̂ = (p̂> > > >
1 p̂2 . . . p̂S ) denote the MLE of the transition matrix.
Under regularity conditions (ergodicity) the p̂rs have a joint normal distribution
8
√
D
nr. p̂r − pr −−−→ N (0, Vr )
n→∞
where nr. is the number of visits the chain makes to state r, and
pr1 (1 − pr1 ) −pr1 pr2 −pr1 pr3 . . . −pr1 prS
−pr2 pr1 pr2 (1 − pr2 ) −pr2 pr3 . . . −pr2 prS
Vr =
.. .. .. . . ..
. . . . .
−prS pr1 −prS pr2 −prS pr3 . . . prS (1 − prS )
and p̂i and p̂j are asymptotically independent. Therefore the covariances of p̂rs can be
estimated by
p̂rs (1 − p̂rs )/nr· , r = t, s = u,
cov(p̂
c rs , p̂tu ) = −p̂rs p̂ru /nr· , r = t, s 6= u,
0, otherwise
Example. Assume X = (0, 0, 0, 1, 1, 0, 0, 0, 1, 0) was drawn from a first order Markov
(0) (0)
chain, and the initial probability vector is p>(0) = (p1 p2 ) = (0.8 0.2). Then the
observed likelihood function is
S Y
Y S
P(X) =p0 pnrsrs
r=1 s=1
=0.8 × p400 × p201 × p210 × p111
So the maximum likelihood estimates of prs will be
4 2 2 1
pM
00
LE
= pM
01
LE
= pM
10
LE
= pM
11
LE
=
6 6 3 3
3.2 A Simpler Model
One could test a simpler model where the transition probabilities are independent of the
current state i.e. prs = ps This model is a zeroth-order Markov Chain PS that just assumes a
multinomial distribution.
P In
P this case the likelihood simplifies to j=1 n·j log pj and p̂s =
n·s /n·· with n·· = s n·s = r,s nrs .
The likelihood ratio test for testing a zeroth order V. a first order Markov chain models is
n o S X
nX S S
X o
Λn =2 `(P̂)f irst order − `(P̂)zeroth order =2 nrs log p̂rs − n.s log p̂s
r=1 s=1 s=1
S X
nX S S X
X S o XS X
S p̂
rs
=2 nrs log p̂rs − nrs log p̂s = 2 nrs log
r=1 s=1 s=1 r=1 r=1 s=1
p̂s
S S n n
rs ··
XX
=2 nrs log
r=1 s=1
nr· n·s
9
To test a simpler independent model, the null hypothesis should be H0 :“all rows of P equal
(p1 , . . . , pS ) ” or equivalently “the zeroth order model holds”. In this case, under H0 we have
D
Λn −−−→ χ2(S−1)2
n→∞
Here we have (S − 1)2 the degrees of freedom since the difference of number of parameters
in the two models compared is S(S − 1) − (S − 1) = (S − 1)2 . Note that, as the probability
of all states needs to sum to 1, each row has only (S − 1) parameters.
Example. DNA Sequence Data
Calculating the likelihood ratio under the independent (zeroth order) model, we get
Λn ≈ χ29 , where the observed value of this likelihood ratio is tn = 64.45, and p-value is
p < .00001. This gives strong evidence against the independence model. Note that Λn
Figure 7: Fit of Markov Chains to the DNA Data
is asymptotically equivalent to the Pearson chi-squared statistic given i (Oi − Ei )2 /Ei ,
P
where the sum is taken over all the cells in Figure 2b (16 in total), Oi are the observed
counts in those cells and Ei denote the expected counts under H0 :“the zeroth order
model is correct”. The Ei are calculated using the empirical proportions p̂s = n·s /n·· and
1/2
the total counts per row in Figure 2b. Since (Oi − Ei )/Ei is asymptotically normally
distributed under H0 , they can be used for model checking. According to the Q-Q plot
Figure 7, we can also conclude that the first-order model (right), fits better than zeroth-
order model (left). In particular while the left plot shows that the zeroth-order and the
first-order models give very different fits, the Q-Q plot on the right hand side shows that
there are no huge discrepancies in the fits of the first- and second-order models. In the case
the Oi denotes each cells in Figure 8 (64 in total) and the expected counts are computed
H0 :“the first order model is correct” using the empirical proportions p̂rs = nrs /nr· and
the total counts of each row of Figure 8.
10
3.3 Higher Order Models
One can extend the idea of a first-order Markov chain to chains of order m, where the
probability of transition into s depends on the m previous states
P(Xj = s|X0 = s0 , . . . , Xtj−1 = sj−1 )
= P(Xj = s|Xj−m = sj−m , . . . , Xtj−1 = sj−1 )
= psj−m sj−m+1 ...sj−1 s
Thus the ‘current’ state Xj depends not only on Xj−1 but on Xm−1 , Xm−2 , . . . , Xj−1 .
When m = 1 we have S(S−1) parameters since we had S vectors of parameters (pr1 , . . . , prS ).
In general there will be S m such vectors now, which leads to S m (S − 1) parameters. Hence
the dimension of the parameter increases exponentially fast with the order of the Markov
Chain. If we consider the second-order model for the DNA data example, i.e. a model where
the ‘current’ state depends on previous two states, we get the observed frequencies of Figure
8. The proportion of transitions described by the ‘frequencies for third base’ in Figure 8
Figure 8: Second-order Markov Chains Model
needs sum to 1, so the second-order Markov model has S − 1 = 3 parameters in each row
so there are S m (S − 1) = 16 × 3 = 48 unknown parameters. A third-order model will have
S m = 43 = 64 rows, hence S m (S − 1) = 64 × 3 = 192 parameters.
Example. Model Selection (DNA Sequence Data)
With the likelihood ratio test, we get the test statistic for testing first-order vs. second-
order dependence leads to P(χ236 > 55.2) ≈ 0.02. The likelihood ratio statistics for
testing second-order vs. third-order dependence leads to P(χ2144 > 150.3) ≈ 0.34. Recall
that there is strong evidence (p < .00001) for first-order dependence model compared
to independence model (zeroth-order model), the evidence for second- compared to first-
order model is weaker, and there is no suggestion of third-order dependence model.
Another way to make model selection is to compare AIC values. The Akaike Information
11
Criterion (AIC) is a general likelihood based model selection criterion. It can be defined
for a parametric model as
AIC(θ̂k ) = −2`k (θ̂k ) + 2k
where θk is a k-dimensional parameter, θ̂k is the MLE of θk and `k (θ̂k ) denotes the log-
likelihood of the MLE of θk . Using the AIC as the model criterion, we choose the model
with minimum AIC.2 The AIC values for the zeroth-, first-, second- and third-order
models are 4122.9, 4076.0, 4092.8 and 4230.5. This suggests that that m = 1 is the best
model.
4 Final Remarks
The general ideas developed here can be extended in many ways. Major examples include:
• Markov Chain Monte Carlo: an extremely useful method for running simulations and
Bayesian inference.
• Continuous time models: extend the Markov property to infinitesimal intervals [t, (1 +
δ)t). Markov Chains missed the precise lengths of time spent in each state before a
transition occurs. Continuous time models can accommodate in a more realistic way
such as continuous time events.
• Hidden Markov Models: the distribution that generates an observation depends on the
state of an underlying and unobserved Markov process. In the snow fall example, if
the observation states S : { ‘snow’ , ‘no snow’} also depends on an unobserved states
atmospheric pressure H : {‘low’, ‘high’}, it then can be modeled as a Hidden Markov
Model.
2
See Chapter 4.7 of Statistical Models for more details.
12
Appendix A Eigenvalues and Eigenvectors
Given A a square matrix of dimension n × n. The eigenvalues are defined as the roots of the
characteristic equation
|A − λIn | = 0
Let λ be an eigenvalue of A. The eigenvector x, associated with the eigenvalue λ satisfies
Ax = λx
In general, the eigenvectors are normalized, x> x = 1. The multiplicity of an eigenvalue
corresponds to the number of times that it is a root of the characteristic equation.
• A and A> have the same eigenvalues, but not necessarily the same eigenvectors.
• If λ is an eigenvalue of A, then 1/λ is an eigenvalue of A−1 .
• If A is a matrix n × n and G a non-singular matrix n × n, then A and G−1 AG have
the same eigenvalues.
• A singular matrix has at least one eigenvalue equal to zero.
• An idempotent matrix has only eigenvalues equal to 0 or 1.
• An orthogonal matrix has only eigenvalues equal to 1 and -1.
• Given A a real symmetric n × n :
– All its eigenvalues are real .
– The eigenvectors are orthogonal.
– A is positive definite (positive semi-definite) if and only if its eigenvalues are
positive (non-negative).
– There exists an orthogonal matrix P such that
λ1 0 ... 0
0 λ2 ... 0
P AP > = Λ = ..
..
. .
0 0 . . . λn
with P > ΛP = A
– Noting that Q = Λ−1/2 P , we obtain :
QAQ> = I
• Given A(n × n) a real symmetric matrix, positive definite :
– |A| = ni=1 λi
Q
P P
– tr(A) = aii = λi
13
– The eigenvalues of I − A are equal 1 − λi .
• Given A and B two real symmetric matrices n × n :
– The roots of |A − λB| are the same as those of T −1 AT −1 where B = T > T .
– There exists a matrix R such that
RAR> = Λ and RBR> = In
14