0% found this document useful (0 votes)
15 views114 pages

Stochastic Proccess

Uploaded by

beat box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views114 pages

Stochastic Proccess

Uploaded by

beat box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

1.

L ECTURE 1: A PRIL 3, 2012


1.1. Motivating Remarks: Differential Equations. In the deterministic world, a stan-
dard tool used for modeling the evolution of a system is a differential equation. Such an
equation relates the rate of change of a measured quantity, at a given time t, to the value
of that quantity at time t. A typical first order ordinary differential equation takes the
form
x0 (t) = F (t, x(t))
for some (hopefully nice) function F . Note that the change in x does not depend on the
value of x at earlier times in this equation. An example of such an equation would be
x0 (t) = x(t − t0 )
for some fixed time-difference t0 . This is called a delay equation. Delay equations are
extremely difficult to solve, or even qualitatively analyze; their behavior can depend on
the parameter t0 in complicated ways. For example, if t0 = π2 then the functions x(t) =
cos t and x(t) = sin t are solutions of the above delay equation. But for other values of t0
(even close to π2 ), solutions cannot be described in terms of elementary functions, and are
not periodic.
This is a course in stochastic processes, meaning processes that are inherently random.
In the same manner as above, such processes are vastly easier to analyze if their future
behavior can be predicted knowing only their current behavior, without reference to the
past. We will begin this course with a study of the discrete-time version of this property.

1.2. Markov Chains. Before we define Markov process, we must define stochastic pro-
cesses.
Definition 1.1. Let S be a fixed state space. A (discrete time) stochastic process (Xn )n∈N is a
sequence of random variables all defined on a fixed probability space (Ω, P) all taking values in S.
A (continuous time) stochastic process (Xt )t≥0 is a collection of random variables Xt indexed
by t ∈ [0, ∞), all defined on a fixed probability space (Ω, P) all taking values in S.
We will discuss exclusively discrete time processes for the first half of the course.
Example 1.2. Let X1 , X2 , . . . be i.i.d. real-valued random variables; then they form a dis-
crete time stochastic process with state space R. Similarly, the process Sn = X1 + · · · + Xn
is a discrete time stochastic process with state space R.
Example 1.3. As a special case of the previous example, sample the i.i.d. variables from
the symmetric Bernoulli distribution: P(Xn = ±1) = 12 . Then The state space for the
process (Xn ) is just {±1}. The corresponding process Sn has state space Z. The process
Sn is called the symmetric random walk on Z.
Example 1.4. Fix N > 0 in N. We will soon describe stochastic processes Xn which be-
haves as the symmetric random walk Sn does from Example 1.3 when 0 < Xn < N , but
have different behavior when Xn reaches 0, N . These will be known as reflected, par-
tially reflected, and absorbing random walks. To describe them, we need the technology
of Markov processes, which we introduce next.
1
2

Given a collection X1 , . . . , Xn of random variables with discrete state space S, their joint
distribution is completely described by the collection of probabilities
{P(X1 = i1 , X2 = i2 , . . . , Xn = in ) : i1 , . . . , in ∈ S}.
Thinking of the index as a time parameter, it is useful to express these numbers, using the
definition of conditional probability, as
P(X1 = i1 , . . . , Xn = in )
= P(X1 = i1 )P(X2 = i2 |X1 = i1 )P(X3 = i3 |X1 = i1 , X2 = i2 ) · · · P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ).
Definition 1.5. Let Xn be a discrete time stochastic process, with state space S that is finite or
countably infinite. Then Xn is called a (discrete time) Markov chain if, for each n ∈ N and any
i1 , . . . , in ∈ S,
P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ) = P(Xn = i1 |Xn−1 = in−1 ).
A Markov chain (sometimes called a Markov process, though that is usuall reserved
for the case of an uncountable state space) is a stochastic process whose evolution to the
next state depends only on the current state, not the past behavior. Let’s reconsider the
above examples:
• In Example 1.2 (making the additonal assumption that the state space is a count-
able or finite subset of R to conform to the definition), the independence assump-
tion shows that
P(X1 = i1 , . . . , Xn = in ) = P(X1 = i1 ) · · · P(Xn = in ).
In particular, this means that P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ) = P(Xn = in ) =
P(Xn = in |Xn−1 = in−1 ). Hence, Xn is a Markov chain. For the process Sn of sums,
P(S1 = i1 , S2 = i2 , . . . , Sn = in ) = P(X1 = i1 , X2 = i2 − i1 , . . . , Xn = in − in−1 ).
Thus
P(S1 = i1 , . . . , Sn = in )
P(Sn = in |S1 = i1 , . . . , Sn−1 = in−1 ) =
P(S1 = i1 , . . . , Sn−1 = in−1 )
P(X1 = i1 , X2 = i2 − i1 , . . . , Xn = in − in−1 )
=
P(X1 = i1 , X2 = i2 − i1 , . . . , Xn−1 = in−1 − in−2 )
= P(Xn = in − in−1 )
while
P(Sn−1 = in−1 , Sn = in )
P(Sn = in |Sn−1 = in−1 ) = .
P(Sn−1 = in−1 )
The numerator can be rewritten as
P(Sn − Sn−1 = in − in−1 , Sn−1 = in−1 ) = P(Xn = in − in−1 , Sn−1 = in−1 )
= P(Xn = in − in−1 )P(Sn−1 = in−1 ),
since Xn is independent from X1 , . . . , Xn−1 and hence from Sn−1 . Dividing, we
achieve the desired equality. So Sn is a Markov chain.
3

• In the special case of Example 1.3, we can calculate the probabilities P(Sn = in |Sn−1 =
in−1 ) = P(Xn = in − in−1 ). Since Xn ∈ {±1} with equal probabilities, we must have
in − in−1 ∈ ±1 to have positive probability, and in this case each has value 12 . In
other words:
1
P(Sn = i ± 1|Sn−1 = i) = ,
2
and all other “transition probabilities” are 0.
In general, for a Markov chain (Xn )n∈N , the probabilities P(Xn = in |Xn−1 = in−1 ) are
called the transition probabilities. The last example shows that the transition probabili-
ties of some Markov chains are even more uniform than the definition insists.

Definition 1.6. A Markov chain is called time-homogeneous if, for any i, j ∈ S, the transition
probability P(Xn = j|Xn−1 = i) is independent of n. That is: there is a fixed function p : S × S →
[0, 1] with the property that, for all n,

P(Xn = j|Xn−1 = i) = p(i, j).

The symmetric random walk above is a homogeneous-time Markov chain, as is Sn for


any independent sum of i.i.d. random variables. Note that

X X X P(Xn = j, Xn−1 = i)
p(i, j) = P(Xn = j|Xn−1 = i) = = 1.
j∈S j∈S j∈S
P(Xn−1 = i)

for each i ∈ S, by the law of total probability. If we let P be the |S| × |S| matrix with i, j
entry is p(i, j), this says that the rows of P all sum to 1; P is a stochastic matrix.

Now let’s flesh out Example 1.4.

Example 1.7. Consider the following Markov chain. The state space is {0, . . . , N }. As
with the full random walk, we have transition probabilities p(i, j) = 12 if 0 < i, j < N and
j = i ± 1, and p(i, j) = 0 otherwise. We have yet to specify the transition probabilities
involving the states 0, N . As usual p(0, i) = p(i, 0) = p(N, j) = p(j, N ) = 0 if i 6= 0, 1 and
j 6= N − 1, N . But we impose different boundary conditions for different kinds of random
walks.
• Reflecting Random Walk. Here p(0, 1) = 1 and p(N, N − 1) = 1 (and so p(0, 0) =
0 = p(N, N )). That is: the random walker bounces off the walls surely when reach-
ing them.
• Absorbing Random Walk. Here p(0, 0) = p(N, N ) = 1; so when the walker reaches
the boundary, he says there forever.
• Partially Reflecting Walk. In general, we can set p(0, 0) = p, p(0, 1) = 1 − p,
p(N, N ) = q, and p(N, N − 1) = 1 − q for some arbitrary probabilities 0 ≤ p, q ≤ 1.
As we will see in the next weeks, the long-time behavior of the random walk is very
depenedent on the parameters p, q in the partially reflecting walk.
4

2. L ECTURE 2: A PRIL 4, 2012


2.1. Examples.
Example 2.1. Ehrenfest urn. There are N balls and 2 urns. At each time step, choose a
ball from one of the two urns randomly and move it to the opposite urn. Let Xn denote
the number of balls in the first urn after n time steps. Then (Xn )n∈N is a Markov chain
with state space {0, . . . , N } and transition probabilities p(i, i − 1) = Ni and p(i, i + 1) = NN−i
(and all other p(i, j) = 0). [This is a statistical mechanical model of the exchange of gas
between two chambers connecting by a small hole.]
Example 2.2. Wright-Fisher model. A population of fixed size N can have genes of two
types: A and a. The gene pool in generation n + 1 is obtained by sampling with replace-
ment from the population in generation n. Let Xn denote the number of A genes in the
population in generation n. Then Xn is a Markov chain with state space {0, 1, . . . , N } and
transition probabilities
   j  N −j
N i N −i
p(i, j) = .
j N N
I.e. the model is that, if i A genes are present at time n, then in generation n + 1 each
individual has probability Ni of getting A and probability 1− Ni of getting a; so the number
of A is given by the binomial distribution. [This is a model of genetic drift in diploid
populations with non-overlapping generations; for example annual plants.]
2.2. n-Step Transition Probabilities. The array p(i, j) gives the one-step transition prob-
abilities of the Markov chain. But the typical question is: given an initial probability
distribution (for X0 ), what is the probability that Xn = i for some large n and any i ∈ S?
To answer that question, we will have to delve into linear algebra, because:
Lemma 2.3. Let (Xn )n∈N be a homogeneous-time Markov chain with state space S, with transi-
tion probabilities p : S × S → [0, 1], and associated transition matrix P. Then
P(Xn = j|X0 = i) = [Pn ]ij .
That is: the n-step transition probabilities are given by the entries of Pn .
Proof. Let pn (i, j) = P(Xn = j|X0 = i). By definition, p1 (i, j) = p(i, j) (by time-homogeneity).
We proceed by induction. By the law of total probability
X
P(Xn+1 = j|X0 = i) = P(Xn+1 = j, Xn = k|X0 = i)
k∈S
X
= P(Xn+1 = j|Xn = k, X0 = i)P(Xn = k|X0 = i).
k∈S

Since Xn is a Markov process, P(Xn+1 = j|Xn = k, X0 = i) = P(Xn+1 = j|Xn = k) =


p(k, j). Hence, we have
X X
pn+1 (i, j) = P(Xn = k|X0 = i)p(k, j) = pn (i, k)p(k, j).
k∈S k∈S
By the inductive hypothesis, pn (i, k) = [Pn ]ik , and so we have
X
P(Xn+1 = j|X0 = i) = [Pn ]ik [P]kj = [Pn+1 ]ij .
k∈S
5

This proves the claim. 

We can generalize this slightly, as follows.


Corollary 2.4. If (Xn )n∈N is a time-homogeneous Markov chain with transition matrix P with
entries [P]ij = p(i, j) = P(Xn = j|Xn−1 = i), then the n-step transition probabilities pn (i, j) =
P(Xn = j|X0 = i) satisfy the Chapman-Kolmogorov Equations:
X
pm+n (i, j) = pm (i, k)pn (k, j).
k∈S

Proof. In light of Lemma 2.3, this is just the restatement of the fact that powers of P com-
mute under matrix multiplication:
X
pm+n (i, j) = [Pm+n ]ij = [Pm Pn ]ij = [Pm ]ik [Pn ]kj
k∈S

which yields the stated sum. 

2.3. The Markov Property. One way to state the definition of a Markov chain is that,
conditioned on the present, the future is independent from the past. This can be made
precise and slightly stronger, in the following proposition which may be taken as the
(recursive) definition of a homogeneous time Markov chain.
Proposition 2.5. Let (Xn )n∈N be a time-homogeneous Markov chain with discrete state space S
and transition probabilities p(i, j). Fix m ∈ N and i ∈ S, and suppose that P(Xm = i) > 0.
Then conditional on Xm = i, the process (Xm+n )n∈N is a time-homogeneous Markov chain with
the same transition probabilities p(i, j), and independent from X0 , X1 , . . . , Xm .
In other words: if A is an event determined only by X0 , X1 , . . . , Xm with P(A ∩ {Xm = i}) > 0,
then for all n ≥ 0
P(Xm+1 = im+1 , . . . , Xm+n = im+n |A∩{Xm = i}) = p(i, im+1 )p(im+1 , im+2 ) · · · p(im+n−1 , im+n ).
Proof. From the definition of conditional probability, what we need to prove is that
P({Xm+1 = im+1 , . . . , Xm+n = im+n } ∩ A)
= P({Xm = i} ∩ A)p(i, im+1 )p(im+1 , im+2 ) · · · p(im+n−1 , im+n ).

If A takes the form A = {X0 = i0 , X1 = i1 , . . . , Xm = im } (and so im = i since we made the


assumption that P(A ∩ {Xm = i}) = 0), the left hand side is just
P(X0 = i0 , . . . , Xm+n = im+n ) = P(X0 = i0 )p(i0 , i1 ) · · · p(im−1 , i)p(i, im+1 ) · · · p(im+n−1 , im+n )
because (Xn )n≥0 is a Markov chain. But we also have
P({Xm = i} ∩ A) = P(X0 = i0 , X1 = i1 , . . . , Xm = i) = P(X0 = i0 )p(i0 , i1 ) · · · p(im−1 , i)
for the same reason. Comparing, this proves the desired equality for such special A. But
we assumed that A depends only on the X0 , . . . , Xm ; thus, A must be a disjoint union of
events of the assumed form, and the result follows by summing. 
6

2.4. Hitting Times. One important question about a stochastic process is: when is the
first time it enters a certain set? Let A be an event, and define the random time τA by
τA = min{n ∈ N : Xn ∈ A}.
(Here we make the convention that min ∅ = ∞.) For example, we may wish to know (the
distribution of) the first time the symmetric random walk gets distance N away from the
origin. Better yet: in the random walk with absorbing boundaries 0, N , we may want to
know the probability that the walk reaches N before it reachs 0, given an initial distribu-
tion. This can be formalized as a general problem of this form:
Let A, B ⊂ S with A ∩ B = ∅. For i ∈ S, let h(i) = P(τA < τB |X0 = i). Calculate the
function h.
When (Xn )n∈N is a Markov chain, this becomes a linear algebra problem. First note that it
is always true that
h(i) = 1 if i ∈ A, h(i) = 0 if i ∈ B.
Now, take i ∈
/ A ∪ B. Conditioning on the first step of the process, we have
X
P(τA < τB |X0 = i) = P(τA < τB |X0 = i, X1 = j)P(X1 = j|X0 = i).
j∈S

By the Markov property, conditioning on X1 = j gives a new Markov process with the
same transition probabilities, hence the same joint distributions. The probability of the
event {τA < τB } = {τA − 1 < τB − 1} depends only on the joint distributions of the Xn s,
and so it follows that
P(τA < τB |X0 = i, X1 = j) = P(τA < τB |X1 = j) = P(τA − 1 < τB − 1|X0 = j) = h(j).
Thus, we have the equations
X
h(i) = p(i, j)h(j). (2.1)
j∈S

If the state space is finite, then we can think of h as a vector h ∈ R|S| (with hi = h(i)),
and these equations say that h = Ph. That is, h is an eigenvector of P with eigenvalue 1.
This is not typically enough to nail down h (it could be scaled, or 1 may not be a simple
eigenvalue), but the boundary conditions h(i) = 1 for i ∈ A and h(i) = 0 for i ∈ B are
often enough to uniquely determine h. Let’s consider an example.
Example 2.6. Let (Xn )n∈N be random walk on Z, but not necessarily symmetric: we define
p(i, i + 1) = q and p(i, i − 1) = 1 − q for some q ∈ [0, 1] and all i ∈ Z. If q = 21 this is the
symmetric random walk of Example 1.3. Let’s consider the probability that the random
walk reaches N before 0; with the language above, A = {N } and B = {0}. Thus h(N ) = 1
and h(0) = 0. Equation 2.1 in this case says
h(i) = (1 − q)h(i − 1) + qh(i + 1), 0 < i < N.
It is convenient to rewrite this in the form
(1 − q)[h(i) − h(i − 1)] = q[h(i + 1) − h(i)].
If q = 0 this gives h(i)−h(i−1) = 0 for 0 < i < N and since h(0) = 0 it follows that h(i) = 0
for i < N . (Indeed, here the walk always moves left.) Otherwise, let c = h(1) − h(0), and
7

1−q
let θ = q
. Define ∆h(i) = h(i)−h(i−1) for i ≥ 1. Thus, we have a geometric progression

∆h(i + 1) = θ∆h(i), 1≤i≤N −1

with initial condition ∆h(1) = c. The solution is

∆h(i + 1) = θi c, 1 ≤ i ≤ N − 1.

Therefore, the telescoping sum

h(i) = h(i) − h(i − 1) + h(i − 1) − h(i − 2) + · · · + h(1) − h(0) + h(0)

yields (since h(0) = 0)

i−1
X
h(i) = ∆h(1) + · · · + ∆h(i) = c + θc + · · · + θi−1 c = c θj
j=0

for 2 ≤ i ≤ N .
• Symmetrix RW. Here q = 12 so θ = 1. This gives h(i) = ci for 2 ≤ i ≤ N . In
particular h(N ) = cN . But also h(N ) = 1, so c = N1 , and we have h(i) = Ni .

In terms of gambling: suppose we bet on coin tosses. If heads comes up, you win
$1; if tails, you lose $. You begin with $i, and will cash out either when you go bust
or win $N. If the coin is fair, the probability of cashing out rather than going bust
is i/N – it increases linearly th closer your intial capital is to the cash-out.

• Biased RW. For q 6= 12 , we have

i−1
X 1 − θi
h(i) = c θj = c , 2 ≤ i ≤ N.
j=0
1−θ

N
In particular 1 = h(N ) = c 1−θ
1−θ
, and so c = 1−θ
1−θN
. Combining, this yields

1 − θi
h(i) = , 1 ≤ i ≤ N.
1 − θN

This highlights the gambler’s ruin phenomenon. In the gambling game above, we
suppose the coin is biased against you (perhaps only slightly). Suppose, for ex-
ample, that q = 0.49. If you start with $100 and set the cash-out at $200, then
h(100) = 0.018, as oppoed to 0.5 in the q = 0.5 case.
8

3. L ECTURE 3: A PRIL 6, 2012


Last time we saw how to calculate Pi (τA < τB ), the probability that a Markov chain
hits a set A before a set B (starting in a given state i). Today, we will consider the closely
related question: what is Ei (τA ), the expected hitting time of a set A?

3.1. Expected Hitting Times. Let (Xn )n∈N be a Markov chain with state space S and tran-
sition probabilities p(i, j). Let A ⊆ S, and as usual set τA = min{n ≥ 0 : Xn ∈ A}. Let us
define the function g : S → [0, ∞] by
g(i) = Ei (τA ) = E[τA |X0 = i].
Note: the conditional expectation here is the expectation with respect to the conditional
probability Pi ,
∞ ∞
X X P(Y = n, X0 = i)
Ei (Y ) = nPi (Y = n) = n .
n=1 n=1
P(X 0 = i)

Of course, g(i) = 0 if i ∈ A. For i ∈ / A, we condition on the first step of the trajectory


and use the Markov property:
X
g(i) = Ei (τA ) = Ei [τA |X1 = j]Pi (X1 = j).
j∈S

Now,
Pi (τA = n|X1 = j) = P(X0 ∈
/ A, X1 ∈
/ A, . . . , Xn−1 ∈
/ A, Xn ∈ A|X0 = i, X1 = j)
= P(X0 ∈
/ A, . . . , Xn−2 ∈
/ A, Xn−1 ∈ A|X0 = j) = Pj (τA = n − 1)
by the Markov property. Hence

X ∞
X
Ei [τA |X1 = j] = nPi (τA = n|X1 = j) = nPj (τA = n − 1)
n=1 n=1
X∞
= (m + 1)Pj (τA = m)
m=0
X∞ ∞
X
= mPj (τA = m) + Pj (τA = m).
m=0 m=0

The first term is just Ej (τA ), and since τA is N ∪ ∞-valued, the latter sum is 1. Hence we
have
X X X
g(i) = [Ej (τj ) + 1]Pi (X1 = j) = [g(j) + 1]p(i, j) = 1 + p(i, j)g(j). (3.1)
j∈S j∈S j∈S

This system of equations may not have a unique solution, since g(i) = ∞ for i ∈ / A is
always a solution. But if it is known a priori that the expected hitting time is finite, it may
often be used to calculate the function g.
Remark 3.1. Let 1 denote the column vector with all entries equal to 1. If we interpret the
function g as a vector g ∈ R|S| , then Equation 3.1 can be written as
g = 1 + Pg
9

where P is the transition matrix for the Markov chain. This is an affine equation, rather
than a linear equation; it is less amenable to the techniques of linear algebra. But in many
examples, it can be solved iteratively.
Example 3.2. On average, how many times do we need to toss a coin to get two consecutive
heads? To answer this question, we need to construct an appropriate Markov chain. There
are different ways to approach this; probably the simplest is to let Xn denote the number
of consecutive heads one has immediately seen immediately after the nth toss; we stop
the chain once we see two consecutive heads. This means Xn is a Markov chain with state
space {0, 1, 2} and transition matrix
1 1 
2 2
0
P = 2 0 21  .
 1

0 0 1
As above, set g(i) = Ei (τ{2} ). Then g(2) = 0. Then Equation 3.1 says
g(0) = 1 + p(0, 0)g(0) + p(0, 1)g(1) + p(0, 2)g(2) = 1 + 21 g(0) + 21 g(1)
g(1) = 1 + p(1, 0)g(0) + p(1, 1)g(1) + p(1, 2)g(2) = 1 + 12 g(0) + 21 g(2) = 1 + 12 g(0)
(and the third equation gives no information since it says g(2) = g(2)). The first equation
simplifies to 12 g(0) − 21 g(1) = 1 or g(0) − g(1) = 2, and the seconq equation says g(0) =
2g(1) − 2. Thus (2g(1) − 2) − g(1) = 2 and so g(1) = 4; therefore g(0) = g(1) + 2 = 6.
So, starting having seen no heads (of course), on average it takes about 6 tosses until 2
consecutive heads come up.

In making the above calculations, the key tool was the Markov property: conditioned
on X1 = j, the chain behaves after time 1 just like another Markov chain started from
j. To make other calculations, we will need something even stronger: we will want to
apply this reasoning not only to shifts by deterministic times, but also by (appropriate)
random times. I.e. we will see that, if T is an appropraite N-valued random variable, then
the shifted process XT +n is still a Markov chain with the same transition probabilities.
3.2. Stopping Times. The appropriate kind of random time alluded to in the above para-
graph is a stopping time.
Definition 3.3. Let (Xn )n∈N be a discrete-time stochastic process. A stopping time is a random
variable T with state space N ∪ {∞} such that, for each n, the event {T = n} depends only on
X0 , X1 , . . . , Xn .
It is typical to think of stopping times in terms of betting strategies. If Xn represents
some statistic associated to a gambling game, one has to decide when to bet (or fold)
based only on information up to the present time; but that information depends on the
random occurrences in the game up to that time. For example: one could decide to sell
a stock once it reaches $200; this is a stopping time. One would prefer to sell it the day
before it drops; this is not a stopping time.
Example 3.4. Fix i ∈ S. The return time Ti is the first time that Xn returns to state i after
time n = 0 (where we condition on X0 = i). That is:
Ti = min{n ≥ 1 : Tn = i}.
10

For any given n, the event {Ti = n} can be written as {X1 6= i, X2 6= i, Xn−1 6= i, Xn = i}.
Thus, the event {Ti = n} depends only on the values of X0 , . . . , Xn , and hence Ti is a
stopping time. Note: the hitting time τ{i} = min{n ≥ 0 : Xn = i} is generally different
from Ti (they differ if X0 = i which we generally condition on).
Example 3.5. If T = min{n ≥ 0 : Xn+1 = i}, then {T = n} = {X0 6= i, X1 6= i, . . . , Xn 6=
i, Xn+1 = i} depends on the future value of Xn+1 , and so T is not a stopping time.

3.3. The Strong Markov Property. Restarting a Markov chain at a stopping time pro-
duces a new Markov chain with the same transition probabilities.
Proposition 3.6. Let (Xn )n∈N be a time-homogeneous Markov chain with state space S and tran-
sition probabilities p(i, j). Let T be a stopping time. Suppose i ∈ S and P(XT = i) > 0. Then,
conditional on XT = i, (XT +n )n∈N is a time-homogeneous Markov chain with transition proba-
bilities p(i, j), independent of X0 , X1 , . . . , XT .
In other words: if A is an event that depends only on X0 , X1 , . . . , XT and P(A ∩ {XT = i}) > 0,
then for all n ≥ 0 and all i1 , . . . , in ∈ S,
P(XT +1 = i1 , XT +2 = i2 , . . . , XT +n = in |A ∩ {XT = i}) = p(i, i1 )p(i1 , i2 ) · · · p(in−1 , in ).
Proof. What we need to show is that
P(A ∩ {XT = i, XT +1 = i1 , . . . , XT +n = in }) = P(A ∩ {XT = i})p(i, i1 ) · · · p(in−1 , in ).
Because A depends only on X0 , . . . , XT , the event A∩{T = m} depends only on X0 , . . . , Xm .
We therefore decompose according to the value m of T and apply the Markov property:
P(A ∩ {XT = i, XT +1 = i1 , . . . , XT +n = in })
X∞
= P({T = m} ∩ A ∩ {Xm = i, Xm+1 = i1 , . . . , Xm+n = in })
m=0
X∞
= P({T = m} ∩ A ∩ Xm = i)p(i, i1 ) · · · p(in−1 , in )
m=0
= P(A ∩ {XT = i})p(i, i1 ) · · · p(in−1 , in ).

11

4. L ECTURE 4: A PRIL 9, 2012


We now begin our discussion of the limiting (long-term) behavior of Markov chains. A
quick reminder: we are utilizing the brief notation:
Pi (Xn = j) = P(Xn = j|X0 = i).

4.1. Transience and Recurrence. Let (Xn )n∈N be a Markov chain with state space S. There
are only two eventual behaviors a state can have under the action Xn .
Definition 4.1. A state i ∈ S is called recurrent if Pi (Xn = i for infinitely many n) = 1. A
state i ∈ S is called transient of Pi (Xn = i for infinitely many n) = 0.
We will use the strong Markov property to show that every state is either recurrent or
transient. The idea is to use the following collection of stopping times. Fixing i ∈ S, and
conditioning on the event X0 = i, we set
Ti,1 = 0, Ti,k = min{n > Ti,k : Xn = i} for k ≥ 2.
So Ti,k is the time of the k th visit to i. Note that, for each m ∈ N,
{Ti,k = m} = {XTi,k−1 +1 6= i, XTi,k−1 +2 6= i, . . . , Xm−1 6= i, Xm = i}
depends only on X0 , . . . , Xm (inductively verified, since Ti,k−1 is involved). These gener-
alize the first return time Ti = Ti,2 = min{n ≥ 1 : Xn = i}, which you saw was a stopping
time in the previous lecture. Now, define
ri = Pi (Ti < ∞).
Assume, for the moment, that, for fixed k ≥ 1, Ti,k < ∞. So-conditioned, the strong
Markov property yields that the process XTi,k +n is a Markov chain with the same transi-
tion probabilities as (Xn ). It follows that
P(Ti,k+1 < ∞|Ti,k < ∞) = Pi (Ti < ∞) = ri .
By induction, then, we have
P(Ti,k < ∞) = rik−1 .
This leads to the desired dichotomy.
P∞
Theorem 4.2. Let i ∈ S. Then ri = 1 iff n=0 pn (i, i) = ∞ iff i is recurrent; and ri < 1 iff
P ∞
n=0 pn (i, i) < ∞ iff i is transient.

Remark 4.3. The statement of Theorem 4.2 shows that one can verify recurrence/transience
for the state i either by
P calculating (or estimating) the probability ri , or by calculating (or
estimating) the sum pn (i, i); both are useful in actual examples.
Remark 4.4. The intuition is pretty clear here. If ri = 1 then we’re guaranteed to return
to i, and then (since the process is Markov) we start over and are guaranteed to return
again, eventually visiting i infinitely often. If, onthe other hand, ri < 1, there is a chance
we never return the first time; in the (uncertain) case we do, there is the same chance
we won’t visit a second time, etc.; these probabilities compound and eventually we stop
returning to i. The proof below makes this intuition rigorous.
12

In order to prove the theorem, we need to recall a useful fact from undergraduate prob-
ability. Let T be an N-valued random variable. Then one way to calculate its expectation
is
∞ ∞ X j ∞ X∞ ∞
X X X X
E(T ) = jP(T = j) = P(T = j) = P(T = j) = P(T ≥ k).
j=1 j=1 k=1 k=1 j=k k=1

(This is sometimes called the “layer cake representation” of T .)


Proof. Define Ni to be the number of visits to i:

1Xn =i .
X
Ni =
n=0

Then {Ni ≥ k} = {Ti,k < ∞}, and so Pi (Ni ≥ k) = P(Ti,k < ∞) = rik−1 . We also have
∞ ∞ ∞
Ei [1Xn =i ] =
X X X
Ei [Ni ] = Pi (Xn = i) = pn (i, i).
n=0 n=0 n=0

Now consider the two cases for ri .


• If ri = 1 then Pi (Ni ≥ k) = 1k−1 = 1 for all k, so Pi (Ni = ∞) = 1, which is precisely
the statement that i is recurrent. Moreover, we have
X∞
pn (i, i) = Ei [Ni ] = ∞
i=0

as claimed.
• If ri < 1 then Pi (Ni ≥ k) = rik−1 is exponentially small and so Pi (Ni = ∞) = 0,
which his precisely the statement that i is transient. Moreover, we have
∞ ∞ ∞
X X X 1
pn (i, i) = Ei [Ni ] = Pi (Ni ≥ k) = rik−1 = < ∞.
i=0 k=1 k=1
1 − ri

Example 4.5. Consider simple random walk on Z, with p(i, i + 1) = p and p(i, i − 1) =
1 − p. For the n-step transition probabilities, we have pn (i, i) = 0 unless n is even. So we
calculated p2n (i, i). Consider the trajectory of the path; in order for X2n = i given X0 = i, it
2n
must be that n of the steps are left and n are right. There are n such paths, and because
of the independence of the steps, each has probability pn (1−p)n of having occurred. Thus,
we have  
2n n
p2n (i, i) = p (1 − p)n .
n
So the recurrence/transience is decided by the summability of
∞ ∞  
X X 2n n
pn (i, i) = p (1 − p)n .
n=0 n=0
n
Of course this is finite (in fact 0) of p ∈ {0, 1}, where the walk moves  deterministically
2n
either right or left. In general, we can make the blunt upper-bound n ≤ 4n : this follows
from the overcount of paths, since 2n

n
counts the number of length-2n paths that return
13

to their starting point, while 4n = 22n counts the number of all length-2n paths. Then we
have
X∞ ∞
X ∞
X
pn (i, i) ≤ 4n pn (1 − p)n = (4p(1 − p))n .
n=0 n=0 n=0
1
The function p 7→ p(1 − p) takes maximum value at p = 12 , and is strictly less than 14
4
elsewhere in [0, 1]; hence the series is summable when p 6= 12 and all such non-symmetric
random walks on Z are transient.
When p = 21 this gives ∞ as an upper-bound for ∞
P
n=0 pn (i, i) which is not useful; we
need a lower-bound. Here we use Stirling’s approximation:

n! ∼ 2πn(n/e)n .
From this it follows that

4πn(2n/e)2n 22n
 
2n (2n)!
= ∼ = √ .
n (n!)2 2πn(n/e)2n πn
Thus, in the case p = 12 , we then have
∞ ∞  n  n X ∞
X XX 22n 1 1 1
pn (i, i) √ = √ = ∞.
n=1 n=0
πn 2 2 n=1
πn

Hence, the symmetric random walk on Z is recurrent.


4.2. Irreducible Markov chains. Transience and recurrence are properties of a state of a
Markov chain. Example 4.5 shows that all states of any simple RW on Z have the same
behavior: either all recurrent (when p = 12 ) or all transient (when p 6= 12 ). But it is perfectly
possible for different states to have different asymptotic properties in a Markov chain.
Example 4.6. Consider the stupid Markov chain with state space {1, 2} where p(1, 2) = 1
and p(2, 2) = 1. Then 1 is transient and 2 is recurrent.
The issue that makes Example 4.6 stupid is that 1 is transient for the silly reason that
you can never get back to it once you leave. If we avoid this kind of thing, then we will
always have universal asymptotic behavior for the states.
Definition 4.7. a Markov chain is irreducible if, for any two states i, j there exists n ∈ N so
that pn (i, j) > 0.
For example, in a simple random walk on Z, as soon as n > |i − j| and has the same
parity as i − j, pn (i, j) > 0. Hence, Example 4.5 highlights the following proposition.
Proposition 4.8. If (Xn )n∈N is an irreducible Markov chain, then either all states are recurrent
or all states are transient.
Proof. It suffices to show that if even one state is transient then all others must be transient.
Suppose i ∈ S is a transient state, and let j ∈ S. By assumption, there exists n ∈ N with
pn (i, j) > 0, and also there exists m ∈ N with pm (i, j) > 0. Then we have for any ` ≥ 0
p`+m+n (i, i) ≥ pm (i, j)p` (j, j)pn (j, i).
14

(I.e. one way to get from i back to i in ` + m + n steps is to move from i to j in m steps, that
return to j in ` steps, and then move from j to i in n steps.) Thus, we have the estimate
∞ ∞ ∞
X X p`+m+n (i, i) 1 X
p` (j, j) ≤ = pk (i, i) < ∞
`=0 `=0
pm (i, j)pn (j, i) pm (i, j)pn (j, i) k=m+n
by the assumption that i is transient. Hence j is transient as well. 
15

5. L ECTURE 5: A PRIL 11, 2012


5.1. Random Walks on Graphs. One of the most important class of examples of Markov
chains consists of simple random walks on graphs.
Definition 5.1. A graph G = (V, E) is a collection of vertices V and a relation G on V × V
(the set of edges). If x, y ∈ V we write x ∼ y to mean (x, y) ∈ E. The relation E is assumed to
be anti-reflexive (i.e. no loops, x 6∼ x for any x) and symmetric (i.e. undirected, x ∼ y means the
same as y ∼ x).
The valence of a vertex x ∈ V in graph is vx = #{y ∈ V : x ∼ y}. We will generally work with
graphs for which every vertex has finite, non-zero valence. (If vx = 0 then x is an isolated point
disconnected from the rest of the graph, and we may as well ignore it.
Definition 5.2. The simple random walk on a graph G = (V, E) is the Markov chain Xn with
state space V , and with transition probabilities given by p(x, y) = v1x if x ∼ y and p(x, y) = 0 if
x 6∼ y.
Example 5.3. The integers Z form a graph: the vertices are the integers Z themselves, and
the edges are all adjacent pairs (i, i + 1) for i ∈ Z. The simple random walk on this graph
is precisely the symmetric random walk we studied in Example 4.5.
Example 5.4. More generally, we consider the graphs Zd for d ≥ 1. Here the vertices
are all d-tuples (i1 , . . . , id ) ∈ Zd , and we declare (i1 , . . . , id ) ∼ (j1 , . . . , jd ) if there is an
m ∈ {1, . . . , d} with |im − jm | = 1 and ik = jk for all k 6= m. I.e. two points are adjacent if
they are exactly one step apart along one coordinate axis.
Thus, the simple random walk on Zd has transition probabilities
1
p ((i1 , . . . , id ), (i1 , . . . , im ± 1, . . . , id )) = 2d
, m ∈ {1, . . . , d}
and all other transition probabilities 0.
Let i and j be any two vertices in Zd . There is, of course, a path between them along
coordinate-axis single steps. If n is the minimal length of such a path, then pn (i, j) > 0
while pk (i, j) = 0 for k < n. In particular, it follows that Zd is irreducible, and so all states
have the same asymptotic behavior. We will therefore restrict the discussion to the state
0 = (0, . . . , 0), the origin, as a representative state for all others.
5.2. SRW in Zd . We have already seen (Example 4.5) that the SRW on Z is recurrent. Now
consider Z2 .
Example 5.5. SRW on Z2 . As usual, pn (0, 0) = 0 if n is odd. Now, there are trajectories of
1
length 2n that return to the origin; since each step has transition probability 2d = 14 , any
particular length 2n trajectory will have probability 4−2n of occurring. We have to count
the number of trajectories.
Let m be the number of (1, 0) steps; to return to the origin, there must be m (−1, 0)
steps. Since there are 2n steps total, this means 2m ≤ 2n. The same analysis for (0, ±1)
steps shows there must be k of each with 0 ≤ k ≤ n. But there are exactly 2n steps so
2k + 2m = 2n so k = n − m. Hence, the total number of length 2n trajectories from 0 to 0
can be enumerated as
Xn
#{paths with m (±1, 0) steps and n − m (0, ±1) steps}.
m=0
16

This summand is a multinomial coefficient: from 2n, choose m (1, 0) steps, m (−1, 0) steps,
n − m (0, 1) steps, and n − m (0, 1) steps. This is given by
 
2n (2n)!
=
m m n−m n−m (m!) ((n − m)!)2
2

and so we have the exact calculation


n
X (2n)!
p2n (0, 0) = 4−2n 2 ((n − m)!)2
.
m=0
(m!)

This can be simplified as


n n  2
−2n
X (2n)! −2n (2n)!
X n!
4 2 ((n − m)!)2
=4 2
.
m=0
(m!) (n!) m=0
m!(n − m)!
n n n
  
The squared term in the sum is just m , and since m = n−m , we can write this as
 X n   
−2n 2n n n
4 .
n m=0 m n − m

A convenient binomial relationship is that


n     
X n n 2n
= .
m=0
m n−m n

(Think of it this way: to choose n objects from 2n, one could start by artificially dividing
the 2n into two groups of n, then selecting m from the first group and n − m from the
second; summing over all allotments of 0 ≤ m ≤ n from the first and n − m from the
second, we achieve all ways of selecting n from the original 2n.)
Hence, we have shown that
 2   2
−2n 2n −2n 2n
p2n (0, 0) = 4 = 2 .
n n
Looking back at Example 4.5, we see that this is exactly the square of p2 n(0, 0) for the SRW
on Z. In particular, we already calculated that
 
−2n 2n 1
2 ∼√ .
n πn
Hence, for SRW on Z2 ,
∞ ∞ ∞
X X X 1
pn (0, 0) = p2n (0, 0) = = ∞.
n=0 n=0 n=0
πn

Thus, SRW on Z2 is recurrent.


It turns out the situation is different for SRW in Zd for d ≥ 3.
17

Example 5.6. SRW on Z3 . Arguing exactly as we did above to enumerate the trajectories
of length 2n, now noting that the adjacent-step transition probabilities are 16 , we have
  X  2
−2n
X (2n)! −2n 2n −2n n!
p2n (0, 0) = 6 =2 3 .
i,j,k≥0
(i!j!k!)2 n i,j,k≥0
i!j!k!
i+j+k=n i+j+k=n

You might hope this works out to exactly p2n (0, 0)3 from SRW on Z, but it does not. How-
ever, it turns out to be asymptotically a constant times this, as we will now see. Here are
some facts that help in the calculation:
• The number of ways to put n balls into 3 boxes is, of course, 3n ; but by considering
all ways of subdividing them individually, we have
X  n  X n!
n
3 = = .
i,j,k≥0
i j k i,j,k≥0
i!j!k!
i+j+k=n i+j+k=n

• If n is divisible by 3, say n = 3m, and i + j + k = n = 3m, then i!j!k! ≥ (m!)3 and


n! n!
so i!j!k! ≤ (m!) 3 . (I.e. the “middle” multinomial coefficient is largest; this can be

checked easily by induction.) P


• If ai ≥ 0 are non-negative with sum i ai = 1, then
X X
a2i ≤ ai · max aj = max aj .
j j
i i

Now, define ani = ani,j,k = 3−n i!j!k!


n!
. The first fact above asserts that i ani = 1, where the
P

sum is over those i = (i, j, k) in N with i + j + k = n. We calculated above that


 X
−2n 2n
p2n (0, 0) = 2 (ani )2 .
n i

So, the third fact above shows us that


 
−2n 2n
p2n (0, 0) ≤ 2 max anj .
n j

For the moment, take n = 3m. Then the second fact above gives
n! n!
max anj = max 3−n ≤ 3−n .
j i,j,k≥0
i+j+k=n
i!j!k! (m!)3

Hence, we have

3−n 2πn(n/e)n
 
−2n 2n −n n! 1 1
p6m (0, 0) ≤ 2 ·3 ∼ √ · =
n (m!)3 πn (2πm)3/2 (m/e)3m 2π 3/2 m3/2
using Stirling’s formula again.
This only gives us an upper bound on p6m ; to get p2n for all n, we also need p6m−2 and
p6m−4 . But note: if there is a length 6m − 2 path from 0 to 0, then by traveling one step
forward and back along any of the 3 axes, we produce a path of length 6m; since each step
18

has transition probability 16 , we then have the lower bound p2m (0, 0) ≥ ( 16 )2 p6m−2 (0, 0).
Similarly, we have p2m (0, 0) ≥ ( 61 )4 p6m−4 (0, 0). This gives the estimate

X ∞
X ∞
X
4
p2n (0, 0) = [p6m−4 (0, 0) + p6m−2 (0, 0) + p6m (0, 0)] ≤ 6 p6m (0, 0)
n=1 m=1 m=1
and thus ∞ ∞
X 64 X 1
p2n (0, 0) ≤ 3/2 < ∞.
n=1
2π m=1 m3/2
Ergo, SRW on Z3 is transient.
In particular, by Theorem 4.2, this shows that r03 = P0 (T03 < ∞), the probability of return
to 0 in Z3 , is < 1. Now, in Zd , in order for the process to return to 0, all components must
return to 0; hence we have {T0d−1 < ∞} ⊇ {T0d < ∞} for all d, and hence the probability
r0d is non-increasing with dimension. It therefore follows that SRW on Zd is transient for
all d ≥ 3.
Remark 5.7. It can be calculated, using generating-function techniques we do not discuss
here, that the return-probability in Z3 is r03 ≈ 0.34.
19

6. L ECTURE 6: A PRIL 13, 2012


Let (Xn )n≥0 be a Markov chain with state space S. For the moment, we will not con-
dition on a particular initial state. Instead, suppose the initial probability distribution is
π0 ; i.e. P(X0 = i) = π0 (i) for i ∈ S. The Markov property then allows us to compute the
distribution of Xn for any n:
X X X
P(Xn = j) = P(Xn = j, X0 = i) = P(X0 = i)P(Xn = j|X0 = i) = π0 (i)pn (i, j).
i∈S i∈S i∈S

As we established in Lecture 2 (cf. the Chapman-Kolmogorov Equations), pn (i, j) is the


(i, j)-entry of the matrix Pn . So we have for n ≥ 0, the distribution πn (j) = P(Xn = j) can
be computed as
X
πn (j) = π0 (i)pn (i, j) = [π0 Pn ]j .
i∈S

I.e. considering π0 and πn as row-vectors, we have πn = π0 Pn .


So, the question of what happens for large n amounts to understanding the behavior of
high powers of stochastic matrices.

6.1. Stationarity. Suppose that π is a probability vector with the property that π = Pπ;
i.e. π is a right eigenvector of P with eigenvalue 1. Then, of course, π = πPn for all n;
in other words, if π is ever the distribution of the process at some time, it will remain
stationary in that distribution for all time. So we call such a π a stationary distribution. We
can be a little forgiving with the requirement that it be a distribution.
Definition 6.1. Let (Xn )n≥0 be a Markov chain with state space S transition matrix P. A vector
π = [π(i)P: i ∈ S] is called a stationary measure if π(i) ≥ 0 for all i ∈ S and πP = π. If, in
addtion, i∈S π(i) = 1, we call π a stationary distribution.
Note: at least in the case of a finite-state Markov chain, if there is a stationary measure π
with at least one non-zero entry,P then we−1 can normalize the vector to produce a stationary
distribution π̂: just take π̂ = ( i∈S π(i)) π.
Example 6.2. Suppose a 2-state Markov chain has transition matrix
 
1/2 1/2
P= .
1/4 3/4
If π = [x, y] is a stationary measure, then we have the equations
 
1/2 1/2
[x, y] = [x, y] = [ 21 x + 41 y, 12 x + 34 y].
1/4 3/4
These two equations actually amount to the same thing: 12 x = 14 y, so y = 2x. Hence
there is a one-parameter family of stationary measures; with the additional constraint
that x + y = 1, we see the unique stationary distribution is
π = [1/3, 2/3].
(If there exists a stationary measure, it is never unique: one can always scale an eigenvec-
tor. The question is whether 1 is an eigenvalue.)
20

Now, let us look at the powers of the matrix P. Accurate to 3 digits,


     
2 0.375 0.625 4 0.336 0.664 7 0.333 0.667
P = , P = , P = .
0.313 0.688 0.332 0.668 0.333 0.667
It appears (and is easily verified) that
 
n 1/3 2/3
lim P = .
n→∞ 1/3 2/3
I.e. all the rows of the limiting transition matrix are equal to the stationary distribu-
tion π. In particular, if ν = [x, y] is any distribution (so x + y = 1), it then follows that
limn→∞ νPn = π; i.e. no matter what is the starting distribution, the limit distribution of
the chain is the stationary distribution.
Example 6.2 leaves us with the following general questions.
(Q1) When does a Markov chain possess a stationary distribution?
(Q2) If it exists, when is the stationary distribution unique?
(Q3) When does the distirbution of Xn converge to the stationary distribution?
First let’s consider some examples to see that all three of these can go wrong.
Example 6.3. (1) Let S = Z and set p(i, i + 1) = 1 for i ∈ Z. This chain is deterministic:
it always moves right. So all states are transient. That is: regardless of the starting
distribution, limn→∞ P(Xn = i) = 0 for any i ∈ Z. This means there is is no station-
ary distribution. (Indeed: if there were a stationary distribution, then starting in
it would preserve the distribution for all time and so some state, with stationary
distribution having positive probability, would have fixed positive probability of
being visited each turn – it would therefore be recurrent.)
(2) Let S = {1, 2, 3, 4} and suppose P has the block form
 
1/2 1/2 0 0
1/2 1/2 0 0 
P=  0
.
0 1/2 1/2
0 0 1/2 1/2
Note, then, that all powers of the matrix have this same block form; in particular
pn (i, j) = 0 for all n if i ∈ {1, 2} and j ∈ {3, 4}. It is also easy to check that
the two vectors [1/2, 1/2, 0, 0] and [0, 0, 1/2, 1/2] are both stationary states, so this
chain possesses stationary states but they are not unique.
(3) Consider SRW on the square (vertices 1, 2, 3, 4 with the four edges (1, 2), (2, 3),
(3, 4), (4, 1)). If X0 = 0, then Xn is even if n is even and odd if n is odd. So for
any j ∈ {1, 2, 3, 4}, limn→∞ P(Xn = j) cannot exist; this chain is periodic and cannot
have a stationary distribution. (We’ll come back to periodicity later.)
As we will see, the three examples in 6.3 show all the possible ways Questions 1–3 can
fail to have positive answers.
6.2. General 2-state Markov chain. Let’s take a moment to completely classify the be-
havior of all 2-state Markov chains to gain intuition. Any 2-state Markov chain has a
transition matrix of the form  
1−p p
P=
q 1−q
21

for some 0 ≤ p, q ≤ 1. To compute all powers of this matrix, we diagonalize. The charac-
teristic polynomial is (λ − (1 − p))(λ − (1 − q)) − pq = λ2 + (p + q − 2)λ + 1 − p − q whose
roots are λ ∈ {1, 1 − p − q}.
First, suppose that {p, q} ∈ {0, 1}. That is, consider separately the four special cases:
       
1 0 0 1 1 0 0 1
P∈ , , , .
1 0 0 1 0 1 1 0
 
π
In the first two cases, one can easily check that the matrix is already in the form for
π
π the stationary distribution, as desired. The latter two are different, however. For the
identity matrix, every vector is a left-eigenvector with eigenvalue 1, and so although it
is true that πI n = π for all n (and so too in the limit), the stationary distirbution is not
unique, and the limit distribution depends on the initial distribution. Worse yet, for the
last matrix: the powers do not stablize, but cycle periodically between the matrix and
I. The eigenvalues are ±1, and the unique stationary distribution is [ 21 , 21 ], but the limit
limn→∞ P(Xn = j) does not exists, regardless of the initial distribution.
Now, suppose that at least one of p, q is strictly in (0, 1). Then the eigenvalues are
distinct and we can find two linearly independent eigenvectors. The result is that we can
write P in the form P = QDQ−1 where
     
1 −p −1 q/(p + q) p/(p + q) 1 0
Q= , Q = , D= .
1 q −1/(p + q) 1/(p + q) 0 1−p−q
Note: writing it this way means PQ = QD or Q−1 P = DQ−1 and so the columns of Q are
right-eigenvectors for P, and the rows of Q−1 are left-eigenvectors for P. The normalized
left-eigenvector for eigenvalue 1 is [q/(p+q), p/(p+q)], so this is the stationary distribution
π. Now, since 0 ≤ p, q ≤ 1 and one is strictly in (0, 1), the eigenvalue 1 − p − q has
|1 − p − q| < 1. Thus
   
n 1 0 1 0
D = → , as n → ∞
0 (1 − p − q)n 0 0
and so
   
n −1 q/(p + q) p/(p + q) 1
n 0 1 −p
P =Q D Q→
−1/(p + q) 1/(p + q) 0 0 1 q
   
q/(p + q) p/(p + q) π
= = , as n → ∞.
q/(p + q) p/(p + q) π
Hence, the chain converges to the stationary distribution regardless of initial distribution,
provided at least one of p, q is in (0, 1).
Remark 6.4. If both p, q are in (0, 1), then all entries of P are strictly positive. This turns
out to be enough in general to guarantee the existence of a unique stationary distribution
and universal convergence to it. Now, if (say) p ∈ (0, 1) but q = 1, then the matrix has the
form    
1−p p 2 1 − p + p2 p − p2
P= , and so P =
1 0 1−p p
2
and so all entries of P are strictly positive, and so similar reasoning will allow us to say
something about stationary distributions and limit theorems for such chains.
22

6.3. Perron-Frobenius Theorem. Let P be a (finite-dimensional) stochastic matrix. It is


easy to check that the column vector 1 with all entries equal to 1 is a left-eigenvector for
P with eigenvalue 1: P1 = 1. So 1 is always an eigenvalue of a stochastic matrix, and
therefore there exists a left eigenvector for this eigenvalue. Suppose we could show that:
(1) 1 is a simple eigenvalue, so its (left-)eigenspace is 1-dimensional.
(2) There exists a left eigenvector with all non-negative entries.
(3) All non-1 eigenvalues have magnitude strictly < 1.
Even in this case, it may not be possible to diagonalize the matrix P, but we don’t need
to. Using the Jordan normal form for P, we can decompose P = QDQ−1 where D is not
diagonal (it may have entries on the super-diagonal as well), Q has columns that are gen-
eralized right-eigenvectors, and Q−1 has columns that are generalized left-eigenvectors.
In particular, this means we will have
1 0 ··· 0 1 ···
   
" #
0  1 · · ·  −1
π
D= .
 ..
, Q = . .  ,
 .. . .  Q = ..
M  .
0 1 ···
where π is a left-eigenvector with eigenvalue 1, and the submatrix M is bidiagonal matrix
and easily seen to satisfy M n → 0 as n → ∞ (by condition (3) above). (Consult a linear
algebra textbook, or Wikipedia, regarding the Jordan normal form.) Hence
1 0 ··· 0
 
π
 
n
0  −1 .. 
lim P = Q  ..

Q = . .
 
n→∞ . 0
π
0
Theorem 6.5 (Perron, Frobenius). Let M be an N × N matrix all of whose entries are strictly
positive. Then there is an eigenvalue r > 0 such that all other eigenvalues λ satisfy |λ| < r.
Moreover, r is a simple eigenvalue, and its one-dimensionalP eigenspace contains aPvector with all
strictly-positive eintres. Finally, r satisfies the bounds mini j Mij ≤ r ≤ maxi j Mij .
The Perron-Frobenius theorem is one of the most important results in linear algebra.
Applied to a stochastic matrix (with strictly positive entries), we see that the Perron-
Frobenius eigenvalue r = 1 (since it is bounded above by the maximal row sum and
below by the minimal row sum, both of which are equal to 1 for a stochastic matrix).
Hence, conditions (1), (2), and (3) above are satisfied if P has all strictly positive entries.
Of course, there are plenty of Markov chains with some transition probabilities 0; but we
can generalize a little bit.
Corollary 6.6. Suppose P is a stochastic matrix with the property that, for some n ∈ N, Pn has
all entries strictly positive. Then P satisfies conditions (1), (2), and (3) above, and therefore there
exists a unique stationary distribution π and limn→∞ νPn = π for any initial distribution ν.
Proof. If λ is an eigenvalue of P with eigenvector v then λn is an eigenvalue of Pn with
eigenvector v. By the Perron-Frobenius theorem, the non-1 eigenvalues of Pn are all < 1
in magnitude, and so the same holds true for their nth roots; since 1 is a simple eigenvector
for Pn , the same must be true for P, and the eigenvector π of 1 (for Pn , and so also for P)
may be chosen with all positive entries. 
23

7. L ECTURE 7: A PRIL 16, 2012


We will soon classify all Markov chains whose transition matrices P have the property
that Pn has all positive entries for some n ≥ 1. In the meantime, we give a probabilistic
interpretation to the (positive) entries of the stationary distribution.
Proposition 7.1. Let (Xn )n∈N be a Markov chain with finite state space S, satisying the condi-
tions of Corollary 6.6. Let π denote the unique stationary distribution. Then for each j ∈ S, π(j)
is equal to the asymptotic expected fraction of time the chain spends in state j.
Proof. To make the statement of the proposition clear, let Y (j, n) denote the number of
time steps the process is in state j up through time n:
n
1{Xm =j} .
X
Y (j, n) =
m=0

Then the precise statement we are proving is that, for any initial state i ∈ S,
Ei [Y (j, n)]
π(j) = lim .
n→∞ n+1
To prove this, simply note that
n n n
Ei (1{Xm =j} ) =
X X X
Ei [Y (j, n)] = Pi (Xm = j) = [ei Pn ]j
m=0 m=0 m=0

where ei is the vector with a 1 in the j-entry and 0s elsewhere. By assumption i Pn → π


as n → ∞, ans the result follows. (I.e. this is the elementary fact that if an is any sequence
of real numbers that converges to α ∈ R, then the “running average” (a1 + a2 + · · · + an )/n
converges to α as well.) 

On Homework 2, you also explore the connection between the stationary distribution
and the expected return times Ti for states i ∈ S. Here is a somewhat heuristic argument
for the relation between them. As usual, let Ti,k denote the time of the k th return to the
state i (so that, conditioned on X0 = i, Ti = Ti,2 ). Write this time as
Ti,k = T1 + T2 + · · · + Tk
where T` = Ti,` − Ti,`−1 for ` ≥ 1. The Markov property shows that the T` are independent
and identically distributed (with distribution T1 ∼ Ti ). Now, the law of large numbers
therefore gives
Ti,k 1
= (T1 + · · · + Tk ) ≈ E(Ti ).
k k
I.e. there are approximately k visits to state i in the first kE(Ti ) steps of the chain. But
Proposition 7.1 states that in n steps there are about nπ(i) such visits to the state i. Taking
n = kE(Ti ), we find kE(Ti )π(i) ≈ k for large k, and so we have loosely derived the fact
1
E(Ti ) = .
π(i)
You will give a rigorous proof of this fact on Homework 2.
24

7.1. Periodic Chains. Let P be the transition matrix for an irreducible chain. The period
of a state i ∈ S, denoted d(i), is the greatest common divisor of the set of times Ji for
which the chain can return from i to i:
d(i) = gcd Ji , Ji = {n ≥ 0 : pn (i, i) > 0}.
Lemma 7.2. If P is the transition matrix for an irreducible Markov chain, then d(i) = d(j) for
all states i, j.
Proof. For fixed state i, note that if n ∈ Ji and m ∈ Ji then n + m ∈ Ji (if there is a
positive probability trajectory of length n and another of length m, then the conjunction
of the two is a positive probability trajectory of length n + m). Hence, Ji is closed un-
der addition. Now, if J is any subset of N closed under addition and d = gcd J, then
J ⊆ {0, d, 2d, . . .} = dN (otherwise J would contain a number not divisible by d which
contradicts the definition of d).
Now, let j be another state. The chain is irreducible, so there are m, n ∈ N so that
pn (i, j) > 0 and pm (j, i) > 0; hence m + n ∈ Ji . Since m + n ∈ Ji there is k ∈ N so that
m + n = kd(i). Now, suppose that ` ∈ Jj . Since p` (j, j) > 0, we have
pm+n+` (i, i) ≥ pm (i, j)p` (j, j)pn (j, i) > 0
and so m + n + ` ∈ Ji ; hence m + n + ` is divisible by d(i). Thus, ` is divisible by d(i)
as well. Thus, d(i) is a common divisor of Jj , and it follows that d(j) divides d(i). The
symmetric argument shows that d(i) divides d(j), so d(i) = d(j). 

It therefore makes sense to talk about the period of an irreducible Markov chain or the
period of P.
Example 7.3. Let G be a finite, connected graph. Then the simple random walk on G is
irreducible. Note, if i, j are two adjacent vertices, then p2 (i, i) ≥ p(i, j)p(j, i) > 0, and so
the period of i is ≤ 2; so simple random walks on graphs have period 1 or 2. It is not hard
to see that the period is 2 if and only if the graph is bipartite: G = (V1 t V2 , E) where there
are no edges between any two vertices in either V1 or V2 , only between the two non-empty
disjoint sets V1 and V2 . For example, any cyclic graph (like the square, in Example 6.3(3)
above) is bipartitie, partitioned by parity.
A Markov chain is called aperiodic of all of its states have period 1.
7.2. Limit Theorem for Irreducible Aperiodic Chains. The Perron-Frobenius approach
we discussed in the last lecture works precisely for irreducible aperiodic Markov chains.
Theorem 7.4. Let P be the transition matrix for a finite-state, irreducible, aperiodic Markov
chain. Then there exists a unique stationary distribution π: πP = π, and for any initial probability
distribution ν,
lim νPn = π.
n→∞

For the proof, we need a slightly stronger number theoretic claim about additively-
closed subsets of N. If J ⊆ N is closed under addition, then J ⊆ dN where d = gcd J as
shown above. What’s more, it is true that J contains all but finitely many multiples of d. In
other words, there is M ∈ N such that md ∈ J for all m ≥ M . (A proof of this fact is
outlined in Exercise 1.21 on page 41 of Lawler.)
25

Proof. Using the Perron-Frobenius theorem as discussed in Corollary 6.6, it sufficies to


show that there is an n ∈ N for which Pn has all strictly positive entries. Fix states i, j.
Since the chain is irreducible, there is a time n(i, j) such that pn(i,j) (i, j) > 0. Since the
chain is aperiodic, the state i has period 1 and so, as per the discussion preceding this
proof, Ji contains all sufficiently large integers: there is M (i) ∈ N so that pm (i, i) > 0 for
all m ≥ M (i). Hence, if m ≥ M (i),
pm+n(i,j) (i, j) ≥ pm (i, i)pn(i,j) (i, j) > 0.
Now, since the state space is finite, n = max{M (i) + n(i, j) : i, j ∈ S} is finite. The above
shows that pn (i, j) > 0, concluding the proof. 

7.3. Reducible Chains. Irreducible Markov chains are the easiest to handle, in large part
because of Proposition 4.8. Other Markov chains are called reducible, because they re-
duce (in some sense) to collections of irreducible chains. Following Definition 4.7 and the
proof of Proposition 4.8, they key feature of irreducible chains is the ability of all states to
eventually communicate with each other.
Definition 7.5. Let (Xn ) be a Markov chain with state space S. Say that two states i, j ∈ S can
communicate, written i ↔ j, if there exist n, m ∈ N so that pn (i, j) > 0 and pm (j, i) > 0.
Lemma 7.6. The relation ↔ on the state space S (cf. Definition 7.5) is an equivalence relation.
Proof. Since p0 (i, i) = 1 by definition, i ↔ i and so the relation is reflexive. Symmetry is
built into the definition. We need only verify transitivity. Suppose, then, that pm (i, j) > 0
and pn (j, k) > 0. Then
pm+n (i, k) = Pi (Xm+n = k)
≥ Pi (Xm+n = k, Xm = j)
= Pi (Xm = j)Pi (Xm+n = k|Xm = j)
= Pi (Xm = j)Pj (Xn = k)
= pm (i, j)pn (j, k) > 0
where the penultimate equality is the Markov property. A similar argument applied to
(k, j) proves transitivity. 
The equivalence classes under ↔ partition the state space S into communication classes.
An irreducible Markov chain is one with exactly one communication class. Examining the
proof of Proposition 4.8 shows that:
Corollary 7.7. Let (Xn ) be any Markov chain. Then all states within any one of its communica-
tion classes have the same asymptotic behavior: either all recurrent or all transient.
Remark 7.8. Similarly, the proof of Lemma 7.2 shows that any two states in a given com-
munication class have the same period.
Based on Corollary 7.7, we refer to communication classes either as recurrent classes or
transient classes. If S is a finite state space, at least one class must be recurrent (since at
least one state must be visited infnitely often).
26

Let i, j be states in two distinct classes. If p(i, j) > 0 then it must be true that pn (j, i) = 0
for all n (or else they would be in the same class). It follows that Pj (Xn = i) = 0 for all n,
and so i is transient. Hence, if i, j are recurrent states in distinct communication classes,
we must have p(i, j) = 0 (and also p(j, i) = 0 by the symmetric argument).
Let us agree to order the states by communication class, with recurrent classes R1 , . . . , Rr
first. The above argument shows that the Markov chain “reduces” the subsets R` : once in
a state in R` , the chain stays in R` forever; we can consider this as its own Markov chain
with transition matrix P` . The transition matrix for the full chain has the form
P1
 
 P2 0 
P3 0 
 

P= ... 

 0 

 P r

S Q
where the bottom rows [ S |Q] give the transition probabilities starting in the transient
classes. Now, this block diagonal form is preserved under matrix multiplication, so we
have  n
P1

 Pn2 0 
n
P 0
 
n
 3 
P =  ... 
 0 

 Pn 
r
Sn Qn
for some matrix Sn . Note that the row sums of [ S |Q] are all 1 and so the same remains
true for [ Sn | Qn ] since Pn is a stochastic matrix. So Qn is substochastic for all n (its row-
sums are ≤ 1). This will be important later.
27

8. L ECTURE 8: A PRIL 18, 2012


8.1. Transient States. We saw above that, at least in the case of an irreducible chain, if a
stationary distribution π exists then π(j) represents the average amount of time the chain
spends in state j. If j is a transient state, however, this average ought to be 0. Of course,
in a finite-state irreducible Markov chain, all states have the same asymptotic behavior,
and so since there must be at least one recurrent state, all states are recurrent. But in
an infinite-state Markov chain, or a reducible chain, this highlights one obstacle to the
existence of a stationary distribution.
The question remains: how can we calculate or estimate the (finite) number of visits
that will eventually be made to a transient state j? That is: fix a transient state j, and
define

1{Xn =j}
X
Yj =
n=0
(which, by assumption, is almost surely finite). We would like to calculate Ei [Yj ] for some
starting state i.
"∞ # ∞ ∞ ∞
1{Xn =j} = Ei [1{Xn =j} ] =
X X X X
Ei [Yj ] = Ei Pi (Xn = j) = pn (i, j).
n=0 n=0 n=0 n=0

This last sum is the (i, j)-entry of the matrix I + P + P2 + · · · which, a priori, may not
converge at all. However, consider the block-diagonal form above for the matrix P. Let
us assume that both i and j are transient states, which means that the (i, j)-entry of Pn is the
(i, j)-entry of Qn , where Q is the substochastic matrix for the transitions of transient states
to transient states. By definition of transience, for each i, j there is N so that [Qn ]ij = 0 for
all n ≥ N . Hence, the sum
[I + P + P2 + · · · ]ij = [I + Q + Q2 + · · · ]ij
is actually a finitely-terminatining series; so, in particular, it converges. In terms of calcu-
lating it, however, we use the fact that it converges to use the geometric series to express
it as
I + Q + Q2 + · · · = (I − Q)−1 .
Example 8.1. Consider the random walk on {0, 1, 2, 3, 4} with absorbing boundaries (cf.
Example 1.7). The states 0, 4 are recurrent, so we should order the states {0, 4, 1, 2, 3}; then
the transition matrix becomes
 
1 0 0 0 0  1 
0 1 0 0 0 0 2 0
1 1

P =  2 0 0 2 0 ,
  Q = 21 0 12  .

1 1
0 0
2
0 2 0 12 0
0 12 0 12 0
So we can easily and quickly calculate
1 1
−1 3 
2
0 2 2
1 12
(I − Q)−1 =  0 1
2
0 = 1 2 1 .
1
1
2
0 1
2 2
1 32
28

Thus, starting in state 1, the expected number of visits to state 1 is 32 , to state 2 is 1, and to
state 3 is 12 . Hence, the total expected number of steps before absorption at the boundary,
starting in state 1 is 23 + 1 + 21 = 3. The same expected number of steps holds starting in
state 3, while starting in state 2 the expected number of steps is 1 + 2 + 1 = 4.

8.2. Eventual Recurrence Probabilities. As another application of the matrix (I − Q)−1 ,


consider a Markov chain with at least two distinct recurrent classes. What is the probabil-
ity, when starting in a given initial state, of eventually ending up in a particular recurrent
class? To answer this question, it suffices to consider a reduced Markov chain: collapse
each recurrent class to a single point. The reduced Markov chain then has transition ma-
trix of the form  
I 0
P=
S Q
(i.e. each Pk has been collapsed to 1). Denote the states of this Markov chain as either ri
(for the reduced recurrent states) or tj (for the original transient states). Let
α(ti , rj ) = Pti (Xn = rj eventually).
Of course we also have α(rj , rk ) = δjk and α(x, tj ) = 0 for all states x and all transient
states tj . Let A be the square matrix with α(x, y) as entries, for all x, y ∈ {r1 , r2 , . . . , rk , t1 , . . . , tm }.
To calculate these probabilities, we use the Markov property as usual to assert that
α(ti , rj ) = Pti (Xn = rj eventually)
X
= Pti (Xn = rj eventually, X1 = x)
x∈S
X
= Pti (X1 = x)Pti (Xn = rj eventually|X1 = x)
x∈S
X
= p(ti , x)Px (Xn = rj eventually)
x∈S
X
= p(ti , x)α(x, rj ).
x∈S

In other words, A satisfies the equations A = PA, at least for entries of the form (ti , rj ). If
we write A in block form as such, we have
 
I 0
A=
Atr 0
where Atr are the entries calculated above. The matrix equation A = PA then yields the
matrix equation Atr = S + QAtr , or
Atr = (I − Q)−1 S.
Example 8.2. Again, consider the random walk on {0, 1, 2, 3, 4} with absorbing bound-
aries, cf. Example 7.3 above. Then we have
1   1  3 
2
0 0 2 0 2
1 21
S = 0 0 , Q =  12 0 21  , (I − Q)−1 =  1 2 1  .
1 1
0 2 1
0 2 0 2
1 23
29

Hence 3
 1  3 
1 1
2
1 2 2
0 4 4
Atr = (I − Q)−1 S =  1 2 1   0 0  =  12 1
2 .
1
2
1 32 0 12 1
4
3
4

Thus, starting at state 1, the walk is eventually absorbed at 0 with probability 34 ; starting
at state 2 its an even split for where it ends up; starting at state 3 the walk is absorbed at
state 1 with probability 41 .

8.3. Transit Times. Techniques like those above can be used to determine expected tran-
sit times from one state to another, in an irreducible chain. Fix a state i, and let Ti =
min{n ≥ 1 : Xn = i} as usual. We saw (in Lecture 3) a technique for calculating Ei [Ti ], but
what about the more general question of Ej [Ti ] for different states i, j?
One approach is to define a new Markov chain from the original which converts the
coveted state i into an absorbing state. First, write the transition matrix for the original
chain with the stat i ordered first:
 
p(i, i) r
P= .
s Q
Then the new chain Xn0 has transition matrix
 
0 1 0
P = .
s Q

Now, for any state k 6= i let Yik denote the number of visits to k before reaching i. Then
" #
X X
k
Ej [Ti ] = Ej Yi = Ej [Yik ].
k6=i k6=i

Now, because we have changed the chain to make i an absorbing state, if Xn0 = i then it
will never reach a state k 6= i at any time ≥ n. Thus, we can calculate the expectation of
Yik simply as
"∞ # ∞ ∞ ∞
1{Xn0 =k} = Ej [1{Xn0 =k} ] =
X X X X
0
k
Ej [Yi ] = Ej Pj (Xn = k) = p0n (j, k).
n=0 n=0 n=0 n=0

As above, p0n (j, k) = [(P0 )n ]jk , and so long as j, k 6= i, this means that

X
Ej [Yik ] = [Qn ]jk = [(I − Q)−1 ]jk .
n=0

Thus, for j 6= i,
X
Ej [Ti ] = [(I − Q)−1 ]jk = [(I − Q)−1 1]j .
k6=i

Here 1 is the column-vector with all entries equal to 1.


30

Example 8.3. Consider the random walk on {0, 1, 2, 3, 4} with reflecting boundary:
 
0 1 0 0 0
 1 0 1 0 0
2 1 2 1 
P= 0 2 01 2 01  .

0 0 0 2
2
0 0 0 1 0
Taking the state i = 0 as the target, we use the preceding technique to calculated the
expected transit time to 0. The submatrix Q and the associated matrix (I − Q)−1 are
 1   
0 2 0 0 2 2 2 1
 1 0 1 0 −1
2 4 4 2
Q= 0
2
1
2
1
, (I − Q) = 2 4 6 3 .
 
2
0 2
0 0 1 0 2 4 6 4
Thus     
2 2 2 1 1 7
2 4 4 2 1 12
(I − Q)−1 1 = 
     
  =  .
2 4 6 3 1 15
2 4 6 4 1 16
So, starting at 1, it takes about 7 steps to get to 0; starting at 4 it takes about 16 steps.
31

9. L ECTURE 9: A PRIL 20, 2012


We now turn our attention to issues that arise in the world of discrete-time Markov
chains with countably infinitely many states. To begin, we consider a class of examples
called birth-and-death chains.
9.1. Birth-and-Death chains. Consider a Markov chain (Xn )n≥0 with state space N =
{0, 1, 2, . . .}. For i ≥ 1, we will have p(i, j) = 0 unless |i − j| = 1, as with simple random
walk. But we generally allow the transition probabilities to vary with location:
p(i, i + 1) = pi , p(i, i − 1) = qi = 1 − pi , i ≥ 1.
As for p(0, 0) = 1 − p(0, 1), we can choose this parameter to be 1 (absorbing), 0 (reflect-
ing), or anything in between for mixed boundary conditions. Such a chain can serve as
a general model for population growth, for example. As such, the important questions
are Pi (∃n ≥ 0 Xn = 0) (i.e. starting with population i, what is the probability the popu-
lation dies out) and Pi (Xn → ∞ as n → ∞) (i.e. starting with population i, what is the
probability the population explodes).
As usual, let h(i) = Pi (∃n ≥ 0 Xn = 0). Then standard first step analysis shows that
X X X
h(i) = Pi (∃n ≥ 0 Xn = 0|X1 = j)Pi (X1 = k) = Pk (∃n ≥ 0 Xn = 0)p(i, j) = p(i, j)h(j).
j≥0 j≥0 j≥0

For this kind of chain, with i ≥ 1, this says h(i) = pi h(i + 1) + qi h(i − 1). This is a 2-step
recurrence relation, but it has variable coefficients, so it is a little trickier to handle, but
not too much trickier. Let u(i) = h(i − 1) − h(i) for i ≥ 1. Then h(i) = pi h(i + 1) + qi h(i − 1)
can be rewritten as pi u(i + 1) = qi u(i). Assuming pi 6= 0 for any i, this gives us
qi qi qi−1 · · · q1
u(i + 1) = u(i) = u(1)
pi pi pi−1 · · · p1
by induction. Define
qi qi−1 · · · q1
ρi = .
pi pi−1 · · · p1
Then the telescoping sum gives us u(1) + u(2) + · · · + u(i) = h(0) − h(i), and so
h(0) − h(i) = u1 (1 + ρ1 + · · · + ρi−1 ). (9.1)
The value of h(0) will be determined by the boundary conditions; for example if p(0, 0) =
1 then h(0) = 1; if p(0, 1) = 1 then h(0) = 0. The first difference u(1) remains undeter-
mined. We consider two cases.
• Suppose ∞
P
i=0 ρi = ∞. Since h(i) is a probability, h(i) ∈ [0, 1]. Thus the sequence
h(0) − h(i) must remain bounded. Taking i → ∞ then shows that we must have
u(1) = 0 here, and so h(i)
P = h(0) for all i (boring case).
• If, on the other hand, ∞ i=0 ρi < ∞, then u(1) can be taken > 0 provided that

h(i) = h(0) − u(1)(1 + ρ1 + · · · + ρi ) ≥ 0 for all i ≥ 1.


This suggests how to find a solution: the minimal possible solution will be achieved
when limi→∞ [h(0) − u(1)(1 + ρ1 + · · · + ρi )] = 0. So we take

!−1
X
u(1) = h(0) ρj .
j=0
32

Thus, the solution is



!−1 i P∞
j=i ρj
X X
h(i) = h(0) − h(0) ρi ρi = h(0) P∞ .
j=0 j=0 j=0 ρj

Example 9.1. In a population growth model we certainly must have p(0, 0) = 1, giving
boundary condition h(0)
P = 1. So long as pi > 0 for each i, this implies that ρi > 0, and
hencePwe either have ρi = ∞ in which case h(i) = 1 for all i – i.e. guaranteed extinction
– or ρi < ∞ in which case h(i) < 1 and h(i) → 1 as i → ∞ – i.e. there is positive
probability of survival, and this probability increased with larger starting population (as
expected).

9.2. Positive and Null Recurrence. In general, we can refine the definition of recurrence
further (although, as we’ll see, this really only makes sense for infinite-state Markov
chains).
Definition 9.2. Let i be a recurrent state for a Markov chain. Denote Ti = min{n ≥ 1 : Xn = 1}
as usual. If Ei [Ti ] < ∞, we call i positive recurrent. If, on the other hand, Ei [Ti ] = ∞, then we
call i null recurrent.
Remark 9.3. Note, it is certainly possible for a finite-valued random variable T to have
infinite expectation. For example, suppose that P(T = 2k ) = 2−k for each P kk ≥ 1. Then T
k
P∞ only the values {2, 4, 8, 16, . . .} and is never infinite, but E[T ] = k 2 · P(T = 2 ) =
takes
k=1 1 = ∞.

Proposition 9.4. In a finite-state, irreducible Markov chain, all states are positive recurrent.
Proof. Fix states i, j ∈ S. By the irreducible assumption, there is a finite time n(i, j) so that
pn(i,j) (i, j) > 0. Let N = max{n(i, j) : i, j ∈ S} and let q = min{pn(i,j) (i, j) : i ∈ S}. Thus,
regardless of the current state i, there is a positive probability ≥ q of reaching state j in
the next ≤ N steps; i.e. Pi (Tj ≤ N ) ≥ q, and hence Pi (Tj > N ) ≤ 1 − q. We now apply the
Markov property: for any states i, j ∈ S and any k ∈ N,
Pj (Tj > (k + 1)N |Tj > kN, XkN = i) = Pi (Tj > N ) ≤ 1 − q.
Of course, conditioned on Tj > kN the event Tj > (k + 1)N is independent of XkN by the
Markov property, and so in fact we have Pj (Tj > (k + 1)N |Tj > kN ) ≤ 1 − q. By induction
it follows that Pj (Tj > kN ) ≤ (1 − q)k for all k ∈ N. Thus:

X ∞ (k+1)N
X X
Ej [Tj ] = Pj (Tj ≥ n) = Pj (Tj ≥ n).
n=1 k=0 n=kN +1

We can bound the internal sum by


(k+1)N (k+1)N
X X
Pk (Tj ≥ n) ≤ Pj (Tj > kN ) = N Pj (Tj > kN )
n=kN +1 n=kN +1

because Pj (Tj ≥ n) is non-increasing with n. Thus



X ∞
X
Ej [Tj ] ≤ N Pj (Tj > kN ) ≤ N (1 − q)k < ∞.
k=0 k=0
33

Thus, null recurrence can only occur if there are infinitely many states (at least in the
case of an irreducible chain). For infinite state Markov chains, even irreducible ones, it
can certainly happen that the states are null recurrent. Indeed, simple random walk on Z
gives an example.
Example 9.5. Let (Xn )n≥0 denote SRW on Z. Let us calculate E0 [T0 ], the expected return
time to 0 starting at 0 (of course this will be the same as Ei [Ti ] for any state i ∈ Z).
X
E0 [T0 ] = P0 (T0 ≥ n) = P0 (X1 6= 0, X2 6= 0, . . . , Xn 6= 0).
n≥1

We can evaluate this probability by summing over all non-0 states:


X
P0 (X1 6= 0, . . . , Xn 6= 0) = P0 (X1 = i1 , . . . , Xn = in )
i1 ,...,in 6=0
X
= p(0, i1 )p(i1 , i2 ) · · · p(in−1 , in ).
i1 ,...,in 6=0

Given the form of the transition probabilities, there are only finitely many (i1 , . . . , in ) for
which this product is non-zero: namely those that satisfy ik+1 = ik ± 1 for each k. So i1 =
±1, and so forth. Thus (i1 , . . . , in ) forms a lattice path. The restriction that i1 , . . . , in 6= 0
means that, after the first step i1 = ±1, the path must stay either strictly above 0 (if i1 = 1)
or strictly below 0 (if i1 = −1). By symmetry, we consider only the first case, so we have
i1 = 1, ik+1 = ik ± 1, and ik ≥ 1 for all k; call such paths Pn+ . Along any such path, all the
probabilities p(ik , ik+1 ) = 21 . Thus, we have
X 1 X 1
P(X1 6= 0, . . . , Xn 6= 0) = n
+ n
= 2−n+1 |Pn+ |.
+
2 +
2
(1,i2 ,...,in )∈Pn (−1,i2 ,...,in )∈−Pn

So we would like to count the number of paths in Pn+ . This turns out to be tricky. But
we can bound the size from below as follows. Let Pn+,m denote the subset of Pn+ of those
paths that end at height m (so 1 + i2 + · · · + in = m). Consider a pair π1 , π2 of paths in Pn+ .
By reversing π2 (i.e. if π2 = (1, i2 , . . . , in ) then π2−1 = (in , in−1 , . . . , i2 , 1)), the path π1 π2−1 is a
lattice path of length 2n that starts and ends at height 1, and never drops below height 1.
By decreasing all heights by 1, we see this path is a Dyck path of length 2n. What’s more,
any Dyck path of length 2n, whose middle-height is m − 1, can be identified with the pair
π1 , π2 as above.
So, if Dn,m−1 denotes the set of Dyck paths of length 2n whose middle height is m − 1,
then we have a bijection Pn+,m × Pn+,m ∼
= Dn,m−1 for all m ≥ 1. (These sets are nonempty
only if m ≤ n + 1.) So we have |Dn,m−1 | = |Pn+,m |2 , and so
n+1
X n+1
X
|Pn+ | = |Pn+,m | = |Dn,m−1 |1/2 .
m=1 m=1

It follows that !2
n+1
X n+1
X
|Pn+ |2 = |Dn,m−1 | 1/2
≥ |Dn,m−1 | = |Dn |
m=1 m=1
34

where Dn is the set of all Dyck paths. It is a well known result that Dn is counted by the
1 2n
Catalan number |Dn | = n+1 n
. Thus, we have
s   s
1 2n 1 22n 2n 1
|Pn+ | ≥ ∼ √ ∼ √ 3/4 .
n+1 n n + 1 πn πn
Hence, we have
2n 1 2 1
P0 (X1 6= 0, . . . , Xn 6= 0) & 2−n+1 · √ 3/4 = √ 3/4 .
πn πn
In particular, this means that
X X 2 1
E0 [T0 ] = P0 (X1 6= 0, . . . , Xn 6= 0) & √ 3/4 = ∞.
n≥1 n≥1
πn
We showed in Example 4.5 that all states of the SRW on Z are recurrent. So the state 0 is
null recurrent.

9.3. Positive Recurrence and Stationary Distributions. Proposition 9.4 shows that finite-
state (irreducible) Markov chains have only positive recurrent states. As the next theorem
shows, this is the real reason that stationary distributions exists for such chains.
Theorem 9.6. Let (Xn )n≥0 be a Markov chain with state space S that is countable (but not nec-
essarily finite). Suppose there exists a positive recurrent state i ∈ S: i.e. Ei [Ti ] < ∞, where, as
usual, Ti = min{n ≥ 1 : Xn = i}. For each state j ∈ S, define
"T −1 #
i

1{Xn =j}
X
γ(i, j) = Ei
n=0
be the expected number of visits to j before returning to i. Then the function π : S → R+
γ(i, j)
π(j) =
Ei [Ti ]
is a stationary distribution for (Xn ){n≥0} .
Remark 9.7. Note: Theorem 9.6 gives a much simpler proof that no state can be positive
recurrent for the SRW on Z. Indeed, if such a state existed, then there would necessarily
exist a stationary distribution π for the SRW. From the form of the transition probabilities,
this would mean that
1 1
π(j) = π(j − 1) + π(j + 1).
2 2
for all j. Rearranging this (by writing π(j) = 2 π(j) + 21 π(j), subtracting, and multiplying
1

through by 2), we have


π(j + 1) − π(j) = π(j) − π(j − 1).
Thus, if there is ever a j such that π(j + 1) 6= π(j), we see that the sequence π is an
arithmetic progression and is both unbounded and not always positive, which means it
is not a stationary distribution. If, on the other hand, π(j) = π(j + 1) for all j, then π is
constant, and is either 0 so is not a distribution or is not summable so is not a distribution.
Thus, the SRW on Z does not possess an invariant distribution, and so by Theorem 9.6,
it cannot have any positive recurrent states.
35

10. L ECTURE 10: A PRIL 23, 2012


We begin with the proof of Theorem 9.6 from the previous lecture.
Proof of Theorem 9.6. First note that Ei [Ti ] can be calculated as 1+ the sum over all states
j 6= i of the expected number of visits to j before returning to i:
X X
Ei [Ti ] = 1 + γ(i, j) = γ(i, j)
j6=i j∈S

where the second inequality follows from the fact that γ(i, i) = 1 by definition. It follows
that P
j∈S γ(i, j)
X
π(j) = = 1.
j∈S
Ei [Ti ]
Since γ(i, j) and Ei [Ti ] are ≥ 0, the components of π are ≥ 0; soP π is a probability dis-
tribution. It remains to show that it is stationary: that π(j) = k∈S π(k)p(k, j) for all
j ∈ S. P By multiplying through both sides by Ei [Ti ], what we wish to show is that
γ(i, j) = k∈S γ(i, k)p(k, j) for all j ∈ S. To that end, we calculate: for j ∈ S,
"T −1 # "∞ # ∞
i

1{Xn =j} = Ei 1{Xn =j,Ti >n} =


X X X
γ(i, j) = Ei Pi (Xn = j, Ti > n).
n=0 n=0 n=0

On the other hand, so long as j 6= i we have Ei [1{XTi =j} ] = 0 and so


"T # "T −1 #
i i

1{Xn =j} = Ei 1{Xn =j} = γ(i, j).


X X
Ei
n=0 n=0

Thus, in the case j 6= i, we have


"T # "T # "T −1 #
i i i

1{Xn =j} = Ei 1{Xn =j} = Ei 1{Xm+1 =j}


X X X
γ(i, j) = Ei
n=0 n=1 m=0

where the second equality follows from the fact that Ei [1X0 =j ] = 0. On the other hand,
when j = i γ(i, i) = 1 and the right-hand-side is
"T −1 # T −1 Ti
i i

1{Xm+1 =i} =
X X X
Ei P(Xm+1 = i) = P(Xn = i) = 1
m=0 m=0 n=1

by definition of Ti . So for all states j we have


"T −1 # ∞
i

1{Xn+1 =j} =
X X
γ(i, j) = Ei Pi (Xn+1 = j, Ti > n).
n=0 n=0

All told, we have the two seemingly competing equalities:



X
γ(i, j) = Pi (Xn = j, Ti > n), and
n=0
X∞
γ(i, j) = Pi (Xn+1 = j, Ti > n).
n=0
36

Therefore, applying the Markov property, we have


X ∞
XX
γ(i, k)p(k, j) = Pi (Xn = k, Ti > n)p(k, j)
k∈S k∈S n=0
∞ X
X
= Pi (Xn = k, Xn+1 = j, Ti > n)
n=0 k∈S

X
= Pi (Xn+1 = j, Ti > n) = γ(i, j),
n=0
as claimed.

Corollary 10.1. If i is a positive recurrent state, then the stationary distribution π defined in
Theorem 9.6 satisfies
1
π(i) = .
Ei [Ti ]
Proof. This follows immediately from the definition of π, since γ(i, i) = 1. 

Corollary 10.1 gives a more succinct interpretation of the components of the stationary
distribution, should it exist; it also points out the necessity of positive recurrence for the
existence of a stationary distribution with positive entries. The next result makes this fact
clearer.
Proposition 10.2. Suppose (Xn )n≥0 is a time-homogeneous Markov chain with state space S,
and suppose the chain possesses a stationary distribution π.
(1) If (Xn )n≥0 is irreducible, then π(j) > 0 for all j ∈ S.
(2) In general, if π(j) > 0, then j is positive recurrent.
a stationary distribution, π = πP and hence by induction π = πPn for
Proof. (1) Since π isP
P = i∈S π(i)pn (i, j). There must exists some i with π(i) > 0 (since π(i) ≥ 0
all n. Thus, π(j)
for all i and i π(i) = 1 > 0). By irreducibility, for this choice of i, there is some n with
pn (i, j) > 0. Hence π(j) ≥ π(i)pn (i, j) > 0.
(2) Suppose that j is not positive recurrent. Start the chain in distribution π: so P(X0 =
i) = π(i) for all i. Denote by Vn = Vn (j) the number of visits (after time 0) to j up to time
n: n
1{Xm =j} .
X
Vn (j) =
m=1
Since π is a stationary distribution, we have
n n n
Eπ [1{Xm =j} ] =
X X X
Eπ [Vn (j)] = Pπ (Xm = j) = π(j) = nπ(j).
m=1 m=1 m=1

Now, let Tjk be the time of the kth visit to state j: Tjk = min{n ≥ 0 : Vn (j) = k}. We claim
that
!
Tjk
Pπ lim = ∞ = 1. (10.1)
k→∞ k
37

Assume, for the moment, that we have proved Equation 10.1. It follows that, for fixed
M > 0, the probability that Tjk /k ≤ M is small for large k. In particular, there must exist
N so that P(TjN /N ≤ M ) ≤ M1 . Note that
TjN /N ≤ M ⇐⇒ min{n : Vn (j) = N } ≤ N M.
Now, if VN M (j) ≥ N (there are at least N visits to j in the first N M time steps) then the
minimal time n by which there are N visits to j is ≤ N M . In other words
{VN M (j) ≥ N } ⊂ {TjN ≤ N M }
and so
1
Pπ (VN M (j) ≥ N ) ≤ Pπ (TjN ≤ N M ) ≤ .
M
But this implies that
NM
X
M N π(j) = Eπ [VM N ] = Pπ (VN M ≥ k)
k=1
since VN M ≤ N M . We may estimate this sum as follows:
NM
X N
X NM
X
Pπ (VN M ≥ k) = Pπ (VN M ≥ k) + Pπ (VN M ≥ k).
k=1 k=1 k=N +1

The first sum has N terms each ≤ 1, so is ≤ N . Since the events {VN M ≥ k} are decreasing
as k increases, the second sum may be bounded by
NM NM NM
X X X 1 NM − N
Pπ (VN M ≥ k) ≤ Pπ (VN M ≥ N ) ≤ = < N.
k=N +1 k=N +1 k=N +1
M M
2
Hence, we have N M π(j) < 2N , and so π(j) < M
for each M > 0, which contradicts the
assumption that π(j) > 0.
It remains to verify Equation 10.1. By assumption j is not positive recurrent. If j is
transient, then Tjk = ∞ for sufficiently large k, proving Equation 10.1 trivially in this case.
So, suppose j is null recurrent. Define
τjk = Tjk+1 − Tjk
be the duration of the kth excursion away from state j. Then we can write Tj2 = Tj1 +
τj1 , Tj3 = Tj2 + τj2 , and so forth. Clearly τjk is a stopping time. Hence, by the strong
Markov property, τj1 , τj2 , . . . are independent and identically-distributed. Moreover, the
distirbution is the same as that of Tj1 − Tj0 = Tj ; hence E(τjk ) = ∞ by the assumption of
null recurrence. Now
Tjk Tj1 + τj1 + · · · + τjk−1 Tj1 k − 1 τj1 + · · · + τjk−1
= = + .
k k k k k−1
The first term tends to 0 since Tj1 < ∞ (as the state j is recurrent). By the Strong Law
of Large Numbers, the second term tends to E(Tj ) = ∞ with probability 1. This proves
Equation 10.1. 
38

11. L ECTURE 11: A PRIL 25, 2012


Proposition 10.2 allows us to conclude that positive recurrence is a class property.
Corollary 11.1. For an irreducible Markov chain, the following are equivalent:
(1) There exists a stationary distribution with all entries > 0.
(2) There exists a stationary distribution.
(3) There exists a positive recurrent state.
(4) All states are positive recurrent.
Proof. The implication (4) =⇒ (3) is trivial. The implication (3) =⇒ (2) is the statement of
Theorem 9.6. Implication (2) =⇒ (1) is statement (1) of Proposition 10.2, while implication
(1) =⇒ (4) is statement (2) of Proposition 10.2. 
Example 11.2. Consider a birth and death chain with homogeneous probabilities: S =
{0, 1, 2, . . .} and p(i, i + 1) = q ∈ (0, 1) for i ∈ S while p(i, i − 1) = 1 − q for i ≥ 1
and p(0, 0) = 1 − q. Since 0 < q < 1 it is easy to see that the chain is irreducible. We
wish to determine if there exists a stationary distribution. If π is stationary, then π(0) =
π(0)(1 − q) + π(1)(1 − q), which we can rearrange as
q
π(1) = βπ(0), β= .
1−q
For i ≥ 1, we have π(i) = π(i−1)q +π(i+1)(1−q), which we can rearrange in the standard
way as
π(i + 1) − π(i) = β[π(i) − π(i − 1)].
i
It is easy to verify that both these conditions are true of π(i) P = cβ for some constant c.
Of course, we want π to be a distribution, so π(i) ≥ 0 and π(i) = 1. The latter is only
possible if β < 1, in which case

X c c 1−q
cβ i = = q = c .
i=0
1−β 1 − 1−q 1 − 2q
Thus, when 0 < q < 21 (so that 0 < β < 1, if we take c = 1−2q 1−q
, we find that π(i) =
 i
1−2q q
1−q 1−q
is a stationary distribution for the chain. By Corollary 11.1, it follows that all
states are positive recurrent in this case.
Now, when q = 12 , the situation is very similar to the SRW on Z: we have π(i+1)−π(i) =
π(i) − π(i − 1) for all i ≥ 1, and so unless π(1) − π(0) = 0 the sequence π grows without
bound and cannot be a distribution; if π(1) = π(0), it follows that π(i) = π(0) for all i
and again π cannot be summable unless it is constantly 0 and so not a distribution. Thus
the chain does not possess an invariant distirbution when q = 21 , and so no states are
positive recurrent. On the other hand, conditioned on the event X1 = 1 the chain behaves
exactly as a SRW on Z until time T0 , and since SRW is recurrent, it follows that this chain
is recurrent. Hence, all its states are null recurrent.
Finally, suppose q < 12 . The same first step analysis we did in Example 2.6 (biased SRW)
shows that  
1 − 1−q q
Pi (TN < T0 ) =  N
1 − 1−qq
39

which shows that limN →∞ P1 (TN < T0 ) = 1 − 1−q q


= 2q−1
q
> 0 in this case. Hence, there is
a positive probability that the chain visits all sufficiently large N before visiting 0, so 0 is
(and hence all states are) transient.

11.1. Average Convergence to the Stationary Distribution: The Ergodic Theorem. We


have now seen that (in the irreducible case) the existence of a stationary distribution is
equivalent to the existence of a positive recurrent state (which is the same as all states
being positive recurrent). Even in the finite-state case, however, we saw examples (cf.
Example 6.3(3), SRW on an even-length cycle) where a stationary distribution exists, but
the chain does not converge to it; indeed, periodicity plays a role too. Nevertheless, there
is still always a sense in which a chain posessing a stationary distribution converges to it:
in terms of running average.
Theorem 11.3 (Ergodic Theorem). Let (Xn )n≥0 be an irreducible recurrent Markov chain with
state space S. Let j ∈ S be fixed, and define
1
π(j) =
Ej [Tj ]
(with the understanding that π(j) = 0 if j is null recurrent, i.e. if Ej [Tj ] = ∞). Let Vn (j) =
m=1 1{Xm =j} be the number of visits to state j up to time n. Then for any state i,
P n

 
Vn (j)
Pi lim = π(j) = 1
n→∞ n
and
n
1X
lim pm (i, j) = π(j).
n→∞ n
m=1

Remark 11.4. In general, the claim n1 nm=1 pm → p is weaker than the claim pn → p. The
P
exception is when pn ≥ 0 and p = 0; in this case, the average converging to 0 implies
that the sequence converges to 0 as the reader should check. Hence, the theorem has the
interesting corollary that, in a null recurrent irreducible chain, pn (i, j) → 0 for all i, j. This
is born out, for example, in SRW on Z1 and Z2 .
Proof. The proof is quite similar to the proof of Proposition 10.2(2). Since j is a recurrent
state, we know that Vn (j) → ∞ as n → ∞ with Pi -probability 1. It follows that VnV(j)+1
n (j)
→1
with Pi -probability 1. Now, since there were Vn (j) visits to j by time n, the Vn (j)th visit to
j occured at a time ≤ n. Using the times Tjk from the proof of Proposition 10.2, this means
V (j)
Tj n ≤n
with Pi -probability 1. On the other hand, the (Vn (j) + 1)st visit to j is ≥ n, meaning that
V (j)+1
Tj n ≥n
V (j) V (j)+1
with Pi -probability 1. Hence Tj n ≤ n ≤ Tj n . Dividing through by Vn (j), we have
V (j) V (j)+1
Tj n n Tj n Vn (j) + 1
≤ ≤
Vn (j) Vn (j) Vn (j) + 1 Vn (j)
40

In the proof of Proposition 10.2, we shows that Tjk /k → Ei [Tj ] with Pi -probability 1 (using
the Strong Law of Large Numbers). Hence, by the Squeeze Theorem, Vnn(j) → E[Tj ] as
n → ∞ with Pi -probability 1. Taking reciprocals proves the first equation.
For the second equality, note that since limn→∞ Vnn(j) = π(j) with Pi -probability 1, it
follows that Ei [ Vnn(j) ] → Ei [π(j)] = π(j) as n → ∞. The proof is then concluded by noting
that
  n n n n
Vn (j) 1 X 1X 1X 1X
Ei = Ei [ 1{Xm =j} ] = Ei [1{Xm =j} ] = Pi (Xm = j) = pm (i, j).
n n m=1 n m=1 n m=1 n m=1

41

12. L ECTURE 12: A PRIL 27, 2012


12.1. The Convergence Theorem. In order to have genuine convergence to the stationary
distribution, we know (from the finite-state case, cf. Theorem 7.4) that aperiodicity is also
required. As in the finite-state case, it is also sufficient.
Theorem 12.1. Let (Xn )n≥0 be an irreducible, aperiodic Markov chain possessing a stationary
distribution π. Then for any states i, j, limn→∞ pn (i, j) = π(j).
Remark 12.2. (1) Since limits of real number sequences are unique, it follows that an irre-
ducible, aperiodic Markov chain has at most one stationary distribution (i.e. if a stationary
distribution exists, then it is unique). Moreover, from Theorem 11.3 (in fact from Corollary
1
10.1, it follows that this stationary distribution is given by π(i) = Ei [Ti]
.
(2) It is a fact, which we won’t prove however, that the uniqueness of the stationary distri-
bution does not require aperiodicity. Any irreducible Markov chain possesses at most one
stationary distribution; aperiodicity is only required to guarantee convergence toward the
stationary distribution.
Example 12.3. Consider SRW on a 5-cycle. This graph is connected so the SRW is irre-
ducible. Also, because there are an odd number of vertices, there are path of odd length
from any state i to itself (going all around the cycle), and so d(i) = 1. Hence, this is an
irreducible, aperiodic, finite-state Markov chain, and so possesses an invariant distribu-
tion to which the process converges in distribution. It is easy to check that the invariant
distribution is π(i) = 51 for each i. This immediately shows us that Ei [Ti ] = 5 for each i.
Even more, we note from Theorem 9.6 that π(j) = Eγ(i,j) i [Ti ]
= 15 γ(i, j) for all pairs i, j. So we
can quickly calculate that the expected number of visits to any state j before returning to
i satisfies 51 = 15 γ(i, j); in other words, γ(i, j) = 1.
Proof of Theorem 12.1. Consider the conditioned Markov chain (Xni )n≥0 where X0i = i, and
(Xni )n≥0 has the same transition probabilities p(i, j) as X. Then pn (i, j) = Pi (Xn = j) =
P(Xni = j). Similarly, let (Xnπ )n≥0 be a Markov chain independent of (Xni )n≥0 , which has
initial distribution π: P(X0 = i) = π(i), and also has transition probabilities p(i, j). Since
π is stationary, it follows that P(Xnπ = i) = π(i) for all n.
Let T = min{n ∈ N : Xni = Xnπ } (clearly T is a stopping time). Define a new process
(Yn )n≥0 by (
Xni , n ≤ T
Yn =
Xnπ , n > T
Since T is a stopping time, (Yn ) is a Markov chain with transition probabilities p(i, j),
which satisfies Y0 = i; thus, P(Yn = j) = pn (i, j). Now, for any j, if Yn = j then either
Xnπ = j or Xnπ 6= Yn ; in other words {Yn = j} ⊆ {Xnπ = j} ∪ {Xnπ 6= Yn }, and so
P(Yn = j) ≤ P(Xnπ = j) + P(Xnπ 6= Yn ).
Similarly, {Xnπ = j} ⊆ {Yn = j} ∪ {Xnπ 6= Yn } and so
P(Xnπ = j) ≤ P(Yn = j) + P(Xnπ 6= Yn ).
Together, these say that |P(Yn = j) − P(Xnπ = j)| ≤ P(Yn 6= Xnπ ). It follows that we can
estimate
|pn (i, j) − π(i)| = |P(Yn = j) − P(Xnπ = j)| ≤ P(Yn 6= Xnπ ) = P(T > n).
42

Hence, to prove the theorem, it suffices to prove that limn→∞ P(T > n) = 0.
Consider the Markov chain Zn = (Xni , Xnπ ) (whose state space is S × S).
Claim. For any state j ∈ S, P(Zn = (j, j) for some n) = 1.
Once the Claim is proved, it follows immediately that P(T < ∞) = 1, which implies
P(T > n) tends to 0 as n → ∞, as required to prove the theorem. To prove the Claim, we
refer to Exercise 1 on Homework 3, which states that if (Zn )∞n=0 is an irreducible, recurrent
Markov chain, then Pi (Zn = j for some n) = 1 for all states i, j. As such, it suffices to show
that Zn is irreducible and recurrent. In fact, we will show it is irreducible and positive
recurrent, by demonstrating (for the latter) that it possesses an invariant distribution.
The key observation is that, since Xni and Xnπ are independent by construction, it follows
that the transitions probabilities q ((i1 , i2 ), (j1 , j2 )) for the chain Zn are
q ((i1 , i2 ), (j1 , j2 )) = p(i1 , j1 )p(i2 , j2 ).
• Zn is irreducible. Let (i1 , i2 ) and (j1 , j2 ) be any two states in S × S. By assumption,
both Xn is irreducible, and so there are times m1 and m2 where pm1 (i1 , j1 ) > 0
and pm2 (i2 , j2 ) > 0. Since Xn is aperiodic, we know (cf. the remark preceeding the
proof of Theorem 7.4) we know pn (j1 , j1 ) > 0 and pn (j2 , j2 ) > 0 for all sufficiently
large n; in particular, for all sufficiently large k we have pm1 +k (j1 , j1 ) > 0 and
pm2 +k (j2 , j2 ) > 0. Thus
qm1 +m2 +k ((i1 , i2 ), (j1 , j2 )) = pm1 +m2 +k (i1 , j1 )pm1 +m2 +k (i2 , j2 )
≥ [pm1 (i1 , j1 )pm2 +k (j1 , j1 )][pm2 (i2 , j2 )pm1 +k (j2 , j2 )] > 0.
• Zn positive recurrent. By Corollary 11.1, it suffices to show there exists a stationary
distribution for Zn . Define θ(i, j) = π(i)π(j) for (i, j) ∈ S × S. Then
X X
θ(k, `) · q ((k, `), (i, j)) = π(k)π(`)p(k, i)p(`, j)
(k,`)∈S×S (k,`)∈S×S
" #" #
X X
= π(k)p(k, i) π(`)p(`, j)
k∈S `∈S
= π(i)π(j) = θ(i, j)
because π is stationary for Xn . Thus, θ is stationary for Zn .

43

13. L ECTURE 13: A PRIL 30, 2012


13.1. Time Reversal. Let (Xn )n≥0 be a Markov chain. For a fixed time N , by definition
XN +1 is conditionally independent of X0 , X1 , . . . , XN −1 , conditioned on XN . In other
words, the past and the future are independent, conditioned on the present. Stated in
this symmetric manner, it is natural to think of a Markov chain running backward in time,
and expect this process to be another Markov chain. In fact, it is sometimes the same
Markov chain.
Example 13.1. Let (Xn )n≥0 be SRW on Z. Then for any trajectory (i0 , i1 , . . . , iN ) for the first
N steps, we have P(X0 = i0 , . . . , XN = iN ) = 0 unless in+1 = in ± 1 for each n; in this case,
each such trajectory has probability 2−N .
Now, let Yn = XN −n for 0 ≤ n ≤ N . Then for any trajectory j0 , . . . , jN ,
P(Y0 = j0 , . . . , YN = jN ) = P(X0 = jN , X1 = jN1 , . . . , XN = j0 ).
The condition jn+1 = jn ± 1 is equivalent to jn−1 = jn ± 1, and so we once again have the
joint distribution of (Y0 , . . . , YN ) is supported on the same trajectories as that of (X0 , . . . , XN ),
and each such trajectory has probability 2−N , the same as for (X0 , . . . , XN ).
Example 13.1 gives an appropriate sense we can give to the statement “the chain looks
the same forwards and backwards”: we consider the joint distirbution of (X0 , . . . , XN )
and (XN , XN −1 , . . . , X0 ). In this case, they are both the same. This will not always occur;
but what is true is that, started/ended in the right distirbution, the backward process will
still be a Markov chain. We have to be careful, though. If we (as is typical) condition
the process (Xn )n≥0 to start in a particular state, X0 = i, we cannot expect the reversed
process Yn to be Markovian; for example, if Xn is irreducible and positive recurrent, then
XN will be close to stationarity for large N , not concentrated near the state i.
The best situation where we can hope for the reversed process to exhibit Markovian
behavior is when it starts (and therefore stays) in its stationary state.
Theorem 13.2. Let (Xn )n≥0 be an irreducible Markov chain possessing a stationary distribution
π. Let N ∈ N, and for 0 ≤ n ≤ N define Yn = XN −n . Then (Yn )0≤n≤N is an irreducible Markov
chain with the same stationary distribution, and transition probabilities q(i, j) given by
π(j)
q(i, j) = p(j, i). (13.1)
π(i)
Remark 13.3. (1) Recall from Corollary 10.2 that the stationary distribution has strictly
positive entries; hence there is no trouble dividing by π(i) in Equation 13.1.
(2) Note that q(i, j) ≥ 0 from Equation 13.1, and for fixed i
X 1 X 1
q(i, j) = π(j)p(j, i) = · π(i) = 1
j
π(i) j π(i)
where the penultimate equality follows from the stationarity of π for p. Hence q is still
stochastic, and so indeed gives the transition probabilities of some Markov chain.
Proof. First let’s check that π is still stationary for q:
X X π(j) X X
π(i)q(i, j) = π(i) · p(j, i) = π(j)p(j, i) = π(j) p(j, i) = π(i) · 1 = π(j)
i i
π(i) i i
44

where the penultimate equality is just the statement that p is stochastic. Thus, π is q-
stationary as well as p-stationary.
Now, for a given trajectory i0 , . . . , iN , we have
P(Y0 = i0 , Y1 = i1 , . . . , YN = iN ) = P(X0 = iN , X1 = iN −1 , . . . , XN = i0 )
= P(X0 = iN )p(iN , iN −1 ) · · · p(i2 , i1 )p(i1 , i0 )
= π(iN )p(iN , iN −1 ) · · · p(i2 , i1 )p(i1 , i0 )
by the Markov property and the assumption that Xn is in the stationary distribution. If
we rewrite Equation 13.1 in the form
π(j)p(j, i) = q(i, j)π(i)
then we can proceed by induction and write
π(iN )p(iN , iN −1 )p(iN −1 , iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )π(iN −1 )p(iN −1 , iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 )π(iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 ) · · · q(i1 , i2 )π(i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 ) · · · q(i1 , i2 )q(i0 , i1 )π(i0 ).
Thus, we have
P(Y0 = i0 , Y1 = i1 , . . . , YN = iN ) = π(i0 )q(i0 , i1 )q(i1 , i2 ) · · · q(iN −1 , iN )
which is precisely to say that (Yn )0≤n≤N is a Markov chain.
Finally, since p is irreducible, given states i, j there is some finite n so that pn (i, j) > 0.
By the definition of matrix multiplication,
X
pn (i, j) = p(i, i1 )p(i1 , i2 ) · · · p(in−1 , j)
i1 ,...,in−1

and since all these terms are ≥ 0 and pn (i, j) > 0 it must be that there is at least one
trajectory i, i1 , . . . , in−1 , j such that p(i, i1 )p(i1 , j2 ) · · · p(in−1 , j) > 0. But then
π(in−1 ) π(in−2 ) π(i)
q(j, in−1 )q(in−1 , in−2 ) · · · q(i1 , i) = p(in−1 , j) · p(in−2 , in−1 ) · · · p(i, i1 )
π(j) π(in−1 ) π(i1 )
π(i)
= p(i, i1 )p(i1 , i2 ) · · · p(in−1 , j) > 0.
π(j)
Thus, at least one term in the matrix multiplication sum for qn (i, j) is > 0, so qn (i, j) > 0.
The symmetric argument for qn (j, i) shows that q is irreducible, as claimed. 
Remark 13.4. The final step in the above proof highlights a(n obvious) alternate definition
of irreducibility of a Markov chain, or more precisely of communication classes. Say j
can be reached from i if there is a finite trajectory i1 , i2 , . . . , in−1 so that p(i, i1 ), p(i1 , i2 ), . . . ,
p(in−1 , j) are all > 0. Then say i and j can communicate if j can be reached from i and i can
be reached from j. The above proof shows that this is equivalent to the usual definition
of communication classes.
45

13.2. Reversible Markov Chains. Theorem 13.2 shows that a time reversed Markov chain
(in stationary distribution) is still a Markov chain, but with a potentially different tran-
sition matrix. It is natural to ask when it happens that the reversed chain is actually the
same as the original chain.
Definition 13.5. Let (Xn )n≥0 be a (time-homogeneous) Markov chain, with countable state space
S and transition probabilities p(i, j) for i, j ∈ S. Call the chain reversible if there is a function
π : S → [0, ∞), not identically 0, which verifies the detailed balance condition:
π(i)p(i, j) = π(j)p(j, i), ∀ i, j ∈ S. (13.2)
The detailed balance condition is invariant under scaling – that is, if π(i)p(i, j) = π(j)p(j, i)
then [λπ(i)]p(i, j) = [λπ(j)]p(j, i) as well for any λ > 0. Thus, if the function π in Definition
13.5 is summable, then we can scale it to be a distribution. It is not hard to see that
Corollary 13.6. If (Xn )n≥0 is a reversible Markov chain that possesses a unique stationary distri-
bution (i.e. if it is irreducible and positive recurrent), then that stationary distribution witnesses
the local balance condition.
Proof. We need only show that π is stationary for p; this follows from the same argument
in the proof of Theorem 13.2:
X X X
π(i)p(i, j) = π(j)p(j, i) = π(j) p(j, i) = π(j) · 1 = π(j).
i∈S i∈S i∈S

It is possible, however, for a π verifying the detailed balance condition to exist even if
a stationary distribution does not exist (cf. Example 13.1).
Example 13.7. Suppose the Markov chain is symmetric: the transition probabilities satisfy
p(i, j) = p(j, i). In this case, trivially the constant vector π = (c, c, . . . , c) witnesses the
local balance condition for any c. If there are a finite number of states, N , it follows that
π(i) = N1 for each N is a stationary distribution, and the chain is reversible. If there are
infinitely many states, there may not be any distribution that witnesses the local balance
condition. For example, take SRW on Z, whose transition probabilities are symmetric:
p(i, j) = p(j, i) has value 12 if |i − j| = 1 and 0 otherwise. Here any constant π = (c, c, c, . . .)
witnesses the local banace condition, and it is easy to see that only constants do. (Indeed,
we know there is no stationary distribution.)
Perhaps the most important reason reversible processes are nice is that the detailed
balance condition provides a much easier way to find the stationary distribution than the
definition π = πP. Here is an important class of examples, cf. Definition 5.2.
Example 13.8. Let G be a countable graph with no isolated vertices. The SRW on G has
transition probabilities p(i, j) = v1i where vi is the valence of i (which is ≥ 1 since i is not
isolated) if i ∼ j, and p(i, j) = 0 if i 6∼ j. Then we can check that π(i) = vi satisfies
(
1, i ∼ j
π(i)p(i, j) = π(j)p(j, i) =
0, i 6∼ j
so the “valence vector” π satisfies the local balance condition and SRW on graphs are
reversible. In particular, this shows that
P if the graph is finite, there exists a stationary
distribution: just normalize π(i) = vi / j vj .
46

Example 13.9. Capitalizing on the previous example, consider the following problem. A
Kinght starts at a corner of a standard 8 × 8 chessboard, and on each step moves at random. How
long, on average, does it take to return to its starting position?
If i is a corner, the question is what is Ei [Ti ], where the Markov process in question is the
one desribed in the problem: it is SRW on a graph whose vertices are the 64 squares of the
chessboard, and two squares are connected in the graph if a Knight can move from one
to the other. (Note: if the Knight can move from i to j, it can do so in exactly two ways: 1
up/down and 2 left/right, or 2 left/right and 1 up/down. So choosing uniformly among
moves or among positions amounts to the same thing.) This graph is connected (as a little
thought will show), and so the SRW is irreducible; therefore there is a unique stationary P
distribution. By Example 13.7, the stationary distribution π is given by π(i) = vi / j vj ,
1
P
and so by the Ergodic Theorem, Ei [Ti ] = π(i) = j vj /vi .
A Knight moves 2 units on one direction and 1 unit in the other direction. Starting at
a corner (1, 1), the Knight can only more to the positions (3, 2) and (2, 3), so vi = 2. To
solve the problem, we need to calculate vj for all starting positions j on the board. By
square symmetry, we need only calculate the numbers for the upper 4 × 4 grid. An easy
calculation then gives these valences as
 
2 3 4 4
3 4 6 6
 
4 6 8 8
4 6 8 8
The sum is 84, and so the sum over the full chess board is 4 · 84 = 336. Thus, the number
of expected steps for the Knight to return to the corner is 21 · 336 = 168.
47

14. L ECTURE 14: M AY 2, 2012


One of the main uses of Markov chains is sampling (approximately) from probability
distributions that are computationally intractable.
Example 14.1. A Hard Core Configuration on an N × N grid, HCCN , is a function
c : {1, . . . , N }2 → {0, 1}
such that if c(i, j) = 1 then the four values c(i ± 1, j ± 1) are 0. (I.e. there are no adjacent
vertices in the square lattice both having value 1.) This is a simple model of the configura-
tion of an ideal gas (the 1s represent gas molecules, the 0s empty space). We would like to
pick a random hard-core configuration – meaning sample from the uniform distribution
on HCCN , so each c is chosen with weight 1/|HCCN |. The problem is that |HCCN | is
2
unknown, even to exponential order, for large N . A simple upper-bound is 2N (which
counts the set of all configurations, functions f : {1, . . . , N }2 → {0, 1} without constraint).
2
It is conjectured that |HCCN | ∼ β N for some β ∈ (1, 2), but no one even has a good guess
as to what β is.
So, how do we sample from a distribution we cannot even approximate? By using
the following Markov chain (Xn )n≥0 . The state space is S = HCCN , and the transition
probabilities for c 6= c0 are:
(
1/N 2 , if c, c0 differ at exactly one point in {1, . . . , N }2 ,
p(c, c0 ) =
0, otherwise.
2
For any fixed hard-core configuration P c, there0 are at most N hard-core configurations
that differ from it in 1 position, so c0 6=c p(c, c ) ≤ 1. We therefore define p(c, c) = 1 −
0
P
c0 6=c p(c, c ), and this gives a valid Markov transition kernel.

It is actually easy to implement this chain computationally: at each step, we choose one
of the N 2 points in the grid uniformly at random, and change the value of the configura-
tion at that point if we can; if the change would violate the HCC condition, then we leave
the configuration unchanged for the next step. To be precise: if Xn = c, choose a point
(i, j) in the grid uniformly at random from all N 2 ; if c(i, j) = 1, then let Xn+1 = c0 where
c0 (i, j) = 0 and c0 (k, `) = c(k, `) for (k, `) 6= (i, j); if c(i, j) = 0 and c(i ± 1, j ± 1) = 0 then
similarly set Xn+1 = c0 where c0 (i, j) = 1 and c0 (k, `) = c(k, `) for (k, `) 6= (i, j); if, on the
other hand, c(i, j) = 0 but a neighboring point has c = 1, then set Xn+1 = Xn = c. It is
very easy to see that this algorithm produces a Markov chain with the above transition
kernel p(c, c0 ).
Note that p(c, c0 ) = p(c0 , c), so the transition kernel is symmetric. It follows that the
uniform distribution on the state space, π, satisfies the detailed balance condition with
p(c, c0 ), and hence the uniform distirbution π is the stationary distribution for (Xn )n≥0 . (It
is easy to see that (Xn )n≥0 is irreducible; for example, to get from any c to any c0 , one can
begin by eliminating the 1s in c one by one to get to the 0 configuration, then add new 1s
one by 1 to reach c0 .)
So, how do we select a uniformly random hard core configuration? Start at any config-
uration and run the Markov chain above. For large n, the distribution of Xn is approxi-
mately π, by the Convergence Theorem; so, for large n, Xn will be a (nearly) uniformly
random hard-core configuration.
48

Example 14.2. Consider the problem of graph colorings. Let G = (V, E) be a finite graph.
A q-coloring of G (with q ∈ N) is a function f : V → {1, 2, . . . , q} with the property that,
if u, v ∈ V with u ∼ v, then f (u) 6= f (v). The set of q-colorings is very hard to count (es-
pecially for small q), so again we cannot directly sample a uniformly random q-coloring.
Instead, define a Markov chain on the state space of all q-colorings f , where for f 6= g
(
1
q|V |
, if f and g differ at exactly one vertex,
p(f, g) =
0, otherwise.
Again: since there are at most q different
P q-colorings of G that differ at a given
P vertex,
and there are |V | vertives, we have f 6=g p(f, g) ≤ 1, and so setting p(f, f ) = g6=f p(f, g)
yields a stochastic matrix p which is the transition kernel of a Markov chain. It is evidently
symmetric, and by considerations like those in Example 14.1 the chain is irreducible and
aperiodic (so long as q is large enough for any q-colorings to exist!); hence, the stationary
distribution is uniform, and this Markov chain converges to the uniform distribution.
To simulate the corresponding Markov chain: given Xn = f , choose one of the |V |
vertices, v, uniformly at random and one of the q colors, k, uniformly at random. If the
new function g defined by g(v) = k and g(w) = f (w) for w 6= v is a q-coloring of the
graph, set Xn+1 = g; otherwise, keep Xn+1 = Xn = f . This process has the transition
probabilities listed above, and so running it for large n gives an approximately uniformly
random q-coloring.

In Examples 14.1 and 14.2, we constructed Markov chains with symmetric kernels,
therefore having the uniform distribution as stationary, and converging to it. This is often
a good way to simulate uniform samples. But how can we sample from a non-uniform
distribution?

14.1. The Metropolis-Hastings Algorithm. We can generalize Examples 14.1 and 14.2
to give a method for sampling any (strictly positive) distribution π, by building a 2-step
Markov chain: one to propose moves, a second to accept/reject them. Here is the algo-
rithm.
Let S be a finite set, and let π be a (strictly positive) distribution on S. First, construct
an irreducible Markov chain on S which has transition probabilities q(i, j) = q(j, i), sym-
metric. This chain is thus reversible, and has the uniform distribution as its stationary
state; but π is not necessarily uniform. Instead, we construct a new Markov chain with
transition kernel
 
π(j) X
p(i, j) = q(i, j) min 1, for i 6= j, p(i, i) = 1 − p(i, j).
π(i) j6=i

Once we have constructed this chain, we run it for long time to sample from the distribu-
tion π: it is clear from the definition that p is irreducible (since q is), and
π(i)p(i, j) = q(i, j) min{π(i), π(j)} = q(j, i) min{π(j), π(i)} = π(j)p(j, i)
so π satisfies the detailed balance condition for p, and thus π is the stationary distribu-
tion for p. Hence, by the Convergence Theorem, the p-Markov chain converges to π in
distribution for large time.
49

It remains to be seen how to actually simulate the Markov chain with kernel p (given
that we know how to simluate the chain with kernel q). Here is the algorithm:
• Suppose we’ve reached Xn = i. For each j 6= i, propose the move Xn+1 = j
according to the original chain; i.e. propose the move i → j with probability q(i, j).
• Once the first move has been proposed (i.e. the proposed next state j has been
chosen), accept this move with probability min{1, π(j)
π(i)
}; otherwise reject it and stay
in state i. In other words: if π(j) ≥ π(j), set Xn+1 = j; if π(j) = απ(i) with
0 < α < 1, then set Xn+1 = j with probability α and Xn+1 = i with probability
1 − α.
It is an easy exercise to show that the transition kernel p above is the result of this two-step
Markov chain algorithm. Hence, it may be used to simluate any distribution π (approxi-
mately, for large run time). Note: the indeterminacy of q is a benefit here. It means we are
free to choose the initial (symmetric) chain to fit the symmetries of the given state space.
There are many variations on the Metropolis-Hastings algorithm, all with the same ba-
sic ingredients and goal: to use a (collection of correlated) Markov chain(s), designed to
have a given stationary distribution π, to sample from π approximately. They are col-
lectively known as Markov chain Monte Carlo simulation methods. One place they are
especially effective is in calculating high-dimensional integrals (i.e. expectations), a com-
mon problem in multivariate statistics.

14.2. A word on convergence rates. In order for the Metropolis-Hastings algorithm (or
other MCMC methods) to be effective, the Markov chain designed to converge to the
given distribution must converge reasonably fast. The term used to measure this rate
is the mixing time of the chain. (To make this precise, we need a precise measure of
closeness to the stationary distribution; then we can ask how long before the distance to
stationarity is less than  > 0.) This is a topic of a great deal of current research, with
literally thousands of papers published. We will not even scratch the surface here. Let’s
just point out the connection to eigenvalues.
Suppose our Markov chain has N states, is irreducible and aperiodic (as Homework
3, Problem 3 points out, every finite-state Markov chain is arbitrarily close to one). Let’s
suppose further that the chain is symmetric; i.e. the transition matrix P = P> . Then by the
spectral theorem from linear algebra,
P = UDU>
where U, the eigenvector matrix (columns are normalized eigenvalues) is an orthogonal
matrix (the columns are orthogonal to each other), and
λ1
 
 λ2 
D=  . ..


λN
where λ1 ≥ λ2 ≥ · · · ≥ λN are the eigenvalues of P. We also know, from the Perron-
Frobenius theorem (Corollary 6.6) and the assumption of irreducibility and aperiodicity,
50

λ1 = 1 and for j ≥ 2 |λj | < 1. (I.e. the largest magnitude eigenvalue is 1 and all others are
strictly smaller.) So 1 = λ1 > λ2 ≥ · · · ≥ λN > −1. Hence
1 1
   
 λn2  0
lim Pn = lim UDn U> = U lim 
 >  >
U = U
. . ..  U .

n→∞ n→∞ n→∞  ..  
λnN 0
Moreover, we see that this convergence is exponentially-fast: setting P∞ equal to the
limit, we have
1 1
   

n ∞
 λn2   0 
kP −P k =   . ..
−
  .  max (1−λnj ) = max{1−λn2 , 1−λnN }.
. .  = 2≤j≤N
λnN 0
So, measuring in terms of matrix norm at least, the mixing rate is controlled by the size
of 1 − max{λ2 , |λN |}; if this is big, then mixing is fast; if it is small, mixing is slow. For
example:
 
1− 
λ1 = 1, λ2 = 1 − 2, slow mixing
 1−
 
1/2 1/2
λ1 = 1, λ2 = 0, fast mixing.
1/2 1/2
Much of the techonology for studying mixing times thus amounts to estimating the largest
magnitude eigenvalues of (huge) stochastic matrices.
51

15. L ECTURE 15: M AY 4, 2012


Today we consider a class of countable state Markov chains where the transition prob-
abilities, although theoretically computable, are somewhat complicated to handle. We
introduce alternate methods (generating functions) to compute the desired statistics.
15.1. Galton-Watson branching process. Consider an evolving popluation of asexual re-
producers. For example: bacteria. One can also think of the propagation (and extinction)
of family names of male offspring, at least under the (traditional, unrealistic) assumption
that family name passes from father to son. Relatedly, one can think about the evolution
of a Y-chromosome DNA haplogroups (genes).
Let Xn denote the number of individuals in the given population in the nth generation.
We make the following assumptions:
• There is a fixed distribution p0 , p1 , p2 , . . . on N so that each individual produces k
offspring with probability pk , and then dies.
• All individuals reproduce independently.
It is clear, then, that the number of individuals in the nth generation is determined by the
individuals present in generation n−1 only; that is, (Xn )n≥0 is a Markov chain. This chain
is called a (Galton-Watson) branching process, since one can represent it by a “family
tree” diagram. Note that if the population ever reaches 0 then it remains 0 from then on –
the population has gone extinct.
We could explicitly calculate the transition probabilities of the chain as follows. Let
Y1 , Y2 , Y3 , . . . be a sequence of i.i.d. random variables each with distribution P(Yi = k) = pk
for k ∈ N. Then
p(i, j) = P(Xn+1 = j|Xn = i) = P(Y1 + · · · + Yi = j).
Since the Yi are independent, this probability can be written as an i-fold convolution of
the distribution of Y1 evaluated at j, and written down explicitly; but the formula is quite
complicated and difficult to work with. We will avoid the exact form of the transition
kernel, and use other tools to analyze the branching process.
The fundamental question we want to answer is: what is the probability that the population
never goes extinct? That is, we want to calculate (or estimate)
q(i) = Pi (Xn = 0 for some n)
where i is the initial population size. It turns out this probability is largely controlled by
the mean of the offspring distribution:
X∞
µ= kpk .
k=0

To see why, consider the expected size of the population in generation n: Ei [Xn ]. We
compute it by conditioning on the population in the previous generation:

X
Ei [Xn ] = Pi (Xn−1 = k)E[Xn |Xn−1 = k].
k=0
But, using the i.i.d. Yi variables above, we have
Ei [Xn |Xn−1 = k] = Ek [X1 ] = E[Y1 + · · · + Yk ] = kE[Y1 ] = kµ.
52

Thus

X ∞
X
Ei [Xn ] = Pi (Xn−1 = k)kµ = µ kPi (Xn−1 = k) = µEi [Xn−1 ].
k=0 k=0
By induction, it follows that
Ei [Xn ] = µn Ei [X0 ] = iµn .
Now, from the following simple estimate

X
n
iµ = Ei [Xn ] = Pi (Xn ≥ k) ≥ Pi (Xn ≥ 1)
k=0

we see that if µ < 1 then limn→∞ Pi (Xn ≥ 1) = 0. It follows that in this case q(i) = P(Xn =
0 eventually) = 1, for any i. This motivated (part of) the following definition.

P∞Let (Xn )n≥0 be a branching process, with offspring distirbution p0 , p1 , p2 , . . .


Definition 15.1.
and mean µ = k=0 kpk . Call the process subcritical of µ < 1; critical if µ = 1; and supercrit-
ical if µ > 1.
The calculation preceding Definition 15.1 shows that a subcritical branching process has
q(i) = 1 for all i: no matter how large the initial population, the population eventually
goes extinct. For critical and supercritical branching processes, we need another tool to
analyze q(i). Before we proceed, let us note that, since the individuals in each generation
reproduce indpendently, the branches of the process are all independent; it follows that
if qn (i) = Pi (Xn = 0), then qn (i) = [qn (1)]i . Since q(i) = limn→∞ qn (i), it therefore suffices
to calculate q(1) ≡ q, which we define to be the extinction probability. (I.e. q is the
probability the population eventually goes extinct, conditioned on starting with a single
individual.)

15.2. Probability generating function. Let X be a N-valued random variable. Define its
probability generating function ϕX as

X
X
ϕX (s) = E[s ] = sk P(X = k).
k=0

This is a variant of the characteristic function (a.k.a. Fourier transform) χX (t) = E[eitX ]
which is more commonly used for continuous distributions. Let us point out a few facts
about ϕX .
• From the power-series, since P(X = k) ≤ 1 for all k, it follows that ϕX is analytic
(n)
on (−1, 1). We have ϕX (1) = P 1, and ϕX (0) = n! P(X = n).

• For |s| < 1, we have ϕ0X (s) = k=1 ksk−1 P(X = k). If E[X] < ∞, it follows that this
series converges at s = 1, and ϕ0X (1) = E[X].
• Because P(X = k) ≥ 0, the function s 7→ ϕ0X (s) is nondecreasing; if P(X ≥ 2) > 0,
it is strictly increasing.
Theorem 15.2. Let (Xn )n≥0 be a branching process with offspringP distribution p0 , p1 , p2 , . . . Let ϕ
be the probability generating function of this distribution ϕ(s) = ∞ k
k=0 pk s . Then the extinction
probability q is given by
q = min{s ∈ [0, 1] : ϕ(s) = s}.
53

Proof. As noted above, q(i) = q(1)i = q i . Then by first step conditioning,


X∞
q = P1 (Xn = 0 eventually) = P1 (Xn = 0 eventually|X1 = i)P1 (X1 = i)
i=0

X
= Pi (Xn = 0 eventually)pi
k=0
X∞
= pi q i = ϕ(q)
i=0
so q is a fixed point of ϕ. Of course, 1 is always a fixed point of ϕ; it remains to show that
if there is a smaller fixed point, q takes it as its value.
Define q̂ = min{s ∈ [0, 1] : ϕ(s) = s}. Since qn ≡ qn (1) = P1 (Xn = 0) have limit
limn→∞ qn = q, it suffices to show that qn ≤ q̂ for all n (then the limit q must be ≤ q̂,
and therefore equal to q̂ since q is a fixed point in [0, 1]). Clearly q0 = 0 ≤ q̂. Suppose we
have shown that qn−1 ≤ q̂. Then
X∞ X∞
i
qn = P1 (Xn = 0) = P1 (Xn = 0|X1 = i)P(X1 = i) = pi qn−1
i=0 k=0
by the independence of the i offspring in gneeration n − 1. Thus, by the inductive hypoth-
esis,
X∞ ∞
X
i
qn = pi qn−1 ≤ pi q̂ i = ϕ(q̂) = q̂,
k=0 k=0
concluding the proof. 
Corollary 15.3. Suppose p1 6= 1. Then q = 1 if the process is critical or subcritical, and q < 1 if
the process is supercritical.
Note: if p1 = 1 (so that pk = 0 for k 6= 1) the population remains 1 for all time, the boring
case. Excluding this deterministic situation, we know exactly when there is a chance that
the population will survive in the limit: precisely when it is supercritical.
Proof. We have already seen that, in the subcritical case, q = 1. If the process is supercriti-
cal, then 1 < µ = ϕ0 (1). On the other hand we know that ϕ0 (0) = P(Y = 1) = p1 < 1. Now,
consider the function f (s) = ϕ(s) − s. Then f (1) = 0 and f (0) = p0 ≥ 0; we also have
f 0 (s) = ϕ0 (s) − 1 so f 0 (1) > 0. Since f is smooth, it follows that f is strictly increasing near
1, so f (s) < 0 for s near 1. Since f is continuous and f (0) ≥ 0, there must be some point
s < 1 so that f (s) = 0. Hence the minimal fixed point q̂ of ϕ is in (0, 1), and so q < 1.
Now,
P∞suppose the process is critical: so µ = 1 (in particular p0 6= 1). Since p1 6= 0, we
have k=2 pk > 0. Thus ϕ (1) = 1 and ϕ0 (t) < 1 for all t < 1. Thus, if s < 1 then
0
Z 1
1 − ϕ(s) = ϕ(1) − ϕ(s) = ϕ0 (t) dt < 1 − s
s
so ϕ(s) > s for all s < 1. It follows that q = q̂ = 1. 
54

16. L ECTURE 16: M AY 7, 2012


We are about to discuss a class of stochastic models known as Hidden Markov models.
To begin, consider the following motivating example.
Example 16.1 (Occasionally Dishonest Casino). In a casino game, a die is thrown. Usually
a fair die is used, but sometimes the casino swaps it out for a loaded die. To be precise:
there are two dice, F (fair) and L (loaded). For the fair die F , the probability of rolling i
is 61 for i = 1, 2, . . . , 6. For the loaded die, however, 1 is rolled with probability 0.5 while 2
through 5 are each rolled with probability 0.1. Each roll, the casino randomly switches out
F for L with probability 0.05; if L is in use, they switch back to F next roll with probability
0.9.
You notice that 1 is rolled 6 times in a row. How likely is it that the fair die was in use
for those rolls (given the above information)? How likely is it the loaded die was used in
rolls 2, 3, and 5? More generally, what is the most likely sequence of dice that was used to
generate this sequence of rolls?
16.1. Hidden Markov Models. Example 16.1 is a very typical situation that occurs in
many real-world problems. To model it, we consider it as two processes: there is a pro-
cess of observations (the rolls of the die), and an underlying hidden process (which die is
chosen). We would like to “decode” the hidden process, but all the information we have
are the conditional probabilities of the observed process given the hidden process.
Definition 16.2. A Hidden Markov Model (HMM) is a pair of stochastic processes, (Xn , Yn )n≥0 ,
where (Yn )n≥0 is a Markov chain with state space S, and (Xn )n≥0 has a possibly different state
space R, and the vector-valued process Zn = (Xn , Yn ) is a Markov chain. For y ∈ S and x ∈ R,
the conditional probabilities
ey (x) = P(Xn = x|Yn = y)
are called the emission probabilities of the model. Let p : S × S → [0, 1] be the transition kernel
for (Yn )n≥0 . It is taken as an assumption that the transition kernel for (Zn )n≥0 is
P(Zn+1 = (x0 , y 0 )|Zn = (x, y)) = p(y, y 0 )ey0 (x0 ). (16.1)
It is easy to verify that Equation 16.1 defined a valid transition kernel on R × S (the reader
should take a moment to check this). Note: it is not assumed that (Xn )n≥0 alone is a
Markov chain. The form of the transition kernel of (Zn )n≥0 is special; it is not generally
true that a vector-valued Markov chain’s transition probabilities will only depend on the
new state of one of the coordinates. This is the defining characteristic of Hidden Markov
models.
By induction and the Markov property, it follows that
P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn = yn )
= P(X0 = x0 , Y0 = y0 )p(y0 , y1 )ey1 (x1 ) · · · p(yn−1 , yn )eyn (xn )
= P(Y0 = y0 )ey0 (x0 )p(y0 , y1 )ey1 (x1 ) · · · p(yn−1 , yn )eyn (xn ). (16.2)
On the other hand, we could condition as
P(X0 = x0 , Y0 = y0 , . . . , Xn = xn , Yn = yn )
= P(Y1 = y1 , . . . , Yn = yn )P(X0 = x0 , . . . , Xn = yn |Y0 = y0 , . . . , Yn = yn ).
55

The former probability is P(Y1 = y1 , . . . , Yn = yn ) = P(Y0 = y0 )p(y0 , y1 ) · · · p(yn−1 , yn ) since


(Yn )n≥0 is a Markov chain, and so combining Equations 16.2 and 16.3 we have
P(X0 = x0 , . . . , Xn = xn |Y0 = y0 , . . . , Yn = yn ) = ey0 (x0 ) · · · eyn (xn ). (16.3)
By reversing the above calculations, we see that Equation 16.3 implies that (Zn )n≥0 is a
Markov chain. So, we could alternatively take, as the definition of a HMM, a pair (Xn , Yn )
such that Yn is a Markov chain and Equation 16.3 holds for the joint distirbution.
The factorization of Equation 16.3 is a kind of independence property (though not par-
ticularly eary to formalize). The intuition is that the underlying process (Yn )n≥0 is run-
ning, and then at each time k the observed signal Xk is generated from the present value
of Yk , independently of the previous observations. Problems 1 and 2 on Homework 5
show how to construct Hidden Markov models.
The language is suggestive of the chain (Yn )n≥0 generating signals which we observe as
(Xn )n≥0 , possessing some random noise. Indeed, one of the main application of HMMs
is in machine speech recognition. Here we have S = {words} and R = {wave forms}. We
want to develop a model to decode which wave forms correspond to which words (i.e.
determine the emission probabilities).
Example 16.3 (Occasionally Dishonest Casino, Continued). Example 16.1 can be described
as a HMM: the underlying Markov chain (Yn )n≥0 (which we cannot observe directly) is
the switching between fair and loaded dice: S = {F, L}. The model presented is that
p(F, L) = 0.05 and p(L, F ) = 0.9 (so p(F, F ) = 0.95 and p(L, L) = 0.1). The observed
process (Xn )n≥0 is the roll of the die, so R = {1, 2, 3, 4, 5, 6}. In this case, we know a priori
that eF (i) = 61 for i = 1, . . . , 6, while eL (1) = 0.5 and eL (2) = · · · = eL (6) = 0.1. Since all
the rolls (of any dice) are independent of each other, the product formula of Equation 16.3
follows.
Example 16.4. Let (Yn )n≥0 be a Markov chain with state space S. Let R be a set, and let
f : S → R be a function. Then Xn = f (Yn ) is a Hidden Markov Model, with ei (x) =
P(Xn = x|Yn = i) = P(f (Yn ) = x|Yn = i) = 1{f (i)=x} . (As Homework 5, Problem 2
shows, this is a special case of a general construction for HMMs: we can always represent
Xn = F (Yn , Vn ) where F : S × [0, 1] → R is a fixed function and Vn are i.i.d. uniform
random variables.)
16.2. The Forward Algorithm. Let (Xn , Yn ) be a HMM (with known parameters – i.e.
known transition kernel for Yn and known emission probabilities). As per Example 16.1,
the kinds of questions we’re interested in are: what is the probability of a given se-
quence of states y0 , y1 , . . . , yN of Yn occurring, given an observed sequence of symbols
x0 , x1 , . . . , xN ?
Fix an observation sequence x = (x0 , x1 , . . . , xN ) and a state sequence y = (y0 , y1 , . . . , yN ).
We use the notation:
P(x) = P(X0 = x0 , X1 = x1 , . . . , XN = xn )
P(x, y) = P(X0 = x0 , X1 = x1 , . . . , XN = xN , Y0 = y0 , Y1 = y1 , . . . , YN = yN )
Then what we seek to calculate is
P(x, y)
P(y|x) = .
P(x)
56

In this language, Equation 16.2 takes the form


P(x, y) = P(Y0 = y0 )p(y0 , y1 )ey0 (x0 )p(y1 , y2 )ey1 (x1 ) · · · p(yN −1 , yN )eyN (xN ). (16.4)
P
We could then calculate P(x) = y∈S N P(x, y) to determine the conditional probability
P(x|y) in question. But this approach is computationally infeasible. For each y ∈ S N
there are 2N multiplications to be done to compute the term P(x, y) in Equation 16.4, and
then the sum is over |S|N such terms: so there are 2N |S|N computations. Even if |S| is
very small, as N grows moderatley this becomes intractible for any computer.
The forward algorithm is a recursive procedure to compute the conditional probability
in question in polynomial time. Here is how it works. Fix an observation sequence x =
(x0 , x1 , . . . , xN ). For any state y ∈ S and any time 0 ≤ n < N , set
αn (y) = P(X0 = x0 , X1 = x1 , . . . , Xn = xn , Yn = y)
(I.e. the probability that the first n observations have occurred and the hidden state at the
present time is y). Then we have
αn+1 (y 0 ) = P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn+1 = y 0 )
X
= P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 ).
y∈S

For each such term,


P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 )
= P(X0 = x0 , . . . , Xn = xn , Yn = y)P(Xn+1 = xn+1 , Yn+1 = y 0 |X0 = x0 , . . . , Xn = xn , Yn = y)
= αn (z)P(Xn+1 = xn+1 , Yn+1 = y 0 |X0 = x0 , . . . , Xn = xn , Yn = y)
This last term is handled easily by the Markov property for Zn .
Lemma 16.5. Let Zn = (Xn , Yn ) be a Markov chain with state space R × S. For any states
x0 , . . . , xn+1 ∈ R and y, y 0 ∈ S,
P(Zn+1 = (xn+1 , y 0 )|X0 = x0 , X1 = x1 , . . . , Xn = xn , Yn = y)
= P(Zn+1 = (xn+1 , y 0 )|Zn = (xn , y)).
Remark 16.6. Below we give an elementary but cumbersome discrete probability proof
of this fact. The intuition is quite clear. Let’s denote the information in all the random
variables Z0 , . . . , Zn symbolically as F≤n , and the information contained in just Zn as Fn .
The Markov property then says that
P(Zn = z|F≤n ) = P(Zn = z|Fn ).
Of coursr Fn ⊆ F≤n . Now, denote by G the information contained in the random vari-
ables X0 , . . . , Xn , Yn . This list is strictly contained in the components of Z0 , . . . , Zn , and it
contains the components of Zn , so we have
Fn ⊆ G ⊆ F≤n .
Since the conditional distribution of Zn+1 , conditioned on the big set F≤n is the same
as conditioned on the small set Fn , it stands to reason it should also be the same when
conditioned on the intermediate set G . This intuition can be formalized and forms the basis
of the theory of conditional expectation and Martingales, that we will study later in this
course.
57

Proof. By definition
P(Zn+1 = (xn+1 , y 0 )|X0 = x0 , X1 = x1 , . . . , Xn = xn , Yn = y)
P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 )
= .
P(X0 = x0 , . . . , Xn = xn , Yn = y)
The numerator can be expanded by summing over the probabilities of all possible values
taken by Y0 , . . . , Yn−1 :
P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 )
X
= P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y), Zn+1 = (xn+1 , y 0 ))
y0 ,...,yn−1

We can then express


P(Zn+1 = (xn+1 , y 0 )|X0 = x0 , X1 = x1 , . . . , Xn = xn , Yn = y)
X P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y), Zn+1 = (xn+1 , y 0 ))
=
y ,...,y
P(X0 = x0 , . . . , Xn = xn , Yn = y)
0 n−1

P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))


·
P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))

X P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y), Zn+1 = (xn+1 , y 0 ))


=
y ,...,y
P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))
0 1

P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))


·
P(X0 = x0 , . . . , Xn = xn , Yn = y)

X
= P(Zn+1 = (xn+1 , y 0 )|Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))
y0 ,...,yn−1

P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))


· .
P(X0 = x0 , . . . , Xn = xn , Yn = y)
Since Zn is a Markov chain, the conditional probability is simply equal to the one step
transition probability P(Zn+1 = (xn+1 , y 0 )|Zn = (xn , y)), which does not depend on the
summation variables. Hence we have
X P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))
P(Zn+1 = (xn+1 , y 0 )|Zn = (xn , y)) .
y ,...,y
P(X0 = x0 , . . . , Xn = xn , Yn = y)
0 n−1

The numerator is just the expansion of the denominator over all possible values of Y0 , . . . , Yn−1 ,
so the ratio is 1, proving the lemma. 
We therefore have
X
αn+1 (y 0 ) = αn (y)P (Zn+1 = (xn+1 , y 0 )|Zn = (xn , y))
y∈S
X
= αn (y)p(y, y 0 )ey0 (xn+1 )
y∈S
58

by Equation 16.1. We can now factor out the term ey0 (xn+1 ) which is independent of the
summation variable y, yo achieve
X
αn+1 (y 0 ) = ey0 (xn+1 ) αn (y)p(y, y 0 ). (16.5)
y∈S

Then, to compute P(x) = P(X0 = x0 , . . . , XN = xN ), we simply sum


X X
P(x) = P(X0 = x0 , . . . , XN = xN , YN = y) = αN (y).
y∈S y∈S

Let us now analyze the Forward Algorithm for complexity. Let 0 < n < N , and suppose
we have calculated all the quantities {αn (y) : y ∈ S}. For each y 0 ∈ S, Equation 16.5
allows us to compute αn+1 (y 0 ) with |S| multiplications, then the sum of those |S| terms,
and finally the product with the one remaining term, for a total of 2|S| + 1 operations. We
must compute all |S| terms {αn+1 (y 0 ) : y 0 ∈ S}, which means repeating this |S| times for a
total of |S|(2|S| + 1) operations.
At time 0, we initialize with the known distribution α0 (y) = P(X0 = x0 , Y0 = y) =
P(Y0 = y)ey (x0 ), requiring one computation for each state, so |S| computations in total.
Then we need N ·|S|(2|S|+1) operations to compute all the terms {αN (y) : y ∈ S}. Finally,
to compute P(x) we must sum up these |S| terms, for a total of 2|S| + N |S|(2|S| + 1) =
O(N |S|2 ) computations.

Example 16.7. To reiterate, if (Xn , Yn )n≥0 is a HMM and x = (x0 , . . . , xN ) is a fixed obser-
vation sequence, computing P(x) by
• the straightforward method y∈S N P(x, y) requires 2N |S|N operations.
P

• the Forward algorithm requires 2N |S|2 + (N + 2)|S| operations.


For example, suppose |S| = 2 (as in the occasionally dishonest casino example) and
N = 100. Then the Forward Algorithm takes 1004 computations, while the straightfor-
ward method takes 2.54 × 1032 computations. Note, the world’s fastest supercomputer
(as of November, 2011) operates at about 10.5 PFLOPS, or 1.05 × 1016 flops (floating-point
operations per second). On this computer, the straightforward computation above would
take more than 750 million years. The computation by the forward algorithm, by com-
parison, could be implemented on my iPhone in 1 microsecond.
59

17. L ECTURE 16: M AY 9, 2012


To recap from last time: a Hidden Markov Model (HMM) is a pair (Xn , Yn )n≥0 of sto-
chastic processes, where (Yn )n0 is a Markov chain with transition kernel p(·, ·), and the
joint distribution P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn = yn ) is given by
P(Y0 = y0 )ey0 (x0 )p(y0 , y1 )ey1 (x1 ) · · · p(yn−1 , yn )eyn (xn ).
Here ey (x) = P(Xn = x|Yn = y) are the emission probabilities (and are time-homogeneous).
Equivalently, the model is that (Yn )n≥0 and the pair (Xn , Yn )n≥0 are Markov chains, where
the transition probabilities for the pair chain are
P ((Xn+1 , Yn+1 ) = (x0 , y 0 )|(Xn , Yn ) = (x, y)) = p(y, y 0 )ey0 (x0 ).
This models a scenario where there is an underlying driving Markov process Yn which
is “hidden”, but it generates a process Xn which can be observed (as signals with some
random noise). The emission probabilities are known, and we would like to determine
from a given signal what the hidden state of the underlying process is.
One question we can answer (in a computationally tractable way) is: given an observed
output x = (x0 , x1 , . . . , xN ), what is the probability that the hidden chain followed the
trajectory y = (y0 , y1 , . . . , yN )? That is, what is P(y|x)? By definition is it P(x, y)/P(x). The
numerator is calculated (as above) as the product of transition probabilities and emission
probabilities (and the probability of the initial state). To calculate the denominator P(x),
we use the Forward Algorithm: for fixed observation sequence x as above
Initializiation: For y ∈ S, set α0 (y) = P(X0 = x0 , Y0 = y0 ) = P(Y0 = y0 )ey0 (x0 ).
X
Recursion: For y 0 ∈ S and 0 < n < N , set αn+1 (y 0 ) = ey0 (xn+1 ) αn (y)p(y, y 0 ).
X y∈S
Termination: P(x) = αN (y).
y∈S

If S is the state space of the hidden chain (Yn )n≥0 , this algorithm is O(N |S|2 ) and so quickly
implemented on a computer. Homework 5 Problem 1 asks you to do this computation for
a Crooked Casino style HMM.

17.1. Most Likely Trajectory. HMMs are used extensively in signal processessing appli-
cations: notably speach recognition and error correcting codes for wireless communica-
tions. The observed process Xn is interpretted as a signal with random noise, and the
goal is to decode the true source Yn . This is, of course, impossible, even if all the emission
probabilities for Xn given Yn and all the transition probabilities for Yn are known exactly
(which is usually not true anyhow). The question is, then, what is our best guess about
the values of y0 , y1 , . . . , yN given an observed sequence x0 , x1 , . . . , xN ?
Referring to the previous lecture and the forward algorithm, we have an efficient way
to calculate P(y|x). The usual answer for the best guess y for the true value of y is then
y∗ = argmax P(y|x).
y

The operation argmax is Engineering Speak: the meaning is that y∗ is chosen so that
P(y∗ |x) = max P(y|x).
y
60

Such a y∗ surely exists (since we are dealing with finite-state processes here). A more
subtle theoretical question is whether the maximum is unique (the answer is no typically),
so we can weaken our goal to find a maximizer. The intuition is simple: we observe a
given x. We check all possible strings y, and the one that has the highest probability of
generating x is (probably close to) the true string.
We now come to the question of computational feasibility. We have an algorithm
(the Forward Algorithm) to compute P(y|x) for fixed x and y, and it has complexity
O(N |S|3 ) where S is the state space of the hidden chain. The most obvious way to find
argmaxy P(y|x) is to compute all of the probabilities {P(y|x) : y ∈ S N }, and then select
y∗ as (one of) the maximizer(s). Of course, this procedure will exponentially complex-
ify the calculation, yielding O(N |S|N +3 ). This will never be computationally possible for
moderately large N .
Enter Viterbi. In 1967, he was an EE professor here at UCLA, when he invented the
algorithm (below) for recursively effectively computing the most likely trajectory in a
HMM. He quickly realized the enormous practical applications, and formed a company
(called Linkabit) to market it (initially only to the military and NASA). In 1973, he re-
signed from UCLA and moved to San Diego (where Linkabit was based) to run the com-
pany full-time. In 1980, Linkabit was acquired by a larger company called Microwave
Associates Communications. A change in management caused Viterbi (and his original
Linkabit team) to resign en masse in 1985, at which time they formed a new startup com-
pany that they called QUALCOMM (which is, of course, now an industry giant in com-
munications technology). While at QUALCOMM, Viterbi more or less single-handedly
invented the CDMA algorithm used by all modern cellular providers; the Viterbi algo-
rithm is used at the heart of it (for error-correcting codes). Viterbi retired from QUAL-
COMM in 2000 (at age 65), and became an EE professor here at UCSD, where he is still
on the faculty.

17.2. The Viterbi Algorithm. Fix an observation sequence x = (x0 , x1 , . . . , xN ). First note
that P(y|x) = P(x,y)
P(x)
, so
1
max P(y|x) = max P(x, y).
y P(x) y
It therefore suffices to maximize P(x, y) over y; a maximizer y∗ for this joint probability
is also a maximizer for the conditional probability.
We build up recursively the maximum conditional probability of a state sequence given
the observed sequence. For 0 < n ≤ N , define, for any state y,
Vn (y) = max P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn−1 = yn−1 , Yn = y).
y0 ,...,yn−1 ∈S

Then maxy∈S Vn (y) = maxy∈S N P(x, y), whose argmax we wish to compute. On its face, it
seems that computing Vn (y) requires a brute force approach with |S|n−1 computations of
probabilities; however, note that the Hidden Markov Model provides a simple product
structure for the probabilities in question. From Equation 16.4, we have for any states
y0 , . . . , yn−1 , y, y 0 that
P(X0 = x0 , . . . , Xn+1 = xn+1 , Y0 = y0 , . . . , Yn = y, Yn+1 = y 0 )
= P(Y0 = y0 )ey0 (x0 )p(y0 , y1 )ey1 (x1 ) · · · p(yn−1 , y)ey (xn )p(y, y 0 )ey0 (xn+1 )
61

and so
 
Vn+1 (y 0 ) = max P(Y0 = y0 )p(y0 , y1 )ey0 (x0 ) · · · p(yn−1 , y)eyn (xn )p(y, y 0 )ey (xn+1 )
y0 ,...,yn−1 ,y
 
= max P(Y0 = y0 )p(y0 , y1 )ey0 (x0 ) · · · p(yn−1 , yn )ey (xn )p(y, y 0 ) ey0 (xn+1 ).
y0 ,...,yn−1 ,y

The quantity inside the large parentheses is P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn−1 =


yn−1 , Yn = y) · p(y, y 0 ), again by Equation 16.4. Thus, we have
 
0 0 0
Vn+1 (y ) = max P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn−1 = y, Yn = y )p(y, y ) ey0 (xn+1 ).
y0 ,...,yn−1 ,y

The maximum can be rewritten as


 
0
max p(y, y ) max P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn = y)
y y0 ,...,yn−1
 
= max p(y, y 0 )Vn (y) .
y

Altogether, then, we have the recursive formula


Vn+1 (y 0 ) = max Vn (y)p(y, y 0 )ey0 (xn+1 ). (17.1)
y∈S

Actually computing this maximum means computing the quantitites


Wn+1 (y; y 0 ) = Vn (y)p(y, y 0 )ey0 (xn+1 ) (17.2)
0 0
for each pair y, y . Then define, for each state y ∈ s, a function ψn∗ : S → S, by
ψn∗ (y 0 ) = argmax Wn+1 (y; y 0 ). (17.3)
y

Then, by definition, we have


Vn+1 (y 0 ) = Wn+1 (ψn∗ (y 0 ), y 0 ). (17.4)
Once we have computed Wk (y; y 0 ) and Vk (y 0 ) for all 0 < k ≤ N , we can then do the final
maximization: since maxy P(x, y) = maxy VN (y), we set
∗ ∗
yN = argmax VN (y); ∴ max P(x, y) = VN (yN ).
y y

Now, to compute the maximizing sequence of states, we back-track: we have already



computed yN . In the process of computing VN (y 0 ), we computed ψN ∗ 0 0
−1 (y ) for each y , i.e.
∗ 0 0 0
the state y = ψN −1 (y ) which maximizes WN (y; y ). We now know that the state y which
maximizes the end result is y 0 = yN ∗
, and so we define yN ∗ ∗ ∗
−1 = ψN −1 (yN ). Continuing
recursively, we define
yk∗ = ψk∗ (yk+1 ), 0 ≤ k < N − 1.
This produces the maximizing sequence y∗ = (y0∗ , y1∗ , . . . , yN

) that we sought.
62

To summarize, here is the laid out Viterbi algorithm.

Initializiation: For y ∈ S, set V0 (y) = P(X0 = x0 , Y0 = y0 ) = P(Y0 = y0 )ey0 (x0 ).

Recursion: For y, y 0 ∈ S and 0 < n < N , set Wn+1 (y; y 0 ) = Vn (y)p(y, y 0 )ey0 (xn+1 ).

Then compute ψn∗ (y 0 ) = argmax Wn+1 (y; y 0 ).


y

Set Vn+1 (y 0 ) = Wn+1 (ψn∗ (y 0 ); y 0 ).



Termination: max P(x, y) = max VN (y), so define yN = argmax VN (y).
y y∈S y

Back-tracking: For 0 ≤ k < N , set yk∗ = ψk∗ (yk+1



).
Note: at the n + 1st recursion step we compute |S| maximizers ψn∗ (y 0 ) for all states y 0 ∈
S. It could well be that more than one state y maximizes Wn+1 (y; y 0 ), so there is some
indeterminacy. In practice, we just define ψn∗ (y 0 ) to be the first y to maximize Wn+1 (y; y 0 )
(i.e. we order the states of S to begin with). The same goes for the final mazimization of
Vn (y) over y ∈ S. So the order we choose for S determines which maximum we find.
17.3. Complexity Analysis. The Viterbi algorithm produces a maximizing state sequence

y∗ = (y0∗ , . . . , yN ) for a given observation sequences x = (x0 , . . . , xN ); that is, P(y∗ |x) =
maxy P(y|x). The reason it is so useful is that, unlike the brute force aprroach discussed
above that is exponential in time N , this algorithm is linear in time N . Let’s run through
the algorithm to see why.
• For each y, computing V0 (y) requires one multiplication; so computing the quanti-
ties {V0 (y) : y ∈ S} requires |S| operations.
• Once we have computed all the quantities {Vn (y) : y ∈ S}, for any fixed state
y 0 we need two multiplications to compute Wn+1 (y; y 0 ) = Vn (y)p(y, y 0 )ey0 (xn+1 )
for each pair y, y 0 ∈ S. So there are 2|S|2 operations to compute the quantities
{Wn+1 (y; y 0 ) : y, y 0 ∈ S}; in computing them, we run a pointer to identify the (first)
y that maximizes Wn+1 (y; y 0 ) for each given y 0 , so we compute ψn∗ (y 0 ) and Vn+1 (y 0 )
in the process |S| additional operations (specifying the pointer ψn∗ (y 0 ) for each y 0 ).

• At termination, we argument-maximize VN (y) over y to find yN , and backtrack to
∗ ∗ ∗
compute the maximizing sequence yk = ψk (yk+1 ); this is all a matter of calling the
pointers we specified through the process, so no additional operations are needed.
Thus, the total number of operations is: |S| for intialization, and N · (2|S|2 + |S|) to com-
pute V1 , V2 , . . . , VN and specify the relevant pointers. The total complexity of the Viterbi
algorithm is then |S| + N (2|S|2 + |S|) = 2N |S|2 + (N + 1)|S| = O(N |S|2 ).
63

18. L ECTURE 18: C ONTINUOUS T IME M ARKOV C HAINS


Let S be a finite or countable state space. A stochastic process (Xt )t≥0 with state space S,
indexed by non-negative reals t (in the half-infinite interval [0, ∞), or perhaps in a subin-
terval [a, b]) is called a continuous-time Markov chain if the following two properties
hold.
(1) (Markov property) Let 0 ≤ t0 < t1 < · · · < tn < ∞ be a sequence of times, and let
i0 , i1 , . . . , in ∈ S be a sequence of states such that P(Xt0 = i0 , Xt1 = i1 , . . . , Xtn−1 =
in−1 ) > 0. Then

P(Xtn = in |Xt0 = i0 , Xt1 = i1 , . . . , Xtn−1 = in−1 ) = P(Xtn = in |Xtn−1 = in−1 ).

(2) (Right-continuity) For t ≥ 0 and i ∈ S, if Xt = i then there is  > 0 such that Xs = i


for t ≤ s ≤ t + .
The Markov property is fundamentally the same as the the one in Definition 1.5, except
that we now must allow the times to be chosen from a larger set (all positive reals). The
intuition remains the same: the distribution of the process at the present time tn , condi-
tioned on any past information, is the same as conditioned on only the immediate past. (I.e.
for any time t earlier than tn , conditioning on times t and earlier is the same as condition-
ing only on time t.) Condition (2) (right-continuity) is a technical condition to guarantee
that very wild small-time behavior is ruled out.
As with the discrete time case, we immediately restrict our attention to time-homogeneous
chains:
(3) For any times 0 ≤ s < t < ∞ and states i, j ∈ S,
P(Xt = j|Xs = i) = P(Xt−s = j|X0 = i).

For discrete time chains, it was really enough to understand the one-step transition
matrix P(X1 = j|X0 = i), since there were no times in between 0 and 1. For continuous-
time chains, however, we really need all the information in the transition kernel
pt (i, j) = P(Xt = j|X0 = i), t > 0.
This is a lot more information to process, so we will need a different approach to analyze
such chains.

18.1. Jump-Times. Because of the right-continuity assumption, if Xt = i then Xt+ = i


for all sufficiently small  > 0. Hence, if we define the jump-time
J1 = min{t ≥ 0 : Xt 6= X0 }
then J1 > 0 with probability 1. Now, suppose J1 > s – so we have already waited time s
in state X0 = i. What is the probability that J1 > s + t – i.e. how likely is it wait at least t
longer in state i?
Proposition 18.1. For s, t > 0 and i ∈ S,
P(J1 > s + t|J1 > s) = P(J1 > t).
64

Proof. Let (Xt )t≥0 be a time-homogeneous Markov chain, with jump time J1 from the
initial state. Conditioned on X0 = i, he event {J1 > s} is equal to the event {Xu = i, ∀u ≥
s}. Let’s consider a discrete version of this: fix a sequence of times
0 ≤ u0 < u1 < · · · < un = s < un+1 < · · · < un+m = s + t.
Then Pi (J1 > s + t|J1 > s) is approximated by
P(Xu0 = Xu1 = · · · = Xun+m = i|Xu0 = Xu1 = · · · = Xun = i).
Abbreviate the event {Xu0 = Xu1 = · · · = Xun = i} as An (i); then we have the usual
decomposition
P(Xu0 = Xu1 = · · · = Xun+m = i|An (i))
= P(Xun+m = · · · = Xun+1 = i|An (i))
= P(Xun+1 = i|An (i))P(Xun+2 = i|Xun , An (i)) · · · P(Xun+m = i|Xun+m−1 = · · · = Xun+1 = i, An (i)).
By the Markov property, this is simply the product
P(Xun+1 = i|Xun = i)P(Xun+2 = i|Xun+1 = i) · · · P(Xun+m = i|Xun+m−1 = i).
Now lets choose the partition points to evenly subdivide the interval [s, s + t] into m parts
– that is, un+j+1 = un+j + mt . Then by time-homogeneity, each of the above terms in the
product is equal to
P(Xt/m = i|X0 = i).
By letting the partition witdth tend to 0, we recover the desired probability, so
 m
Pi (J1 > s + t|J1 > s) = lim P(Xt/m = i|X0 = i) . (18.1)
m→∞

As for Pi (J1 > t), we can approximate this probability similar to above by choosing a
partition 0 = v1 < v2 < · · · < vm = t and computing
Pi (Xv1 = · · · = Xvm = i)
= P(Xv1 = i|X0 = i)P(Xv2 = i|Xv1 = i) · · · P(Xvm = i|Xv0 = · · · = Xvm−1 = i)
= P(Xv1 −v0 = i)P(Xv2 −v1 = i) · · · P(Xvm −vm−1 = i)
where we have used the Markov property and time-homogeneity in the last equality. If
we again choose the even partition with vj+1 − vj = t/m, and take limits, this gives
 m
Pi (J1 > t) = lim P(Xt/m = i) . (18.2)
m→∞

Equations 18.1 and 18.2 prove that Pi (J1 > s + t|J1 > s) = Pi (J1 > t) for all states i. This
proves the proposition. 

18.2. The Exponential Distribution. The formula in Equation 18.2 for the (reverse) cu-
mulative probability distribution function of J1 has the feel of an exponential to it. This
turns out to indeed be true. The result of Proposition 18.1 is a “memoryless” property: it
says that the distribution of jump times after already waiting some fixed length of time
is the same as it was at the beginning. In fact, there is only a one-parameter family of
probability distributions with this property: the exponential distributions.
65

Proposition 18.2. If T is a random variable taking values in (0, ∞), and if T has the memoryless
property P(T > s + t|T > s) = P(T > t) for all s, t > 0, then T is an exponential random
variable with some intensity q > 0:
P(T > t) = e−qt , t ≥ 0.
Notation: we write T ∼ Exp(q).
Proof. Let GT (t) = P(T > t). Then for s, t > 0
GT (s + t) = P(T > s + t) = P(T > s)P(T > s + t|T > s) = P(T > s)P(T > t) = GT (s)GT (t).
It follows from this functional equation that GT (t) = e−qt for some q > 0: by induc-
tion it follows that GT (nt) = GT (t)n for all natural numbers n and t > 0. In particular,
GT (1/n)n = GT (1) ∈ (0, 1) and so there is q > 0 with GT (1) = e−q . Thus, for m, n ∈ N,
GT (m/n)n = GT (1)m = e−qm so that GT (m/n) = e−qm/n . So for rational t we have proved
the result. The function GT is decreasing, and so since it is equal to the function e−q(·) on
the dense set of positive rationals, it is equal to it everywhere. 

Here are a few useful properties of the exponential distribution.


Proposition 18.3. Let T1 , T2 , . . . , Tn be independent with Tj ∼ Exp(qj ).
(a) The density of Tj is fTj (t) = qj e−qj t 1{t≥0} , and E[Tj ] = q1j and Var[Tj ] = 1
qj2
.
(b) For s, t > 0, P(Tj ≥ s + t|Tj > s) = P(Tj > t) = e−qt t .
(c) T = minj Tj is also exponential, with T ∼ Exp(q1 + · · · + qn ). Moreover, P(Tj = T ) =
qj
q1 +···+qn
.
Proof. Parts (a) and (b) follow from easy calculation. For part (c), simply note from inde-
pendence that
P(T ≥ t) = P(T1 ≥ t, . . . , Tn ≥ t) = P(T1 ≥ t) · · · P(Tn ≥ t) = e−q1 t · · · e−qn t = e−t(q1 +···+qn ) .
Moreover,
P(T1 = T )
= P(T2 > T1 , . . . , Tn > T1 )
Z
= fT1 ,...,Tn (t1 , . . . , tn ) dt1 · · · dtn
t2 >t1 ,...,tn >t1
Z ∞ Z ∞ Z ∞
= dt ··· fT1 ,...,Tn (t, t2 , . . . , tn ) dt2 · · · dtn
0 t t
Z ∞ Z ∞ Z ∞
= dt ··· q1 e−q1 t q2 e−q2 t2 · · · qn e−qn t dt2 · · · dtn
0 t t
Z ∞ n Z
Y ∞
= q1 e−q1 t dt qi e−qi ti dti
0 i=2 t
Z ∞ n Z ∞
−q1 t
Y
−qi t q1
= q1 e e = q1 e−(q1 +···+qn )t dt = .
0 i=2 0 q1 + · · · + qn

66

Remark 18.4. In fact, one can go further. Let M be the {1, 2, . . . , n}-valued random variable
defined by M = j iff min{T1 , . . . , Tn } = Tj . We just computed that the distribution of M
qj
is P(M = j) = q1 +···+q j
. Similar calculations show that M is independent of T1 , . . . , Tn .
This is quite remarkable. For example, suppose we have two processes, one Exp(1) and
the other Exp(100). If we sample one of the two at random without knowing which one,
even if we get value 0.9, we do not have any information about which one we actually
sampled. This is a very strong form of the memoryless property.
18.3. Transition Rates. Proposition 18.2 shows that the first jump time J1 of a Markov
chain has an exponential distribution with some intensity q > 0, which may differ de-
pending on the initial state. That is: for each i ∈ S, there is a q(i) > 0 so that the condi-
tional distribution of J1 , conditioned on X0 = i, is Exp(q(i)). This gives a convenient way
to think about any continuous-time Markov chain: at each state, there is an “exponential
clock”, which will ring at a random time whose distribution is exponential with some rate
parameter q(i). Once arriving at state i, the clock is started; it rings at time J1 , and the
process moves to another state j, chosen with probabilities
p(i, j) = P(XJ1 = j|X0 = i).
Note that, by definition of the jump-time J1 , XJ1 6= X0 , so p(i, i) = 0.
There is a slightly different interpretation which is somewhat more convenient for cal-
culations. Define
q(i, j) = q(i)p(i, j) = q(i)P(XJ1 = j|X0 = i). (18.3)
The numbers q(i, j) for i, j ∈ S are called the transition rates of the Markov chain. They
satisfy
q(i, j) ≥ 0, q(i, i) = 0, i, j ∈ S
X X
q(i, j) = q(i) p(i, j) = q(i) < ∞.
j∈S j∈S

The transition rates fully determine the Markov chain, so we can specify the chain simply
by specifying the matrix Q = [q(i, j)]i,j∈S . Their interpretation in terms of the process is
thus: when we arrive at state i, all other states start exponential clocks – state j’s clock as
rate q(i, j). When a clock goes off, the chain moves to the state whose clock rang first. All
of the clocks are independent, so the time that the first clock will go off is the minimum
of allPthe independent exponential clocks; by Proposition 18.3(c), this is exponential with
rate j q(i, j) = q(i), as required. The probability that the clock at j goes off first is, again
by Proposition 18.3(c), Pq(i,j)
q(i,k)
= q(i,j)
q(i)
= p(i, j).
k

18.4. The Poisson Process. The simplest non-trivial continuous time Markov chain is a
Poisson Process, with rate λ > 0. This is a Markov chain with state space S = {0, 1, 2, . . .}
and transition rates q(i, i + 1) = λ for all i ≥ 0, and q(i, j) = 0 when j 6= i + 1. So the
process is non-decreasing; the moment that it arrives at state i, it begins an Exp(λ) clock,
and jumps to state i + 1 when it rings. If (Xt )t≥0 is a Poisson process, we can think of the
random variable Xt as counting the number of events that have occurred up to time t. It
is a very good model for radioactive decay, telephone calls coming into a call-center, the
number of requests for a particular page coming into a web server, etc.
67

Why do we call it a Poisson process?


Proposition 18.5. Let (Xt )t≥0 be a Poisson process with rate λ > 0. Then for each t > 0,
conditioned on X0 = 0, the random variable Xt has a Poisson distribution with rate λt:
(λt)k
P1 (Xt = k) = e−λt , k ∈ N.
k!
We will prove Proposition 18.5 in the next lecture, using the infinitesimal description of
continuous-time processes and the forward and backward Kolmogorov equations.
68

19. L ECTURE 19: M AY 14, 2012


19.1. The Strong Markov Property. In complete analogy with the discrete-time case, we
have continuous-time stopping times. Given a Markov chain (Xt )t≥0 , a stopping time T
is a random variable taking values in [0, ∞] with the property that:
for t ≥ 0, the event {T ≤ t} depends only on {Xs : s ≤ t}.
For example: the jump-time J1 of a Markov chain, J1 = min{t ≥ 0 : Xt 6= X0 } is a stopping
time: the event {J1 ≤ t} = {∃s ≤ t Xs 6= X0 } manifestly depends only on Xs for s ≤ t.
Theorem 19.1 (Strong Markov property). Let (Xt )t≥0 be a continuous-time Markov chain with
state space S and transition rates [q(i, j)]i,j∈S . Let T be a stopping time. For some i ∈ S, suppose
that P(XT = i) > 0. Then conditioned on XT = i, (XT +t )t≥0 is a Markov chain with the same
transition rates [q(i, j)]i,j∈S , independent from (Xt )0≤t≤T .
The proof follows the same idea as the proof of the discrete-time Strong Markov prop-
erty, Proposition 3.6. To make the terminology fully rigorously precise (i.e. in particu-
lar what it means for a stopping time to depend only on an uncountable collection like
{Xs : s ≤ t}) really requires measure theory.
We can use the Strong Markov property to develop tools for “first-step” computations
with continuous time Markov chains, in complete analogy with the techniques in Lectures
2 and 3. For example, for a subset A ⊆ S of the state space, as usual define the hitting
time τA = min{t ≥ 0 : Xt ∈ A}. How likely is it that Xt hits A before B, for two disjoint
subsets A, B ⊂ S? To answer, for each state i set h(i) = Pi (τA < τB ). Then h(i) = 1 for
i ∈ A and h(i) = 0 for i ∈ B. For i ∈ A ∪ B, by the Strong Markov property:
X X X q(i, j)
h(i) = Pi (XJ1 =j )Pi (τA < τB |XJ1 = j) = p(i, j)h(j) = h(j). (19.1)
j∈S j∈S j∈S
q(i)
(That is: Pi (τA < τB |XJ1 = j) = Pj (τA − J1 < τB − J1 ) = Pj (τA < τB ) by the Strong Markov
property with stopping time equal to the jump time J1 .) Similarly, we can compute the
expected hitting time g(i) = Ei [τA ] for i ∈ / A by
X X q(i, j)
g(i) = Pi (XJ1 = j)Ei [τA |XJ1 = j] = Ei [τA |XJ1 = j].
j∈S j∈S
q(i)
Now, define Yt = XJ1 +t . Then τA = min{t ≥ 0 : Xt ∈ A} = min{t + J1 : Xt+J1 ∈ A} =
min{t + J1 : Yt ∈ A} = J1 + τA0 , where τA0 is the hitting time of A for the process (Yt )t≥0 . By
the Strong Markov property
Ei [τA |XJ1 =j ] = E[τA0 + J1 |X0 = i, XJ1 = j] = Ej [τA ] + Ei [J1 ].
1
Conditoned on X0 = i, J1 is Exp(q(i)), so its expectation is q(i) . Altogether, then, we have
X q(i, j)  
1
g(i) = g(j) + .
j∈S
q(i) q(i)

Multiplying through by q(i), and using the fact that j∈S q(i,j)
P
q(i)
= 1, this gives us the
equations X
q(i)g(i) = 1 + q(i, j)g(j). (19.2)
j∈S
69

Equations 19.1 and 19.2 can be used to iteratively calculate P(τA < τB ) and E[τA ] just as in
the the discrete-time setting.
19.2. Embedded Jump Chain. Given a continuous-time Markov chain, we have already
defined the first jump-time J1 = min{t ≥ 0 : Xt 6= X0 }. This is part of a whole sequence
of jump times. We can inductively define
J0 = 0, Jn+1 = min{t ≥ Jn : Xt 6= XJn }.
By the strong Markov property, for any states i0 , i1 , . . . , in ∈ S,
P(XJn = in |XJn−1 = in−1 , . . . , XJ1 = i1 , X0 = i0 )
= P(XJn −J1 = in |XJn−1 −J1 = in−1 , . . . , XJ2 −J1 = i2 , X0 = i1 )
= P(XJn −J2 = in |XJn−1 −J2 = in−1 , . . . , XJ3 −J2 = i3 , X0 = i2 )
..
.
= P(XJn −Jn−1 = in |X0 = in−1 )
= P(XJn = in |XJn−1 = in−1 )
where the final step is from time-homogeneity, and the intermediate kth step follows by
restarting the process at time Jk (note that (Jn − Jk−1 ) − (Jk − Jk−1 ) = Jn − Jk , so the
differences telescope). This shows that the discrete-time stochastic process
Yn = XJn ,
called the embedded jump chain of (Xt )t≥0 , is a Markov chain. Moreover, its transition
probabilities are P(Y1 = j|Y0 = i) = P(XJ1 = j|X0 = i) = p(i, j) = q(i,j)
q(i)
.
Now, let us condition on a particular trajectory for the jump chain: let A = {Y0 =
i0 , . . . , Yn = in }. Let us consider the so-called sojourn times Sk = Jk − Jk−1 for k ≥ 1.
Then we have for any 0 < s1 < · · · < sn < ∞
= P(S1 > s1 , . . . , Sn > sn |A)
= P(J1 > s1 , J2 − J1 > s2 , . . . , Jn − Jn−1 > sn |A)
= P(J1 > s1 , J2 > s2 + J1 , . . . , Jn > sn + Jn−1 |A)
= PA (J1 > s1 )PA (J2 > s2 + J1 |J1 > s1 ) · · · PA (Jn > sn + Jn−1 |J1 > s1 , . . . , Jn−1 > sn−1 ).
By the Strong Markov property,
PA (Jk > sk + Jk−1 |J1 > s1 , . . . , Jk−1 > sk−1 ) = PA (Jk > sk + Jk−1 |Jk−1 > sk−1 )
= P(J1 > sk |X0 = ik−1 ) = e−q(ik−1 )sk .
(I.e. restarting the process at time Jk−1 , the next jump time Jk becomes the first jump-time
of the shifted process; the conditioning XJk−1 = Yk−1 = ik−1 translates to X0 = ik−1 for the
shifted process.) So, we have shown that
P(S1 > s1 , . . . , Sn > sn |A) = e−q(i0 )s1 · · · e−q(in−1 )sn
which is precisely the joint distribution of independent exponential random with param-
eters q(i0 ), . . . , q(in−1 ). In particular, this shows that
Proposition 19.2. Conditioned on Y0 , . . . , Yn−1 , the sojourn times S1 , . . . , Sn are independent
exponential random variables with Sk ∼ Exp(q(Yk−1 )).
70

19.3. Infinitesimal Description. We know that the transition rates q(i, j) completely de-
termine the Markov chain, and give a convenient description in terms of exponential
jumps. Since they give a complete description, they must therefore encode the transition
probaiblities pt (i, j) = Pi (Xt = j) for all t ≥ 0. One way to see how pt (i, j) is determined
from q(i, j) is through the following so-called infinitesimal description of the chain.
Theorem 19.3. Let (Xt )t≥0 be a Markov chain with state space S and transition rates [q(i, j)]i,j∈S .
Then the transition probabilities pt (i, j) = Pi (Xt = j) satisfy
pt (i, i) = 1 − q(i)t + o(t) for i ∈ S
for i 6= j ∈ S.
pt (i, j) = q(i, j)t + o(t)
P
In the statement of Theorem 19.3, we have (as usual) q(i) = j∈S q(i, j). The notation
f (t) = g(t) + o(t) is short-hand for
f (t) − g(t)
lim = 0.
t↓0 t
That is, to first order f and g agree near 0. So the statement is that, for small t > 0, pt (i, i)
is very close to 1 − q(i)t, and pt (i, j) is very close to q(i, j)t for i 6= j.
Proof. First, for the i = j case, we have pt (i, i) = Pi (Xt = i). The event {Xt = i} contains
the event {∀s ≤ t Xs = i} which (conditioned on X0 = i) is the event {J1 > t}. Thus
pt (i, i) = Pi (Xt = i) ≥ Pi (J1 > t) = e−q(i)t = 1 − q(i)t + o(t). (19.3)
For j 6= i, we can similarly estimate as follows: if J1 ≤ t, the first jump state is Y1 = j, and
the following sojurn time S2 > t, then the process jumps to j before time t and doesn’t
leave it until time J2 = J1 + S2 > t, thus Xt = j. Hence
pt (i, j) = Pi (Xt = j) ≥ Pi (J1 ≤ t, Y1 = j, S2 > t) = Pi (Y1 = j)Pi (J1 ≤ t, S2 > t|Y1 = j).
By definition Pi (Y1 = j) = Pi (XJ1 = j) = p(i, j). By Proposition 19.2, since J1 = S1 , we
have
Pi (J1 ≤ t, S2 > t|Y1 = j) = Pi (J1 ≤ t|Y1 = j)Pi (S2 > t|Y1 = j) = (1 − e−q(i) )e−q(j)t .
Thus, we have the inequality
pt (i, j) ≥ p(i, j)(1−e−q(i)t )e−q(j)t = p(i, j)(q(i)t+o(t))(1−q(j)t+o(t)) = q(i, j)t+o(t). (19.4)
(The final equality follows from the definition q(i, j) = q(i)p(i, j).) Now, fixing i and
summing over all states j, we have from Inequalities 19.3 and 19.4
X X
pt (i, j) ≥ 1 − q(i)t + o(t) + q(i, j)t + o(t).
j∈S j6=i
P
Since q(i) = j q(i, j) and q(i, i) = 0, this shows that
X
pt (i, j) ≥ 1 + o(t).
j∈S

Hence, if any
P one of the inequailities in 19.3 or 19.4 were strict, we would have the strict
inequality j∈S pt (i, j) > 1 + o(t) and so for small enough t the sum would exceed 1. This
P
contradicts the stochastic condition j∈S pt (i, j) = 1, and so the inequalites must in fact
be equalities. 
71

The infinitesimal description is a very handy way to identify a Markov chain. Indeed,
the chain is completely determined by the parameters q(i) and q(i, j) for i, j ∈ S, so to
fully understand the chain one need only understand pt (i, j) to first order in t. This can
be used to identify a process quickly.
Example 19.4. Let (Xt )t≥0 and (Yt )t≥0 be independent Poisson processes, with parameters
λ1 and λ2 . The infinitesimal description of Theorem 19.3 can be used to prove that Xt + Yt
is a Poisson process with paramter λ1 + λ2 ; this is an exercise on Homework 5.
72

20. L ECTURE 20: M AY 16, 2012


20.1. Kolmogorov’s Forward and Backward Equations. The infinitesimal decription gives
rise to a system of differential equations for pt (i, j). It can be described elegantly in terms
of the following matrix.
Definition 20.1. Let (Xt )t≥0 be a continuous-time Markov chain with state space S and transi-
tion rates [q(i, j)]i,j∈S . Define the infinitesimal
P generator matrix A of the chain as follows: for
i 6= j, [A]ij = q(i, j), and [A]ii = −q(i) = − j∈S q(i, j).
So we can specify a Markov chain by specifying its infinitesimal generator.
Theorem 20.2 (Kolmogorov’s Forward and Backward Equations). Le (Xt )t≥0 be a continuous-
time Markov chain with state space S and infinitesimal generator A. Let Pt denote the matrix with
entries [Pt ]ij = pt (i, j) for i, j ∈ S. Then
d
Pt = Pt A = APt . (20.1)
dt
The first equality is called the Kolomogorov Forward Equation; the Kolmogorov Back-
ward Equation is the equality dtd Pt = APt . At least in the case of a finite state-space, the
second equality follows from the first, as will be evident in the discussion following the
proof.
Proof. Fix t ≥ 0 and h > 0. By the Markov property,
X
pt+h (i, j) = Pi (Xt+h = j) = Pi (Xt = k)Pi (Xt+h = j|Xt = k)
k∈S
X
= pt (i, k)Pk (Xh = j)
k∈S
X
= pt (i, k)ph (k, j).
k∈S

From the infinitesimal description, we have ph (j, j) = 1 − q(j)h + o(h) and ph (k, j) =
q(k, j)h + o(h) when k 6= j. Thus
X
pt+h (i, j) = pt (i, j)(1 − q(j)h + o(h)) + pt (i, k)(q(k, j)h + o(h))
k6=j

which says
X
pt+h (i, j) − pt (i, j) = −pt (i, j)q(j)h + pt (i, k)q(k, j)h + o(h).
k6=j

Dividing both sides by h and taking the limit as h → 0 gives


d d X X
[Pt ]ij = pt (i, j) = −pt (i, j)q(j) + pt (i, k)q(k, j) = [P]ik [A]kj
dt dt k6=j k∈S

which proves the Forward Equation. The Backward Equation is proved similarly, in that
case conditioning on Xh = k rather than Xt = k. 

From the elementary theory of ODEs, we can now easily solve the Kolomogorov For-
ward and Backward Equations, at least in the case of a finite state space.
73

Corollary 20.3. Let (Xt )t≥0 be a finite-state continuous time Markov chain with infinitesimal
generator A. Then the transition probabilities are given by
Pt = etA .
The exponential of a matrix can be defined, for example, by the power-series etA =
P∞ tk k d tA
k=0 k! A . The essential point here is that dt e = AetA = etA A, which shows that
Pt = etA satisfies the Kolmogorov Forward and Backward equations. In addition, we
have 0A = I, the identity matrix, and indeed [P0 ]ij = P(X0 = j|X0 = i) = δij = [I]ij ;
hence, uniqueness of the solution to an ODE with given initial conditions proves the
corollary. It is worth noting that, when there are infinitely-many states, the matrix ex-
ponential may not even make sense, and there can in fact be more than one solution
to the equations. In this case, at can be proved (with probabilistic techniques) that the
Kolmogorov Backward equation possesses a unique minimal non-negative solution, and this
solution gives the transition probabilities pt (i, j).
Example 20.4. Consider an arbitrary two-state chain. The infinitesimal generator A then
has the form  
−α α
A=
β −β
where α = q(1, 2) and β = q(2, 1). We can calculate that
 2   
2 α + αβ −α2 − αβ −α α
A = = −(α + β) = −(α + β)A.
−αβ − β 2 αβ + β 2 β −β
Thus An = [−(α + β)]n−1 A, and so we have
∞ ∞ k ∞
tA
X tk k
X t [−(α + β)]k−1 1 X [−(α + β)t]k
e = A =I+ A=I− A
k=0
k! k=1
k! α + β k=1 k!

which we can write as


1 1 − e−(α+β)t
Pt = etA = I − A e−(α+β)t − 1 = I +

A.
α+β α+β
So we have a closed-form expression for the transition probabilities of any 2-state continuous-
time Markov chain.
Example 20.5. Let (Xt )t≥0 be a PoissonP process with rate λ > 0. So the transition rates are
q(i, j) = λδj,i+1 . It follows that q(i) = j6=i q(i, j) = λ. Hence, the Kolmogorov Forward
Equations are
d X
pt (i, 0) = −pt (i, 0)q(0) + pt (i, k)q(k, 0) = −λpt (i, 0)
dt k6=0
d X
pt (i, j) = −pt (i, j)q(j) + pt (i, j)q(k, j) = −λpt (i, j) + λpt (i, j − 1), j ≥ 1.
dt k6=j

Since this is an infinite system of ODEs, we cannot simply solve it by taking a matrix ex-
ponential; we do not even have a general theorem to tell us that there is a unique solution.
74

But we can, in fact, solve the system and prove it has a unique solution, as follows. Fix i,
and consider the probability generating function

X
ϕi (t, z) = pt (i, j)z j .
j=0

All the coefficients pt (i, j) are ≤ 1 in modulus, so the series converges (uniformly) for
|z| < 1. Taking the time derivative can then be done term-by-term:
∞   ∞  
∂ X d j d X d
ϕi (t, z) = pt (i, j) z = pt (i, 0) + pt (i, j) z j .
∂t j=0
dt dt j=1
dt

Kolmogorov’s Forward Equations then give



∂ X
ϕi (t, z) = −λpt (i, 0) + [−λpt (i, j) + λpt (i, j − 1)] z j
∂t j=1

X ∞
X
= −λ pt (i, j)z j + λ pt (i, j − 1)z j
j=0 j=1

X
= −λϕi (t, z) + λ pt (i, k)z k+1
k=0
= −λϕi (t, z) + λzϕi (t, z).
In other words, the function ϕi (t, z) satisfies the partial differential equation ∂t ϕi (t, z) =
λ(z − 1)ϕi (t, z). Dividing through by ϕi (t, z) gives
∂ 1 ∂
ln ϕi (t, z) = ϕi (t, z) = λ(z − 1).
∂t ϕi (t, z) ∂t
It follows immediately that the solution is
ϕi (t, z) = ϕi (0, z)eλt(z−1) = z i eλt(z−1) = e−λt z i eλtz
where the constant term ϕi (0, z) = z i comes from the fact that pi (0, j) = δij . We can then
recover the desired coefficients pt (i, j) by expanding the power-series
∞ ∞ ∞ k
X
j −λt i
X 1 k
X
−λt (λt)
pt (i, j)z = ϕi (t, z) = e z (λtz) = e z k+i .
j=0 k=0
k! k=0
k!

Comparing coefficients, we see that pt (i, j) = 0 for j < i, while for j ≥ i


(λt)j−i
pt (i, j) = e−λt .
(j − i)!
In particular, when i = 0, we have
(λt)j
P(Xt = j|X0 = 0) = pt (0, j) = e−λt ,
j!
confirming that Xt is a Poisson random variable with rate λt.
75

20.2. More in Poisson Processes. The Poisson process with rate λ has a very special
property that is not shared by all Markov processes. To get a hint of what it is, consider
the embedded jump chain (Yn )n≥0 of such a Poisson process (Xt )t≥0 . Since q(i, j) = 0 un-
less j = i + 1, we have XJn+1 = XJn + 1. It follows that Yn = Y0 + n; the jump chain is
just a deterministic one-step increase chain. Now, by Proposition 19.2, the sojourn times
are independent exponentials conditioned on the trajectory of (Yn ); but since there is only
one trajectory of (Yn ), and the jump rates q(i, i + 1) = λ do not depend on i, it follows that
the sojourn times are just plan i.i.d. Indeed, we can compute P(S1 > s1 , . . . , Sn > sn ) as
the conditional sum
X
P(S1 > s1 , . . . , Sn > sn |Y0 = i0 , . . . , Yn = in )P(Y0 = i0 , . . . , Yn = in ).
i0 ,...,in

All of the terms P(Y0 = i0 , . . . , Yn = in ) are 0 except for those for which ik = i0 + k for
1 ≤ k ≤ n. So we have just the single sum
X
P(S1 > s1 , . . . , Sn > sn |Y0 = i, . . . , Yn = i + n)P(Y0 = i, . . . , Yn = i + n).
i

Because the jump chain is deterministic, P(Y0 = i, . . . , Yn = i + n) = P(Y0 = i). By


Proposition 19.2,
P(S1 > s1 , . . . , Sn > sn |Y0 = i, . . . , Yn = i + n) = e−q(i)s1 · · · e−q(i+n−1)sn
= e−λ(s1 +···+sn ) .
So in total we have
X
P(S1 > s1 , . . . , Sn > sn ) = e−λ(s1 +···+sn ) P(Y0 = i) = e−λ(s1 +···+sn ) .
i
That is:
Proposition 20.6. If (Xt )t≥0 is a Poisson process with rate λ, then its sojourn times are i.i.d.
Exp(λ) random variables.
This gives us a slightly different way to describe/construct a Poisson process. We can
begin with a sequence S1 , S2 , . . . of i.i.d. Exp(λ) random variables, define Jn = S1 +· · ·+Sn ,
and then define Xt = n for Jn ≤ t < Jn+1 ; then (Xt )t≥0 is a Poisson process of rate λ.

Theorem 20.7. Let (Xt )t≥0 be a Poisson process of rate λ (with X0 = 0). Then for any s ≥ 0, the
et = Xt+s − Xs is a Poisson process of rate λ, independent of {Xu : 0 ≤ u ≤ s}.
process X
Proof. It suffices to prove the claim under the conditioning Xs = i for an arbitrary fixed
et for t ≥ 0, and A is any event
state i. Indeed, if B is any event determined by the process X
determined by {Xu : 0 ≤ u ≤ s}, if Aand B are independent conditioned on {Xs = i} then
X X
P(A, B) = P(A, B|Xs = i)P(Xs = i) = P(A|Xs = i)P(B|Xs = i)P(Xs = i).
i i

Note that, by the conditional independence claim, we have B and {Xs = i} are indepen-
dent conditioned on {Xs = i}, which means simply that B and {Xs = i} are independent.
Thus P(B|Xs = i) = P(B), and so
X X
P(A, B) = P(A|Xs = i)P(B)P(Xs = i) = P(B) P(A|Xs = i)P(Xs = i) = P(B)P(A)
i i
76

as required.
Let J1 , J2 , . . . be the jump times and S1 , S2 , . . . the sojourn times. The event {Xs = i}
can be written exclusively in terms of these variables: Xs = i means that there have been
exactly i jumps (and no more) by time s. Hence
{Xs = i} = {Ji ≤ s < Ji+1 } = {Ji ≤ s} ∩ {Si+1 > s − Ji }.
By definition Xe0 = Xs − X0 = Xs = i. The first jump time Je1 of X et is the portion of Si+1 to
the future of s; since Si+1 is the interval between Ji and Ji+1 , this means that
Je1 = Se1 = Si+1 − (s − Ji ) = Ji+1 − s.
For further sojourn times, we simply have a time shift
Sen = Si+n , n ≥ 2.
We can then calculate the distirbution of Se1 , . . . , Sen conditioned on S1 , . . . , Si and the event
{Xs = i} = {Ji ≤ s} ∩ {Ji+1 > s}.
P(Se1 > s̃1 , . . . , Sen > s̃n |S1 > s1 , . . . , Si > si , Xs = i)
= P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s|Ji+1 > s, Ji ≤ s, Si > si , . . . , S1 > s1 )
P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s, Ji ≤ s, Si > si , . . . , S1 > s1 )
= . (20.2)
P(Ji+1 > s, Ji ≤ s, Si > si , . . . , S1 > s1 )
Both numerator and denominator can be computed as multiple integrals. Recalling that
Ji = S1 + · · · + Si , and that S1 , S2 , . . . are i.i.d. Exp(λ) random variables, the denominator
of Equation 20.2 is
P(S1 + · · · + Si+1 > s, S1 + · · · + Si ≤ s, Si > si , . . . , S1 > s1 )
Z Z
= 1{u1 +···+ui+1 >s} λi+1 e−λ(u1 +···+ui+1 ) du1 · · · dui+1
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
ui+1 ≥0
Z Z
= λi+1 e−λ(u1 +···+ui+1 ) du1 · · · dui+1 .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
ui+1 ≥s−(u1 +···+ui )

The inside integral can be computed as


Z ∞
i −λ(u1 +···+ui )
λe λe−λui+1 dui+1 = λi e−λ(u1 +···+ui ) e−λ(s−(u1 +···+ui ))
s−(u1 +···+un )

= λi e−λs .
Hence, the denominator of Equation 20.2 is simply equal to
Z
i −λs
λe du1 · · · dui .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s

A totally analogous argument applied to the numerator of Equation 20.2, integrating out
all the variables ui+1 , . . . , ui+n , yields
P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s, Ji ≤ s, Si > si , . . . , S1 > s1 )
Z
−λs̃n −λs̃2 i −λ(s+s̃1 )
= e ···e ·λe du1 · · · dui .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
77

Taking the ratio then yields


P(Se1 > s̃1 , . . . , Sen > s̃n |S1 > s1 , . . . , Si > si , Xs = i) = e−λs̃1 · · · e−λs̃n .
This shows that, conditioned on Xs = i, Se1 , . . . , Sen are independent Exp(λ) random vari-
ables, all independent from S1 , . . . , Si . Since it is clear that X et = Xs+t − Xs also has
embedded jump chain Yen = n, we see that X et is a Poisson process of rate λ.
The brief discussion following Proposition 20.6 shows that the process Xt is completely
determined by its Sojourn times. Indeed, conditioned on Xs = i, we can write Xr for
r ≤ s as
i
1{Sj ≤r} .
X
Xr = (20.3)
j=1
et = k is equal to {Se1 +· · ·+Sek ≤ t < Se1 +· · ·+Sek +Sek+1 },
Similarly, for any state k the event X
and by the previous paragraph this event is independent from S1 , . . . , Si and therefore
from Xr for r ≤ s. This concludes the proof. 
20.3. Independent Increments. Given a stochastic process (Xt )t≥0 , its increments are the
random variables {Xt − Xs : 0 ≤ s < t < ∞}. If (Xt )t≥0 is a Poisson process (or any
counting process, so that Xt represents the number of events that have occurred by time
t), then the increment Xt − Xs is the number of events that have occurred in the interval
(s, t]. It is because of the following corollary to Theorem 20.7 that the Poisson process is
so useful in applications.
Corollary 20.8. If (Xt )t≥0 is a Poisson process with rate λ, then for any sequence 0 ≤ t0 < t1 <
· · · < tn the increments Xtn − Xtn−1 , . . . , Xt1 − Xt0 are independent, and each increment Xt − Xs
is a Poisson random variable with rate λ(t − s). Moreover, these properties uniquely characterize
the Poisson process of rate λ.
Proof. First, write any increment Xt − Xs = X(t−s)+s − Xs . By Theorem 20.7, X eu = Xu+s −
Xs is a Poisson process with rate λ, and hence Xt − Xs = X et−s is Poisson(λ(t − s)) as
claimed. For independence, proceed by induction. By Theorem 20.7, X et = Xt+tn − Xtn is
independent of Xs for s ≤ tn ; in particular, it is independent of Xtn − Xtn−1 , . . . , Xt1 − Xt0 ,
which are themselves all independent by the inductive hypothesis. It follows that for any
tn+1 = tn +t > tn , all the increments Xtn+1 −Xtn , Xtn −Xtn−1 , . . . , Xt1 −Xt0 are independent.
For the converse, note that independent increments with Poisson distributions com-
pletely determine the joint distributions of (Xt0 , . . . , Xtn ) for any choice of times 0 ≤ t0 <
· · · < tn . Hence, there is a unique process with these properties. The above proof shows
that the Poisson process of rate λ is such a process, hence it is the only one.


Not all Markov chains have independent increments, although the converse is true (and
easy to prove): if a stochastic process (with a countable state space and right-continuous
trajectories) has independent increments, then it is a Markov chain; if the increments
have the property that the distirbution of Xt − Xs only depends on t − s (we say the
process has stationary increments), then (Xt )t≥0 is time-homogenous as well. The class of
such stochastic processes is called the Lévy processes. Most Markov chains are not Lévy
processes; the Poisson process is almost alone in this regard (with countable state space).
78

21. L ECTURE 21: M AY 18, 2012


21.1. Birth and Death Chains. A continuous time Markov chain is called a birth and
death chain if
• the state space is S = {0, 1, 2, . . .}
• the transition rates are q(i, i + 1) = λi ≥ 0 for i ≥ 0, q(i, i − 1) = µi ≥ 0 for i ≥ 1,
and q(i, j) = 0 if j 6= i ± 1.
If all the µi = 0, it is a pure birth process; if all the λi = 0 it is a pure death process. The
Poisson process with rate λ is a pure birth process with λi = λ for all i.
Example 21.1 (Kingman’sCoalescent Process). Consider the pure death process (Xt )t≥0
with µ1 = 0 and µk = k2 for k ≥ 2. This is a model of genetic coalescence: tracking
ancestor lines backward in time. It models the following: suppose we take a sample (size
N ) from a larrge population (of genes, for example), and track their ancestors back in
time. As soon as we find a common ancestor, that shrinks the sample size by 1; the set of
pairs of members of the population has size N2 , so there are this many possible coalescent
points. Once one is reached, the process starts over with the new coalesced population of
size N − 1.
What we are most interested in is T = min{t : Xt = 1}, the Time to Most Recent Common
Ancestor (TMRCA). Conditioned on X0 = N , this time is
T = S1 + S2 + · · · + SN −1
(here the sojourn times are given by S1 = amount of time spend in state N , S2 = amount
of time spent in state N −1, etc.) Conditioned on the trajectory of the jump chain (which is
then guaranteed to be Yj = N − j), these times are independent and Sj ∼ Exp(q(Yj−1 )) =
N −j+1

Exp( 2 ). Therefore we have
 
1 2 1 1
E[SN −k ] = k+1 = =2 − .
2
(k + 1)k k k+1
Hence
N −1 N −1    
X X 1 1 1
E[T ] = E[Sj ] = 2 − =2 1− .
j=1 k=1
k k+1 N

Thus, in this model, as N grows large, the TMRCA approaches 2.


Another interesting statistic is the sum of the branch lengths in the ancestral tree, L.
This gives a measure of the number of mutations in the population: if we assume mu-
tations happen randomly along the branches with a constant rate, then the number of
mutations is proportional to L. Conditioned on X0 = N , the amount of time spent with
population size k is SN −k+1 , so this branch length is given by L = N
P
k=2 kSN −k+1 . So we
have
N N N
X X 2 X 1
E[L] = kE[SN −k+1 ] = k = ≈ ln N.
k=2 k=2
k(k − 1) k=2 k − 1
Hence, the number of mutations in a sample of size N is proportional to ln N .
79

Example 21.2 (Explosion). Let (Xt )t≥0 be a pure birth process with λi = i2 for i ∈ N.
Condition on X0 = 1. If TN is the time to reach state N , then TN 2 = S1 + · · · + SN , and so
N N
X X 1
E[TN ] = E[Si ] = .
i=1 i=1
i2
2
In particular, this means that limN →∞ E[TN ] = π6 < ∞. Since all the random variables are
positive, we can exchange the limit and the sum, and if we set T = ∞
P
i=1 i (the time to
S
π2
reach ∞), then E[T ] = 6 < ∞. In particular, this means P(T < ∞) = 1.
The situation in Example 21.2 is called explosion. It presents a subtlety that we have
thus far ignored: what happens after T = ∞
P
S
i=1 i ? The jump-chain, sojourn-times de-
scription is insufficient to answer this. Indeed, after this time, the process may restart
in any state, and the Markov property is still obeyed. It is for this reason that the Kol-
mogorov Forward- and Backward- equations can have multiple solutions (despite always
having the same initial date P0 = I). If explosion occurs, then the very same transition
rates describe different chains: one that stays at ∞ ever after the explosion, and others
than restart in other states. Such chains are called non-minimal. We will not discuss such
chains. But it is important to realize that the jump-chain sojourn-time description is only
valid up to the time of explosion (if it occurs).

21.2. Recurrence and Transience. The same asymptotic notions for the behavior of a
Markov chain apply in the continuous-time setting as the discrete-time setting.
Definition 21.3. Let (Xt )t≥0 be a continuous-time Markov chain with state space S, and let
i ∈ S. Let T = min{t > 0 : Xt = i}. The state i is called transient if Pi (Ti < ∞) = 0; it is called
recurrent if Pi (Ti < ∞) = 1; it is called positive recurrent if Ei [Ti ] < ∞.
Note that if Xt = i then there is k with Jk ≤ t < Jk+1 , and so the the jump chain also
satisfies Yk = i. Since Yk = XJk , it follows that Yn returns to i iff Xt returns to i, and so the
state i is recurrent (resp. transient) for the chain Xt iff it is recurrent (resp. transient) for
the jump chain Yn . We therefore need no new technology to analyze a continuous-time
chain for recurrence/transience; we need only look at its jump chain. (In particular, as
before, if the chain is irreducible – meaning for each i, j ∈ S pt (i, j) > 0 for some t > 0 –
then either all states are recurrent or all states are transient.)
However, positive recurrence is determined not only by the property of return to a
state, but how long it takes to get there, which is different for the two chains. Hence,
positive recurrence is a different matter for continuous-time chains than discrete-time.

Example 21.4. Let (Xt )t≥0 be a birth and death chain with parameters λi = q(i, i + 1) > 0
for i ≥ 0 and µi = q(i, i − 1) > 0 for i ≥ 1. Since all the parameters are > 0, the chain is
irreducible, so we may test any state (by default 0) for transience/recurrence of the chain.
We can analyze such a chain in exactly the same manner we did for discrete-time birth
and death chains in Section 9.1.
Let h(i) = Pi (∃t ≥ 0 Xt = 0); then h(0) = 1. For i ≥ 1, we do a first-jump analysis, we
have X
h(i) = Pi (∃t ≥ 0 Xt = 0|XJ1 = j)Pi (XJ1 = j).
j≥0
80

By the strong Markov, we can restart the process at time J1 . Since i 6= 0, Xt 6= 0 for any
t < J1 , and so Pi (∃t ≥ 0 Xt = 0|XJ1 = j) = Pj (∃t ≥ 0 Xt = 0) = h(j). Hence we have the
usual recursion X
h(i) = p(i, j)h(j), i ≥ 1.
j≥0

For this chain p(i, j) = q(i, j)/q(i) = 0 unless j = i ± 1. So we have q(i) = λi + µi , and the
recursion becomes
λi µi
h(i) = h(i + 1) + h(i − 1).
λi + µi λi + µi
Rewriting this as
(λi + µi )h(i) = λi h(i + 1) + µi h(i − 1)
we can subtract to get
0 = λi h(i + 1) − λi h(i) − µi h(i) + µi h(i − 1) = λi [h(i + 1) − h(i)] − µi [h(i) − h(i − 1)].
Hence, we have
µi
h(i + 1) − h(i) = [h(i) − h(i − 1)], i ≥ 1.
λi
By induction, then, we have
µi µi−1 · · · µ1
h(i + 1) − h(i) = [h(1) − h(0)] ≡ ρi [h(1) − h(0)].
λi λi−1 · · · λ1
n−1
X n−1
X
h(n) − h(0) = [h(i + 1) − h(i)] = [h(1) − h(0)] ρi . (21.1)
i=1 i=1

Now, we have two cases. Suppose that n−1


P
i=1 ρi → ∞ as n → ∞. Since the left-hand-side
h(n) − h(0) is a difference of two probabilities, it stays bounded in [0, 1] for all n; the right-
hand-side must as well, which means we must have h(1) − h(0) = 0. Since h(0) = 1, this
means 1 = h(1) = P1 (∃t ≥ 0 Xt = 0). It then similarly follows that the left-hand-side
is always 0, so 1 = h(n) = Pn (∃t ≥ 0 Xt = 0) for all n. This means (by Homework 4,
Problem 1) that the chain is recurrent.
Thus, if ∞
P
i=1 ρi = ∞ then the chain is recurrent. Suppose, on the hand, that this sum
is finite. Referring to Equation 21.1 once more, set u(n) = h(0) − h(n) = 1 − h(n). Since
h(n) Pis a probability, u(n) ∈ [0, 1]. Of course u(n) = 0 for all n is a solution, as in the case
that ∞ i=1 ρi = ∞. But there are strictly positive solutions as well. Indeed, we can choose
the value of u(1) as we like, provided that u(n) ∈ [0, 1] for all n. Since ρi > 0, the sequence
u(1) n−1
P
i=1 ρi is increasing, and so we need

X n−1
X
u(1) ρi = lim u(1) ρi = lim u(n) ≤ 1.
n→∞ n→∞
i=1 i=1

The minimal solution h(n), meaning the maximal solution u(n), is then achieved when
−1
u(1) = ( ∞
P
i=1 ρi ) . So we have
n−1 Pn−1
X ρi
1 − h(n) = u(1) ρi = Pi=1

i=1 i=1 ρi
81

meaning that
Pn−1 P∞
i=1 ρi ρi
h(n) = 1 − P∞ = Pi=n
∞ .
i=1 ρi i=1 ρi
As n → ∞, h(n) → 0, which means that, for some n, there is a positive probability that
the chain started at n never reaches 0. This means 0 is a transient state, and so the chain
is transient.
Hence, recurrence is characterized by the summability of ∞
P
i=1 ρi : recurrent if the sum
is ∞, transient if it is < ∞.
Example 21.5 (M/M/1 queueing system). Consider a birth and death chain with λi = λ
and µi = µ both constant and non-zero. This is called an M/M/1 queue. It is a model for
a system where jobs (or customers) arrive at Poissonian times (at rate λ), queue up, and
are served in the order they arrived at rate µ. The process Xt is the number of jobs in the
queue at time t. There are much more general kinds of queueing models; the M/M stands
for Markov/Markov (meaning that the arrivals and service times are both Markovian),
and the index 1 means that only 1 job is served at a time.
µi−1 i−1
P∞
P∞ Following example 21.4, here we simply have ρi = λ i−1 = (µ/λ) . Hence i=1 ρi =
j −1
j=0 (µ/λ) is infinite if µ ≥ λ, and is finite and equal to (1 − µ/λ) if µ < λ. So Xt is
recurrent if µ ≥ λ, and transient of µ < λ. Indeed: if the service rate is at least the arrival
rate, we expect the queue to keep emptying out over time; if the service rate is less than
the arrival rate, then the line eventually grows without bound and never empties again
(i.e. 0 is never revisited).
82

22. L ECTURE 22: M AY 23


22.1. Stationary Distribution.
Definition 22.1. Let (Xt )t≥0 be a continuous time Markov chain with transition rates q(i, j). A
probability distribution π is called stationary or invariant if, for each state j,
X
q(j)π(j) = π(i)q(i, j).
i

In terms of the infinitesimal generator A, this says πA = 0.


Remark 22.2. One might think that the stationary distribution for Xt should be the same as
the stationary distribution $ for its jump chain Yn . However, the distribution $ satisfies
$(j) = i $(i)p(i, j) = i $(i) q(i,j)
P P
q(i)
. Therefore, if π is invariant for (Xt )t≥0 then $(i) =
P
q(i)π(i) is invariant for (Yn )n≥0 . Note that, even if π is a probability distribtion i π(i) = 1,
it may be that $ is not summable and so there is no invariant distribution for Yn . The
converse holds similarly: (Yn )n≥0 may have an invariant distribution while (Xt )t≥0 fails
to. This contributes to the fact that positive recurrence can differ between the process and
its jump chain.

Proposition 22.3. Let (Xt )t≥0 be a continuous time non-explosive Markov chain, and suppose
that π is a stationary distribution for (Xt )t≥0 . If P(X0 = j) = π(j) for all states i, then P(Xt =
j) = π(j) for all t > 0 and all states j.
Remark 22.4. In the following proof, we will basically assume there are finitely-many
states (this is required to interchange the derivatve and the sum). A different proof is
needed to show the result holds true for infinitely-many states. Since Kolmogorov’s equa-
tions actuallyonly determine the process up to the time of explosion, however, the fact
that dtd P(Xt = j) = 0 only shows that P(Xt = j) is constant up to explosion time. If the
process explodes, it can well start in and stay in the stationary distribution up to explosion
time, and then be in a(n arbitrary) new distribution after explosion.
Proof. Fix a state j. Then
d d X d X X d
P(Xt = j) = P(X0 = i)P(Xt = j|X0 = i) = π(i)pt (i, j) = π(i) pt (i, j).
dt dt i dt i i
dt
By Kolmogorov’s backward equation,
d X
pt (i, j) = q(i, k)pt (k, j) − q(i)pt (i, j)
dt k6=i

and so
X d X X X
π(i) pt (i, j) = π(i) q(i, k)pt (k, j) − π(i)q(i)pt (i, j).
i
dt i k6=i i

In the first term, exchanging the order of the summation yields


X X X X X
pt (k, j) π(i)q(i, k) = pt (k, j) π(i)q(i, k) = pt (k, j)q(k)π(k)
k i6=k k i k
83

where we have used the fact that q(k, k) = 0 and the definition of invariance of π. Hence
we have
d X X
pt (i, j) = pt (k, j)q(k)π(k) − π(i)q(i)pt (i, j) = 0.
dt k i
It follows that t 7→ pt (i, j) is constant, as claimed. 
Hence, it makes sense to call such a distribution stationary or invariant.

Example 22.5. Let (Xt )t≥0 be an irreducible birth-and-death chain. (Irreducible here sim-
ply means that λi > 0 and µi > 0 for all i.) Then q(0) = λ0 , q(j) = λj + µj for j ≥ 1, and
the equations for a stationary distribution are
λ0 π(0) = µ1 π(1), (λj + µj )π(j) = µj+1 π(j + 1) + λj−1 π(j − 1), j ≥ 1.
The equations for j ≥ 1 can be written in the form
µj+1 π(j + 1) − µj π(j) = λj π(j) − λj−1 π(j − 1).
Thus, for example, µ2 π(2) − µ1 π(1) = λ1 π(1) − λ0 π(0); but λ0 π(0) = µ1 π(1) and so this
simplifiles to µ2 π(2) = λ1 π(1). By induction, it follows that
µj+1 π(j + 1) = λj π(j), j≥0
and hence (since µj > 0 for all j), again by induction, we have
λj−1 λj−1 λj−2 λ0 · · · λj−1
π(j) = π(j − 1) = π(j − 2) = · · · = π(0).
µj µj µj−1 µ1 · · · µj
Define this new coefficient as θj ; in terms of the previous notation ρj , we have
λ0 · · · λj−1 λ0 1
θj = = .
µ1 · · · µj µj ρ j
By convention, we set θ0 = 1. Thus, normalizing π(0) as we like, we have that π(j) ∼ θj
is an invariant measure; in order for it to be a stationary distribution, we need it to be
summable.
P∞ Hence, an invariant birth-and-death chain has a stationary distribution
P∞ if and
−1
only if i=0 θi < ∞, in which case the stationary distribution is π(j) = ( i=0 θi ) θj .

Example 22.6. Consider the M/M/1 queue of Example 21.5. This is a birth-and-death
chain with µi = µ and λi = λ for all i. Hence, we have
 j
λ0 · · · λj−1 λj λ
θj = = j = .
µ1 · · · µj µ µ
So ∞
P
j=0 θj is summable if and only if λ < µ, in which case the sum is
∞ ∞  j
X X λ 1
θj = = .
j=0 j=0
µ 1 − λ/µ
Hence, the stationary distribution for this chain is

!−1    j
X λ λ
π(j) = θi θj = 1 −
i=0
µ µ
which is a (shifted) geometric distribution.
84

Example 22.7 (M/M/∞ queue). Consider a queueing system with infinitely many servers.
Jobs arrive according to a Poisson process with rate µ, and are served independently with
independent service times at rate µ, where all jobs currently in the queue are served si-
multaneously. We model this as a birth-and-death-chain with λj = λ for all j ≥ 0, and
µj = jµ for all j ≥ 1. Thus, we have
 j
λ0 · · · λj−1 λj 1 λ
θj = = =
µ1 · · · µj µ · 2µ · · · jµ j! µ
and so
∞ ∞  j
X X 1 λ
θj = = eλ/µ
j=0 j=0
j! µ
which is finite for any λ, µ > 0. Hence the stationary distribution always exists, and is
given by
 j
e−λ/µ λ
π(j) = .
j! µ
That is, the stationary distribution is Poissonian with mean λ/µ. In particular, if the queue
is at equilibrium, i.e. in the stationary distribution, then the mean number of customers
in the queue is the mean of Poisson(λ/µ), which is λ/µ.

22.2. Convergence to the Stationary Distribution. The exact analog of the basic conver-
gence theory for discrete Markov chains (cf. Corollary 11.1, Theorem 11.3, and Theorem
12.1) holds for continuous time chains as well, with the proviso that the theory is sim-
pler: there is no issue with periodicity. Indeed, for any continuous-time Markov chain, if
pt (i, j) > 0 for some time t > 0, then in fact pt (i, j) > 0 for all time t > 0 (this is Problem 2
on Homework 6).
Theorem 22.8. Let (Xt )t≥0 be an irreducible, continuous-time Markov chain, with P transition
kernel pt (i, j) = Pi (Xt = j) for states i, j, and transition rates q(i, j) > 0 and q(i) = j6=i q(i, j).
The following are equivalent.
(1) All states are positive recurrent.
(2) Some state is positive recurrent.
(3) The chain is non-explosive, and there exists a stationary distribution π.
Moreover, when these conditions hold, the stationary distribution is given by
1
π(j) =
Ej [Tj ]
where Tj is the return time to state j. Finally, under these conditions, we have pt (i, j) → π(j) as
t → ∞ for any states i, j.

Hence, Example 22.6 shows that the M/M/1 queue is, in fact, positive recurrent if and
only if λ < µ, and that it converges to equilibrium in this case. We already saw, in Example
21.5, that the M/M/1 queue is recurrent when λ ≤ µ; we conclude that in the critical case
λ = µ, the chain is null recurrent. As usual, this means that pt (i, j) → 0 for each fixed i, j in
this case. Example 22.7 shows that the M/M/∞ queue is always positive recurrent, and
always converges to its stationary (Poissonian) distribution.
85

The proofs of all parts of Theorem 22.8 are very similar to the proofs of Corollary 11.1
and Theorem 12.1, and so we will not touch them here. It is important to note that the non-
explosion clause (3) in the theorem is absolutely necessary. For example, for convergence,
we know that an exploding chain can start in the stationary distribution and hence stay in
it until it explodes; afterward it can take any distribution at all, and so may not converge
back to the stationary distribution.
Example 22.9. Consider a birth-and-death chain with λj = λ · 2j and µj = µ · 2j for some
parameters λ, µ chosen so that 1 < λ/µ < 2. Then
 j
λ0 · · · λj−1 λ · 2λ · · · 2j−1 λ λ 1
θj = = j
=
µ1 · · · µj 2µ · 4µ · · · 2 µ µ 2j
and so ∞
P
j=0 θj < ∞, so an invariant distribution exists. But because the rates increase
exponentially up the chain, explosion does occur, so the limiting behavior of the chain is
not determined by rate paramters at all.
86

23. L ECTURE 23: M AY 23, 2012


We are about to discuss a new kind of stochastic process, called martingales, that are
not necessarily Markov processes (but are associated to them in important ways). We
begin with a motivating example that we have already studied quite a bit from a different
perspective.
Example 23.1. Let (Xn )n≥0 be SRW on Z. This is a Markov chain, of course, but it is even
more “memoryless” than your average Markov chain. To see how, let’s think about it in
terms of a game: recall that Xn can be thought of as the sum of n independent Bernoulli
trials. As such, we think of it as a kind of gambling game: you and your opponent each
wager $1 each toss; if heads comes up you win your dollar and your opponent’s, while if
tails you lose your dollar. Hence Xn represents your winnings after n tosses. In this con-
text, it is natural ask about the expectation of the process: what is your expected winnings
after n tosses? In this simple game, we have
n
X
E[Xn ] = E[B1 + · · · + Bn ] = E[Bj ] = 0
j=1

where Bj are the independent Bernoulli trials that all have expectation 0. So the expec-
tation is constant(ly equal to 0). This is inevitable for a “fair game” – if your expected
winnings were ever positive, your opponent’s would be negative, which means the game
would be slanted in your favor.
Actually, there is a stronger “fairness” property that holds for this game. Suppose m
tosses have already occurred, and your fortune has varied according to the trajectory
i0 , i1 , . . . , im over that time. After some larger number of tosses n, what is your expected
winnings? Since we already know Xm = im , the interesting quantity is Xn − Xm : your net
winnings (or loss) since time m. So what we want to know is
E[Xn − Xm |X0 = i0 , . . . , Xm = im ].
Expressing this as a telescoping sum and using the linearity of expectation, we have
n−1
X
E[Xn − Xm |X0 = i0 , . . . , Xm = im ] = E[Xk+1 − Xk |X0 = i0 , . . . , Xm = im ].
k=m

Now further conditioning up to time k, we have E[Xk+1 − Xk |X0 = i0 , . . . , Xm = im ] is


equal to
X
E[Xk+1 − Xk |X0 = i0 , . . . , Xm = im , Xm+1 = im+1 , . . . , Xk = ik ]
im+1 ,...,ik

· P(Xm+1 = im+1 , . . . , Xk = ik |X0 = i0 , . . . , Xm = im ).


The one-step increment conditional probabilities are easy to calculate. From the Markov
property,
E[Xk+1 − Xk |X0 = i0 , . . . , Xm = im , Xm+1 = im+1 , . . . , Xk = ik ]
= E[Xk+1 |X0 = i0 , . . . , Xm = im , Xm+1 = im+1 , . . . , Xk = ik ] − ik
= E[Xk+1 |Xk = ik ] − ik
= E[X1 |X0 = ik ] − ik .
87

Since X1 = X0 + B1 where B1 is a symmetric Bernoulli trial independent from X0 , and


since E[B1 ] = 0, it follows that
E[X1 |X0 = ik ] = E[X0 + B1 |X0 = ik ] = ik + E[B1 |X0 = ik ] = ik + E[B1 ] = ik .
Hence all the terms in the above sum are 0, and we have the strong averaging property
E[Xn − Xm |X0 = i0 , . . . , Xm = im ] = 0 (23.1)
for any initial trajectory i0 , . . . , im with m ≤ n. In other words, no matter what has hap-
pened to the player’s fortune so far, the expected net win or loss for any future time is
always 0.

A process which satisfies Equation 23.1 is called a (discrete-time) martingale. So SRW


is a martingale. It is also a Markov chain, and in fact we used the Markov property to
deduce the martingale property – but not alone. Indeed, Markov chains need not take
real values, so the expectation is not in general even defined. Even when it is, in general a
Markov chain Xn need not have constant expectation – E[Xn ] may well vary with n. For
example, consider a non-symmetric random walk on Z, with p(i, i+1) = p and p(i, i−1) =
1 − p with p 6= 21 , cf. Example 4.5. This random walk can be realized as Xn = B1 + · · · + Bn
where Bj are i.i.d. biased Bernoulli random variables with P(Bj = 1) = p and P(Bj =
−1) = 1 − p. Then E[Bj ] = 1 · p + (−1) · (1 − p) = 2p − 1, and so E[Xn ] = (2p − 1)n is only
constant when p = 12 . Most Markov chains are not martingales. The converse is also true,
but before we see examples, it will be useful to improve our understanding of conditional
expectation.

23.1. Conditional Expectation. We are accustomed to thinking of conditional expecta-


tion as a number: if XP is a random variable with (discrete) state space S ⊂ R and B is an
event, then E[X|B] = x∈S x · P(X = x|B). The kinds of events we typically condition on
are the outcomes of (other) random variables: for example B = {Y1 = i1 , . . . , Yn = in } for
some random variables Y1 , . . . , Yn and some states i1 , . . . , in . As a matter of convenience,
we can then group all of the conditional expectactions together into a new random vari-
able which we call E[X|Y1 , . . . , Yn ]. It simply takes the value E[X|Y1 = i1 , . . . , Yn = in ] in
the event {Y1 = i1 , . . . , Yn = in }. That is:

E[X|Y1 = i1 , . . . , Yn = in ]1{Y1 =i1 ,...,Yn =in } .


X
E[X|Y1 , . . . , Yn ] =
i1 ,...,in

This may look like simple book-keeping, but it brings new intuition about conditioning.
We start with a random variable X. We are then given some information, in the form of
some other random variables Y1 , . . . , Yn that we may observe. Given these observations,
what is our best guess for the value of X? In the (random event) that Y1 = i1 , . . . , Yn = in ,
our best guess is the conditional expectation E[X|Y1 = i1 , . . . , Yn = in ]. Hence, the new
random variable E[X|Y1 , . . . , Yn ] is our best (random) guess about the value of X given
any observations of Y1 , . . . , Yn .
Example 23.2. Suppose that X is a function of Y1 , . . . , Yn , X = F (Y1 , . . . , Yn ) for some
fixed (deterministic F ). Then complete information about Y1 , . . . , Yn yields complete in-
formation about X, so our “best guess” about X should be X itself. Indeed, in this case
88

we have
E[X|Y1 , . . . , Yn ] = E[F (Y1 , . . . , Yn )|Y1 , . . . , Yn ]
E[F (Y1 , . . . , Yn )|Y1 = i1 , . . . , Yn = in ]1{Y1 =i1 ,...,Yn =in }
X
=
i1 ,...,in

F (i1 , . . . , in )1{Y1 =i1 ,...,Yn =in } .


X
=
i1 ,...,in

On the other hand, we also have


X 1{Y1 =i1 ,...,Yn =in }
X
X=
i1 ,...,in

and since X = F (Y1 , . . . , Yn ), on the event {Y1 = i1 , . . . , Yn = in } X = F (i1 , . . . , in ). Thus


X 1{Y1 =i1 ,...,Yn =in } = F (i1 , . . . , in )1{Y1 =i1 ,...,Yn =in } = E[X|Y1 , . . . , Yn ].
X X
X=
i1 ,...,in i1 ,...,in

When X is a function of Y1 , . . . , Yn , we say X is measurable with respect to Y1 , . . . , Yn .


So Example 22.6 shows that, when X is measurable with respect to some random vari-
ables Y1 , . . . , Yn , we have E[X|Y1 , . . . , Yn ] = X – our best guess for the value of X is X
itself, since we have complete information.

Example 23.3. On the other hand, suppose that X and {Y1 , . . . , Yn } are independent. In
this case, information about Y1 , . . . , Yn should be essentially useless in determining the
value of X. What does this mean about E[X|Y1 , . . . , Yn ]? For any states x, i1 , . . . , in the
events {X = k} and {Y1 = i1 , . . . , Yn = in } are independent, so P(X = x|Y1 = i1 , . . . , Yn =
in ) = P(X = x), and therefore
X X
E[X|Y1 = i1 , . . . , Yn = in ] = x·P(X = x|Y1 = i1 , . . . , Yn = in ) = x·P(X = x) = E[X].
x x

Thus
E[X|Y1 = i1 , . . . , Yn = in ]1{Y1 =i1 ,...,Yn =in }
X
E[X|Y1 , . . . , Yn ] =
i1 ,...,in
X
= E[X] 1{Y1 =i1 ,...,Yn =in } = E[X].
i1 ,...,in

Thus, when X is independent from Y1 , . . . , Yn , the “best guess” E[X|Y1 , . . . , Yn ] is just the
constant E[X]. This is built into conditional expectation; the average value of X without
conditioning is always known. If no further information is given, this constant is the
conditional expectation as well.

Example 23.4. Consider the SRW Xn of Example 22.5 again. Using Equation 23.1, we
have
E[Xn − Xm |X0 = i0 , . . . , Xm = im ]1{X0 =i0 ,...,Xm =im } = 0.
X
E[Xn − Xm |X0 , . . . , Xm ] =
i0 ,...,im

Now, Xm is certainly measurable with respect to X1 , . . . , Xm , and so we also have


E[Xn −Xm |X0 , . . . , Xm ] = E[Xn |X0 , . . . , Xm ]−E[Xm |X0 , . . . , Xm ] = E[Xn |X0 , . . . , Xm ]−Xm
89

by Example 22.6. Thus, for n ≥ m, we have E[Xn |X0 , . . . , Xm ] = Xm ; the best guess about
our future fortune at time n, given complete information up to time m, is our present for-
tune Xm . This is the essence of the “average fairness” property that defines a martingale.

Let’s collect the properties of conditional expectation from the last examples (and a few
more) in the following proposition.
Proposition 23.5. Let X, X 0 be random variables, and Y = Y1 , . . . , Yn a collection of random
variables. The following properties hold.
(1) For a, b ∈ R, E[aX + bX 0 |Y] = aE[X|Y] + bE[X 0 |Y].
(2) If X is Y-measurable, then E[X|Y] = X.
(3) If X is independent of Y, then E[X|Y] = E[X].
(4) (Tower Property) Let Z be another collection of random variables, and suppose Y are Z-
measurable, so Y = F (Z) for some function F . (A typical situation is when Z ⊇ Y).
Then
E[E[X|Z]|Y] = E[X|Y].
(5) (Factoring) If Y is Y-measurable then E[XY |Y] = Y E[X|Y].
Proof. Properties (2) and (3) were proved in Examples 23.2 and 23.3. Properties (1), (4),
and (5) all follow from straightforward (but lengthy) calculations immediate from the
definitions. 
The simplest case of the Tower Property (4) in Proposition 23.5 is worthy of its own
statement.
Corollary 23.6. If X and Y = Y1 , . . . , Yn are random variables, then E[E[X|Y]] = E[X].
Proof. The set ∅ of random variables is (vacuously) independent from all random vari-
ables, including X, and Y ⊃ ∅ for any Y. Thus, by the Tower Property (4) of Proposition
23.5, we have
E[E[X|Y]|∅] = E[X|∅] = E[X]
where the second equality follows from (3) in Propostion 23.5. By the same argument, the
random variable E[X|Y] is independent from ∅, and so again by (3)
E[E[X|Y]|∅] = E[E[X|Y]]
which proves the claim. 
90

24. L ECTURE 24: M AY 23, 2012


24.1. Martingales. We have now seen several examples and equivalent formulations of
the martingale concept. Let’s give the full definition. To make all calculations valid, we
need to ensure that expectations are finite.
Definition 24.1. A (discrete-time) martingale is a stochastic process (Xn )n≥0 which satisfies
E[|Xn |] < ∞ and
E[Xn+1 |X0 , . . . , Xn ] = Xn
for all n ≥ 0.
In Example 23.4, we showed that SRW is a martingale. (Indeed, since |Xn | ≤ n, E[|Xn |] ≤
n < ∞ for all n.) In fact, we showed something apparently stronger: that E[Xn |X0 , . . . , Xm ] =
Xm for all m ≤ n. In fact, this nominally stronger property is equivalent to being a mar-
tingale.
Lemma 24.2. If (Xn )n≥0 is a martingale, then E[Xn |X0 , . . . , Xm ] = Xm for all m ≤ n.
Proof. The proof is by induction. The statement holds when m = n since then Xn is
(X1 , . . . , Xm )-measurable, cf. Proposition 23.5(2). Suppose that we have shown that, for
some n ≥ m, E[Xn |X0 , . . . , Xm ] = Xm . Then by the Tower Property of Proposition 23.5(4)
E[Xn+1 |X0 , . . . , Xm ] = E[E[Xn+1 |X0 , . . . , Xn ]|X0 , . . . , Xm ].
We assumed that Xn is a martingale, meaning that the inside conditional expectation is
E[Xn+1 |X0 , . . . , Xn ] = Xn . Hence, we have
E[Xn+1 |X0 , . . . , Xm ] = E[Xn |X0 , . . . , Xm ] = Xm
where the second equality is the inductive hypothesis. 
Corollary 24.3. Of (Xn )n≥0 is a martingale, then it has constant expectation: E[Xn ] = E[X0 ] for
all n.
Proof. Here we simply use the Double Expectation property (Corollary 23.6). By Lemma
24.2 X0 = E[Xn |X0 ], and thus
E[X0 ] = E[E[Xn |X0 ]] = E[Xn ].

Example 24.4 (Betting on Independent Coin Tosses). We motivated the discussion of mar-
tingales with a simple betting strategy on coin tosses (i.e. SRW), where we wager $1 on
each toss regardless. In fact, it doesn’t matter how we bet – as long as we have no fu-
ture information, our winnings forms a martingale. To make this precise: let X1 , X2 , . . .
be i.i.d. Bernoulli trials (so P(Xn = ±1) = 21 , and thus E[Xn ] = 0). We start with some
initial fortune W0 (which is independent of the tosses Xn ). Suppose we decide to bet an
amount, Bn , on the nth coin toss. Thus, after the toss, we either increase our fortune by
Bn if Xn = 1, or decrease it by Bn if Xn = −1 – that is, our winnings are
Xn
Wn = Xi Bi .
i=1

The sequence B1 , B2 , . . . is our betting strategy, which we allow to be random. We insist


that E[|B|n ] < ∞ for each n – in a real betting game, we would be restricted to have
91

Bn ≤ max{Wn , 0}, and so long as we start with finite capital, by induction this means
Bn is a bounded random variable so certainly has finite expectation. We also insist that
the betting strategy at time n can only use information up to time n: in other words, we
require Bn to be (W0 , W1 , . . . , Wn−1 )-measurable.
Under these conditions, we have
" n # " n # n
X X X
E[|Wn |] ≤ E |Xi Bi | = E |Bi | = E[|Bi |] < ∞
i=1 i=1 i=1

for each n, as required. Now, the winnings at time n + 1 are Wn+1 = Wn + Xn+1 Bn+1 , and
so
E[Wn+1 |W0 , . . . , Wn ] = E[Wn |W0 , . . . , Wn ] + E[Xn+1 Bn+1 |W0 , . . . , Wn ].
The first term is E[Wn |W0 , . . . , Wn ] = Wn . For the second term: by assumption Bn+1 is
measurable with respect to W0 , . . . , Wn , and so by the factoring property of Proposition
23.5(5), we have
E[Xn+1 Bn+1 |W0 , . . . , Wn ] = Bn+1 E[Xn+1 |W0 , . . . , Wn ].
Now, B1 is W0 measurable, and W0 is independent of Xk for all k, so B1 is as well. It
follows that X1 B1 is independent of Xn+1 , and therefore so is W1 = W0 + X1 B1 . Then B2
is (W0 , W1 )-measurable, and since both are independent of Xn+1 , so is B2 . Thence so is
X2 B2 , and so too W2 = W1 + X2 B2 . Continuing this way, we find that W0 , . . . , Wn are all
independent of Xn+1 . Hence, by Propositon 23.5(3),
E[Xn+1 |W0 , . . . , Wn ] = E[Xn+1 ] = 0.
So, we have shown that E[Wn+1 |W0 , . . . , Wn ] = Wn + 0 = Wn , and so Wn is a martingale.

24.2. Stopping Martingales. Just as for a (discrete-time) Markov chain, or any stochastic
process (Xn )n≥0 , a stopping time T is a random variable taking values in {0, 1, . . .} ∪ {∞},
which has the property that the event {T ≤ n} depends only on X0 , . . . , Xn . (To be precise:
{T ≤ n} can be expressed in the form {(X0 , . . . , Xn ) ∈ U } for some set of states U .)
In the context of Example 24.4, we would like to stop betting at some time. We could
stop after a fixed number n of coin tosses. More likely, we would like to stop as soon as
Wn ≥ F for some fixed fortune F . (Indeed, T = min{n ≥ 0 : Wn ≥ F } is a stopping time.)
If we stop betting at this time, then no matter what happens to the coin, our fortune is
equal to WT from then on. In other words, we have replaced the original martingale Wn
with the new process WT ∧n , where T ∧ n = min{T, n}. In fact, this stopped martingale is
still a martingale.
Proposition 24.5. Let (Xn )n≥0 be a martingale and let T be a stopping time for this martingale.
Then (XT ∧n )n≥0 is a martingale.
Proof. Let Yn = XT ∧n . First note that Yn = MT ∧n ∈ {X0 , . . . , Xn } for each n, so
|Yn | ≤ max{|X0 |, . . . , |Xn |} ≤ |X0 | + · · · + |Xn |,
hence E[|Yn |] ≤ E[|X0 |] + · · · + E[|Xn |] < ∞, as required. We now need to show that
E[Yn+1 |Y0 , . . . , Yn ] = Yn . We will first show that
E[Yn+1 |X0 , . . . , Xn ] = Yn . (24.1)
92

To prove Equation 24.1, we can express Yn+1 = XT ∧(n+1)T as follows. If T ≥ n + 1, then


T ∧ (n + 1) = n + 1 and so Yn+1 = Xn+1 in this case. If, on the other hand, T ≤ n, then
T ∧ (n + 1) = T and Yn+1 = T . Thus
Yn+1 = XT 1{T ≤n} + Xn+1 1{T >n} .
Since T is a stopping-time, {T ≤ n} only depends on X0 , . . . , Xn ; in other words 1{T ≤n} is
(X0 , . . . , Xn )-measurable. Of course, in the event T ≤ n, XT is (X0 , . . . , Xn )-measurable as
well, and so the product XT 1{T ≤n} is (X0 , . . . , Xn )-measurable. In addition, the function
1{T >n} = 1 − 1{T ≤n} is (X0 , . . . , Xn )-measurable. By Proposition 23.5, it follows that
E[Yn+1 |X0 , . . . , Xn ] = E[XT 1{T ≤n} |X0 , . . . , Xn ] + E[Xn+1 1{T >n} |X0 , . . . , Xn ]
= XT 1{T ≤n} + 1{T >n} E[Xn+1 |X0 , . . . , Xn ].
By assumption Xn is a martingale, and so we have
E[Yn+1 |X0 , . . . , Xn ] = XT 1{T ≤n} + 1{T >n} Xn .
Note that, in the case T = n, XT = Xn , so we can rewrite this as
XT 1{T ≤n} + 1{T >n} Xn = XT 1{T <n} + 1{T ≥n} Xn = XT ∧n = Yn .
This proves Equation 24.1.
Now, the collection Yn = (Y0 , . . . , Yn ) is Xn = (X0 , . . . , Xn )-measurable. Hence, we can
use the Tower Property:
E[Yn+1 |Y0 , . . . , Yn ] = E[Yn+1 |Yn ] = E[E[Yn+1 |Xn ]|Yn ].
Equation 24.1 says that E[Yn+1 |Xn ] = Yn , and so we have
E[Yn+1 |Y0 , . . . , Yn ] = E[Yn |Yn ] = E[Yn |Y0 , . . . , Yn ] = Yn
concluding the proof.


As a corollary, we therefore have that E[XT ∧n ] = E[XT ∧0 ] = E[X0 ] for all n. This is not
quite the same as saying that, when we’ve stopped playing, our expected fortune is still
unchanged. What we would like to see, for a truly “fair game”, is that E[XT ] = E[X0 ]. If
this can be violated, then even though at each fixed time n we cannot expect any profit
from any betting strategy, perhaps we can choose a clever enough random stopping time
(based on events we’ve seen thus far in the game) to come out on top, on average. In fact,
this is possible.
Example 24.6 (Martingale Betting Strategy). Consider a betting strategy as in Example
24.4, where the strategy is fixed as Bn = 2n (double your bet each round). Example 24.4
shows that the winnings Wn form a martingale. Now, let T = min{n : Xn = 1}, the first
time heads comes up. If the initial capital is the fixed amount W0 = C, then we have

X
E[WT ] = E[WT |T = n]P(T = n).
n=0
93

The event {T = n} is the event that X1 = · · · = Xn−1 = −1 and Xn = 1, so it fixes a


trajectory and thus P(T = n) = 2−n . So we have
X ∞
X
E[WT ] = E[WT |T = n] · 2−n = 2−n E[Wn |X1 = · · · = Xn−1 = −1, Xn = 1].
n=1 n=1

By the betting strategy, conditioned on X1 = · · · = Xn−1 = −1 and Xn = 1, we have


n
X
Wn = Xi Bi = C − 2 − 4 − · · · − 2n−1 + 2n+1 = C + 1
i=1

and so

X
E[WT ] = 2−n (C + 1) = C + 1.
n=1

Hence, E[WT ] = C + 1 6= C = W0 = E[W0 ]. (However, by Proposition 24.5, E[WT ∧n ] = C


for all n. The problem is that the stopping time can be arbitrarily large, so it is never true
that T ∧ n = T for any finite n.)
Remark 24.7. Example 24.6 is the reason for the terminology “martingale” for the processes
we’re studying (at least according to legend). This betting strategy was popular with 18th
Century French gamblers. The term “martingale” comes from the shape of a particular
horse harness; the reason it was used to name the betting strategy is lost to the mysts of
time (though there are many interesting theories that can be found online).
This betting strategy appears to provide a sure-fire way to make money (which is what
18th Century gamblers thought). The problem is that the above model is not quite ac-
curate; the real betting strategy is Bn = 2n 1{Wn−1 ≥2n } , since we cannot bet money we do
not have. The strategy therefore only really works if C = ∞, i.e. infinite capital. When
C < ∞, the player will eventually go bust (by the law of large numbers).

Example 24.6 shows that “arbitrage” is possible: one can employ a betting strategy that
is guaranteed to make money in a fair game (if one has infinite capital). But the winnings-
martingale Wn wildly fluctuates in this case. In fact, if we don’t allow this (or require that
our stopping time must occur by a fixed time), then arbitrage is impossible.
Theorem 24.8 (Optional Sampling Theorem). Let (Xn )n≥0 be a martingale, and let T be a
finite stopping time. Suppose that either:
(1) T is bounded: there is some N < ∞ so that P(T ≤ N ) = 1; or
(2) (Xn )0≤n≤T is bounded: for some B < ∞, P(|Xn | ≤ B for n ≤ T ) = 1.
Then E[XT ] = E[X0 ].
Proof. First, suppose (1) holds. By proposition 24.5, XT ∧n is a martingale, and so E[XT ∧n ] =
E[XT ∧0 ] = E[X0 ] for all n. In particular, E[X0 ] = E[XT ∧N ]; but T ∧N = T , so E[X0 ] = E[XT ]
as claimed.
Now, suppose T is not necessarily bounded, but (2) holds. Then for any n, we can write
XT = XT ∧n + (XT − XT ∧n ) = XT ∧n + (XT − XT ∧n )1{T >n} .
94

(The second equality follows from the fact that, if T ≤ n, thenXT − XT ∧n = XT − XT = 0.)
So we have
E[XT ] = E[XT ∧n ] + E[(XT − XT ∧n )1{T >n} ].
Now, T ∧ n bounded by n, and it is easy to see that it is a stopping time; so by part (1), it
follows that E[XT ∧n ] = E[X0 ]. For the second term, we have
|E[(XT − XT ∧n )1{T >n} ]| ≤ E[|XT − XT ∧n |1{T >n} ] ≤ 2BE[1{T >n} ] = 2BP(T > n).
Since P(T < ∞) = 1, P(T > n) → 0 as n → ∞. Hence, we have
|E[XT ] − E[X0 ]| = |E[XT ] − E[XT ∧n ]| ≤ 2BP(T > n) → 0.
Since the left-hand-side does not depend on n, it follows that E[XT ] = E[X0 ], as required.

95

25. L ECTURE 25: M AY 30, 2012


25.1. Applications of the Optional Sampling Theorem.
Example 25.1. Let (Xn )n≥0 be SRW on Z, conditioned to start at X0 = j for some inte-
ger j ∈ {0, 1, . . . , N }. This is a martingale (indeed it was our motivating example of a
martingale). Let τk = min{n : Xn = k} and set T = min{τ0 , τN }. Using the first-step
analysis iterative scheme from Section 2.4, we know how to calculate h(j) = Pj (τ0 < τN ).
Martingale theory provides a different, much simplere method. Note that, by definition,
|Xn | ≤ N for n ≤ T , hence the optional sampling theorem applies to (Xn )0≤n≤T . Therefore
E[X0 ] = E[XT ]. Since we conditioned X0 = j, E[X0 ] = j. On the other end, the random
variable XT can take only two values: XT = 0 or XT = N . Thus
j = E[X0 ] = E[XT ] = 0 · P(XT = 0) + N · P(XT = N ).
It follows that P(XT = N ) = j/N , and sp P(τ0 < τN ) = P(XT = 0) = 1 − j/N .

Example 25.2. Let X1 , X2 , . . . be a sequence of i.i.d. random variables with common mean
E[Xn ] = µ, and let Sn = X1 + · · · + Xn . Define Mn = Sn − nµ, and M0 = 0. Then note that
E[Mn+1 |M0 , M1 , . . . , Mn ] = E[Sn+1 − (n + 1)µ|M0 , . . . , Mn ]
= E[Sn − nµ + Xn+1 − µ|M0 , . . . , Mn ]
= E[Mn + Xn+1 − µ|M0 , . . . , Mn ]
= Mn + E[Xn+1 − µ|M0 , . . . , Mn ]
Since the Xn are independent, Xn+1 is independent from Sk for all k ≤ n, and hence also
from Mk for k ≤ n. Thus the last term is
E[Xn+1 − µ|S0 , . . . , Sn ] = E[Xn+1 − µ] = µ − µ = 0.
Also: E[M1 |M0 ] = E[M1 |0] = E[M1 ] = E[X1 − µ] = 0 = M0 . Thus (Mn )n≥0 is a martingale.
Now, let T be a bounded stopping time for (Xn )n≥0 (which is therefore also a bounded
stopping time for (Mn )n≥0 . Then by the Optional Sampling Theorem,
0 = E[M0 ] = E[MT ] = E[ST − T µ] = E[ST ] − µE[T ].
We have thus proved Wald’s equation: if T is a bounded stopping time for an i.i.d. se-
quence with mean µ, then E[X1 + X2 + · · · + XT ] = µE[T ]. (In fact, this is true for any
stopping time T with E[T ] < ∞, and it can be proved using the optional sampling theo-
rem in this stronger form, but the proof is more involved.)

25.2. Submartingales. A stochastic process (Xn )n≥0 is called a submartingale (resp. su-
permartingale) if, for each n, E[Xn+1 |X0 , . . . , Xn ] ≥ Xn . (resp. E[Xn+1 |X0 , . . . , Xn ] ≤ Xn ).
For example, suppose we bet on coin-tosses with a betting strategy as in Example 24.4, but
we additionally have a secret admirer who watches us bet. At each time n, the admirer
adds some amount Gn to our winnings, where the amount given depends only on our
betting and the coin tosses up to time n − 1. That is: Wn+1 = Wn + Bn+1 Xn+1 + Gn+1 where
Gn+1 is (X1 , . . . , Xn , B1 , . . . , Bn )-measurable. Then our total fortune Wn is a submartin-
gale: E[Wn+1 |Wn ] = Wn + Gn+1 ≥ Wn . (We might similarly consider a game where, at
each time, we bet but may also buy some number of drinks, with the number depending
only on our current and past fortune; this would give a supermartingale.)
96

Sub- and supermartingales are very useful technical tools in analyzing martingales and
other stochastic processes. The main example we will consider comes from maximal in-
equalities. One of the most basic estimates useful in probability theory is Markov’s in-
equality: P(|X| ≥ a) ≤ E[|X|] a
. There are more general versions (proved similarly) that
E[|X|r ] bX
give P(|X| ≥ a) ≤ ar and P(X ≥ a) ≤ E[eeba ] for any b > 0. If Xn is a martingale,
stronger estimates are possible: the same bounds can be given not only for Xn , but for
max{X0 , . . . , Xn }. The key is the following strengthening of the r = 1 Markov inequality
for a positive submartingale.
Theorem 25.3 (Doob’s Maximal Inequality). Let (Xn )n≥0 be a non-negative submartingale.
Then for any a > 0,
E[Xn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ .
a
Proof. Let T = min{n : Xn ≥ a} (a stopping time). Consider the event
Ak = {T = k} = {T ≤ k} ∩ {T 6≤ k − 1}.
Since T is a stopping time, {T ≤ k} is (X0 , . . . , Xk )-measurable, and the event {T ≤ k − 1}
is (X0 , . . . , Xk−1 )-measurable, so too is its complement, and therefore Ak is (X0 , . . . , Xk )-
measurable. Now, since Xn ≥ 0, we have Xn ≥ Xn 1{T ≤n} , and hence
" n
# n
E[Xn ] ≥ E[Xn 1{T ≤n} ] = E Xn 1Ak = E[Xn 1Ak ].
X X

k=0 k=0

We now use Proposition 23.5 to simplify the summands:


E[Xn 1Ak ] = E[E[Xn 1Ak |X0 , . . . , Xk ]] = E[1Ak E[Xn |X0 , . . . , Xk ]].
Since Xn is a submartingale, it follows by induction that for k ≤ n E[Xn |X0 , . . . , Xk ] ≥ Xk .
Thus, we have
E[Xn 1Ak ] ≥ E[1Ak Xk ].
Now, Ak is the event that min{n : Xn ≥ a} = k; in particular, when Ak holds, Xk ≥ a.
Thus 1Ak Xk ≥ a1Ak , and so we have
n n
" n #
1Ak = aP(T ≤ n).
X X X
E[Xn ] ≥ E[1Ak Xk ] ≥ aE[1Ak ] = aE
k=0 k=0 k=0

Now, the event {T ≤ n} is the event that there exists a k ≤ n with Xk ≥ a; i.e. {T ≤ n} =
{max{X0 , . . . , Xn } ≥ a}. This proves the theorem. 
We will use Doob’s maximal inequality directly for the special case of a non-negative
martingale, in the next section. The more general result for submartingales actually al-
lows a much more powerful family of estimates for martingales. This is because of the
following.
Lemma 25.4. Let (Xn )n≥0 be a martingale, and let f : R → R be a convex function such that
E[|f (Xn )|] < ∞ for all n. Then Yn = f (Xn ) is a submartingale.
The proof of Lemma 25.4 is an Exercise on Homework 7.
97

Corollary 25.5. Let (Xn )n≥0 be a martingale, and let r ≥ 1 and a, b > 0. Then
E[|Xn |r ]
P(max{X0 , . . . , Xn } ≥ a) ≤ (25.1)
ar
E[ebXn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ . (25.2)
eba
Proof. The function f (x) = |x|r is convex for r ≥ 1. Since (Xn )n≥0 is a martingale, Lemma
25.4 shows that (|Xn |r )n≥0 is a (non-negative) submartingale. Now, for fixed a, if there
exists a k ∈ {0, . . . , n} with Xk ≥ a then |Xk |r ≥ ar . Thus {max{X0 , . . . , Xn } ≥ a} ⊆
{max{|X0 |r , . . . , |Xn |r } ≥ ar }. Therefore, using Theorem 25.3 for Yn = |Xn |r , we have
E[Yn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ P(max{Y0 , . . . , Yn } ≥ ar ) ≤
ar
which proves Equation 25.1. The proof of Equation 25.2 is very similar. 

Example 25.6. Let X1 , X2 , . . . be i.i.d. symmetric Bernoulli random √ variables, so that Sn =


X1 + · · · + Xn√(and S0 = 0) forms a martingale. Taking b = 1/ n in Equation 25.2, and
setting α = a/ n, we have the estimate
√ √
P(max{S1 , . . . , Sn } ≥ α n) ≤ e−ba E[ebSn ] = e−α E[eSn / n ]. (25.3)
Now, using independence and identical distribution of the Xk , we have
E[ebSn ] = E[eb(X1 +···+Xn ) ] = E[ebX1 ]n .
Of coursr E[ebX1 ] = 21 eb + 12 e−b , and so
√ √ n
e1/ + e−1/ n
n


Sn / n
E[e ]= .
2
One can now employ elementary calculus (for example l’Hôpital’s Rule) to show that
 1/√n √ n

Sn / n e + e−1/ n
lim E[e ] = lim = e1/2 .
n→∞ n→∞ 2
Since this limit exists, we can conclude from the estimate of Equation 25.3 that there exists
a constant C > 0 such that for α > 0

P(max{S0 , . . . , Sn } ≥ α n) ≤ Ce−α .
This is
√ a very strong statement to the effect that, with high probability, Sn is not bigger
than n.
98

26. L ECTURE 26: J UNE 1, 2012


26.1. The Martingale Convergence Theorem. Perhaps the central result about martin-
gales is the following convergence theorem.
Theorem 26.1. Let (Xn )n≥0 be a martingale, and suppose there is a constant C > 0 so that
P(Xn ≥ −C) = 1 for all n. Then there is a random variable X∞ such that limn→∞ Xn = X∞
with probability 1.
Remark 26.2. The condition of the theorem is often stated as requiring either Xn ≥ 0 or
|Xn | ≤ C (i.e. non-negative or bounded martingales). The standard proof does not require
an upper-bound, only a lower-bound, so we state the theorem in this stronger form .
Proof. First, suppose we have proved the theorem in the special case C = 0: i.e. for
Xn ≥ 0. Then for any martingale Yn ≥ −C, set Xn = Yn + C. It is easy to check that Xn
is a martingale, and of course Xn ≥ 0, so by the theorem Xn → X∞ with probability 1.
Hence, Yn → X∞ − C with probability 1, proving the theorem in general. Hence, we will
assume that Xn ≥ 0.
Since Xn ≥ 0 is a (sub)martingale, Doob’s Maximal Inequality shows that, for any fixed
N ∈ N,  
E[XN ] E[X0 ]
P max Xn ≥ a ≤ = .
0≤n≤N a a
Taking N → ∞, this shows that
 
E[X0 ]
P max Xn ≥ a ≤ .
n≥0 a
It follows that P(maxn Xn < ∞) = 1, and so Xn is bounded with probability 1. In particu-
lar, this means that Xn has a convergent subsequence. This means there is a small interval
[a, b] that Xn keeps returning to infinitely often.
It is still possible for the sequence Xn to diverge, if Xn oscillates: i.e. if it also leaves the
interval [a, b] infinitely often. So we fix 0 ≤ a < b, and consider these potential oscillations.
Let S0 = min{k : Xk ≤ a}; for n ≥ 1 let Tn = min{k ≥ Sn−1 : Xk ≥ b}, and Sn = min{k ≥
Tn : Xk ≤ a}. So Sn are the upcrossings and Tn are the downcrossings through the interval
[a, b]. The bounded sequence Xn fails to converge only if there is are some fixed a < b such
that there are infinitely many up-crossing and down-crossings through [a, b]. So define Un
to be the number of up-and-down crossings through [a, b] by time n:
Un = max{k : Tk ≤ n}.
Since Un is non-decreasing, U = limn→∞ Un exists (though it may be ∞). Our goal is to
show that P(U < ∞) = 1.
We will prove this with a betting strategy akin to Example 24.4. Let Bj = 1 if Sk−1 <
j ≤ Tk for some k, and Bj = 0 if Tk < j ≤ Sk+1 for some k. (So we bet 1 dollar every step
during an up-crossing, and we don’t bet during a down-crossing.) Then our winnings by
time n are
Xn
Wn = Bj (Xj − Xj−1 ).
j=1
99

In fact, (Wn )n≥1 is a martingale; the proof is similar to the proof in Example 24.4. We have
E[Wn+1 |X1 , . . . , Xn ] = E[Wn + Bn+1 (Xn+1 − Xn )|X1 , . . . , Xn ].
Now, the bet Bn+1 is determined by X1 , . . . , Xn , and the winnings Wn are determined by
X1 , . . . , Xn and B1 , . . . , Bn (hence just by X1 , . . . , Xn ). Thus Wn and Bn+1 are (X1 , . . . , Xn )-
measurable, and so
E[Wn+1 |X1 , . . . , Xn ] = Wn + Bn+1 E[Xn+1 − Xn |X1 , . . . , xn ] = Wn
since Xn is a martingale. Now, since W1 , . . . , Wn are (X1 , . . . , Xn )-measurable, we there-
fore have
Wn = E[Wn+1 |W1 , . . . , Wn ] = E[E[Wn+1 |X1 , . . . , Xn ]|W1 , . . . , Wn ] = E[Wn+1 |W1 , . . . , Wn ]
and so (Wn )n≥1 is a martingale. In particular E[Wn ] = E[W1 ] for all n, and W1 equals either
X1 − X0 or 0, both if which have E = 0. Thus E[Wn ] = 0 for all n.
Now, note that
X X
Wn = 1 · (Xj − Xj−1 )
k : Tk ≤n Sk−1 <j≤Tk
Un
X X n
X
= (Xj − Xj−1 ) + (Xj − Xj−1 ).
k=0 Sk−1 <j≤Tk j=SUn +1

The first sum is over precisely the complete Un up-crossings. In each up-crossing, the sum
over j is a telescoping sum with value XTk − XSk−1 ; by definition XTk ≥ b and XSk−1 ≤ a.
Hence, each of those terms is ≥ b − a, and we have
Un
X n
X
Wn ≥ (b − a) + (Xj − Xj−1 ).
k=0 j=SUn +1

The latter sum is also telescoping, with value Xn − XSUn . By assumption Xn ≥ 0, and
SUn ≤ a, so we conclude that
Wn ≥ (b − a)Un − a.
a
Hence 0 = E[Wn ] ≥ (b − a)E[Un ] − a, and so we have E[Un ]l b−a for all n. It follows that
a
the limit U = limn→∞ Un also has E[U ] ≤ b−a < ∞. In particular, P(U < ∞) = 1, which
concludes the proof of the theorem.


Example 26.3. Let (Xn )n≥0 be SRW on Z, conditioned so X0 = 1. Let T = min{n ≥


0 : Xn = 0}. Then T is a stopping time, and so, by Proposition 24.5 Mn = XT ∧n is a
martingale. Before time T , Xn ≥ 1, and so Mn ≥ 0. Thus by the Martingale Convergence
Theorem, there is a limit M∞ so that Mn → M∞ with probability 1.
What could this limit M∞ possibly be? First, note that Mn is integer valued (since Xn
is), so convergence means that Mn is eventually constant. Fix k ∈ Z; then {M∞ = k} =
{Mn = k for all but finitely many n}. If k 6= 0, then this is the same as the event {Xn =
k for all but finitely many n}; but this eventually constant trajectory has probability 21 · 21 ·
1
2
· · · = 0. Thus, for k 6= 0, P(M∞ = k) = 0. It follows that M∞ = 0 with probability 1.
100

So, start Xn as X0 = 0 as usual. Its first step is to ±1. In the case it is +1, the above
analysis shows that Xn = 0 for some n; a similar analysis works for X1 = −1. This gives
an alternate proof the SRW is recurrent.
Note: in Example 26.3, conditioning X0 = 1 means that M0 = XT ∧0 = X0 = 1,
and so E[Mn ] = E[M0 ] = 1 for all n. But M∞ = 0 so E[M∞ ] = 0. We see that the
constant expectation property of martingales does not generally extend to their limits.
This is not hard to believe – it is always a tricky business exchanging a limit with a
sum/integral/expectation.
Example 26.4 (Polya Urns). An urn initially contains a red balls and b blue balls. At each
time step, draw a ball at random from the urn, then return it along with another ball of
the same color. Let Xn denote the number of red balls after n turns. Then (Xn )n≥0 is a
(time-inhomogeneous) Markov chain. Indeed, suppose at time n there are k red balls in
the bin. The total number of balls is at time n is n + a + b, and so the probability that the
k
ball we pick a red ball next is n+a+b . Then at time n + 1, we return this ball with another
of the same color; if it was red, Xn+1 = k + 1, and if it was blue, Xn+1 = k again. Hence,
for k ≥ a,
k k
P(Xn+1 = k + 1|Xn = k) = P(Xn+1 = k|Xn = k) = 1 − .
n+a+b n+a+b
All our techniques for studying Markov chains required time homogeneity, so we are left
in the lurch to analyze this process directly.

Xn
Instead, let Mn = n+a+b be the fraction of red balls after n turns. Then (Mn )n≥0 is a
martingale: first, note that 0 ≤ Mn ≤ 1, so E[|Mn |] ≤ 1 < ∞. Now, since Xn is a Markov
chain, we know that for any states i0 , i1 , . . . , in , in+1 , P(Xn+1 = in+1 |Xn = in , . . . , X0 =
i0 ) = P(Xn+1 = in+1 |Xn = in ). It follows (from the summation definition of conditional
expectation) that E[Xn+1 |X0 , . . . , Xn ] = E[Xn+1 |Xn ]. This we can compute:
X
E[Xn+1 |Xn = k] = jP(Xn+1 = j|Xn = k)
j

= kP(Xn+1 = k|Xn = k) + (k + 1)P(Xn+1 = k + 1|Xn = k)


 
k k k
=k· 1− + (k + 1) · =k+ .
n+a+b n+a+b n+a+b
Hence
X k

E[Xn+1 = k|Xn = k]1{Xn =k} 1{Xn =k}
X
E[Xn+1 |Xn ] = = k+
k≥a k≥a
n + a + b
X Xn
 
Xn
X
= Xn + 1{Xn =k} = Xn + 1{Xn =k}
k≥a
n+a+b n + a + b k≥a
Xn
= Xn + .
n+a+b
(This makes good sense: our best guess for the number of red balls after the known out-
come of the previous selection is the number of red balls previously plus the probability
101

of adding a red ball at the next step.) Since Mn is Xn -measurable, we therefore have
1
E[Mn+1 |Mn , . . . , M0 ] = E[Mn+1 |Xn ] = E[Xn+1 |Xn ]
n+1+a+b
and this equals
 
1 Xn Xn
Xn + = = Mn
n+1+a+b n+a+b n+a+b
showing that (Mn )n≥0 is a martingale.
This martingale is non-negative, hence by the Martingale Convergence Theorem, Mn →
M∞ with probability 1 for some random variable M∞ . It can be shown (though we are not
equipped to show it) that M∞ has a beta distribution: it is a continuous random variable
on [0, 1] with density
(a + b − 1)! a−1
fM∞ (x) = x (1 − x)b−1 , 0 < x < 1.
(a − 1)!(b − 1)!
There are good heuristics for why a beta should show up, so it is not too hard to justify
that if there is a limit it ought to be beta. The martingale convergence theorem is the key to
making the proof rigorous.
102

27. L ECTURE 27: J UNE 4, 2012


Let’s take a moment to recall the Poisson process: the simplest (continuous-time) birth-
and-death process, with λi = λ for all i and µi = 0 for all i. As we showed in Lecture 20,
there is an alternative description of this process: the Poisson process (Xt )t≥0 with rate
λ > 0 satisfies
(1) Stationary Increments: if 0 ≤ s < t < ∞, then Xt − Xs has a Poisson distribution
Poisson(λ(t − s)) that only depends on s and t through the difference t − s.
(2) Independent Increments: If 0 ≤ t0 < t1 < · · · < tn < ∞, then the increments of
the process Xt1 − Xt0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 are independent.
(3) Right-continuous Trajectories: For each t ≥ 0, lims↓t Xs = Xt .
Properties (1) and (2) are essential for what we should call random motion. If we think
of the trajectory t 7→ Xt as describing the motion of a particle, property (2) says that once
the particle reaches position Xs , its change in position Xt − Xs at a future time interval is
independent of any previous motion. (This is a strong kind of Markov property.) Property
(1) is a time-homogeneity condition: the distirbution of the random choice of “change in
direction” also does not depend on your position or time. The Poisson process is not a
good description of motion, however, since its paths are not continuous.

27.1. Random Continuous Motion. Call a stochastic process (Xt )t≥0 a random continu-
ous motion if it has:
(1) Stationary Increments: There is a one parameter family of probability distribu-
tions {µt }t≥0 so that, for 0 ≤ s < t < ∞, the distribution of Xt − Xs is µt−s .
(2) Independent Increments: If 0 ≤ t0 < t1 < · · · < tn < ∞, then the increments of
the process Xt1 − Xt0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 are independent.
(3) Continuous Trajectories: The function t 7→ Xt is continuous with probability 1.
A Poisson process only misses the mark on item (3); but this is a serious change. Indeed,
unless (Xt )t≥0 is constant (the degenerate case), the condition that t 7→ Xt be continuous
means that the sample space is continuous (i.e. it must contain an interval). Hence, a random
continuous motion is a new kind of stochastic process beyond what we’ve studied: it is a
continuous-time, continuous-state process.
One might expect this to open a can of worms on the kinds of processes that fit the bill;
but the continuity of the paths (together with the other properties) actually pins down
the measures: up to scale, there is only one random continuous motion! To see how the
measure µt is determined, note that µt is the distribution of Xt − X0 . We can write this
increment as a telescoping sum: let tj = nj t, so that t0 = t and tn = t. Then
Xt − X0 = [Xtn − Xtn−1 ] + [Xtn−1 − Xtn−2 ] + · · · + [Xt1 − Xt0 ].
By conditions (1) and (2), this means Xt − X0 is a sum of n independent random vari-
ables, each of which has distribution µtj −tj−1 = µt/n . Probability distributions µt that can
be written as the sum of n i.i.d. random variables for any n are called infinitely-divisible,
and they form a small (but still infinite) family of probability distributions. Indeed, it we
forget about the continuity, then there are many such processes, called Lévy processes,
among which the Poisson process is an example. (As follows from Homework 5, Problem
3, a sum if independent Poisson random variables is Poisson, hence the Poisson distribu-
tion of any fixed rate is infinitely-divisible.)
103

Now we take the continuity condition (3) into account. Define


Mn = max{|Xtn − Xtn−1 |, |Xtn−1 − Xtn−2 |, . . . , |Xt1 − Xt0 |}.
Since t 7→ Xt is continuous on the compact interval [0, t], is is uniformly continuous there.
So for any  > 0, there is δ > 0 such that |Xs − Xs0 | <  whenever |s − s0 | < δ. For
large enough n, all the partitions |tj − tj−1 || < δ, and so for all large n Mn < . It follows
that limn→∞ Mn = 0. It turns out that this condition – that the maximum value of the
increments tends to 0 as the partition-width tends to 0 – pins down µt : the distribution
must be a centered Gaussian.
Definition 27.1. A stochastic process (Bt )t≥0 with state space R is called a Brownian Motion
or Wiener process if:
(1) For 0 ≤ s < t < ∞, Bt − Bs has a normal distribution N (0, σ 2 (t − s)) for some σ > 0.
(2) The increments are independent.
(3) The trajectories are continuous with probability 1.
If we additionally normalize σ 2 = 1 and condition B0 = 0, then (Bt )t≥0 is called a standard
Brownian motion.
To be clear: condition (1) says that the increment Bt − Bs is a continuous random vari-
able with density
1 2
fBt −Bs (x) = p e−x /2(t−s) .
2π(t − s)
Our preceding discussion shows that all random continuous motions are Brownian mo-
tions. The process is named in honorof the early 19th Century botanist Robert Brown
who noticed that pollen particles on the surface of apparently still water appeared, under
a microscope, to execute “random continuous motion” – they jiggled around seemingly
at random. The explanation for this motion (which is the result of zillions of collisions
with water molecules all moving essentially randomly) in terms of the above definition
of “random continuous motion” was provided by Einstein (and is one of his key papers
that led to his Nobel Prize). However, an actually mathematically rigorous proof of the
existence of this motion was first given by Norbert Wiener in the 1920s; for this reason,
the process is often called the Wiener process and often denoted Wt rather than Bt .
Note that E[Bt ] = 0 since the normal distribution is centered. It is also interesting to
consider the covariances. Let’s take s ≤ t; then
Cov(Bs , Bt ) = E[(Bs − E[Bs ])(Bt − E[Bt ])]
= E[Bs Bt ] = E[Bs (Bs + Bt − Bs )] = E[Bs2 ] + E[Bs (Bt − Bs )].
It is customary to take B0 to be a constant; in general, we can allow it to be any random
variable, as long as we assume B0 is independent of all increments. Then we have the last
term is
E[(Bs − B0 + B0 )(Bt − Bs )] = E[(Bs − B0 )(Bt − Bs )] + E[B0 (Bt − Bs )]
= E[Bs − B0 ]E[Bt − Bs ] + E[B0 ]E[Bt − Bs ] = 0.
Hence
Cov(Bs , Bt ) = E[Bs2 ] = Var(Bs ) = σ 2 s, s≤t
104

since E[Bs ]2 = 0, and the variance of N (0, σ 2 s) is (by definition) σ 2 s. We can restate all of
this as
Cov(Bs , Bt ) = σ 2 min{s, t}. (27.1)
On the subject of variances, note that we have
Var(Bt+h − Bt ) = σ 2 h.

In other words, the standard deviation of the increment Bt+h − Bt is σ h. For small h,
this is  h. So, on average, we cannot expect h1 [Bt+h − Bt ] to have a limit as h ↓ 0. Indeed,
Brownian trajectories are nowhere differentiable.
Here is a very useful property of Brownian motion we will exploit in investigating its
properties.
Lemma 27.2 (Scaling Property). Let (Bt )t≥0 be a standard Brownian motion, and let a > 0. For
t ≥ 0, define Wt = a−1/2 Bat . Then (Wt )t≥0 is a standard Brownian motion.
Proof. Since B0 = 0, W0 = 0. Scaling does not affect continuity. So we need only
check that the increments are independent and normally distributed. For fixed times
0 ≤ t0 < t1 < · · · < tn , we also have 0 ≤ at0 < at1 < · · · < atn , and so by as-
sumption the increments Bat1 − Bat0 , . . . , Batn − Batn−1 are independent. Scaling them
by a−1/2 preserves independence, and so the increments of Wt are independent. Finally,
Wt − Ws = a−1/2 (Bat − Bas ). By assumption the distirbution of Bat − Bas is N (0, a(t − s)),
and so scaling by a−1/2 scales the variance by (a−1/2 )2 = a−1 , showing that Wt − Ws has
distribution N (0, t − s). Thus (Wt )t≥0 is a standard Brownian motion. 

27.2. Gaussian Processes.


Definition 27.3. Let T denote a set of times (so T ⊆ R, could be discrete or continuous). A
stochastic process (Xt )t∈T is called a Gaussian process if its state space is R, and if for all
t1 , . . . , tn ∈ T and all a1 , . . . , an ∈ R, a1 X1 + · · · + an Xn is a Gaussian random variable.
For any stochastic process (Xt )t≥0 , its marginal distributions are the joint distributions
of all random vectors (Xt1 , . . . , Xtn ) for all choices of n and times t1 , . . . , tn . Definition
27.3 is a statement about the marginal distributions of the process. Properties (1) and (2)
in the definition of Brownian motion are also statements exclusively about the marginal
distributions of the process.
Example 27.4. Let (Bt )t≥0 be a standard Brownian motion. Fix 0 ≤ t1 < · · · < tn , and set
Gj = Btj − Btj−1 . Then for any a1 , . . . , an ∈ R,
a1 Bt1 + a2 Bt2 + · · · + an Btn = a1 G1 + a2 (G1 + G2 ) + · · · + an (G1 + · · · + Gn ).
We can rewrite this as
(a1 + · · · + an )G1 + (a2 + · · · + an )G2 + · · · + (an−1 + an )Gn−1 + an Gn .
The increments Gj are independent Gaussians, and therefore this linear combination of
them is also Gaussian. This shows that (Bt )t≥0 is a Gaussian process.
It is very important to note that the marginal distributions do not encode fine properties
of the trajectories, like continuity.
105

Example 27.5. Let (Bt )t≥0 be a standard Brownian motion, and let U be a uniform random
variable (on [0, 1]) independent from (Bt )t≥0 . Define
(
Bt , t 6= U
Xt = .
0, t=U
In fact, Xt is a Gaussian process with the same marginal distributions as Bt . To see this,
it suffices to show that for any times t1 , . . . , tn and any coefficients a1 , . . . , an , the two
random variables a1 Xt1 + · · · + an Xtn and a1 Bt1 + · · · + an Btn have the same distribution.
This can be tested by looking at their characteristic functions χX (ξ) = E[eiξX ]. We have, for
any ξ ∈ R,
E[eiξ(a1 Xt1 +···+an Xtn ) ] = E[eiξ(a1 Xt1 +···+an Xtn ) 1{U 6=t1 ,...,tn } ]
because the event {U 6= t1 , . . . , tn } has probability 1. But on this event Xt = Bt for each t,
so
E[eiξ(a1 Xt1 +···+an Xtn ) ] = E[eiξ(a1 Bt1 +···+an Btn ) 1{U 6=t1 ,...,tn } ]
= E[eiξ(a1 Bt1 +···+an Btn ) ]E[1{U 6=t1 ,...,tn } ]
= E[eiξ(a1 Bt1 +···+an Btn ) ],
where the second equality follows from the independence of U from Bt1 , . . . , Btn , and the
third is just the statement that P(U 6= t1 , . . . , tn ) = 1. So a1 Xt1 + · · · + an Xtn has the same
characteristic function as a1 Bt1 + · · · + an Btn for all times t1 , . . . , tn and all coefficients
a1 , . . . , an . It follows that Xt and Bt have the same marginal distributions, and that Xt is a
Gaussian process. But the function t 7→ Xt disagrees with t 7→ Bt at t = U (unless BU = 0,
which is an event of probability 0), hence with probability 1 the trajectory t 7→ Bt has a
discontinuity in [0, 1].
We therefore have to be careful to verify continuity properties separarely from mar-
ginal distribution properties. In particular, Brownian motion is a continuous Gaussian
process; the “continuous” has to be added to the description, as it does not follow from a
description of the marginal distributions.
106

28. L ECTURE 28: J UNE 6, 2012


One of the reasons that the Gaussian distribution is so nice is that is is completely deter-
mined by its mean and variance. This fact implies that Gaussian processes are determined
(in the sense of marginal distributions) by their mean and covariances.
Lemma 28.1. Let (Xt )t∈T and (Yt )t∈T be Gaussian processes. Suppose that E[Xt ] = E[Yt ] and
Cov(Xs , Xt ) = Cov(Ys , Yt ) for all s, t ∈ T . Then (Xt )t∈T and (Yt )t∈T have the same marginal
distributions.
(The proof is somewhat involved, but not difficult.) We therefore use some notation: for
a Gaussain process Xt , set
µ(t) = E[Xt ], Γ(s, t) = Cov(Xs , Xt ).
Example 28.2. Let (Bt )t≥0 be a standard Brownian motion. Then Bt is a Gaussian process
with µ(t) = 0 and Γ(s, t) = min{s, t}. It is a continuous process (meaning its trajecto-
ries are continuous with probability 1). The process described in Example 27.5 is also a
Gaussian process with the same mean and covariance functions; but it is not a continuous
process.
Example 28.3. Let (Bt )t≥0 be a standard Brownian motion, and let Xt = Bt − tB1 for
0 ≤ t ≤ 1. Then for any coefficients a1 , . . . , an and times 0 ≤ t1 < · · · < tn ≤ 1,
a1 Xt1 + · · · + an Xtn = (a1 Bt1 + · · · + an Btn ) − (a1 t1 + · · · + an tn )B1 .
If tn = 1 we just combinate the last terms. In either case, this reduces to a linear combi-
nation of instances of Bt , and from the previous example we saw this yields a Gaussian.
Hence Xt is a Guassian process. Note that it also has continuous trajectories. Xt is called
a Brownian bridge.
Example 28.4. Let (Bt )t≥0 be a standard Brownian motion, and let Zt = e−t Be2t for t ≥ 0.
Then a linear combination of Ztj is a linear combination of Bsj for sj = e2t tj , and hence is
Gaussian; so Zt is a Gaussian process. Again, it is clear that is has continuous trajectories.
Zt is called an Ornstein-Uhlenbeck process.

The Gaussian processes point-of-view gives us another useful invariance property of


Brownian motion.
Lemma 28.5 (Time Inversion). Let (Bt )t≥0 be a standard Brownian motion. Let W0 = 0 and for
t > 0 set Wt = tB1/t . Then (Wt )t≥0 is a standard Brownian motion.
Proof. Let t1 , . . . , tn ≥ 0. If any = 0, then since Wt = 0 we may ignore these terms in a
sum, so wlog tj > 0. Set sj = 1/tj . Then for any coefficients a1 , . . . , an ,
a1 W1 + · · · + an Wn = a1 t1 Bs1 + · · · + an bn Bsn
is a linear combination of Brownian points, and as shown in Example 27.4, it follows that
a1 W1 + · · · + an Wn is Gaussian. Hence (Wt )t≥0 is a Gaussian process. We can also calculate
µ(t) = E[Wt ] = tE[B1/t ] = 0 for t > 0 and µ(0) = E[W0 ] = E[0] = 0. For covariances, if
either s = 0 or t = 0 then Cov(Ws , Wt ) = 0, while if 0 < s ≤ t
Cov(Ws , Wt ) = E[Ws Wt ] = stE[B1/s B1/t ] = st min{1/s, 1/t} = st · 1/t = s.
So Γ(s, t) = Cov(Ws , Wt ) = min{s, t} as well, and we have shown that (Wt )t≥0 has the
same marginal distributions as (Bt )t≥0 .
107

It now suffices to show that t 7→ Wt is continuous with probability 1. For t > 0 this
is clear, since t 7→ tB1/t is continuous wherever 1/t is continuous. At t = 0, first we
note that the above treatment shows that Bt and Xt have the same marginal distributions.
In particular, their distributions agree on the dense set of rationals Q ∩ (0, ∞). Since Bt
has continuous paths and Xt is continuous on [0, ∞), it follows (from the uniqueness
of Brownian motion) that Wt is continuous at 0 as well. Hence (Wt )t≥0 is a standard
Brownian motion. 
Bt
Corollary 28.6. If (Bt )t≥0 is a standard Brownian motion, then limt→∞ = 0.
t
Pn
Remark 28.7. For integer t = n, we can write Bn = Bn − B0 = k=1 [Bk − Bk−1 ] The
increments Bk − Bk−1 are i.i.d. with mean 0, and hence by the strong law of large numbers
Bn
n
→ 0 with probability 1. Extending this argument to t ∈ / N directly takes a lot of
work (owing to the very rough nature of Brownian motion: just because it is continuous
doesn’t mean there’s a very nice way to bound Bt by Bn and Bn+1 for n ≤ t < n + 1). But
time-inversion gives an immediate easy proof.
Proof. Let Wt = tB1/t ; then W1/t = Btt . In Lemma 28.5, we saw that Wt is a standard
Brownian motion; in particular this means W0 = 0 and t 7→ Wt is continuous. It follows
that limt→∞ Btt = limt→∞ W1/t = lims↓0 Ws = 0. 

28.1. Donsker’s Scaling Limit. Brownian motion is the pre-eminent Gaussian process.
The Gaussian distribution arises as the limit distribution in the central limit theorem; sim-
ilarly, Brownian motion arises as the “limit process” in a collection of limit theorems. The
gist is that if one rescales a discrete-time random walk with space scaled as the square-
root of time, then in the scaling limit the result is Brownian motion. Here is a precise
statement.
Theorem 28.8 (Donsker’s Invariance Principle). Let (Xn )n≥0 be SRW on Z. Let c > 0, and
(c)
define a continuous-time, continuous process Xt as follows: for t ≥ 0 such that ct ∈ N, set
(c) (c)
Xt = c−1/2 Xct ; for other times t, define Xt by linear interpolation. Then for any times 0 ≤
t1 < · · · < tn ,
(c) (c) D
(Xt1 , . . . , Xtn ) −→ (Bt1 , . . . , Btn )
D
as c → ∞, where (Bt )t≥0 is a standard Brownian motion, and −→ means convergence in distri-
bution.
The proof, boiled down to its basic idea, is just repeated application of the central limit
theorem. Indeed, one does not need to use the SRW; instead, one could take Xn = Y1 +
· · · + Yn where (Yn )n≥0 is a sequence of i.i.d. random variables each with E[Yj ] = 0 and
Var[Yj ] = 1. (If one takes Var[Yj ] = σ 2 then the theorem holds with a Brownian motion
of variance σ 2 in the limit.) Note also that the linear-interpolation definition of the scaled
(c)
process Xt is included only to make these processes continuous. One could just as well
use (and indeed the proof uses)
et(c) = c−1/2 Xbctc .
X
The sense of convergence stated in Theorem 28.8 is very weak; much stronger forms of
(c)
convergence hold. For example, the random continuous path (t 7→ Xt )0≤t≤R converge
108

to the Brownian path (t 7→ Bt )0≤t≤R uniformly, with probability 1, for any R > 0. (This is
one way to prove the existence of Brownian motion – if one can show this uniform limit
exists, then the limit is a uniform limit of continuous paths and therefore has continuous
paths; that is satisfies the marginal distribution properties of Brownian motion is then
verified via central limit theorem considerations as above.)

28.2. Brownian Motion as a Markov Process. It is difficult to formulate the Markov


property in exact accordinace with the definition of continuous time Markov chains (with
countable state space) in Lecture 18, without the benefit of more sophisticated technology
(i.e. measure theory). But because of the indpendent increments, it is possible to state the
Markov property in the following form.
Proposition 28.9 (Markov Property for Brownian Motion). Let (Bt )t≥0 be a standard Brow-
nian motion. Fix as time s ≥ 0, and let Xt = Bt+s − Bs . Then (Xt )t≥0 is a standard Brownian
motion, independent from (Bt )0≤t≤s .
Proof. Of course X0 = Bs − Bs = 0, and the trajectories of Xt are continuous. For given
0 ≤ t1 < · · · < tn < ∞, the increments of Xt are Xtj − Xtj−1 = (Btj +s − Bs ) − (Btj−1 +s −
Bs ) = Btj +s − Btj−1 +s = Bsj − Bsj−1 , where sj = tj + s so 0 ≤ s1 < · · · < sn < ∞.
Hence, they are shifted increments of the Brownian motion, and hence are independent.
Finally, if t1 < t2 then Xt2 − Xt1 = Bt2 +s − Bt1 +s which is a Brownian increment of length
(t2 + s) − (t1 + s) = t2 − t1 , and therefore has a N (0, t2 − t1 ) distribution, as required. Thus,
(Xt )t≥0 is a standard Brownian motion.
Moreover, suppose s > 0 and let 0 ≤ t1 < · · · < tn < s. Now for t ≥ 0, since Xt = Bt+s −Bs ,
we know that the increments Bt+s − Bs , Bs − Btn , Btn − Btn−1 , . . . , Bt2 − Bt1 , Bt1 − 0 ar
independent. So, in particular, Xt is independent from Bt1 , Bt2 − Bt1 , . . . , Btn − Btn−1 . It is
therefore also independent from the successive sums of these, namely Bt2 = (Bt2 − Bt1 ) +
Bt1 , and Bt3 = (Bt3 − Bt2 ) + Bt2 , and so forth.

109

29. L ECTURE 29: J UNE 9, 2012


Just as in the discrete state-space case, this implies a Strong Markov property. As usual, a
stopping time (in this case for Brownian motion) is a random variable T taking values in
[0, ∞] with the property that, for each t ≥ 0, the event {T ≤ t} depends only on {Bs }0≤s≤t .
Example 29.1. Fix x ∈ R, and let τx = min{t ≥ 0 : Bt = x}, the first passage time to x.
Then
{τx ≤ t} = {∃s ≤ t : Bs = x}
which is an event determined by the random variables (Bs )0≤s≤t . Hence τx is a stopping
time for (Bt )t≥0 .

Theorem 29.2 (Strong Markov property). Let (Bt )t≥0 be a standard Brownian motion, and let
T be a stopping time. For t ≥ 0, set Xt = BT +t − BT . Then (Xt )t≥0 is a standard Brownian
motion, independent from (Bt )0≤t≤T .
Example 29.3. Fix x > 0; we will determine the distribution of the first passage time τx .
This can be done with the following trick. If we had a discrete random variable Y , then
we could compute using the Law of Total Probability that, for any event A,
X X
P(A) = P(A, Y = y) = P(A > x|Y = y)P(Y = y).
y y

This formula does not make sense of Y is a continuous random variable with density fY ,
since the sum is over a continuous set, and since P(Y = y) = 0. The analog of this formula
which does hold is Z
P(A) = P(A|Y = y)fY (y) dy. (29.1)
The range of the integral is all possible states that Y can take, in the event A. We apply
this in the setting A = {Bt > x} and Y = τx . Note: when Bt > x, since B0 = 0 and the
Brownian path is continuous, by the intermediate value theorem there is a time s ∈ (0, t)
where Bs = x; hence τx ∈ (0, t). Thus
Z t
P(Bt > x) = P(Bt > x|τx = s)fτx (s) ds. (29.2)
0
We can deal with the integrand as follows. First, by definition Bτx = x. Also, as we
integrate over 0 < s < t, we may write t = s + (t − s). Hence
P(Bt > x|τx = s) = P(Bs+(t−s) − Bτx > 0|τx = s) = P(Bτx +(t−s) − Bτx > 0|τx = s).
By the strong Markov property, Bτx +t − Bτx is a standard Brownian motion, independent
from (Bs )0≤s≤τx . In particular, it is independent from τx , which is measurable with respect
to (Bs )0≤s≤τx , and so
1
P(Bτx +(t−s) − Bτx > 0|τx = s) = P(Bτx +(t−s) − Bτx > 0) = P(Bt−s > 0) =
2
where the last equality follows from the fact that Bt−s is a N (0, t − s) random variable
(hence symmetric). Plugging this into Equation 29.2 yields
Z t
1 1 1
P(Bt > x) = fτx (s) ds = P(τx ∈ (0, t)) = P(τx < t).
0 2 2 2
110

F IGURE 1. The density fτ1 of the passage time to 1 for a standard Brownian motion.

Thus P(τx < t) = 2P(Bt > x), which is a probability we can calculate explicitly (in terms
of the Gaussian kernel).
Z ∞
1 −y2 /2t
P(τx < t) = 2P(Bt > x) = 2 √ e dy.
x 2πt

To calculate the density fτx , it is convenient to make the change of variables y/ t = u, so
the integral becomes
Z ∞
1 2
P(τx < t) = 2 √ √ e−u /2 du.
x/ t 2π
This is convenient because the t only appears in the limits of integration. So we can
differentiate easily:
d 2 √ 2 d x x 2
fτx (t) = P(τx < t) = − √ e−(x/ t) /2 √ = √ e−x /2t .
dt 2π dt t 2πt3
We assumed that x > 0 in all of the above. Note, however, that (−Bt )t≥0 is also a standard
Brownian motion (as can be very easily verified from the definition), and so by symmetry
we will then have the more general formula
|x| −x2 /2t
fτx (t) = √ e , x ∈ R.
2πt3

Example 29.4. Consider the Maximum process for Brownian motion: Mt = max0≤s≤t Bs .
First note that Mt ≥ 0 for all t, since Mt is increasing and M0 = B0 = 0. We can calculate
the distribution of the random variable Mt using the passage time. Notice that the event
{Mt ≥ x} is equal to the event that there is some time s ≤ t for which Bs ≥ x. Since
Brownian paths are continuous, this is equal to the event that there is some (possibly
different) time s0 ≤ t where Bs0 = x, and so we have {Mt ≥ x} = {τx ≤ t}. Thus, by
111

Example 29.3, for x ≥ 0 we have


Z ∞
1 −y2 /2t
P(Mt ≥ x) = P(τx ≤ t) = 2 √ e dy.
x 2πt
Thus, Mt has a density:
Z ∞
r
d 1 −y2 /2t 2 −x2 /2t
fMt (x) = 2 √ e dy = e 1x≥0 .
dx x 2πt πt
So the density of the maximum process is exactly 2 times the Gaussian density on the
positive real axis. I.e. it is exactly twice as likely for Mt to exceed x as it is for Bt to exceed
x. (This is a symmetry you may have expected, but it is not obvious.)

29.1. The Zero Set of Brownian Motion. Let (Bt )t≥0 be a standard Brownian motion,
and denote by Z the (random) set of zeroes of Bt :
Z = {t ≥ 0 : Bt = 0}.
Using techniques like the ones above, we can prove the following exact formula which is
great asset in understand the zeroes, and general behavior, of Brownian motion.
Proposition 29.5 (Arcsine Law for Zeroes of BM). Let 0 < s < t. Define
Θ(s, t) = P(Z ∩ [s, t] 6= ∅).
Then r
2 s
Θ(s, t) = 1 − arcsin .
π t
Proof. To begin, we will condition on the value of Brownian motion at time s. Using
Equation 29.1,
Z ∞
Θ(s, t) = P(Z ∩ [s, t] 6= ∅) = P(Z ∩ [s, t] 6= ∅|Bs = x)fBs (x) dx.
−∞

The density fBs is a centered Gaussian of variance s. As for the conditional probability,
we can use the Markov property: given Bs = x, we restart the process at time s: take
Xr = Bs+r − Bs = Bs+r − x. Then (Xr )r≥0 is a standard Brownian motion. Translating, the
event that B· has a zero in [s, t] is the same as the event that X· reaches height −x in the
interval [0, t − s]. Since X· is a standard Brownian motion, this is the event that τ−x ≤ t − s.
Hence: Z ∞
1 −x2 /2s
Θ(s, t) = P(τ−x ≤ t − s) √ e dx.
−∞ 2πs
|x| 2
Referring to Example 29.3, the random variable τ−x has density fτ−x (t) = √
2πt3
e−x /2t .
This is symmetric in x, as is the Gaussian density, so we have
Z ∞ Z t−s
1 −x2 /2s
Θ(s, t) = 2 fτ−x (r) dr √ e dx
0 0 2πs
r Z ∞ Z t−s
2 −x2 /2s x 2
= e dx √ e−x /2r dr.
πs 0 0 2πr 3
112

We must now evaluate this double integral. To start, we exchange the order of integration
(since everything in sight is positive, this follows from Fubini’s theorem).
Z t−s Z ∞
1 −3/2 2 2
Θ(s, t) = √ r dr xe−x /2r−x /2s dx.
π s 0 0
2 2
d −x /2r−x /2s
The inside integral can be explicitly evaluated, since the derivative dx e is equal
−x2 /2r−x2 /2s
to −(1/r + 1/s)xe . Thus
Z t−s
1 −1  2 2 x→∞
Θ(s, t) = √ r−3/2 · e−x /2r−x /2s dr
π s 0 1/r + 1/s x=0
Z t−s
1 t−s s dr
Z r
1 −3/2 rs
= √ r dr = .
π s 0 r+s π 0 rr+s
p
Now, we substitute
p u = r/s. Then r = su2 , so dr = 2su du. The limits of integration
become 0 up to (t − s)/s, and we have
Z √(t−s)/s Z √(t−s)/s
1 1 2su du 2 du
Θ(s, t) = = .
π 0 u su2 + s π 0 1 + u2
1
The antiderivative of is arctan u, so we can write this as
1+u2
" r # r
2 t−s 2 t−s
Θ(s, t) = arctan − arctan 0 = arctan .
π s π s
This is a perfectly good
q final answer, but it is more common to write this as an arcsine.
If we let θ = arctan t−ss
, then θ is an angle in a right triangle whose opposite side has
√ √
length t − s and adjacent side has length s. The hypoteneuse therefore has length
√ √
s + (t − s) = t. Hence cos θ = √st . Using the relation sin( π2 − θ) = cos θ, we therefore
p

π s
have 2
− θ = arcsin √ ,
t
and so
 r  r
2 π s 2 s
Θ(s, t) = − arcsin = 1 − arcsin .
π 2 t π t


Example 29.6. Let L = max{t ≤ 1 : Bt = 0} be the largest 0 of Brownian motion ≤ 1. Then


for t < 1 the event {L ≥ t} is the event that Z ∩ [t, 1] 6= ∅, and so
2 √
P(L ≥ t) = P(Z ∩ [t, 1] 6= ∅) = Θ(t, 1) = 1 − arcsin t.
π
So we can calculate the density of L:
d d 2 √ 1
fL (t) = P(L < t) = arcsin t = p , 0 ≤ t ≤ 1.
dt dt π π t(1 − t)
This density is symmetric about 12 , with a minimum there. It blows up at 0 and at 1; so the
maximal 0 is most likely to be either very close to 0 or very close to 1.

We can use the arcsine law for the zeroes to show, for example, that Brownian motion
(in one dimension) is recurrent.
113

Theorem 29.7 (Recurrence of Brownian Motion on R). Let (Bt )t≥0 be a standard Brownian
motion on R. Then the zero set Z is unbounded with probability 1. (I.e. the process returns to 0
at arbitrarily large times.)
Proof. If the process is not recurrent, there is some time s after which there are no zeroes
(with positive probability):
P(∀t ≥ s Bt 6= 0) = P(Z ∩ [s, ∞) = ∅).
The event {Z ∩ [t0 , ∞) = ∅} is contained in the event {Z ∩ [s, t] = ∅} for any t > s.
Hence
r
2 s
P(Z ∩ [s, ∞) = ∅) ≤ P({Z ∩ [s, t] = ∅) = 1 − Θ(s, t) = arcsin .
π t
As t → ∞, arcsin st → 0 for any fixed s, and hence P(Z ∩ [s, ∞) = ∅) = 0 for any s > 0.
p
Thus, there are arbitrarily large elements in Z , with probability 1. 
Remark 29.8. Since we proved, early on, that SRW on Z is recurrent, it is natural that
its scaling limit Brownian motion is recurrent as well – but this is highly non-obvious.
Indeed, given space is scaled along with time, the kind of recurrence that is more natural
to expect for Brownian motion is this: given any  > 0, the set of t ≥ 0 for which |Bt | ≤ 
is unbounded with probability 1. The fact that the Brownian motion actually returns to 0
infinitely often is a very fine property. Now, one can consider Brownian motion on Rd :
(d)
simply take Bt = (Bt1 , Bt2 , . . . , Btd ) where the Btj are independent Brownian motions.
(d)
Donsker’s invariance theorem extends to this case, and Bt is the scalaing limit of the
SRW on Zd . Recall that SRW on Z2 is recurrent, while SRW on Zd is transient for d ≥ 3. If
we interpret recurrence and transience in the weaker sense of return to a neighborhood of
(2)
0, then these translate to Brownian motion as well. But it is not true that Bt is recurrent
(2)
in the strong sense: the event that {Bt 6= (0, 0) for all large t} has positive probability.
(d)
For d ≥ 3, the results match up again: {|Bt | → ∞} has probability 1 for d ≥ 3.

The recurrence of Brownian motion on R, together with its time inversion property,
gives one of the most graphic demonstration of just how rough Brownian paths are.
Corollary 29.9. For any  > 0, the (0, ) ∩ Z 6= ∅ with probability 1; hence it contains infinitely
many zeroes of (Bt )t≥0 .
Proof. Consider Wt = tB1/t . We have shown that (Wt )t≥0 is a standard Brownian motion,
and hence with probability 1 Wt = 0 for arbitrarily large times t. But Bt = tW1/t , and
therefore Bt has zeroes for t arbitrarily close to 0. 
Remark 29.10. By the strong Markov property, if T is a stopping time at which BT = 0,
then the same argument shows that Z ∩ [T, T + ) contains infinitely-many zeroes of the
Brownian motion. It is important to note that this only holds true for stopping times T , not
just any random time that the Brownian motion has value 0. So, for example, if we fix
a time t0 and define Tt0 = min{t ≥ t0 : Bt = 0}, then Tt0 is a stopping time, and indeed
zeroes of the Brownian motion will accumulate to the right of Tt0 . On the other hand,
there is a largest zero L smaller than 1 and, by Example 29.6, it has a positive probability
of being less than 0.9. Indeed, from the arcine law for Z , given any interval [s, t] with
s < t, the probability that there are no zeroes in [s, t] is π2 arcsin st which is strictly positive.
p
114

What can be proved, using the stopping times Tt0 above, is that Z is a perfect set: it is
closed (since t 7→ Bt is continuous) and contains no isolated points. So, considering the
largest zero L less than 1 again, there must be a sequence of zeroes less than L convering
to L from below. In general, the zero set has the topological structure of a Cantor set.

You might also like