Stochastic Proccess
Stochastic Proccess
1.2. Markov Chains. Before we define Markov process, we must define stochastic pro-
cesses.
Definition 1.1. Let S be a fixed state space. A (discrete time) stochastic process (Xn )n∈N is a
sequence of random variables all defined on a fixed probability space (Ω, P) all taking values in S.
A (continuous time) stochastic process (Xt )t≥0 is a collection of random variables Xt indexed
by t ∈ [0, ∞), all defined on a fixed probability space (Ω, P) all taking values in S.
We will discuss exclusively discrete time processes for the first half of the course.
Example 1.2. Let X1 , X2 , . . . be i.i.d. real-valued random variables; then they form a dis-
crete time stochastic process with state space R. Similarly, the process Sn = X1 + · · · + Xn
is a discrete time stochastic process with state space R.
Example 1.3. As a special case of the previous example, sample the i.i.d. variables from
the symmetric Bernoulli distribution: P(Xn = ±1) = 12 . Then The state space for the
process (Xn ) is just {±1}. The corresponding process Sn has state space Z. The process
Sn is called the symmetric random walk on Z.
Example 1.4. Fix N > 0 in N. We will soon describe stochastic processes Xn which be-
haves as the symmetric random walk Sn does from Example 1.3 when 0 < Xn < N , but
have different behavior when Xn reaches 0, N . These will be known as reflected, par-
tially reflected, and absorbing random walks. To describe them, we need the technology
of Markov processes, which we introduce next.
1
2
Given a collection X1 , . . . , Xn of random variables with discrete state space S, their joint
distribution is completely described by the collection of probabilities
{P(X1 = i1 , X2 = i2 , . . . , Xn = in ) : i1 , . . . , in ∈ S}.
Thinking of the index as a time parameter, it is useful to express these numbers, using the
definition of conditional probability, as
P(X1 = i1 , . . . , Xn = in )
= P(X1 = i1 )P(X2 = i2 |X1 = i1 )P(X3 = i3 |X1 = i1 , X2 = i2 ) · · · P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ).
Definition 1.5. Let Xn be a discrete time stochastic process, with state space S that is finite or
countably infinite. Then Xn is called a (discrete time) Markov chain if, for each n ∈ N and any
i1 , . . . , in ∈ S,
P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ) = P(Xn = i1 |Xn−1 = in−1 ).
A Markov chain (sometimes called a Markov process, though that is usuall reserved
for the case of an uncountable state space) is a stochastic process whose evolution to the
next state depends only on the current state, not the past behavior. Let’s reconsider the
above examples:
• In Example 1.2 (making the additonal assumption that the state space is a count-
able or finite subset of R to conform to the definition), the independence assump-
tion shows that
P(X1 = i1 , . . . , Xn = in ) = P(X1 = i1 ) · · · P(Xn = in ).
In particular, this means that P(Xn = in |X1 = i1 , . . . , Xn−1 = in−1 ) = P(Xn = in ) =
P(Xn = in |Xn−1 = in−1 ). Hence, Xn is a Markov chain. For the process Sn of sums,
P(S1 = i1 , S2 = i2 , . . . , Sn = in ) = P(X1 = i1 , X2 = i2 − i1 , . . . , Xn = in − in−1 ).
Thus
P(S1 = i1 , . . . , Sn = in )
P(Sn = in |S1 = i1 , . . . , Sn−1 = in−1 ) =
P(S1 = i1 , . . . , Sn−1 = in−1 )
P(X1 = i1 , X2 = i2 − i1 , . . . , Xn = in − in−1 )
=
P(X1 = i1 , X2 = i2 − i1 , . . . , Xn−1 = in−1 − in−2 )
= P(Xn = in − in−1 )
while
P(Sn−1 = in−1 , Sn = in )
P(Sn = in |Sn−1 = in−1 ) = .
P(Sn−1 = in−1 )
The numerator can be rewritten as
P(Sn − Sn−1 = in − in−1 , Sn−1 = in−1 ) = P(Xn = in − in−1 , Sn−1 = in−1 )
= P(Xn = in − in−1 )P(Sn−1 = in−1 ),
since Xn is independent from X1 , . . . , Xn−1 and hence from Sn−1 . Dividing, we
achieve the desired equality. So Sn is a Markov chain.
3
• In the special case of Example 1.3, we can calculate the probabilities P(Sn = in |Sn−1 =
in−1 ) = P(Xn = in − in−1 ). Since Xn ∈ {±1} with equal probabilities, we must have
in − in−1 ∈ ±1 to have positive probability, and in this case each has value 12 . In
other words:
1
P(Sn = i ± 1|Sn−1 = i) = ,
2
and all other “transition probabilities” are 0.
In general, for a Markov chain (Xn )n∈N , the probabilities P(Xn = in |Xn−1 = in−1 ) are
called the transition probabilities. The last example shows that the transition probabili-
ties of some Markov chains are even more uniform than the definition insists.
Definition 1.6. A Markov chain is called time-homogeneous if, for any i, j ∈ S, the transition
probability P(Xn = j|Xn−1 = i) is independent of n. That is: there is a fixed function p : S × S →
[0, 1] with the property that, for all n,
X X X P(Xn = j, Xn−1 = i)
p(i, j) = P(Xn = j|Xn−1 = i) = = 1.
j∈S j∈S j∈S
P(Xn−1 = i)
for each i ∈ S, by the law of total probability. If we let P be the |S| × |S| matrix with i, j
entry is p(i, j), this says that the rows of P all sum to 1; P is a stochastic matrix.
Example 1.7. Consider the following Markov chain. The state space is {0, . . . , N }. As
with the full random walk, we have transition probabilities p(i, j) = 12 if 0 < i, j < N and
j = i ± 1, and p(i, j) = 0 otherwise. We have yet to specify the transition probabilities
involving the states 0, N . As usual p(0, i) = p(i, 0) = p(N, j) = p(j, N ) = 0 if i 6= 0, 1 and
j 6= N − 1, N . But we impose different boundary conditions for different kinds of random
walks.
• Reflecting Random Walk. Here p(0, 1) = 1 and p(N, N − 1) = 1 (and so p(0, 0) =
0 = p(N, N )). That is: the random walker bounces off the walls surely when reach-
ing them.
• Absorbing Random Walk. Here p(0, 0) = p(N, N ) = 1; so when the walker reaches
the boundary, he says there forever.
• Partially Reflecting Walk. In general, we can set p(0, 0) = p, p(0, 1) = 1 − p,
p(N, N ) = q, and p(N, N − 1) = 1 − q for some arbitrary probabilities 0 ≤ p, q ≤ 1.
As we will see in the next weeks, the long-time behavior of the random walk is very
depenedent on the parameters p, q in the partially reflecting walk.
4
Proof. In light of Lemma 2.3, this is just the restatement of the fact that powers of P com-
mute under matrix multiplication:
X
pm+n (i, j) = [Pm+n ]ij = [Pm Pn ]ij = [Pm ]ik [Pn ]kj
k∈S
2.3. The Markov Property. One way to state the definition of a Markov chain is that,
conditioned on the present, the future is independent from the past. This can be made
precise and slightly stronger, in the following proposition which may be taken as the
(recursive) definition of a homogeneous time Markov chain.
Proposition 2.5. Let (Xn )n∈N be a time-homogeneous Markov chain with discrete state space S
and transition probabilities p(i, j). Fix m ∈ N and i ∈ S, and suppose that P(Xm = i) > 0.
Then conditional on Xm = i, the process (Xm+n )n∈N is a time-homogeneous Markov chain with
the same transition probabilities p(i, j), and independent from X0 , X1 , . . . , Xm .
In other words: if A is an event determined only by X0 , X1 , . . . , Xm with P(A ∩ {Xm = i}) > 0,
then for all n ≥ 0
P(Xm+1 = im+1 , . . . , Xm+n = im+n |A∩{Xm = i}) = p(i, im+1 )p(im+1 , im+2 ) · · · p(im+n−1 , im+n ).
Proof. From the definition of conditional probability, what we need to prove is that
P({Xm+1 = im+1 , . . . , Xm+n = im+n } ∩ A)
= P({Xm = i} ∩ A)p(i, im+1 )p(im+1 , im+2 ) · · · p(im+n−1 , im+n ).
2.4. Hitting Times. One important question about a stochastic process is: when is the
first time it enters a certain set? Let A be an event, and define the random time τA by
τA = min{n ∈ N : Xn ∈ A}.
(Here we make the convention that min ∅ = ∞.) For example, we may wish to know (the
distribution of) the first time the symmetric random walk gets distance N away from the
origin. Better yet: in the random walk with absorbing boundaries 0, N , we may want to
know the probability that the walk reaches N before it reachs 0, given an initial distribu-
tion. This can be formalized as a general problem of this form:
Let A, B ⊂ S with A ∩ B = ∅. For i ∈ S, let h(i) = P(τA < τB |X0 = i). Calculate the
function h.
When (Xn )n∈N is a Markov chain, this becomes a linear algebra problem. First note that it
is always true that
h(i) = 1 if i ∈ A, h(i) = 0 if i ∈ B.
Now, take i ∈
/ A ∪ B. Conditioning on the first step of the process, we have
X
P(τA < τB |X0 = i) = P(τA < τB |X0 = i, X1 = j)P(X1 = j|X0 = i).
j∈S
By the Markov property, conditioning on X1 = j gives a new Markov process with the
same transition probabilities, hence the same joint distributions. The probability of the
event {τA < τB } = {τA − 1 < τB − 1} depends only on the joint distributions of the Xn s,
and so it follows that
P(τA < τB |X0 = i, X1 = j) = P(τA < τB |X1 = j) = P(τA − 1 < τB − 1|X0 = j) = h(j).
Thus, we have the equations
X
h(i) = p(i, j)h(j). (2.1)
j∈S
If the state space is finite, then we can think of h as a vector h ∈ R|S| (with hi = h(i)),
and these equations say that h = Ph. That is, h is an eigenvector of P with eigenvalue 1.
This is not typically enough to nail down h (it could be scaled, or 1 may not be a simple
eigenvalue), but the boundary conditions h(i) = 1 for i ∈ A and h(i) = 0 for i ∈ B are
often enough to uniquely determine h. Let’s consider an example.
Example 2.6. Let (Xn )n∈N be random walk on Z, but not necessarily symmetric: we define
p(i, i + 1) = q and p(i, i − 1) = 1 − q for some q ∈ [0, 1] and all i ∈ Z. If q = 21 this is the
symmetric random walk of Example 1.3. Let’s consider the probability that the random
walk reaches N before 0; with the language above, A = {N } and B = {0}. Thus h(N ) = 1
and h(0) = 0. Equation 2.1 in this case says
h(i) = (1 − q)h(i − 1) + qh(i + 1), 0 < i < N.
It is convenient to rewrite this in the form
(1 − q)[h(i) − h(i − 1)] = q[h(i + 1) − h(i)].
If q = 0 this gives h(i)−h(i−1) = 0 for 0 < i < N and since h(0) = 0 it follows that h(i) = 0
for i < N . (Indeed, here the walk always moves left.) Otherwise, let c = h(1) − h(0), and
7
1−q
let θ = q
. Define ∆h(i) = h(i)−h(i−1) for i ≥ 1. Thus, we have a geometric progression
∆h(i + 1) = θi c, 1 ≤ i ≤ N − 1.
i−1
X
h(i) = ∆h(1) + · · · + ∆h(i) = c + θc + · · · + θi−1 c = c θj
j=0
for 2 ≤ i ≤ N .
• Symmetrix RW. Here q = 12 so θ = 1. This gives h(i) = ci for 2 ≤ i ≤ N . In
particular h(N ) = cN . But also h(N ) = 1, so c = N1 , and we have h(i) = Ni .
In terms of gambling: suppose we bet on coin tosses. If heads comes up, you win
$1; if tails, you lose $. You begin with $i, and will cash out either when you go bust
or win $N. If the coin is fair, the probability of cashing out rather than going bust
is i/N – it increases linearly th closer your intial capital is to the cash-out.
i−1
X 1 − θi
h(i) = c θj = c , 2 ≤ i ≤ N.
j=0
1−θ
N
In particular 1 = h(N ) = c 1−θ
1−θ
, and so c = 1−θ
1−θN
. Combining, this yields
1 − θi
h(i) = , 1 ≤ i ≤ N.
1 − θN
This highlights the gambler’s ruin phenomenon. In the gambling game above, we
suppose the coin is biased against you (perhaps only slightly). Suppose, for ex-
ample, that q = 0.49. If you start with $100 and set the cash-out at $200, then
h(100) = 0.018, as oppoed to 0.5 in the q = 0.5 case.
8
3.1. Expected Hitting Times. Let (Xn )n∈N be a Markov chain with state space S and tran-
sition probabilities p(i, j). Let A ⊆ S, and as usual set τA = min{n ≥ 0 : Xn ∈ A}. Let us
define the function g : S → [0, ∞] by
g(i) = Ei (τA ) = E[τA |X0 = i].
Note: the conditional expectation here is the expectation with respect to the conditional
probability Pi ,
∞ ∞
X X P(Y = n, X0 = i)
Ei (Y ) = nPi (Y = n) = n .
n=1 n=1
P(X 0 = i)
Now,
Pi (τA = n|X1 = j) = P(X0 ∈
/ A, X1 ∈
/ A, . . . , Xn−1 ∈
/ A, Xn ∈ A|X0 = i, X1 = j)
= P(X0 ∈
/ A, . . . , Xn−2 ∈
/ A, Xn−1 ∈ A|X0 = j) = Pj (τA = n − 1)
by the Markov property. Hence
∞
X ∞
X
Ei [τA |X1 = j] = nPi (τA = n|X1 = j) = nPj (τA = n − 1)
n=1 n=1
X∞
= (m + 1)Pj (τA = m)
m=0
X∞ ∞
X
= mPj (τA = m) + Pj (τA = m).
m=0 m=0
The first term is just Ej (τA ), and since τA is N ∪ ∞-valued, the latter sum is 1. Hence we
have
X X X
g(i) = [Ej (τj ) + 1]Pi (X1 = j) = [g(j) + 1]p(i, j) = 1 + p(i, j)g(j). (3.1)
j∈S j∈S j∈S
This system of equations may not have a unique solution, since g(i) = ∞ for i ∈ / A is
always a solution. But if it is known a priori that the expected hitting time is finite, it may
often be used to calculate the function g.
Remark 3.1. Let 1 denote the column vector with all entries equal to 1. If we interpret the
function g as a vector g ∈ R|S| , then Equation 3.1 can be written as
g = 1 + Pg
9
where P is the transition matrix for the Markov chain. This is an affine equation, rather
than a linear equation; it is less amenable to the techniques of linear algebra. But in many
examples, it can be solved iteratively.
Example 3.2. On average, how many times do we need to toss a coin to get two consecutive
heads? To answer this question, we need to construct an appropriate Markov chain. There
are different ways to approach this; probably the simplest is to let Xn denote the number
of consecutive heads one has immediately seen immediately after the nth toss; we stop
the chain once we see two consecutive heads. This means Xn is a Markov chain with state
space {0, 1, 2} and transition matrix
1 1
2 2
0
P = 2 0 21 .
1
0 0 1
As above, set g(i) = Ei (τ{2} ). Then g(2) = 0. Then Equation 3.1 says
g(0) = 1 + p(0, 0)g(0) + p(0, 1)g(1) + p(0, 2)g(2) = 1 + 21 g(0) + 21 g(1)
g(1) = 1 + p(1, 0)g(0) + p(1, 1)g(1) + p(1, 2)g(2) = 1 + 12 g(0) + 21 g(2) = 1 + 12 g(0)
(and the third equation gives no information since it says g(2) = g(2)). The first equation
simplifies to 12 g(0) − 21 g(1) = 1 or g(0) − g(1) = 2, and the seconq equation says g(0) =
2g(1) − 2. Thus (2g(1) − 2) − g(1) = 2 and so g(1) = 4; therefore g(0) = g(1) + 2 = 6.
So, starting having seen no heads (of course), on average it takes about 6 tosses until 2
consecutive heads come up.
In making the above calculations, the key tool was the Markov property: conditioned
on X1 = j, the chain behaves after time 1 just like another Markov chain started from
j. To make other calculations, we will need something even stronger: we will want to
apply this reasoning not only to shifts by deterministic times, but also by (appropriate)
random times. I.e. we will see that, if T is an appropraite N-valued random variable, then
the shifted process XT +n is still a Markov chain with the same transition probabilities.
3.2. Stopping Times. The appropriate kind of random time alluded to in the above para-
graph is a stopping time.
Definition 3.3. Let (Xn )n∈N be a discrete-time stochastic process. A stopping time is a random
variable T with state space N ∪ {∞} such that, for each n, the event {T = n} depends only on
X0 , X1 , . . . , Xn .
It is typical to think of stopping times in terms of betting strategies. If Xn represents
some statistic associated to a gambling game, one has to decide when to bet (or fold)
based only on information up to the present time; but that information depends on the
random occurrences in the game up to that time. For example: one could decide to sell
a stock once it reaches $200; this is a stopping time. One would prefer to sell it the day
before it drops; this is not a stopping time.
Example 3.4. Fix i ∈ S. The return time Ti is the first time that Xn returns to state i after
time n = 0 (where we condition on X0 = i). That is:
Ti = min{n ≥ 1 : Tn = i}.
10
For any given n, the event {Ti = n} can be written as {X1 6= i, X2 6= i, Xn−1 6= i, Xn = i}.
Thus, the event {Ti = n} depends only on the values of X0 , . . . , Xn , and hence Ti is a
stopping time. Note: the hitting time τ{i} = min{n ≥ 0 : Xn = i} is generally different
from Ti (they differ if X0 = i which we generally condition on).
Example 3.5. If T = min{n ≥ 0 : Xn+1 = i}, then {T = n} = {X0 6= i, X1 6= i, . . . , Xn 6=
i, Xn+1 = i} depends on the future value of Xn+1 , and so T is not a stopping time.
3.3. The Strong Markov Property. Restarting a Markov chain at a stopping time pro-
duces a new Markov chain with the same transition probabilities.
Proposition 3.6. Let (Xn )n∈N be a time-homogeneous Markov chain with state space S and tran-
sition probabilities p(i, j). Let T be a stopping time. Suppose i ∈ S and P(XT = i) > 0. Then,
conditional on XT = i, (XT +n )n∈N is a time-homogeneous Markov chain with transition proba-
bilities p(i, j), independent of X0 , X1 , . . . , XT .
In other words: if A is an event that depends only on X0 , X1 , . . . , XT and P(A ∩ {XT = i}) > 0,
then for all n ≥ 0 and all i1 , . . . , in ∈ S,
P(XT +1 = i1 , XT +2 = i2 , . . . , XT +n = in |A ∩ {XT = i}) = p(i, i1 )p(i1 , i2 ) · · · p(in−1 , in ).
Proof. What we need to show is that
P(A ∩ {XT = i, XT +1 = i1 , . . . , XT +n = in }) = P(A ∩ {XT = i})p(i, i1 ) · · · p(in−1 , in ).
Because A depends only on X0 , . . . , XT , the event A∩{T = m} depends only on X0 , . . . , Xm .
We therefore decompose according to the value m of T and apply the Markov property:
P(A ∩ {XT = i, XT +1 = i1 , . . . , XT +n = in })
X∞
= P({T = m} ∩ A ∩ {Xm = i, Xm+1 = i1 , . . . , Xm+n = in })
m=0
X∞
= P({T = m} ∩ A ∩ Xm = i)p(i, i1 ) · · · p(in−1 , in )
m=0
= P(A ∩ {XT = i})p(i, i1 ) · · · p(in−1 , in ).
11
4.1. Transience and Recurrence. Let (Xn )n∈N be a Markov chain with state space S. There
are only two eventual behaviors a state can have under the action Xn .
Definition 4.1. A state i ∈ S is called recurrent if Pi (Xn = i for infinitely many n) = 1. A
state i ∈ S is called transient of Pi (Xn = i for infinitely many n) = 0.
We will use the strong Markov property to show that every state is either recurrent or
transient. The idea is to use the following collection of stopping times. Fixing i ∈ S, and
conditioning on the event X0 = i, we set
Ti,1 = 0, Ti,k = min{n > Ti,k : Xn = i} for k ≥ 2.
So Ti,k is the time of the k th visit to i. Note that, for each m ∈ N,
{Ti,k = m} = {XTi,k−1 +1 6= i, XTi,k−1 +2 6= i, . . . , Xm−1 6= i, Xm = i}
depends only on X0 , . . . , Xm (inductively verified, since Ti,k−1 is involved). These gener-
alize the first return time Ti = Ti,2 = min{n ≥ 1 : Xn = i}, which you saw was a stopping
time in the previous lecture. Now, define
ri = Pi (Ti < ∞).
Assume, for the moment, that, for fixed k ≥ 1, Ti,k < ∞. So-conditioned, the strong
Markov property yields that the process XTi,k +n is a Markov chain with the same transi-
tion probabilities as (Xn ). It follows that
P(Ti,k+1 < ∞|Ti,k < ∞) = Pi (Ti < ∞) = ri .
By induction, then, we have
P(Ti,k < ∞) = rik−1 .
This leads to the desired dichotomy.
P∞
Theorem 4.2. Let i ∈ S. Then ri = 1 iff n=0 pn (i, i) = ∞ iff i is recurrent; and ri < 1 iff
P ∞
n=0 pn (i, i) < ∞ iff i is transient.
Remark 4.3. The statement of Theorem 4.2 shows that one can verify recurrence/transience
for the state i either by
P calculating (or estimating) the probability ri , or by calculating (or
estimating) the sum pn (i, i); both are useful in actual examples.
Remark 4.4. The intuition is pretty clear here. If ri = 1 then we’re guaranteed to return
to i, and then (since the process is Markov) we start over and are guaranteed to return
again, eventually visiting i infinitely often. If, onthe other hand, ri < 1, there is a chance
we never return the first time; in the (uncertain) case we do, there is the same chance
we won’t visit a second time, etc.; these probabilities compound and eventually we stop
returning to i. The proof below makes this intuition rigorous.
12
In order to prove the theorem, we need to recall a useful fact from undergraduate prob-
ability. Let T be an N-valued random variable. Then one way to calculate its expectation
is
∞ ∞ X j ∞ X∞ ∞
X X X X
E(T ) = jP(T = j) = P(T = j) = P(T = j) = P(T ≥ k).
j=1 j=1 k=1 k=1 j=k k=1
Then {Ni ≥ k} = {Ti,k < ∞}, and so Pi (Ni ≥ k) = P(Ti,k < ∞) = rik−1 . We also have
∞ ∞ ∞
Ei [1Xn =i ] =
X X X
Ei [Ni ] = Pi (Xn = i) = pn (i, i).
n=0 n=0 n=0
as claimed.
• If ri < 1 then Pi (Ni ≥ k) = rik−1 is exponentially small and so Pi (Ni = ∞) = 0,
which his precisely the statement that i is transient. Moreover, we have
∞ ∞ ∞
X X X 1
pn (i, i) = Ei [Ni ] = Pi (Ni ≥ k) = rik−1 = < ∞.
i=0 k=1 k=1
1 − ri
Example 4.5. Consider simple random walk on Z, with p(i, i + 1) = p and p(i, i − 1) =
1 − p. For the n-step transition probabilities, we have pn (i, i) = 0 unless n is even. So we
calculated p2n (i, i). Consider the trajectory of the path; in order for X2n = i given X0 = i, it
2n
must be that n of the steps are left and n are right. There are n such paths, and because
of the independence of the steps, each has probability pn (1−p)n of having occurred. Thus,
we have
2n n
p2n (i, i) = p (1 − p)n .
n
So the recurrence/transience is decided by the summability of
∞ ∞
X X 2n n
pn (i, i) = p (1 − p)n .
n=0 n=0
n
Of course this is finite (in fact 0) of p ∈ {0, 1}, where the walk moves deterministically
2n
either right or left. In general, we can make the blunt upper-bound n ≤ 4n : this follows
from the overcount of paths, since 2n
n
counts the number of length-2n paths that return
13
to their starting point, while 4n = 22n counts the number of all length-2n paths. Then we
have
X∞ ∞
X ∞
X
pn (i, i) ≤ 4n pn (1 − p)n = (4p(1 − p))n .
n=0 n=0 n=0
1
The function p 7→ p(1 − p) takes maximum value at p = 12 , and is strictly less than 14
4
elsewhere in [0, 1]; hence the series is summable when p 6= 12 and all such non-symmetric
random walks on Z are transient.
When p = 21 this gives ∞ as an upper-bound for ∞
P
n=0 pn (i, i) which is not useful; we
need a lower-bound. Here we use Stirling’s approximation:
√
n! ∼ 2πn(n/e)n .
From this it follows that
√
4πn(2n/e)2n 22n
2n (2n)!
= ∼ = √ .
n (n!)2 2πn(n/e)2n πn
Thus, in the case p = 12 , we then have
∞ ∞ n n X ∞
X XX 22n 1 1 1
pn (i, i) √ = √ = ∞.
n=1 n=0
πn 2 2 n=1
πn
(I.e. one way to get from i back to i in ` + m + n steps is to move from i to j in m steps, that
return to j in ` steps, and then move from j to i in n steps.) Thus, we have the estimate
∞ ∞ ∞
X X p`+m+n (i, i) 1 X
p` (j, j) ≤ = pk (i, i) < ∞
`=0 `=0
pm (i, j)pn (j, i) pm (i, j)pn (j, i) k=m+n
by the assumption that i is transient. Hence j is transient as well.
15
This summand is a multinomial coefficient: from 2n, choose m (1, 0) steps, m (−1, 0) steps,
n − m (0, 1) steps, and n − m (0, 1) steps. This is given by
2n (2n)!
=
m m n−m n−m (m!) ((n − m)!)2
2
(Think of it this way: to choose n objects from 2n, one could start by artificially dividing
the 2n into two groups of n, then selecting m from the first group and n − m from the
second; summing over all allotments of 0 ≤ m ≤ n from the first and n − m from the
second, we achieve all ways of selecting n from the original 2n.)
Hence, we have shown that
2 2
−2n 2n −2n 2n
p2n (0, 0) = 4 = 2 .
n n
Looking back at Example 4.5, we see that this is exactly the square of p2 n(0, 0) for the SRW
on Z. In particular, we already calculated that
−2n 2n 1
2 ∼√ .
n πn
Hence, for SRW on Z2 ,
∞ ∞ ∞
X X X 1
pn (0, 0) = p2n (0, 0) = = ∞.
n=0 n=0 n=0
πn
Example 5.6. SRW on Z3 . Arguing exactly as we did above to enumerate the trajectories
of length 2n, now noting that the adjacent-step transition probabilities are 16 , we have
X 2
−2n
X (2n)! −2n 2n −2n n!
p2n (0, 0) = 6 =2 3 .
i,j,k≥0
(i!j!k!)2 n i,j,k≥0
i!j!k!
i+j+k=n i+j+k=n
You might hope this works out to exactly p2n (0, 0)3 from SRW on Z, but it does not. How-
ever, it turns out to be asymptotically a constant times this, as we will now see. Here are
some facts that help in the calculation:
• The number of ways to put n balls into 3 boxes is, of course, 3n ; but by considering
all ways of subdividing them individually, we have
X n X n!
n
3 = = .
i,j,k≥0
i j k i,j,k≥0
i!j!k!
i+j+k=n i+j+k=n
For the moment, take n = 3m. Then the second fact above gives
n! n!
max anj = max 3−n ≤ 3−n .
j i,j,k≥0
i+j+k=n
i!j!k! (m!)3
Hence, we have
√
3−n 2πn(n/e)n
−2n 2n −n n! 1 1
p6m (0, 0) ≤ 2 ·3 ∼ √ · =
n (m!)3 πn (2πm)3/2 (m/e)3m 2π 3/2 m3/2
using Stirling’s formula again.
This only gives us an upper bound on p6m ; to get p2n for all n, we also need p6m−2 and
p6m−4 . But note: if there is a length 6m − 2 path from 0 to 0, then by traveling one step
forward and back along any of the 3 axes, we produce a path of length 6m; since each step
18
has transition probability 16 , we then have the lower bound p2m (0, 0) ≥ ( 16 )2 p6m−2 (0, 0).
Similarly, we have p2m (0, 0) ≥ ( 61 )4 p6m−4 (0, 0). This gives the estimate
∞
X ∞
X ∞
X
4
p2n (0, 0) = [p6m−4 (0, 0) + p6m−2 (0, 0) + p6m (0, 0)] ≤ 6 p6m (0, 0)
n=1 m=1 m=1
and thus ∞ ∞
X 64 X 1
p2n (0, 0) ≤ 3/2 < ∞.
n=1
2π m=1 m3/2
Ergo, SRW on Z3 is transient.
In particular, by Theorem 4.2, this shows that r03 = P0 (T03 < ∞), the probability of return
to 0 in Z3 , is < 1. Now, in Zd , in order for the process to return to 0, all components must
return to 0; hence we have {T0d−1 < ∞} ⊇ {T0d < ∞} for all d, and hence the probability
r0d is non-increasing with dimension. It therefore follows that SRW on Zd is transient for
all d ≥ 3.
Remark 5.7. It can be calculated, using generating-function techniques we do not discuss
here, that the return-probability in Z3 is r03 ≈ 0.34.
19
6.1. Stationarity. Suppose that π is a probability vector with the property that π = Pπ;
i.e. π is a right eigenvector of P with eigenvalue 1. Then, of course, π = πPn for all n;
in other words, if π is ever the distribution of the process at some time, it will remain
stationary in that distribution for all time. So we call such a π a stationary distribution. We
can be a little forgiving with the requirement that it be a distribution.
Definition 6.1. Let (Xn )n≥0 be a Markov chain with state space S transition matrix P. A vector
π = [π(i)P: i ∈ S] is called a stationary measure if π(i) ≥ 0 for all i ∈ S and πP = π. If, in
addtion, i∈S π(i) = 1, we call π a stationary distribution.
Note: at least in the case of a finite-state Markov chain, if there is a stationary measure π
with at least one non-zero entry,P then we−1 can normalize the vector to produce a stationary
distribution π̂: just take π̂ = ( i∈S π(i)) π.
Example 6.2. Suppose a 2-state Markov chain has transition matrix
1/2 1/2
P= .
1/4 3/4
If π = [x, y] is a stationary measure, then we have the equations
1/2 1/2
[x, y] = [x, y] = [ 21 x + 41 y, 12 x + 34 y].
1/4 3/4
These two equations actually amount to the same thing: 12 x = 14 y, so y = 2x. Hence
there is a one-parameter family of stationary measures; with the additional constraint
that x + y = 1, we see the unique stationary distribution is
π = [1/3, 2/3].
(If there exists a stationary measure, it is never unique: one can always scale an eigenvec-
tor. The question is whether 1 is an eigenvalue.)
20
for some 0 ≤ p, q ≤ 1. To compute all powers of this matrix, we diagonalize. The charac-
teristic polynomial is (λ − (1 − p))(λ − (1 − q)) − pq = λ2 + (p + q − 2)λ + 1 − p − q whose
roots are λ ∈ {1, 1 − p − q}.
First, suppose that {p, q} ∈ {0, 1}. That is, consider separately the four special cases:
1 0 0 1 1 0 0 1
P∈ , , , .
1 0 0 1 0 1 1 0
π
In the first two cases, one can easily check that the matrix is already in the form for
π
π the stationary distribution, as desired. The latter two are different, however. For the
identity matrix, every vector is a left-eigenvector with eigenvalue 1, and so although it
is true that πI n = π for all n (and so too in the limit), the stationary distirbution is not
unique, and the limit distribution depends on the initial distribution. Worse yet, for the
last matrix: the powers do not stablize, but cycle periodically between the matrix and
I. The eigenvalues are ±1, and the unique stationary distribution is [ 21 , 21 ], but the limit
limn→∞ P(Xn = j) does not exists, regardless of the initial distribution.
Now, suppose that at least one of p, q is strictly in (0, 1). Then the eigenvalues are
distinct and we can find two linearly independent eigenvectors. The result is that we can
write P in the form P = QDQ−1 where
1 −p −1 q/(p + q) p/(p + q) 1 0
Q= , Q = , D= .
1 q −1/(p + q) 1/(p + q) 0 1−p−q
Note: writing it this way means PQ = QD or Q−1 P = DQ−1 and so the columns of Q are
right-eigenvectors for P, and the rows of Q−1 are left-eigenvectors for P. The normalized
left-eigenvector for eigenvalue 1 is [q/(p+q), p/(p+q)], so this is the stationary distribution
π. Now, since 0 ≤ p, q ≤ 1 and one is strictly in (0, 1), the eigenvalue 1 − p − q has
|1 − p − q| < 1. Thus
n 1 0 1 0
D = → , as n → ∞
0 (1 − p − q)n 0 0
and so
n −1 q/(p + q) p/(p + q) 1
n 0 1 −p
P =Q D Q→
−1/(p + q) 1/(p + q) 0 0 1 q
q/(p + q) p/(p + q) π
= = , as n → ∞.
q/(p + q) p/(p + q) π
Hence, the chain converges to the stationary distribution regardless of initial distribution,
provided at least one of p, q is in (0, 1).
Remark 6.4. If both p, q are in (0, 1), then all entries of P are strictly positive. This turns
out to be enough in general to guarantee the existence of a unique stationary distribution
and universal convergence to it. Now, if (say) p ∈ (0, 1) but q = 1, then the matrix has the
form
1−p p 2 1 − p + p2 p − p2
P= , and so P =
1 0 1−p p
2
and so all entries of P are strictly positive, and so similar reasoning will allow us to say
something about stationary distributions and limit theorems for such chains.
22
Then the precise statement we are proving is that, for any initial state i ∈ S,
Ei [Y (j, n)]
π(j) = lim .
n→∞ n+1
To prove this, simply note that
n n n
Ei (1{Xm =j} ) =
X X X
Ei [Y (j, n)] = Pi (Xm = j) = [ei Pn ]j
m=0 m=0 m=0
On Homework 2, you also explore the connection between the stationary distribution
and the expected return times Ti for states i ∈ S. Here is a somewhat heuristic argument
for the relation between them. As usual, let Ti,k denote the time of the k th return to the
state i (so that, conditioned on X0 = i, Ti = Ti,2 ). Write this time as
Ti,k = T1 + T2 + · · · + Tk
where T` = Ti,` − Ti,`−1 for ` ≥ 1. The Markov property shows that the T` are independent
and identically distributed (with distribution T1 ∼ Ti ). Now, the law of large numbers
therefore gives
Ti,k 1
= (T1 + · · · + Tk ) ≈ E(Ti ).
k k
I.e. there are approximately k visits to state i in the first kE(Ti ) steps of the chain. But
Proposition 7.1 states that in n steps there are about nπ(i) such visits to the state i. Taking
n = kE(Ti ), we find kE(Ti )π(i) ≈ k for large k, and so we have loosely derived the fact
1
E(Ti ) = .
π(i)
You will give a rigorous proof of this fact on Homework 2.
24
7.1. Periodic Chains. Let P be the transition matrix for an irreducible chain. The period
of a state i ∈ S, denoted d(i), is the greatest common divisor of the set of times Ji for
which the chain can return from i to i:
d(i) = gcd Ji , Ji = {n ≥ 0 : pn (i, i) > 0}.
Lemma 7.2. If P is the transition matrix for an irreducible Markov chain, then d(i) = d(j) for
all states i, j.
Proof. For fixed state i, note that if n ∈ Ji and m ∈ Ji then n + m ∈ Ji (if there is a
positive probability trajectory of length n and another of length m, then the conjunction
of the two is a positive probability trajectory of length n + m). Hence, Ji is closed un-
der addition. Now, if J is any subset of N closed under addition and d = gcd J, then
J ⊆ {0, d, 2d, . . .} = dN (otherwise J would contain a number not divisible by d which
contradicts the definition of d).
Now, let j be another state. The chain is irreducible, so there are m, n ∈ N so that
pn (i, j) > 0 and pm (j, i) > 0; hence m + n ∈ Ji . Since m + n ∈ Ji there is k ∈ N so that
m + n = kd(i). Now, suppose that ` ∈ Jj . Since p` (j, j) > 0, we have
pm+n+` (i, i) ≥ pm (i, j)p` (j, j)pn (j, i) > 0
and so m + n + ` ∈ Ji ; hence m + n + ` is divisible by d(i). Thus, ` is divisible by d(i)
as well. Thus, d(i) is a common divisor of Jj , and it follows that d(j) divides d(i). The
symmetric argument shows that d(i) divides d(j), so d(i) = d(j).
It therefore makes sense to talk about the period of an irreducible Markov chain or the
period of P.
Example 7.3. Let G be a finite, connected graph. Then the simple random walk on G is
irreducible. Note, if i, j are two adjacent vertices, then p2 (i, i) ≥ p(i, j)p(j, i) > 0, and so
the period of i is ≤ 2; so simple random walks on graphs have period 1 or 2. It is not hard
to see that the period is 2 if and only if the graph is bipartite: G = (V1 t V2 , E) where there
are no edges between any two vertices in either V1 or V2 , only between the two non-empty
disjoint sets V1 and V2 . For example, any cyclic graph (like the square, in Example 6.3(3)
above) is bipartitie, partitioned by parity.
A Markov chain is called aperiodic of all of its states have period 1.
7.2. Limit Theorem for Irreducible Aperiodic Chains. The Perron-Frobenius approach
we discussed in the last lecture works precisely for irreducible aperiodic Markov chains.
Theorem 7.4. Let P be the transition matrix for a finite-state, irreducible, aperiodic Markov
chain. Then there exists a unique stationary distribution π: πP = π, and for any initial probability
distribution ν,
lim νPn = π.
n→∞
For the proof, we need a slightly stronger number theoretic claim about additively-
closed subsets of N. If J ⊆ N is closed under addition, then J ⊆ dN where d = gcd J as
shown above. What’s more, it is true that J contains all but finitely many multiples of d. In
other words, there is M ∈ N such that md ∈ J for all m ≥ M . (A proof of this fact is
outlined in Exercise 1.21 on page 41 of Lawler.)
25
7.3. Reducible Chains. Irreducible Markov chains are the easiest to handle, in large part
because of Proposition 4.8. Other Markov chains are called reducible, because they re-
duce (in some sense) to collections of irreducible chains. Following Definition 4.7 and the
proof of Proposition 4.8, they key feature of irreducible chains is the ability of all states to
eventually communicate with each other.
Definition 7.5. Let (Xn ) be a Markov chain with state space S. Say that two states i, j ∈ S can
communicate, written i ↔ j, if there exist n, m ∈ N so that pn (i, j) > 0 and pm (j, i) > 0.
Lemma 7.6. The relation ↔ on the state space S (cf. Definition 7.5) is an equivalence relation.
Proof. Since p0 (i, i) = 1 by definition, i ↔ i and so the relation is reflexive. Symmetry is
built into the definition. We need only verify transitivity. Suppose, then, that pm (i, j) > 0
and pn (j, k) > 0. Then
pm+n (i, k) = Pi (Xm+n = k)
≥ Pi (Xm+n = k, Xm = j)
= Pi (Xm = j)Pi (Xm+n = k|Xm = j)
= Pi (Xm = j)Pj (Xn = k)
= pm (i, j)pn (j, k) > 0
where the penultimate equality is the Markov property. A similar argument applied to
(k, j) proves transitivity.
The equivalence classes under ↔ partition the state space S into communication classes.
An irreducible Markov chain is one with exactly one communication class. Examining the
proof of Proposition 4.8 shows that:
Corollary 7.7. Let (Xn ) be any Markov chain. Then all states within any one of its communica-
tion classes have the same asymptotic behavior: either all recurrent or all transient.
Remark 7.8. Similarly, the proof of Lemma 7.2 shows that any two states in a given com-
munication class have the same period.
Based on Corollary 7.7, we refer to communication classes either as recurrent classes or
transient classes. If S is a finite state space, at least one class must be recurrent (since at
least one state must be visited infnitely often).
26
Let i, j be states in two distinct classes. If p(i, j) > 0 then it must be true that pn (j, i) = 0
for all n (or else they would be in the same class). It follows that Pj (Xn = i) = 0 for all n,
and so i is transient. Hence, if i, j are recurrent states in distinct communication classes,
we must have p(i, j) = 0 (and also p(j, i) = 0 by the symmetric argument).
Let us agree to order the states by communication class, with recurrent classes R1 , . . . , Rr
first. The above argument shows that the Markov chain “reduces” the subsets R` : once in
a state in R` , the chain stays in R` forever; we can consider this as its own Markov chain
with transition matrix P` . The transition matrix for the full chain has the form
P1
P2 0
P3 0
P= ...
0
P r
S Q
where the bottom rows [ S |Q] give the transition probabilities starting in the transient
classes. Now, this block diagonal form is preserved under matrix multiplication, so we
have n
P1
Pn2 0
n
P 0
n
3
P = ...
0
Pn
r
Sn Qn
for some matrix Sn . Note that the row sums of [ S |Q] are all 1 and so the same remains
true for [ Sn | Qn ] since Pn is a stochastic matrix. So Qn is substochastic for all n (its row-
sums are ≤ 1). This will be important later.
27
This last sum is the (i, j)-entry of the matrix I + P + P2 + · · · which, a priori, may not
converge at all. However, consider the block-diagonal form above for the matrix P. Let
us assume that both i and j are transient states, which means that the (i, j)-entry of Pn is the
(i, j)-entry of Qn , where Q is the substochastic matrix for the transitions of transient states
to transient states. By definition of transience, for each i, j there is N so that [Qn ]ij = 0 for
all n ≥ N . Hence, the sum
[I + P + P2 + · · · ]ij = [I + Q + Q2 + · · · ]ij
is actually a finitely-terminatining series; so, in particular, it converges. In terms of calcu-
lating it, however, we use the fact that it converges to use the geometric series to express
it as
I + Q + Q2 + · · · = (I − Q)−1 .
Example 8.1. Consider the random walk on {0, 1, 2, 3, 4} with absorbing boundaries (cf.
Example 1.7). The states 0, 4 are recurrent, so we should order the states {0, 4, 1, 2, 3}; then
the transition matrix becomes
1 0 0 0 0 1
0 1 0 0 0 0 2 0
1 1
P = 2 0 0 2 0 ,
Q = 21 0 12 .
1 1
0 0
2
0 2 0 12 0
0 12 0 12 0
So we can easily and quickly calculate
1 1
−1 3
2
0 2 2
1 12
(I − Q)−1 = 0 1
2
0 = 1 2 1 .
1
1
2
0 1
2 2
1 32
28
Thus, starting in state 1, the expected number of visits to state 1 is 32 , to state 2 is 1, and to
state 3 is 12 . Hence, the total expected number of steps before absorption at the boundary,
starting in state 1 is 23 + 1 + 21 = 3. The same expected number of steps holds starting in
state 3, while starting in state 2 the expected number of steps is 1 + 2 + 1 = 4.
In other words, A satisfies the equations A = PA, at least for entries of the form (ti , rj ). If
we write A in block form as such, we have
I 0
A=
Atr 0
where Atr are the entries calculated above. The matrix equation A = PA then yields the
matrix equation Atr = S + QAtr , or
Atr = (I − Q)−1 S.
Example 8.2. Again, consider the random walk on {0, 1, 2, 3, 4} with absorbing bound-
aries, cf. Example 7.3 above. Then we have
1 1 3
2
0 0 2 0 2
1 21
S = 0 0 , Q = 12 0 21 , (I − Q)−1 = 1 2 1 .
1 1
0 2 1
0 2 0 2
1 23
29
Hence 3
1 3
1 1
2
1 2 2
0 4 4
Atr = (I − Q)−1 S = 1 2 1 0 0 = 12 1
2 .
1
2
1 32 0 12 1
4
3
4
Thus, starting at state 1, the walk is eventually absorbed at 0 with probability 34 ; starting
at state 2 its an even split for where it ends up; starting at state 3 the walk is absorbed at
state 1 with probability 41 .
8.3. Transit Times. Techniques like those above can be used to determine expected tran-
sit times from one state to another, in an irreducible chain. Fix a state i, and let Ti =
min{n ≥ 1 : Xn = i} as usual. We saw (in Lecture 3) a technique for calculating Ei [Ti ], but
what about the more general question of Ej [Ti ] for different states i, j?
One approach is to define a new Markov chain from the original which converts the
coveted state i into an absorbing state. First, write the transition matrix for the original
chain with the stat i ordered first:
p(i, i) r
P= .
s Q
Then the new chain Xn0 has transition matrix
0 1 0
P = .
s Q
Now, for any state k 6= i let Yik denote the number of visits to k before reaching i. Then
" #
X X
k
Ej [Ti ] = Ej Yi = Ej [Yik ].
k6=i k6=i
Now, because we have changed the chain to make i an absorbing state, if Xn0 = i then it
will never reach a state k 6= i at any time ≥ n. Thus, we can calculate the expectation of
Yik simply as
"∞ # ∞ ∞ ∞
1{Xn0 =k} = Ej [1{Xn0 =k} ] =
X X X X
0
k
Ej [Yi ] = Ej Pj (Xn = k) = p0n (j, k).
n=0 n=0 n=0 n=0
As above, p0n (j, k) = [(P0 )n ]jk , and so long as j, k 6= i, this means that
∞
X
Ej [Yik ] = [Qn ]jk = [(I − Q)−1 ]jk .
n=0
Thus, for j 6= i,
X
Ej [Ti ] = [(I − Q)−1 ]jk = [(I − Q)−1 1]j .
k6=i
Example 8.3. Consider the random walk on {0, 1, 2, 3, 4} with reflecting boundary:
0 1 0 0 0
1 0 1 0 0
2 1 2 1
P= 0 2 01 2 01 .
0 0 0 2
2
0 0 0 1 0
Taking the state i = 0 as the target, we use the preceding technique to calculated the
expected transit time to 0. The submatrix Q and the associated matrix (I − Q)−1 are
1
0 2 0 0 2 2 2 1
1 0 1 0 −1
2 4 4 2
Q= 0
2
1
2
1
, (I − Q) = 2 4 6 3 .
2
0 2
0 0 1 0 2 4 6 4
Thus
2 2 2 1 1 7
2 4 4 2 1 12
(I − Q)−1 1 =
= .
2 4 6 3 1 15
2 4 6 4 1 16
So, starting at 1, it takes about 7 steps to get to 0; starting at 4 it takes about 16 steps.
31
For this kind of chain, with i ≥ 1, this says h(i) = pi h(i + 1) + qi h(i − 1). This is a 2-step
recurrence relation, but it has variable coefficients, so it is a little trickier to handle, but
not too much trickier. Let u(i) = h(i − 1) − h(i) for i ≥ 1. Then h(i) = pi h(i + 1) + qi h(i − 1)
can be rewritten as pi u(i + 1) = qi u(i). Assuming pi 6= 0 for any i, this gives us
qi qi qi−1 · · · q1
u(i + 1) = u(i) = u(1)
pi pi pi−1 · · · p1
by induction. Define
qi qi−1 · · · q1
ρi = .
pi pi−1 · · · p1
Then the telescoping sum gives us u(1) + u(2) + · · · + u(i) = h(0) − h(i), and so
h(0) − h(i) = u1 (1 + ρ1 + · · · + ρi−1 ). (9.1)
The value of h(0) will be determined by the boundary conditions; for example if p(0, 0) =
1 then h(0) = 1; if p(0, 1) = 1 then h(0) = 0. The first difference u(1) remains undeter-
mined. We consider two cases.
• Suppose ∞
P
i=0 ρi = ∞. Since h(i) is a probability, h(i) ∈ [0, 1]. Thus the sequence
h(0) − h(i) must remain bounded. Taking i → ∞ then shows that we must have
u(1) = 0 here, and so h(i)
P = h(0) for all i (boring case).
• If, on the other hand, ∞ i=0 ρi < ∞, then u(1) can be taken > 0 provided that
Example 9.1. In a population growth model we certainly must have p(0, 0) = 1, giving
boundary condition h(0)
P = 1. So long as pi > 0 for each i, this implies that ρi > 0, and
hencePwe either have ρi = ∞ in which case h(i) = 1 for all i – i.e. guaranteed extinction
– or ρi < ∞ in which case h(i) < 1 and h(i) → 1 as i → ∞ – i.e. there is positive
probability of survival, and this probability increased with larger starting population (as
expected).
9.2. Positive and Null Recurrence. In general, we can refine the definition of recurrence
further (although, as we’ll see, this really only makes sense for infinite-state Markov
chains).
Definition 9.2. Let i be a recurrent state for a Markov chain. Denote Ti = min{n ≥ 1 : Xn = 1}
as usual. If Ei [Ti ] < ∞, we call i positive recurrent. If, on the other hand, Ei [Ti ] = ∞, then we
call i null recurrent.
Remark 9.3. Note, it is certainly possible for a finite-valued random variable T to have
infinite expectation. For example, suppose that P(T = 2k ) = 2−k for each P kk ≥ 1. Then T
k
P∞ only the values {2, 4, 8, 16, . . .} and is never infinite, but E[T ] = k 2 · P(T = 2 ) =
takes
k=1 1 = ∞.
Proposition 9.4. In a finite-state, irreducible Markov chain, all states are positive recurrent.
Proof. Fix states i, j ∈ S. By the irreducible assumption, there is a finite time n(i, j) so that
pn(i,j) (i, j) > 0. Let N = max{n(i, j) : i, j ∈ S} and let q = min{pn(i,j) (i, j) : i ∈ S}. Thus,
regardless of the current state i, there is a positive probability ≥ q of reaching state j in
the next ≤ N steps; i.e. Pi (Tj ≤ N ) ≥ q, and hence Pi (Tj > N ) ≤ 1 − q. We now apply the
Markov property: for any states i, j ∈ S and any k ∈ N,
Pj (Tj > (k + 1)N |Tj > kN, XkN = i) = Pi (Tj > N ) ≤ 1 − q.
Of course, conditioned on Tj > kN the event Tj > (k + 1)N is independent of XkN by the
Markov property, and so in fact we have Pj (Tj > (k + 1)N |Tj > kN ) ≤ 1 − q. By induction
it follows that Pj (Tj > kN ) ≤ (1 − q)k for all k ∈ N. Thus:
∞
X ∞ (k+1)N
X X
Ej [Tj ] = Pj (Tj ≥ n) = Pj (Tj ≥ n).
n=1 k=0 n=kN +1
Thus, null recurrence can only occur if there are infinitely many states (at least in the
case of an irreducible chain). For infinite state Markov chains, even irreducible ones, it
can certainly happen that the states are null recurrent. Indeed, simple random walk on Z
gives an example.
Example 9.5. Let (Xn )n≥0 denote SRW on Z. Let us calculate E0 [T0 ], the expected return
time to 0 starting at 0 (of course this will be the same as Ei [Ti ] for any state i ∈ Z).
X
E0 [T0 ] = P0 (T0 ≥ n) = P0 (X1 6= 0, X2 6= 0, . . . , Xn 6= 0).
n≥1
Given the form of the transition probabilities, there are only finitely many (i1 , . . . , in ) for
which this product is non-zero: namely those that satisfy ik+1 = ik ± 1 for each k. So i1 =
±1, and so forth. Thus (i1 , . . . , in ) forms a lattice path. The restriction that i1 , . . . , in 6= 0
means that, after the first step i1 = ±1, the path must stay either strictly above 0 (if i1 = 1)
or strictly below 0 (if i1 = −1). By symmetry, we consider only the first case, so we have
i1 = 1, ik+1 = ik ± 1, and ik ≥ 1 for all k; call such paths Pn+ . Along any such path, all the
probabilities p(ik , ik+1 ) = 21 . Thus, we have
X 1 X 1
P(X1 6= 0, . . . , Xn 6= 0) = n
+ n
= 2−n+1 |Pn+ |.
+
2 +
2
(1,i2 ,...,in )∈Pn (−1,i2 ,...,in )∈−Pn
So we would like to count the number of paths in Pn+ . This turns out to be tricky. But
we can bound the size from below as follows. Let Pn+,m denote the subset of Pn+ of those
paths that end at height m (so 1 + i2 + · · · + in = m). Consider a pair π1 , π2 of paths in Pn+ .
By reversing π2 (i.e. if π2 = (1, i2 , . . . , in ) then π2−1 = (in , in−1 , . . . , i2 , 1)), the path π1 π2−1 is a
lattice path of length 2n that starts and ends at height 1, and never drops below height 1.
By decreasing all heights by 1, we see this path is a Dyck path of length 2n. What’s more,
any Dyck path of length 2n, whose middle-height is m − 1, can be identified with the pair
π1 , π2 as above.
So, if Dn,m−1 denotes the set of Dyck paths of length 2n whose middle height is m − 1,
then we have a bijection Pn+,m × Pn+,m ∼
= Dn,m−1 for all m ≥ 1. (These sets are nonempty
only if m ≤ n + 1.) So we have |Dn,m−1 | = |Pn+,m |2 , and so
n+1
X n+1
X
|Pn+ | = |Pn+,m | = |Dn,m−1 |1/2 .
m=1 m=1
It follows that !2
n+1
X n+1
X
|Pn+ |2 = |Dn,m−1 | 1/2
≥ |Dn,m−1 | = |Dn |
m=1 m=1
34
where Dn is the set of all Dyck paths. It is a well known result that Dn is counted by the
1 2n
Catalan number |Dn | = n+1 n
. Thus, we have
s s
1 2n 1 22n 2n 1
|Pn+ | ≥ ∼ √ ∼ √ 3/4 .
n+1 n n + 1 πn πn
Hence, we have
2n 1 2 1
P0 (X1 6= 0, . . . , Xn 6= 0) & 2−n+1 · √ 3/4 = √ 3/4 .
πn πn
In particular, this means that
X X 2 1
E0 [T0 ] = P0 (X1 6= 0, . . . , Xn 6= 0) & √ 3/4 = ∞.
n≥1 n≥1
πn
We showed in Example 4.5 that all states of the SRW on Z are recurrent. So the state 0 is
null recurrent.
9.3. Positive Recurrence and Stationary Distributions. Proposition 9.4 shows that finite-
state (irreducible) Markov chains have only positive recurrent states. As the next theorem
shows, this is the real reason that stationary distributions exists for such chains.
Theorem 9.6. Let (Xn )n≥0 be a Markov chain with state space S that is countable (but not nec-
essarily finite). Suppose there exists a positive recurrent state i ∈ S: i.e. Ei [Ti ] < ∞, where, as
usual, Ti = min{n ≥ 1 : Xn = i}. For each state j ∈ S, define
"T −1 #
i
1{Xn =j}
X
γ(i, j) = Ei
n=0
be the expected number of visits to j before returning to i. Then the function π : S → R+
γ(i, j)
π(j) =
Ei [Ti ]
is a stationary distribution for (Xn ){n≥0} .
Remark 9.7. Note: Theorem 9.6 gives a much simpler proof that no state can be positive
recurrent for the SRW on Z. Indeed, if such a state existed, then there would necessarily
exist a stationary distribution π for the SRW. From the form of the transition probabilities,
this would mean that
1 1
π(j) = π(j − 1) + π(j + 1).
2 2
for all j. Rearranging this (by writing π(j) = 2 π(j) + 21 π(j), subtracting, and multiplying
1
where the second inequality follows from the fact that γ(i, i) = 1 by definition. It follows
that P
j∈S γ(i, j)
X
π(j) = = 1.
j∈S
Ei [Ti ]
Since γ(i, j) and Ei [Ti ] are ≥ 0, the components of π are ≥ 0; soP π is a probability dis-
tribution. It remains to show that it is stationary: that π(j) = k∈S π(k)p(k, j) for all
j ∈ S. P By multiplying through both sides by Ei [Ti ], what we wish to show is that
γ(i, j) = k∈S γ(i, k)p(k, j) for all j ∈ S. To that end, we calculate: for j ∈ S,
"T −1 # "∞ # ∞
i
where the second equality follows from the fact that Ei [1X0 =j ] = 0. On the other hand,
when j = i γ(i, i) = 1 and the right-hand-side is
"T −1 # T −1 Ti
i i
1{Xm+1 =i} =
X X X
Ei P(Xm+1 = i) = P(Xn = i) = 1
m=0 m=0 n=1
1{Xn+1 =j} =
X X
γ(i, j) = Ei Pi (Xn+1 = j, Ti > n).
n=0 n=0
Corollary 10.1 gives a more succinct interpretation of the components of the stationary
distribution, should it exist; it also points out the necessity of positive recurrence for the
existence of a stationary distribution with positive entries. The next result makes this fact
clearer.
Proposition 10.2. Suppose (Xn )n≥0 is a time-homogeneous Markov chain with state space S,
and suppose the chain possesses a stationary distribution π.
(1) If (Xn )n≥0 is irreducible, then π(j) > 0 for all j ∈ S.
(2) In general, if π(j) > 0, then j is positive recurrent.
a stationary distribution, π = πP and hence by induction π = πPn for
Proof. (1) Since π isP
P = i∈S π(i)pn (i, j). There must exists some i with π(i) > 0 (since π(i) ≥ 0
all n. Thus, π(j)
for all i and i π(i) = 1 > 0). By irreducibility, for this choice of i, there is some n with
pn (i, j) > 0. Hence π(j) ≥ π(i)pn (i, j) > 0.
(2) Suppose that j is not positive recurrent. Start the chain in distribution π: so P(X0 =
i) = π(i) for all i. Denote by Vn = Vn (j) the number of visits (after time 0) to j up to time
n: n
1{Xm =j} .
X
Vn (j) =
m=1
Since π is a stationary distribution, we have
n n n
Eπ [1{Xm =j} ] =
X X X
Eπ [Vn (j)] = Pπ (Xm = j) = π(j) = nπ(j).
m=1 m=1 m=1
Now, let Tjk be the time of the kth visit to state j: Tjk = min{n ≥ 0 : Vn (j) = k}. We claim
that
!
Tjk
Pπ lim = ∞ = 1. (10.1)
k→∞ k
37
Assume, for the moment, that we have proved Equation 10.1. It follows that, for fixed
M > 0, the probability that Tjk /k ≤ M is small for large k. In particular, there must exist
N so that P(TjN /N ≤ M ) ≤ M1 . Note that
TjN /N ≤ M ⇐⇒ min{n : Vn (j) = N } ≤ N M.
Now, if VN M (j) ≥ N (there are at least N visits to j in the first N M time steps) then the
minimal time n by which there are N visits to j is ≤ N M . In other words
{VN M (j) ≥ N } ⊂ {TjN ≤ N M }
and so
1
Pπ (VN M (j) ≥ N ) ≤ Pπ (TjN ≤ N M ) ≤ .
M
But this implies that
NM
X
M N π(j) = Eπ [VM N ] = Pπ (VN M ≥ k)
k=1
since VN M ≤ N M . We may estimate this sum as follows:
NM
X N
X NM
X
Pπ (VN M ≥ k) = Pπ (VN M ≥ k) + Pπ (VN M ≥ k).
k=1 k=1 k=N +1
The first sum has N terms each ≤ 1, so is ≤ N . Since the events {VN M ≥ k} are decreasing
as k increases, the second sum may be bounded by
NM NM NM
X X X 1 NM − N
Pπ (VN M ≥ k) ≤ Pπ (VN M ≥ N ) ≤ = < N.
k=N +1 k=N +1 k=N +1
M M
2
Hence, we have N M π(j) < 2N , and so π(j) < M
for each M > 0, which contradicts the
assumption that π(j) > 0.
It remains to verify Equation 10.1. By assumption j is not positive recurrent. If j is
transient, then Tjk = ∞ for sufficiently large k, proving Equation 10.1 trivially in this case.
So, suppose j is null recurrent. Define
τjk = Tjk+1 − Tjk
be the duration of the kth excursion away from state j. Then we can write Tj2 = Tj1 +
τj1 , Tj3 = Tj2 + τj2 , and so forth. Clearly τjk is a stopping time. Hence, by the strong
Markov property, τj1 , τj2 , . . . are independent and identically-distributed. Moreover, the
distirbution is the same as that of Tj1 − Tj0 = Tj ; hence E(τjk ) = ∞ by the assumption of
null recurrence. Now
Tjk Tj1 + τj1 + · · · + τjk−1 Tj1 k − 1 τj1 + · · · + τjk−1
= = + .
k k k k k−1
The first term tends to 0 since Tj1 < ∞ (as the state j is recurrent). By the Strong Law
of Large Numbers, the second term tends to E(Tj ) = ∞ with probability 1. This proves
Equation 10.1.
38
Vn (j)
Pi lim = π(j) = 1
n→∞ n
and
n
1X
lim pm (i, j) = π(j).
n→∞ n
m=1
Remark 11.4. In general, the claim n1 nm=1 pm → p is weaker than the claim pn → p. The
P
exception is when pn ≥ 0 and p = 0; in this case, the average converging to 0 implies
that the sequence converges to 0 as the reader should check. Hence, the theorem has the
interesting corollary that, in a null recurrent irreducible chain, pn (i, j) → 0 for all i, j. This
is born out, for example, in SRW on Z1 and Z2 .
Proof. The proof is quite similar to the proof of Proposition 10.2(2). Since j is a recurrent
state, we know that Vn (j) → ∞ as n → ∞ with Pi -probability 1. It follows that VnV(j)+1
n (j)
→1
with Pi -probability 1. Now, since there were Vn (j) visits to j by time n, the Vn (j)th visit to
j occured at a time ≤ n. Using the times Tjk from the proof of Proposition 10.2, this means
V (j)
Tj n ≤n
with Pi -probability 1. On the other hand, the (Vn (j) + 1)st visit to j is ≥ n, meaning that
V (j)+1
Tj n ≥n
V (j) V (j)+1
with Pi -probability 1. Hence Tj n ≤ n ≤ Tj n . Dividing through by Vn (j), we have
V (j) V (j)+1
Tj n n Tj n Vn (j) + 1
≤ ≤
Vn (j) Vn (j) Vn (j) + 1 Vn (j)
40
In the proof of Proposition 10.2, we shows that Tjk /k → Ei [Tj ] with Pi -probability 1 (using
the Strong Law of Large Numbers). Hence, by the Squeeze Theorem, Vnn(j) → E[Tj ] as
n → ∞ with Pi -probability 1. Taking reciprocals proves the first equation.
For the second equality, note that since limn→∞ Vnn(j) = π(j) with Pi -probability 1, it
follows that Ei [ Vnn(j) ] → Ei [π(j)] = π(j) as n → ∞. The proof is then concluded by noting
that
n n n n
Vn (j) 1 X 1X 1X 1X
Ei = Ei [ 1{Xm =j} ] = Ei [1{Xm =j} ] = Pi (Xm = j) = pm (i, j).
n n m=1 n m=1 n m=1 n m=1
41
Hence, to prove the theorem, it suffices to prove that limn→∞ P(T > n) = 0.
Consider the Markov chain Zn = (Xni , Xnπ ) (whose state space is S × S).
Claim. For any state j ∈ S, P(Zn = (j, j) for some n) = 1.
Once the Claim is proved, it follows immediately that P(T < ∞) = 1, which implies
P(T > n) tends to 0 as n → ∞, as required to prove the theorem. To prove the Claim, we
refer to Exercise 1 on Homework 3, which states that if (Zn )∞n=0 is an irreducible, recurrent
Markov chain, then Pi (Zn = j for some n) = 1 for all states i, j. As such, it suffices to show
that Zn is irreducible and recurrent. In fact, we will show it is irreducible and positive
recurrent, by demonstrating (for the latter) that it possesses an invariant distribution.
The key observation is that, since Xni and Xnπ are independent by construction, it follows
that the transitions probabilities q ((i1 , i2 ), (j1 , j2 )) for the chain Zn are
q ((i1 , i2 ), (j1 , j2 )) = p(i1 , j1 )p(i2 , j2 ).
• Zn is irreducible. Let (i1 , i2 ) and (j1 , j2 ) be any two states in S × S. By assumption,
both Xn is irreducible, and so there are times m1 and m2 where pm1 (i1 , j1 ) > 0
and pm2 (i2 , j2 ) > 0. Since Xn is aperiodic, we know (cf. the remark preceeding the
proof of Theorem 7.4) we know pn (j1 , j1 ) > 0 and pn (j2 , j2 ) > 0 for all sufficiently
large n; in particular, for all sufficiently large k we have pm1 +k (j1 , j1 ) > 0 and
pm2 +k (j2 , j2 ) > 0. Thus
qm1 +m2 +k ((i1 , i2 ), (j1 , j2 )) = pm1 +m2 +k (i1 , j1 )pm1 +m2 +k (i2 , j2 )
≥ [pm1 (i1 , j1 )pm2 +k (j1 , j1 )][pm2 (i2 , j2 )pm1 +k (j2 , j2 )] > 0.
• Zn positive recurrent. By Corollary 11.1, it suffices to show there exists a stationary
distribution for Zn . Define θ(i, j) = π(i)π(j) for (i, j) ∈ S × S. Then
X X
θ(k, `) · q ((k, `), (i, j)) = π(k)π(`)p(k, i)p(`, j)
(k,`)∈S×S (k,`)∈S×S
" #" #
X X
= π(k)p(k, i) π(`)p(`, j)
k∈S `∈S
= π(i)π(j) = θ(i, j)
because π is stationary for Xn . Thus, θ is stationary for Zn .
43
where the penultimate equality is just the statement that p is stochastic. Thus, π is q-
stationary as well as p-stationary.
Now, for a given trajectory i0 , . . . , iN , we have
P(Y0 = i0 , Y1 = i1 , . . . , YN = iN ) = P(X0 = iN , X1 = iN −1 , . . . , XN = i0 )
= P(X0 = iN )p(iN , iN −1 ) · · · p(i2 , i1 )p(i1 , i0 )
= π(iN )p(iN , iN −1 ) · · · p(i2 , i1 )p(i1 , i0 )
by the Markov property and the assumption that Xn is in the stationary distribution. If
we rewrite Equation 13.1 in the form
π(j)p(j, i) = q(i, j)π(i)
then we can proceed by induction and write
π(iN )p(iN , iN −1 )p(iN −1 , iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )π(iN −1 )p(iN −1 , iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 )π(iN −2 ) · · · p(i2 , i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 ) · · · q(i1 , i2 )π(i1 )p(i1 , i0 )
= q(iN −1 , iN )q(iN −2 , iN −1 ) · · · q(i1 , i2 )q(i0 , i1 )π(i0 ).
Thus, we have
P(Y0 = i0 , Y1 = i1 , . . . , YN = iN ) = π(i0 )q(i0 , i1 )q(i1 , i2 ) · · · q(iN −1 , iN )
which is precisely to say that (Yn )0≤n≤N is a Markov chain.
Finally, since p is irreducible, given states i, j there is some finite n so that pn (i, j) > 0.
By the definition of matrix multiplication,
X
pn (i, j) = p(i, i1 )p(i1 , i2 ) · · · p(in−1 , j)
i1 ,...,in−1
and since all these terms are ≥ 0 and pn (i, j) > 0 it must be that there is at least one
trajectory i, i1 , . . . , in−1 , j such that p(i, i1 )p(i1 , j2 ) · · · p(in−1 , j) > 0. But then
π(in−1 ) π(in−2 ) π(i)
q(j, in−1 )q(in−1 , in−2 ) · · · q(i1 , i) = p(in−1 , j) · p(in−2 , in−1 ) · · · p(i, i1 )
π(j) π(in−1 ) π(i1 )
π(i)
= p(i, i1 )p(i1 , i2 ) · · · p(in−1 , j) > 0.
π(j)
Thus, at least one term in the matrix multiplication sum for qn (i, j) is > 0, so qn (i, j) > 0.
The symmetric argument for qn (j, i) shows that q is irreducible, as claimed.
Remark 13.4. The final step in the above proof highlights a(n obvious) alternate definition
of irreducibility of a Markov chain, or more precisely of communication classes. Say j
can be reached from i if there is a finite trajectory i1 , i2 , . . . , in−1 so that p(i, i1 ), p(i1 , i2 ), . . . ,
p(in−1 , j) are all > 0. Then say i and j can communicate if j can be reached from i and i can
be reached from j. The above proof shows that this is equivalent to the usual definition
of communication classes.
45
13.2. Reversible Markov Chains. Theorem 13.2 shows that a time reversed Markov chain
(in stationary distribution) is still a Markov chain, but with a potentially different tran-
sition matrix. It is natural to ask when it happens that the reversed chain is actually the
same as the original chain.
Definition 13.5. Let (Xn )n≥0 be a (time-homogeneous) Markov chain, with countable state space
S and transition probabilities p(i, j) for i, j ∈ S. Call the chain reversible if there is a function
π : S → [0, ∞), not identically 0, which verifies the detailed balance condition:
π(i)p(i, j) = π(j)p(j, i), ∀ i, j ∈ S. (13.2)
The detailed balance condition is invariant under scaling – that is, if π(i)p(i, j) = π(j)p(j, i)
then [λπ(i)]p(i, j) = [λπ(j)]p(j, i) as well for any λ > 0. Thus, if the function π in Definition
13.5 is summable, then we can scale it to be a distribution. It is not hard to see that
Corollary 13.6. If (Xn )n≥0 is a reversible Markov chain that possesses a unique stationary distri-
bution (i.e. if it is irreducible and positive recurrent), then that stationary distribution witnesses
the local balance condition.
Proof. We need only show that π is stationary for p; this follows from the same argument
in the proof of Theorem 13.2:
X X X
π(i)p(i, j) = π(j)p(j, i) = π(j) p(j, i) = π(j) · 1 = π(j).
i∈S i∈S i∈S
It is possible, however, for a π verifying the detailed balance condition to exist even if
a stationary distribution does not exist (cf. Example 13.1).
Example 13.7. Suppose the Markov chain is symmetric: the transition probabilities satisfy
p(i, j) = p(j, i). In this case, trivially the constant vector π = (c, c, . . . , c) witnesses the
local balance condition for any c. If there are a finite number of states, N , it follows that
π(i) = N1 for each N is a stationary distribution, and the chain is reversible. If there are
infinitely many states, there may not be any distribution that witnesses the local balance
condition. For example, take SRW on Z, whose transition probabilities are symmetric:
p(i, j) = p(j, i) has value 12 if |i − j| = 1 and 0 otherwise. Here any constant π = (c, c, c, . . .)
witnesses the local banace condition, and it is easy to see that only constants do. (Indeed,
we know there is no stationary distribution.)
Perhaps the most important reason reversible processes are nice is that the detailed
balance condition provides a much easier way to find the stationary distribution than the
definition π = πP. Here is an important class of examples, cf. Definition 5.2.
Example 13.8. Let G be a countable graph with no isolated vertices. The SRW on G has
transition probabilities p(i, j) = v1i where vi is the valence of i (which is ≥ 1 since i is not
isolated) if i ∼ j, and p(i, j) = 0 if i 6∼ j. Then we can check that π(i) = vi satisfies
(
1, i ∼ j
π(i)p(i, j) = π(j)p(j, i) =
0, i 6∼ j
so the “valence vector” π satisfies the local balance condition and SRW on graphs are
reversible. In particular, this shows that
P if the graph is finite, there exists a stationary
distribution: just normalize π(i) = vi / j vj .
46
Example 13.9. Capitalizing on the previous example, consider the following problem. A
Kinght starts at a corner of a standard 8 × 8 chessboard, and on each step moves at random. How
long, on average, does it take to return to its starting position?
If i is a corner, the question is what is Ei [Ti ], where the Markov process in question is the
one desribed in the problem: it is SRW on a graph whose vertices are the 64 squares of the
chessboard, and two squares are connected in the graph if a Knight can move from one
to the other. (Note: if the Knight can move from i to j, it can do so in exactly two ways: 1
up/down and 2 left/right, or 2 left/right and 1 up/down. So choosing uniformly among
moves or among positions amounts to the same thing.) This graph is connected (as a little
thought will show), and so the SRW is irreducible; therefore there is a unique stationary P
distribution. By Example 13.7, the stationary distribution π is given by π(i) = vi / j vj ,
1
P
and so by the Ergodic Theorem, Ei [Ti ] = π(i) = j vj /vi .
A Knight moves 2 units on one direction and 1 unit in the other direction. Starting at
a corner (1, 1), the Knight can only more to the positions (3, 2) and (2, 3), so vi = 2. To
solve the problem, we need to calculate vj for all starting positions j on the board. By
square symmetry, we need only calculate the numbers for the upper 4 × 4 grid. An easy
calculation then gives these valences as
2 3 4 4
3 4 6 6
4 6 8 8
4 6 8 8
The sum is 84, and so the sum over the full chess board is 4 · 84 = 336. Thus, the number
of expected steps for the Knight to return to the corner is 21 · 336 = 168.
47
It is actually easy to implement this chain computationally: at each step, we choose one
of the N 2 points in the grid uniformly at random, and change the value of the configura-
tion at that point if we can; if the change would violate the HCC condition, then we leave
the configuration unchanged for the next step. To be precise: if Xn = c, choose a point
(i, j) in the grid uniformly at random from all N 2 ; if c(i, j) = 1, then let Xn+1 = c0 where
c0 (i, j) = 0 and c0 (k, `) = c(k, `) for (k, `) 6= (i, j); if c(i, j) = 0 and c(i ± 1, j ± 1) = 0 then
similarly set Xn+1 = c0 where c0 (i, j) = 1 and c0 (k, `) = c(k, `) for (k, `) 6= (i, j); if, on the
other hand, c(i, j) = 0 but a neighboring point has c = 1, then set Xn+1 = Xn = c. It is
very easy to see that this algorithm produces a Markov chain with the above transition
kernel p(c, c0 ).
Note that p(c, c0 ) = p(c0 , c), so the transition kernel is symmetric. It follows that the
uniform distribution on the state space, π, satisfies the detailed balance condition with
p(c, c0 ), and hence the uniform distirbution π is the stationary distribution for (Xn )n≥0 . (It
is easy to see that (Xn )n≥0 is irreducible; for example, to get from any c to any c0 , one can
begin by eliminating the 1s in c one by one to get to the 0 configuration, then add new 1s
one by 1 to reach c0 .)
So, how do we select a uniformly random hard core configuration? Start at any config-
uration and run the Markov chain above. For large n, the distribution of Xn is approxi-
mately π, by the Convergence Theorem; so, for large n, Xn will be a (nearly) uniformly
random hard-core configuration.
48
Example 14.2. Consider the problem of graph colorings. Let G = (V, E) be a finite graph.
A q-coloring of G (with q ∈ N) is a function f : V → {1, 2, . . . , q} with the property that,
if u, v ∈ V with u ∼ v, then f (u) 6= f (v). The set of q-colorings is very hard to count (es-
pecially for small q), so again we cannot directly sample a uniformly random q-coloring.
Instead, define a Markov chain on the state space of all q-colorings f , where for f 6= g
(
1
q|V |
, if f and g differ at exactly one vertex,
p(f, g) =
0, otherwise.
Again: since there are at most q different
P q-colorings of G that differ at a given
P vertex,
and there are |V | vertives, we have f 6=g p(f, g) ≤ 1, and so setting p(f, f ) = g6=f p(f, g)
yields a stochastic matrix p which is the transition kernel of a Markov chain. It is evidently
symmetric, and by considerations like those in Example 14.1 the chain is irreducible and
aperiodic (so long as q is large enough for any q-colorings to exist!); hence, the stationary
distribution is uniform, and this Markov chain converges to the uniform distribution.
To simulate the corresponding Markov chain: given Xn = f , choose one of the |V |
vertices, v, uniformly at random and one of the q colors, k, uniformly at random. If the
new function g defined by g(v) = k and g(w) = f (w) for w 6= v is a q-coloring of the
graph, set Xn+1 = g; otherwise, keep Xn+1 = Xn = f . This process has the transition
probabilities listed above, and so running it for large n gives an approximately uniformly
random q-coloring.
In Examples 14.1 and 14.2, we constructed Markov chains with symmetric kernels,
therefore having the uniform distribution as stationary, and converging to it. This is often
a good way to simulate uniform samples. But how can we sample from a non-uniform
distribution?
14.1. The Metropolis-Hastings Algorithm. We can generalize Examples 14.1 and 14.2
to give a method for sampling any (strictly positive) distribution π, by building a 2-step
Markov chain: one to propose moves, a second to accept/reject them. Here is the algo-
rithm.
Let S be a finite set, and let π be a (strictly positive) distribution on S. First, construct
an irreducible Markov chain on S which has transition probabilities q(i, j) = q(j, i), sym-
metric. This chain is thus reversible, and has the uniform distribution as its stationary
state; but π is not necessarily uniform. Instead, we construct a new Markov chain with
transition kernel
π(j) X
p(i, j) = q(i, j) min 1, for i 6= j, p(i, i) = 1 − p(i, j).
π(i) j6=i
Once we have constructed this chain, we run it for long time to sample from the distribu-
tion π: it is clear from the definition that p is irreducible (since q is), and
π(i)p(i, j) = q(i, j) min{π(i), π(j)} = q(j, i) min{π(j), π(i)} = π(j)p(j, i)
so π satisfies the detailed balance condition for p, and thus π is the stationary distribu-
tion for p. Hence, by the Convergence Theorem, the p-Markov chain converges to π in
distribution for large time.
49
It remains to be seen how to actually simulate the Markov chain with kernel p (given
that we know how to simluate the chain with kernel q). Here is the algorithm:
• Suppose we’ve reached Xn = i. For each j 6= i, propose the move Xn+1 = j
according to the original chain; i.e. propose the move i → j with probability q(i, j).
• Once the first move has been proposed (i.e. the proposed next state j has been
chosen), accept this move with probability min{1, π(j)
π(i)
}; otherwise reject it and stay
in state i. In other words: if π(j) ≥ π(j), set Xn+1 = j; if π(j) = απ(i) with
0 < α < 1, then set Xn+1 = j with probability α and Xn+1 = i with probability
1 − α.
It is an easy exercise to show that the transition kernel p above is the result of this two-step
Markov chain algorithm. Hence, it may be used to simluate any distribution π (approxi-
mately, for large run time). Note: the indeterminacy of q is a benefit here. It means we are
free to choose the initial (symmetric) chain to fit the symmetries of the given state space.
There are many variations on the Metropolis-Hastings algorithm, all with the same ba-
sic ingredients and goal: to use a (collection of correlated) Markov chain(s), designed to
have a given stationary distribution π, to sample from π approximately. They are col-
lectively known as Markov chain Monte Carlo simulation methods. One place they are
especially effective is in calculating high-dimensional integrals (i.e. expectations), a com-
mon problem in multivariate statistics.
14.2. A word on convergence rates. In order for the Metropolis-Hastings algorithm (or
other MCMC methods) to be effective, the Markov chain designed to converge to the
given distribution must converge reasonably fast. The term used to measure this rate
is the mixing time of the chain. (To make this precise, we need a precise measure of
closeness to the stationary distribution; then we can ask how long before the distance to
stationarity is less than > 0.) This is a topic of a great deal of current research, with
literally thousands of papers published. We will not even scratch the surface here. Let’s
just point out the connection to eigenvalues.
Suppose our Markov chain has N states, is irreducible and aperiodic (as Homework
3, Problem 3 points out, every finite-state Markov chain is arbitrarily close to one). Let’s
suppose further that the chain is symmetric; i.e. the transition matrix P = P> . Then by the
spectral theorem from linear algebra,
P = UDU>
where U, the eigenvector matrix (columns are normalized eigenvalues) is an orthogonal
matrix (the columns are orthogonal to each other), and
λ1
λ2
D= . ..
λN
where λ1 ≥ λ2 ≥ · · · ≥ λN are the eigenvalues of P. We also know, from the Perron-
Frobenius theorem (Corollary 6.6) and the assumption of irreducibility and aperiodicity,
50
λ1 = 1 and for j ≥ 2 |λj | < 1. (I.e. the largest magnitude eigenvalue is 1 and all others are
strictly smaller.) So 1 = λ1 > λ2 ≥ · · · ≥ λN > −1. Hence
1 1
λn2 0
lim Pn = lim UDn U> = U lim
> >
U = U
. . .. U .
n→∞ n→∞ n→∞ ..
λnN 0
Moreover, we see that this convergence is exponentially-fast: setting P∞ equal to the
limit, we have
1 1
n ∞
λn2 0
kP −P k = . ..
−
. max (1−λnj ) = max{1−λn2 , 1−λnN }.
. . = 2≤j≤N
λnN 0
So, measuring in terms of matrix norm at least, the mixing rate is controlled by the size
of 1 − max{λ2 , |λN |}; if this is big, then mixing is fast; if it is small, mixing is slow. For
example:
1−
λ1 = 1, λ2 = 1 − 2, slow mixing
1−
1/2 1/2
λ1 = 1, λ2 = 0, fast mixing.
1/2 1/2
Much of the techonology for studying mixing times thus amounts to estimating the largest
magnitude eigenvalues of (huge) stochastic matrices.
51
To see why, consider the expected size of the population in generation n: Ei [Xn ]. We
compute it by conditioning on the population in the previous generation:
∞
X
Ei [Xn ] = Pi (Xn−1 = k)E[Xn |Xn−1 = k].
k=0
But, using the i.i.d. Yi variables above, we have
Ei [Xn |Xn−1 = k] = Ek [X1 ] = E[Y1 + · · · + Yk ] = kE[Y1 ] = kµ.
52
Thus
∞
X ∞
X
Ei [Xn ] = Pi (Xn−1 = k)kµ = µ kPi (Xn−1 = k) = µEi [Xn−1 ].
k=0 k=0
By induction, it follows that
Ei [Xn ] = µn Ei [X0 ] = iµn .
Now, from the following simple estimate
∞
X
n
iµ = Ei [Xn ] = Pi (Xn ≥ k) ≥ Pi (Xn ≥ 1)
k=0
we see that if µ < 1 then limn→∞ Pi (Xn ≥ 1) = 0. It follows that in this case q(i) = P(Xn =
0 eventually) = 1, for any i. This motivated (part of) the following definition.
15.2. Probability generating function. Let X be a N-valued random variable. Define its
probability generating function ϕX as
∞
X
X
ϕX (s) = E[s ] = sk P(X = k).
k=0
This is a variant of the characteristic function (a.k.a. Fourier transform) χX (t) = E[eitX ]
which is more commonly used for continuous distributions. Let us point out a few facts
about ϕX .
• From the power-series, since P(X = k) ≤ 1 for all k, it follows that ϕX is analytic
(n)
on (−1, 1). We have ϕX (1) = P 1, and ϕX (0) = n! P(X = n).
∞
• For |s| < 1, we have ϕ0X (s) = k=1 ksk−1 P(X = k). If E[X] < ∞, it follows that this
series converges at s = 1, and ϕ0X (1) = E[X].
• Because P(X = k) ≥ 0, the function s 7→ ϕ0X (s) is nondecreasing; if P(X ≥ 2) > 0,
it is strictly increasing.
Theorem 15.2. Let (Xn )n≥0 be a branching process with offspringP distribution p0 , p1 , p2 , . . . Let ϕ
be the probability generating function of this distribution ϕ(s) = ∞ k
k=0 pk s . Then the extinction
probability q is given by
q = min{s ∈ [0, 1] : ϕ(s) = s}.
53
Proof. By definition
P(Zn+1 = (xn+1 , y 0 )|X0 = x0 , X1 = x1 , . . . , Xn = xn , Yn = y)
P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 )
= .
P(X0 = x0 , . . . , Xn = xn , Yn = y)
The numerator can be expanded by summing over the probabilities of all possible values
taken by Y0 , . . . , Yn−1 :
P(X0 = x0 , . . . , Xn = xn , Xn+1 = xn+1 , Yn = y, Yn+1 = y 0 )
X
= P (Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y), Zn+1 = (xn+1 , y 0 ))
y0 ,...,yn−1
X
= P(Zn+1 = (xn+1 , y 0 )|Z0 = (x0 , y0 ), . . . , Zn−1 = (xn−1 , yn−1 ), Zn = (xn , y))
y0 ,...,yn−1
The numerator is just the expansion of the denominator over all possible values of Y0 , . . . , Yn−1 ,
so the ratio is 1, proving the lemma.
We therefore have
X
αn+1 (y 0 ) = αn (y)P (Zn+1 = (xn+1 , y 0 )|Zn = (xn , y))
y∈S
X
= αn (y)p(y, y 0 )ey0 (xn+1 )
y∈S
58
by Equation 16.1. We can now factor out the term ey0 (xn+1 ) which is independent of the
summation variable y, yo achieve
X
αn+1 (y 0 ) = ey0 (xn+1 ) αn (y)p(y, y 0 ). (16.5)
y∈S
Let us now analyze the Forward Algorithm for complexity. Let 0 < n < N , and suppose
we have calculated all the quantities {αn (y) : y ∈ S}. For each y 0 ∈ S, Equation 16.5
allows us to compute αn+1 (y 0 ) with |S| multiplications, then the sum of those |S| terms,
and finally the product with the one remaining term, for a total of 2|S| + 1 operations. We
must compute all |S| terms {αn+1 (y 0 ) : y 0 ∈ S}, which means repeating this |S| times for a
total of |S|(2|S| + 1) operations.
At time 0, we initialize with the known distribution α0 (y) = P(X0 = x0 , Y0 = y) =
P(Y0 = y)ey (x0 ), requiring one computation for each state, so |S| computations in total.
Then we need N ·|S|(2|S|+1) operations to compute all the terms {αN (y) : y ∈ S}. Finally,
to compute P(x) we must sum up these |S| terms, for a total of 2|S| + N |S|(2|S| + 1) =
O(N |S|2 ) computations.
Example 16.7. To reiterate, if (Xn , Yn )n≥0 is a HMM and x = (x0 , . . . , xN ) is a fixed obser-
vation sequence, computing P(x) by
• the straightforward method y∈S N P(x, y) requires 2N |S|N operations.
P
If S is the state space of the hidden chain (Yn )n≥0 , this algorithm is O(N |S|2 ) and so quickly
implemented on a computer. Homework 5 Problem 1 asks you to do this computation for
a Crooked Casino style HMM.
17.1. Most Likely Trajectory. HMMs are used extensively in signal processessing appli-
cations: notably speach recognition and error correcting codes for wireless communica-
tions. The observed process Xn is interpretted as a signal with random noise, and the
goal is to decode the true source Yn . This is, of course, impossible, even if all the emission
probabilities for Xn given Yn and all the transition probabilities for Yn are known exactly
(which is usually not true anyhow). The question is, then, what is our best guess about
the values of y0 , y1 , . . . , yN given an observed sequence x0 , x1 , . . . , xN ?
Referring to the previous lecture and the forward algorithm, we have an efficient way
to calculate P(y|x). The usual answer for the best guess y for the true value of y is then
y∗ = argmax P(y|x).
y
The operation argmax is Engineering Speak: the meaning is that y∗ is chosen so that
P(y∗ |x) = max P(y|x).
y
60
Such a y∗ surely exists (since we are dealing with finite-state processes here). A more
subtle theoretical question is whether the maximum is unique (the answer is no typically),
so we can weaken our goal to find a maximizer. The intuition is simple: we observe a
given x. We check all possible strings y, and the one that has the highest probability of
generating x is (probably close to) the true string.
We now come to the question of computational feasibility. We have an algorithm
(the Forward Algorithm) to compute P(y|x) for fixed x and y, and it has complexity
O(N |S|3 ) where S is the state space of the hidden chain. The most obvious way to find
argmaxy P(y|x) is to compute all of the probabilities {P(y|x) : y ∈ S N }, and then select
y∗ as (one of) the maximizer(s). Of course, this procedure will exponentially complex-
ify the calculation, yielding O(N |S|N +3 ). This will never be computationally possible for
moderately large N .
Enter Viterbi. In 1967, he was an EE professor here at UCLA, when he invented the
algorithm (below) for recursively effectively computing the most likely trajectory in a
HMM. He quickly realized the enormous practical applications, and formed a company
(called Linkabit) to market it (initially only to the military and NASA). In 1973, he re-
signed from UCLA and moved to San Diego (where Linkabit was based) to run the com-
pany full-time. In 1980, Linkabit was acquired by a larger company called Microwave
Associates Communications. A change in management caused Viterbi (and his original
Linkabit team) to resign en masse in 1985, at which time they formed a new startup com-
pany that they called QUALCOMM (which is, of course, now an industry giant in com-
munications technology). While at QUALCOMM, Viterbi more or less single-handedly
invented the CDMA algorithm used by all modern cellular providers; the Viterbi algo-
rithm is used at the heart of it (for error-correcting codes). Viterbi retired from QUAL-
COMM in 2000 (at age 65), and became an EE professor here at UCSD, where he is still
on the faculty.
17.2. The Viterbi Algorithm. Fix an observation sequence x = (x0 , x1 , . . . , xN ). First note
that P(y|x) = P(x,y)
P(x)
, so
1
max P(y|x) = max P(x, y).
y P(x) y
It therefore suffices to maximize P(x, y) over y; a maximizer y∗ for this joint probability
is also a maximizer for the conditional probability.
We build up recursively the maximum conditional probability of a state sequence given
the observed sequence. For 0 < n ≤ N , define, for any state y,
Vn (y) = max P(X0 = x0 , . . . , Xn = xn , Y0 = y0 , . . . , Yn−1 = yn−1 , Yn = y).
y0 ,...,yn−1 ∈S
Then maxy∈S Vn (y) = maxy∈S N P(x, y), whose argmax we wish to compute. On its face, it
seems that computing Vn (y) requires a brute force approach with |S|n−1 computations of
probabilities; however, note that the Hidden Markov Model provides a simple product
structure for the probabilities in question. From Equation 16.4, we have for any states
y0 , . . . , yn−1 , y, y 0 that
P(X0 = x0 , . . . , Xn+1 = xn+1 , Y0 = y0 , . . . , Yn = y, Yn+1 = y 0 )
= P(Y0 = y0 )ey0 (x0 )p(y0 , y1 )ey1 (x1 ) · · · p(yn−1 , y)ey (xn )p(y, y 0 )ey0 (xn+1 )
61
and so
Vn+1 (y 0 ) = max P(Y0 = y0 )p(y0 , y1 )ey0 (x0 ) · · · p(yn−1 , y)eyn (xn )p(y, y 0 )ey (xn+1 )
y0 ,...,yn−1 ,y
= max P(Y0 = y0 )p(y0 , y1 )ey0 (x0 ) · · · p(yn−1 , yn )ey (xn )p(y, y 0 ) ey0 (xn+1 ).
y0 ,...,yn−1 ,y
Recursion: For y, y 0 ∈ S and 0 < n < N , set Wn+1 (y; y 0 ) = Vn (y)p(y, y 0 )ey0 (xn+1 ).
For discrete time chains, it was really enough to understand the one-step transition
matrix P(X1 = j|X0 = i), since there were no times in between 0 and 1. For continuous-
time chains, however, we really need all the information in the transition kernel
pt (i, j) = P(Xt = j|X0 = i), t > 0.
This is a lot more information to process, so we will need a different approach to analyze
such chains.
Proof. Let (Xt )t≥0 be a time-homogeneous Markov chain, with jump time J1 from the
initial state. Conditioned on X0 = i, he event {J1 > s} is equal to the event {Xu = i, ∀u ≥
s}. Let’s consider a discrete version of this: fix a sequence of times
0 ≤ u0 < u1 < · · · < un = s < un+1 < · · · < un+m = s + t.
Then Pi (J1 > s + t|J1 > s) is approximated by
P(Xu0 = Xu1 = · · · = Xun+m = i|Xu0 = Xu1 = · · · = Xun = i).
Abbreviate the event {Xu0 = Xu1 = · · · = Xun = i} as An (i); then we have the usual
decomposition
P(Xu0 = Xu1 = · · · = Xun+m = i|An (i))
= P(Xun+m = · · · = Xun+1 = i|An (i))
= P(Xun+1 = i|An (i))P(Xun+2 = i|Xun , An (i)) · · · P(Xun+m = i|Xun+m−1 = · · · = Xun+1 = i, An (i)).
By the Markov property, this is simply the product
P(Xun+1 = i|Xun = i)P(Xun+2 = i|Xun+1 = i) · · · P(Xun+m = i|Xun+m−1 = i).
Now lets choose the partition points to evenly subdivide the interval [s, s + t] into m parts
– that is, un+j+1 = un+j + mt . Then by time-homogeneity, each of the above terms in the
product is equal to
P(Xt/m = i|X0 = i).
By letting the partition witdth tend to 0, we recover the desired probability, so
m
Pi (J1 > s + t|J1 > s) = lim P(Xt/m = i|X0 = i) . (18.1)
m→∞
As for Pi (J1 > t), we can approximate this probability similar to above by choosing a
partition 0 = v1 < v2 < · · · < vm = t and computing
Pi (Xv1 = · · · = Xvm = i)
= P(Xv1 = i|X0 = i)P(Xv2 = i|Xv1 = i) · · · P(Xvm = i|Xv0 = · · · = Xvm−1 = i)
= P(Xv1 −v0 = i)P(Xv2 −v1 = i) · · · P(Xvm −vm−1 = i)
where we have used the Markov property and time-homogeneity in the last equality. If
we again choose the even partition with vj+1 − vj = t/m, and take limits, this gives
m
Pi (J1 > t) = lim P(Xt/m = i) . (18.2)
m→∞
Equations 18.1 and 18.2 prove that Pi (J1 > s + t|J1 > s) = Pi (J1 > t) for all states i. This
proves the proposition.
18.2. The Exponential Distribution. The formula in Equation 18.2 for the (reverse) cu-
mulative probability distribution function of J1 has the feel of an exponential to it. This
turns out to indeed be true. The result of Proposition 18.1 is a “memoryless” property: it
says that the distribution of jump times after already waiting some fixed length of time
is the same as it was at the beginning. In fact, there is only a one-parameter family of
probability distributions with this property: the exponential distributions.
65
Proposition 18.2. If T is a random variable taking values in (0, ∞), and if T has the memoryless
property P(T > s + t|T > s) = P(T > t) for all s, t > 0, then T is an exponential random
variable with some intensity q > 0:
P(T > t) = e−qt , t ≥ 0.
Notation: we write T ∼ Exp(q).
Proof. Let GT (t) = P(T > t). Then for s, t > 0
GT (s + t) = P(T > s + t) = P(T > s)P(T > s + t|T > s) = P(T > s)P(T > t) = GT (s)GT (t).
It follows from this functional equation that GT (t) = e−qt for some q > 0: by induc-
tion it follows that GT (nt) = GT (t)n for all natural numbers n and t > 0. In particular,
GT (1/n)n = GT (1) ∈ (0, 1) and so there is q > 0 with GT (1) = e−q . Thus, for m, n ∈ N,
GT (m/n)n = GT (1)m = e−qm so that GT (m/n) = e−qm/n . So for rational t we have proved
the result. The function GT is decreasing, and so since it is equal to the function e−q(·) on
the dense set of positive rationals, it is equal to it everywhere.
Remark 18.4. In fact, one can go further. Let M be the {1, 2, . . . , n}-valued random variable
defined by M = j iff min{T1 , . . . , Tn } = Tj . We just computed that the distribution of M
qj
is P(M = j) = q1 +···+q j
. Similar calculations show that M is independent of T1 , . . . , Tn .
This is quite remarkable. For example, suppose we have two processes, one Exp(1) and
the other Exp(100). If we sample one of the two at random without knowing which one,
even if we get value 0.9, we do not have any information about which one we actually
sampled. This is a very strong form of the memoryless property.
18.3. Transition Rates. Proposition 18.2 shows that the first jump time J1 of a Markov
chain has an exponential distribution with some intensity q > 0, which may differ de-
pending on the initial state. That is: for each i ∈ S, there is a q(i) > 0 so that the condi-
tional distribution of J1 , conditioned on X0 = i, is Exp(q(i)). This gives a convenient way
to think about any continuous-time Markov chain: at each state, there is an “exponential
clock”, which will ring at a random time whose distribution is exponential with some rate
parameter q(i). Once arriving at state i, the clock is started; it rings at time J1 , and the
process moves to another state j, chosen with probabilities
p(i, j) = P(XJ1 = j|X0 = i).
Note that, by definition of the jump-time J1 , XJ1 6= X0 , so p(i, i) = 0.
There is a slightly different interpretation which is somewhat more convenient for cal-
culations. Define
q(i, j) = q(i)p(i, j) = q(i)P(XJ1 = j|X0 = i). (18.3)
The numbers q(i, j) for i, j ∈ S are called the transition rates of the Markov chain. They
satisfy
q(i, j) ≥ 0, q(i, i) = 0, i, j ∈ S
X X
q(i, j) = q(i) p(i, j) = q(i) < ∞.
j∈S j∈S
The transition rates fully determine the Markov chain, so we can specify the chain simply
by specifying the matrix Q = [q(i, j)]i,j∈S . Their interpretation in terms of the process is
thus: when we arrive at state i, all other states start exponential clocks – state j’s clock as
rate q(i, j). When a clock goes off, the chain moves to the state whose clock rang first. All
of the clocks are independent, so the time that the first clock will go off is the minimum
of allPthe independent exponential clocks; by Proposition 18.3(c), this is exponential with
rate j q(i, j) = q(i), as required. The probability that the clock at j goes off first is, again
by Proposition 18.3(c), Pq(i,j)
q(i,k)
= q(i,j)
q(i)
= p(i, j).
k
18.4. The Poisson Process. The simplest non-trivial continuous time Markov chain is a
Poisson Process, with rate λ > 0. This is a Markov chain with state space S = {0, 1, 2, . . .}
and transition rates q(i, i + 1) = λ for all i ≥ 0, and q(i, j) = 0 when j 6= i + 1. So the
process is non-decreasing; the moment that it arrives at state i, it begins an Exp(λ) clock,
and jumps to state i + 1 when it rings. If (Xt )t≥0 is a Poisson process, we can think of the
random variable Xt as counting the number of events that have occurred up to time t. It
is a very good model for radioactive decay, telephone calls coming into a call-center, the
number of requests for a particular page coming into a web server, etc.
67
Multiplying through by q(i), and using the fact that j∈S q(i,j)
P
q(i)
= 1, this gives us the
equations X
q(i)g(i) = 1 + q(i, j)g(j). (19.2)
j∈S
69
Equations 19.1 and 19.2 can be used to iteratively calculate P(τA < τB ) and E[τA ] just as in
the the discrete-time setting.
19.2. Embedded Jump Chain. Given a continuous-time Markov chain, we have already
defined the first jump-time J1 = min{t ≥ 0 : Xt 6= X0 }. This is part of a whole sequence
of jump times. We can inductively define
J0 = 0, Jn+1 = min{t ≥ Jn : Xt 6= XJn }.
By the strong Markov property, for any states i0 , i1 , . . . , in ∈ S,
P(XJn = in |XJn−1 = in−1 , . . . , XJ1 = i1 , X0 = i0 )
= P(XJn −J1 = in |XJn−1 −J1 = in−1 , . . . , XJ2 −J1 = i2 , X0 = i1 )
= P(XJn −J2 = in |XJn−1 −J2 = in−1 , . . . , XJ3 −J2 = i3 , X0 = i2 )
..
.
= P(XJn −Jn−1 = in |X0 = in−1 )
= P(XJn = in |XJn−1 = in−1 )
where the final step is from time-homogeneity, and the intermediate kth step follows by
restarting the process at time Jk (note that (Jn − Jk−1 ) − (Jk − Jk−1 ) = Jn − Jk , so the
differences telescope). This shows that the discrete-time stochastic process
Yn = XJn ,
called the embedded jump chain of (Xt )t≥0 , is a Markov chain. Moreover, its transition
probabilities are P(Y1 = j|Y0 = i) = P(XJ1 = j|X0 = i) = p(i, j) = q(i,j)
q(i)
.
Now, let us condition on a particular trajectory for the jump chain: let A = {Y0 =
i0 , . . . , Yn = in }. Let us consider the so-called sojourn times Sk = Jk − Jk−1 for k ≥ 1.
Then we have for any 0 < s1 < · · · < sn < ∞
= P(S1 > s1 , . . . , Sn > sn |A)
= P(J1 > s1 , J2 − J1 > s2 , . . . , Jn − Jn−1 > sn |A)
= P(J1 > s1 , J2 > s2 + J1 , . . . , Jn > sn + Jn−1 |A)
= PA (J1 > s1 )PA (J2 > s2 + J1 |J1 > s1 ) · · · PA (Jn > sn + Jn−1 |J1 > s1 , . . . , Jn−1 > sn−1 ).
By the Strong Markov property,
PA (Jk > sk + Jk−1 |J1 > s1 , . . . , Jk−1 > sk−1 ) = PA (Jk > sk + Jk−1 |Jk−1 > sk−1 )
= P(J1 > sk |X0 = ik−1 ) = e−q(ik−1 )sk .
(I.e. restarting the process at time Jk−1 , the next jump time Jk becomes the first jump-time
of the shifted process; the conditioning XJk−1 = Yk−1 = ik−1 translates to X0 = ik−1 for the
shifted process.) So, we have shown that
P(S1 > s1 , . . . , Sn > sn |A) = e−q(i0 )s1 · · · e−q(in−1 )sn
which is precisely the joint distribution of independent exponential random with param-
eters q(i0 ), . . . , q(in−1 ). In particular, this shows that
Proposition 19.2. Conditioned on Y0 , . . . , Yn−1 , the sojourn times S1 , . . . , Sn are independent
exponential random variables with Sk ∼ Exp(q(Yk−1 )).
70
19.3. Infinitesimal Description. We know that the transition rates q(i, j) completely de-
termine the Markov chain, and give a convenient description in terms of exponential
jumps. Since they give a complete description, they must therefore encode the transition
probaiblities pt (i, j) = Pi (Xt = j) for all t ≥ 0. One way to see how pt (i, j) is determined
from q(i, j) is through the following so-called infinitesimal description of the chain.
Theorem 19.3. Let (Xt )t≥0 be a Markov chain with state space S and transition rates [q(i, j)]i,j∈S .
Then the transition probabilities pt (i, j) = Pi (Xt = j) satisfy
pt (i, i) = 1 − q(i)t + o(t) for i ∈ S
for i 6= j ∈ S.
pt (i, j) = q(i, j)t + o(t)
P
In the statement of Theorem 19.3, we have (as usual) q(i) = j∈S q(i, j). The notation
f (t) = g(t) + o(t) is short-hand for
f (t) − g(t)
lim = 0.
t↓0 t
That is, to first order f and g agree near 0. So the statement is that, for small t > 0, pt (i, i)
is very close to 1 − q(i)t, and pt (i, j) is very close to q(i, j)t for i 6= j.
Proof. First, for the i = j case, we have pt (i, i) = Pi (Xt = i). The event {Xt = i} contains
the event {∀s ≤ t Xs = i} which (conditioned on X0 = i) is the event {J1 > t}. Thus
pt (i, i) = Pi (Xt = i) ≥ Pi (J1 > t) = e−q(i)t = 1 − q(i)t + o(t). (19.3)
For j 6= i, we can similarly estimate as follows: if J1 ≤ t, the first jump state is Y1 = j, and
the following sojurn time S2 > t, then the process jumps to j before time t and doesn’t
leave it until time J2 = J1 + S2 > t, thus Xt = j. Hence
pt (i, j) = Pi (Xt = j) ≥ Pi (J1 ≤ t, Y1 = j, S2 > t) = Pi (Y1 = j)Pi (J1 ≤ t, S2 > t|Y1 = j).
By definition Pi (Y1 = j) = Pi (XJ1 = j) = p(i, j). By Proposition 19.2, since J1 = S1 , we
have
Pi (J1 ≤ t, S2 > t|Y1 = j) = Pi (J1 ≤ t|Y1 = j)Pi (S2 > t|Y1 = j) = (1 − e−q(i) )e−q(j)t .
Thus, we have the inequality
pt (i, j) ≥ p(i, j)(1−e−q(i)t )e−q(j)t = p(i, j)(q(i)t+o(t))(1−q(j)t+o(t)) = q(i, j)t+o(t). (19.4)
(The final equality follows from the definition q(i, j) = q(i)p(i, j).) Now, fixing i and
summing over all states j, we have from Inequalities 19.3 and 19.4
X X
pt (i, j) ≥ 1 − q(i)t + o(t) + q(i, j)t + o(t).
j∈S j6=i
P
Since q(i) = j q(i, j) and q(i, i) = 0, this shows that
X
pt (i, j) ≥ 1 + o(t).
j∈S
Hence, if any
P one of the inequailities in 19.3 or 19.4 were strict, we would have the strict
inequality j∈S pt (i, j) > 1 + o(t) and so for small enough t the sum would exceed 1. This
P
contradicts the stochastic condition j∈S pt (i, j) = 1, and so the inequalites must in fact
be equalities.
71
The infinitesimal description is a very handy way to identify a Markov chain. Indeed,
the chain is completely determined by the parameters q(i) and q(i, j) for i, j ∈ S, so to
fully understand the chain one need only understand pt (i, j) to first order in t. This can
be used to identify a process quickly.
Example 19.4. Let (Xt )t≥0 and (Yt )t≥0 be independent Poisson processes, with parameters
λ1 and λ2 . The infinitesimal description of Theorem 19.3 can be used to prove that Xt + Yt
is a Poisson process with paramter λ1 + λ2 ; this is an exercise on Homework 5.
72
From the infinitesimal description, we have ph (j, j) = 1 − q(j)h + o(h) and ph (k, j) =
q(k, j)h + o(h) when k 6= j. Thus
X
pt+h (i, j) = pt (i, j)(1 − q(j)h + o(h)) + pt (i, k)(q(k, j)h + o(h))
k6=j
which says
X
pt+h (i, j) − pt (i, j) = −pt (i, j)q(j)h + pt (i, k)q(k, j)h + o(h).
k6=j
which proves the Forward Equation. The Backward Equation is proved similarly, in that
case conditioning on Xh = k rather than Xt = k.
From the elementary theory of ODEs, we can now easily solve the Kolomogorov For-
ward and Backward Equations, at least in the case of a finite state space.
73
Corollary 20.3. Let (Xt )t≥0 be a finite-state continuous time Markov chain with infinitesimal
generator A. Then the transition probabilities are given by
Pt = etA .
The exponential of a matrix can be defined, for example, by the power-series etA =
P∞ tk k d tA
k=0 k! A . The essential point here is that dt e = AetA = etA A, which shows that
Pt = etA satisfies the Kolmogorov Forward and Backward equations. In addition, we
have 0A = I, the identity matrix, and indeed [P0 ]ij = P(X0 = j|X0 = i) = δij = [I]ij ;
hence, uniqueness of the solution to an ODE with given initial conditions proves the
corollary. It is worth noting that, when there are infinitely-many states, the matrix ex-
ponential may not even make sense, and there can in fact be more than one solution
to the equations. In this case, at can be proved (with probabilistic techniques) that the
Kolmogorov Backward equation possesses a unique minimal non-negative solution, and this
solution gives the transition probabilities pt (i, j).
Example 20.4. Consider an arbitrary two-state chain. The infinitesimal generator A then
has the form
−α α
A=
β −β
where α = q(1, 2) and β = q(2, 1). We can calculate that
2
2 α + αβ −α2 − αβ −α α
A = = −(α + β) = −(α + β)A.
−αβ − β 2 αβ + β 2 β −β
Thus An = [−(α + β)]n−1 A, and so we have
∞ ∞ k ∞
tA
X tk k
X t [−(α + β)]k−1 1 X [−(α + β)t]k
e = A =I+ A=I− A
k=0
k! k=1
k! α + β k=1 k!
Since this is an infinite system of ODEs, we cannot simply solve it by taking a matrix ex-
ponential; we do not even have a general theorem to tell us that there is a unique solution.
74
But we can, in fact, solve the system and prove it has a unique solution, as follows. Fix i,
and consider the probability generating function
∞
X
ϕi (t, z) = pt (i, j)z j .
j=0
All the coefficients pt (i, j) are ≤ 1 in modulus, so the series converges (uniformly) for
|z| < 1. Taking the time derivative can then be done term-by-term:
∞ ∞
∂ X d j d X d
ϕi (t, z) = pt (i, j) z = pt (i, 0) + pt (i, j) z j .
∂t j=0
dt dt j=1
dt
20.2. More in Poisson Processes. The Poisson process with rate λ has a very special
property that is not shared by all Markov processes. To get a hint of what it is, consider
the embedded jump chain (Yn )n≥0 of such a Poisson process (Xt )t≥0 . Since q(i, j) = 0 un-
less j = i + 1, we have XJn+1 = XJn + 1. It follows that Yn = Y0 + n; the jump chain is
just a deterministic one-step increase chain. Now, by Proposition 19.2, the sojourn times
are independent exponentials conditioned on the trajectory of (Yn ); but since there is only
one trajectory of (Yn ), and the jump rates q(i, i + 1) = λ do not depend on i, it follows that
the sojourn times are just plan i.i.d. Indeed, we can compute P(S1 > s1 , . . . , Sn > sn ) as
the conditional sum
X
P(S1 > s1 , . . . , Sn > sn |Y0 = i0 , . . . , Yn = in )P(Y0 = i0 , . . . , Yn = in ).
i0 ,...,in
All of the terms P(Y0 = i0 , . . . , Yn = in ) are 0 except for those for which ik = i0 + k for
1 ≤ k ≤ n. So we have just the single sum
X
P(S1 > s1 , . . . , Sn > sn |Y0 = i, . . . , Yn = i + n)P(Y0 = i, . . . , Yn = i + n).
i
Theorem 20.7. Let (Xt )t≥0 be a Poisson process of rate λ (with X0 = 0). Then for any s ≥ 0, the
et = Xt+s − Xs is a Poisson process of rate λ, independent of {Xu : 0 ≤ u ≤ s}.
process X
Proof. It suffices to prove the claim under the conditioning Xs = i for an arbitrary fixed
et for t ≥ 0, and A is any event
state i. Indeed, if B is any event determined by the process X
determined by {Xu : 0 ≤ u ≤ s}, if Aand B are independent conditioned on {Xs = i} then
X X
P(A, B) = P(A, B|Xs = i)P(Xs = i) = P(A|Xs = i)P(B|Xs = i)P(Xs = i).
i i
Note that, by the conditional independence claim, we have B and {Xs = i} are indepen-
dent conditioned on {Xs = i}, which means simply that B and {Xs = i} are independent.
Thus P(B|Xs = i) = P(B), and so
X X
P(A, B) = P(A|Xs = i)P(B)P(Xs = i) = P(B) P(A|Xs = i)P(Xs = i) = P(B)P(A)
i i
76
as required.
Let J1 , J2 , . . . be the jump times and S1 , S2 , . . . the sojourn times. The event {Xs = i}
can be written exclusively in terms of these variables: Xs = i means that there have been
exactly i jumps (and no more) by time s. Hence
{Xs = i} = {Ji ≤ s < Ji+1 } = {Ji ≤ s} ∩ {Si+1 > s − Ji }.
By definition Xe0 = Xs − X0 = Xs = i. The first jump time Je1 of X et is the portion of Si+1 to
the future of s; since Si+1 is the interval between Ji and Ji+1 , this means that
Je1 = Se1 = Si+1 − (s − Ji ) = Ji+1 − s.
For further sojourn times, we simply have a time shift
Sen = Si+n , n ≥ 2.
We can then calculate the distirbution of Se1 , . . . , Sen conditioned on S1 , . . . , Si and the event
{Xs = i} = {Ji ≤ s} ∩ {Ji+1 > s}.
P(Se1 > s̃1 , . . . , Sen > s̃n |S1 > s1 , . . . , Si > si , Xs = i)
= P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s|Ji+1 > s, Ji ≤ s, Si > si , . . . , S1 > s1 )
P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s, Ji ≤ s, Si > si , . . . , S1 > s1 )
= . (20.2)
P(Ji+1 > s, Ji ≤ s, Si > si , . . . , S1 > s1 )
Both numerator and denominator can be computed as multiple integrals. Recalling that
Ji = S1 + · · · + Si , and that S1 , S2 , . . . are i.i.d. Exp(λ) random variables, the denominator
of Equation 20.2 is
P(S1 + · · · + Si+1 > s, S1 + · · · + Si ≤ s, Si > si , . . . , S1 > s1 )
Z Z
= 1{u1 +···+ui+1 >s} λi+1 e−λ(u1 +···+ui+1 ) du1 · · · dui+1
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
ui+1 ≥0
Z Z
= λi+1 e−λ(u1 +···+ui+1 ) du1 · · · dui+1 .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
ui+1 ≥s−(u1 +···+ui )
= λi e−λs .
Hence, the denominator of Equation 20.2 is simply equal to
Z
i −λs
λe du1 · · · dui .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
A totally analogous argument applied to the numerator of Equation 20.2, integrating out
all the variables ui+1 , . . . , ui+n , yields
P(Si+n > s̃n , . . . , Si+2 > s̃2 , Ji+1 > s̃1 + s, Ji ≤ s, Si > si , . . . , S1 > s1 )
Z
−λs̃n −λs̃2 i −λ(s+s̃1 )
= e ···e ·λe du1 · · · dui .
u1 >s1 ,...,ui >si
u1 +···+ui ≤s
77
Not all Markov chains have independent increments, although the converse is true (and
easy to prove): if a stochastic process (with a countable state space and right-continuous
trajectories) has independent increments, then it is a Markov chain; if the increments
have the property that the distirbution of Xt − Xs only depends on t − s (we say the
process has stationary increments), then (Xt )t≥0 is time-homogenous as well. The class of
such stochastic processes is called the Lévy processes. Most Markov chains are not Lévy
processes; the Poisson process is almost alone in this regard (with countable state space).
78
Example 21.2 (Explosion). Let (Xt )t≥0 be a pure birth process with λi = i2 for i ∈ N.
Condition on X0 = 1. If TN is the time to reach state N , then TN 2 = S1 + · · · + SN , and so
N N
X X 1
E[TN ] = E[Si ] = .
i=1 i=1
i2
2
In particular, this means that limN →∞ E[TN ] = π6 < ∞. Since all the random variables are
positive, we can exchange the limit and the sum, and if we set T = ∞
P
i=1 i (the time to
S
π2
reach ∞), then E[T ] = 6 < ∞. In particular, this means P(T < ∞) = 1.
The situation in Example 21.2 is called explosion. It presents a subtlety that we have
thus far ignored: what happens after T = ∞
P
S
i=1 i ? The jump-chain, sojourn-times de-
scription is insufficient to answer this. Indeed, after this time, the process may restart
in any state, and the Markov property is still obeyed. It is for this reason that the Kol-
mogorov Forward- and Backward- equations can have multiple solutions (despite always
having the same initial date P0 = I). If explosion occurs, then the very same transition
rates describe different chains: one that stays at ∞ ever after the explosion, and others
than restart in other states. Such chains are called non-minimal. We will not discuss such
chains. But it is important to realize that the jump-chain sojourn-time description is only
valid up to the time of explosion (if it occurs).
21.2. Recurrence and Transience. The same asymptotic notions for the behavior of a
Markov chain apply in the continuous-time setting as the discrete-time setting.
Definition 21.3. Let (Xt )t≥0 be a continuous-time Markov chain with state space S, and let
i ∈ S. Let T = min{t > 0 : Xt = i}. The state i is called transient if Pi (Ti < ∞) = 0; it is called
recurrent if Pi (Ti < ∞) = 1; it is called positive recurrent if Ei [Ti ] < ∞.
Note that if Xt = i then there is k with Jk ≤ t < Jk+1 , and so the the jump chain also
satisfies Yk = i. Since Yk = XJk , it follows that Yn returns to i iff Xt returns to i, and so the
state i is recurrent (resp. transient) for the chain Xt iff it is recurrent (resp. transient) for
the jump chain Yn . We therefore need no new technology to analyze a continuous-time
chain for recurrence/transience; we need only look at its jump chain. (In particular, as
before, if the chain is irreducible – meaning for each i, j ∈ S pt (i, j) > 0 for some t > 0 –
then either all states are recurrent or all states are transient.)
However, positive recurrence is determined not only by the property of return to a
state, but how long it takes to get there, which is different for the two chains. Hence,
positive recurrence is a different matter for continuous-time chains than discrete-time.
Example 21.4. Let (Xt )t≥0 be a birth and death chain with parameters λi = q(i, i + 1) > 0
for i ≥ 0 and µi = q(i, i − 1) > 0 for i ≥ 1. Since all the parameters are > 0, the chain is
irreducible, so we may test any state (by default 0) for transience/recurrence of the chain.
We can analyze such a chain in exactly the same manner we did for discrete-time birth
and death chains in Section 9.1.
Let h(i) = Pi (∃t ≥ 0 Xt = 0); then h(0) = 1. For i ≥ 1, we do a first-jump analysis, we
have X
h(i) = Pi (∃t ≥ 0 Xt = 0|XJ1 = j)Pi (XJ1 = j).
j≥0
80
By the strong Markov, we can restart the process at time J1 . Since i 6= 0, Xt 6= 0 for any
t < J1 , and so Pi (∃t ≥ 0 Xt = 0|XJ1 = j) = Pj (∃t ≥ 0 Xt = 0) = h(j). Hence we have the
usual recursion X
h(i) = p(i, j)h(j), i ≥ 1.
j≥0
For this chain p(i, j) = q(i, j)/q(i) = 0 unless j = i ± 1. So we have q(i) = λi + µi , and the
recursion becomes
λi µi
h(i) = h(i + 1) + h(i − 1).
λi + µi λi + µi
Rewriting this as
(λi + µi )h(i) = λi h(i + 1) + µi h(i − 1)
we can subtract to get
0 = λi h(i + 1) − λi h(i) − µi h(i) + µi h(i − 1) = λi [h(i + 1) − h(i)] − µi [h(i) − h(i − 1)].
Hence, we have
µi
h(i + 1) − h(i) = [h(i) − h(i − 1)], i ≥ 1.
λi
By induction, then, we have
µi µi−1 · · · µ1
h(i + 1) − h(i) = [h(1) − h(0)] ≡ ρi [h(1) − h(0)].
λi λi−1 · · · λ1
n−1
X n−1
X
h(n) − h(0) = [h(i + 1) − h(i)] = [h(1) − h(0)] ρi . (21.1)
i=1 i=1
The minimal solution h(n), meaning the maximal solution u(n), is then achieved when
−1
u(1) = ( ∞
P
i=1 ρi ) . So we have
n−1 Pn−1
X ρi
1 − h(n) = u(1) ρi = Pi=1
∞
i=1 i=1 ρi
81
meaning that
Pn−1 P∞
i=1 ρi ρi
h(n) = 1 − P∞ = Pi=n
∞ .
i=1 ρi i=1 ρi
As n → ∞, h(n) → 0, which means that, for some n, there is a positive probability that
the chain started at n never reaches 0. This means 0 is a transient state, and so the chain
is transient.
Hence, recurrence is characterized by the summability of ∞
P
i=1 ρi : recurrent if the sum
is ∞, transient if it is < ∞.
Example 21.5 (M/M/1 queueing system). Consider a birth and death chain with λi = λ
and µi = µ both constant and non-zero. This is called an M/M/1 queue. It is a model for
a system where jobs (or customers) arrive at Poissonian times (at rate λ), queue up, and
are served in the order they arrived at rate µ. The process Xt is the number of jobs in the
queue at time t. There are much more general kinds of queueing models; the M/M stands
for Markov/Markov (meaning that the arrivals and service times are both Markovian),
and the index 1 means that only 1 job is served at a time.
µi−1 i−1
P∞
P∞ Following example 21.4, here we simply have ρi = λ i−1 = (µ/λ) . Hence i=1 ρi =
j −1
j=0 (µ/λ) is infinite if µ ≥ λ, and is finite and equal to (1 − µ/λ) if µ < λ. So Xt is
recurrent if µ ≥ λ, and transient of µ < λ. Indeed: if the service rate is at least the arrival
rate, we expect the queue to keep emptying out over time; if the service rate is less than
the arrival rate, then the line eventually grows without bound and never empties again
(i.e. 0 is never revisited).
82
Proposition 22.3. Let (Xt )t≥0 be a continuous time non-explosive Markov chain, and suppose
that π is a stationary distribution for (Xt )t≥0 . If P(X0 = j) = π(j) for all states i, then P(Xt =
j) = π(j) for all t > 0 and all states j.
Remark 22.4. In the following proof, we will basically assume there are finitely-many
states (this is required to interchange the derivatve and the sum). A different proof is
needed to show the result holds true for infinitely-many states. Since Kolmogorov’s equa-
tions actuallyonly determine the process up to the time of explosion, however, the fact
that dtd P(Xt = j) = 0 only shows that P(Xt = j) is constant up to explosion time. If the
process explodes, it can well start in and stay in the stationary distribution up to explosion
time, and then be in a(n arbitrary) new distribution after explosion.
Proof. Fix a state j. Then
d d X d X X d
P(Xt = j) = P(X0 = i)P(Xt = j|X0 = i) = π(i)pt (i, j) = π(i) pt (i, j).
dt dt i dt i i
dt
By Kolmogorov’s backward equation,
d X
pt (i, j) = q(i, k)pt (k, j) − q(i)pt (i, j)
dt k6=i
and so
X d X X X
π(i) pt (i, j) = π(i) q(i, k)pt (k, j) − π(i)q(i)pt (i, j).
i
dt i k6=i i
where we have used the fact that q(k, k) = 0 and the definition of invariance of π. Hence
we have
d X X
pt (i, j) = pt (k, j)q(k)π(k) − π(i)q(i)pt (i, j) = 0.
dt k i
It follows that t 7→ pt (i, j) is constant, as claimed.
Hence, it makes sense to call such a distribution stationary or invariant.
Example 22.5. Let (Xt )t≥0 be an irreducible birth-and-death chain. (Irreducible here sim-
ply means that λi > 0 and µi > 0 for all i.) Then q(0) = λ0 , q(j) = λj + µj for j ≥ 1, and
the equations for a stationary distribution are
λ0 π(0) = µ1 π(1), (λj + µj )π(j) = µj+1 π(j + 1) + λj−1 π(j − 1), j ≥ 1.
The equations for j ≥ 1 can be written in the form
µj+1 π(j + 1) − µj π(j) = λj π(j) − λj−1 π(j − 1).
Thus, for example, µ2 π(2) − µ1 π(1) = λ1 π(1) − λ0 π(0); but λ0 π(0) = µ1 π(1) and so this
simplifiles to µ2 π(2) = λ1 π(1). By induction, it follows that
µj+1 π(j + 1) = λj π(j), j≥0
and hence (since µj > 0 for all j), again by induction, we have
λj−1 λj−1 λj−2 λ0 · · · λj−1
π(j) = π(j − 1) = π(j − 2) = · · · = π(0).
µj µj µj−1 µ1 · · · µj
Define this new coefficient as θj ; in terms of the previous notation ρj , we have
λ0 · · · λj−1 λ0 1
θj = = .
µ1 · · · µj µj ρ j
By convention, we set θ0 = 1. Thus, normalizing π(0) as we like, we have that π(j) ∼ θj
is an invariant measure; in order for it to be a stationary distribution, we need it to be
summable.
P∞ Hence, an invariant birth-and-death chain has a stationary distribution
P∞ if and
−1
only if i=0 θi < ∞, in which case the stationary distribution is π(j) = ( i=0 θi ) θj .
Example 22.6. Consider the M/M/1 queue of Example 21.5. This is a birth-and-death
chain with µi = µ and λi = λ for all i. Hence, we have
j
λ0 · · · λj−1 λj λ
θj = = j = .
µ1 · · · µj µ µ
So ∞
P
j=0 θj is summable if and only if λ < µ, in which case the sum is
∞ ∞ j
X X λ 1
θj = = .
j=0 j=0
µ 1 − λ/µ
Hence, the stationary distribution for this chain is
∞
!−1 j
X λ λ
π(j) = θi θj = 1 −
i=0
µ µ
which is a (shifted) geometric distribution.
84
Example 22.7 (M/M/∞ queue). Consider a queueing system with infinitely many servers.
Jobs arrive according to a Poisson process with rate µ, and are served independently with
independent service times at rate µ, where all jobs currently in the queue are served si-
multaneously. We model this as a birth-and-death-chain with λj = λ for all j ≥ 0, and
µj = jµ for all j ≥ 1. Thus, we have
j
λ0 · · · λj−1 λj 1 λ
θj = = =
µ1 · · · µj µ · 2µ · · · jµ j! µ
and so
∞ ∞ j
X X 1 λ
θj = = eλ/µ
j=0 j=0
j! µ
which is finite for any λ, µ > 0. Hence the stationary distribution always exists, and is
given by
j
e−λ/µ λ
π(j) = .
j! µ
That is, the stationary distribution is Poissonian with mean λ/µ. In particular, if the queue
is at equilibrium, i.e. in the stationary distribution, then the mean number of customers
in the queue is the mean of Poisson(λ/µ), which is λ/µ.
22.2. Convergence to the Stationary Distribution. The exact analog of the basic conver-
gence theory for discrete Markov chains (cf. Corollary 11.1, Theorem 11.3, and Theorem
12.1) holds for continuous time chains as well, with the proviso that the theory is sim-
pler: there is no issue with periodicity. Indeed, for any continuous-time Markov chain, if
pt (i, j) > 0 for some time t > 0, then in fact pt (i, j) > 0 for all time t > 0 (this is Problem 2
on Homework 6).
Theorem 22.8. Let (Xt )t≥0 be an irreducible, continuous-time Markov chain, with P transition
kernel pt (i, j) = Pi (Xt = j) for states i, j, and transition rates q(i, j) > 0 and q(i) = j6=i q(i, j).
The following are equivalent.
(1) All states are positive recurrent.
(2) Some state is positive recurrent.
(3) The chain is non-explosive, and there exists a stationary distribution π.
Moreover, when these conditions hold, the stationary distribution is given by
1
π(j) =
Ej [Tj ]
where Tj is the return time to state j. Finally, under these conditions, we have pt (i, j) → π(j) as
t → ∞ for any states i, j.
Hence, Example 22.6 shows that the M/M/1 queue is, in fact, positive recurrent if and
only if λ < µ, and that it converges to equilibrium in this case. We already saw, in Example
21.5, that the M/M/1 queue is recurrent when λ ≤ µ; we conclude that in the critical case
λ = µ, the chain is null recurrent. As usual, this means that pt (i, j) → 0 for each fixed i, j in
this case. Example 22.7 shows that the M/M/∞ queue is always positive recurrent, and
always converges to its stationary (Poissonian) distribution.
85
The proofs of all parts of Theorem 22.8 are very similar to the proofs of Corollary 11.1
and Theorem 12.1, and so we will not touch them here. It is important to note that the non-
explosion clause (3) in the theorem is absolutely necessary. For example, for convergence,
we know that an exploding chain can start in the stationary distribution and hence stay in
it until it explodes; afterward it can take any distribution at all, and so may not converge
back to the stationary distribution.
Example 22.9. Consider a birth-and-death chain with λj = λ · 2j and µj = µ · 2j for some
parameters λ, µ chosen so that 1 < λ/µ < 2. Then
j
λ0 · · · λj−1 λ · 2λ · · · 2j−1 λ λ 1
θj = = j
=
µ1 · · · µj 2µ · 4µ · · · 2 µ µ 2j
and so ∞
P
j=0 θj < ∞, so an invariant distribution exists. But because the rates increase
exponentially up the chain, explosion does occur, so the limiting behavior of the chain is
not determined by rate paramters at all.
86
where Bj are the independent Bernoulli trials that all have expectation 0. So the expec-
tation is constant(ly equal to 0). This is inevitable for a “fair game” – if your expected
winnings were ever positive, your opponent’s would be negative, which means the game
would be slanted in your favor.
Actually, there is a stronger “fairness” property that holds for this game. Suppose m
tosses have already occurred, and your fortune has varied according to the trajectory
i0 , i1 , . . . , im over that time. After some larger number of tosses n, what is your expected
winnings? Since we already know Xm = im , the interesting quantity is Xn − Xm : your net
winnings (or loss) since time m. So what we want to know is
E[Xn − Xm |X0 = i0 , . . . , Xm = im ].
Expressing this as a telescoping sum and using the linearity of expectation, we have
n−1
X
E[Xn − Xm |X0 = i0 , . . . , Xm = im ] = E[Xk+1 − Xk |X0 = i0 , . . . , Xm = im ].
k=m
This may look like simple book-keeping, but it brings new intuition about conditioning.
We start with a random variable X. We are then given some information, in the form of
some other random variables Y1 , . . . , Yn that we may observe. Given these observations,
what is our best guess for the value of X? In the (random event) that Y1 = i1 , . . . , Yn = in ,
our best guess is the conditional expectation E[X|Y1 = i1 , . . . , Yn = in ]. Hence, the new
random variable E[X|Y1 , . . . , Yn ] is our best (random) guess about the value of X given
any observations of Y1 , . . . , Yn .
Example 23.2. Suppose that X is a function of Y1 , . . . , Yn , X = F (Y1 , . . . , Yn ) for some
fixed (deterministic F ). Then complete information about Y1 , . . . , Yn yields complete in-
formation about X, so our “best guess” about X should be X itself. Indeed, in this case
88
we have
E[X|Y1 , . . . , Yn ] = E[F (Y1 , . . . , Yn )|Y1 , . . . , Yn ]
E[F (Y1 , . . . , Yn )|Y1 = i1 , . . . , Yn = in ]1{Y1 =i1 ,...,Yn =in }
X
=
i1 ,...,in
Example 23.3. On the other hand, suppose that X and {Y1 , . . . , Yn } are independent. In
this case, information about Y1 , . . . , Yn should be essentially useless in determining the
value of X. What does this mean about E[X|Y1 , . . . , Yn ]? For any states x, i1 , . . . , in the
events {X = k} and {Y1 = i1 , . . . , Yn = in } are independent, so P(X = x|Y1 = i1 , . . . , Yn =
in ) = P(X = x), and therefore
X X
E[X|Y1 = i1 , . . . , Yn = in ] = x·P(X = x|Y1 = i1 , . . . , Yn = in ) = x·P(X = x) = E[X].
x x
Thus
E[X|Y1 = i1 , . . . , Yn = in ]1{Y1 =i1 ,...,Yn =in }
X
E[X|Y1 , . . . , Yn ] =
i1 ,...,in
X
= E[X] 1{Y1 =i1 ,...,Yn =in } = E[X].
i1 ,...,in
Thus, when X is independent from Y1 , . . . , Yn , the “best guess” E[X|Y1 , . . . , Yn ] is just the
constant E[X]. This is built into conditional expectation; the average value of X without
conditioning is always known. If no further information is given, this constant is the
conditional expectation as well.
Example 23.4. Consider the SRW Xn of Example 22.5 again. Using Equation 23.1, we
have
E[Xn − Xm |X0 = i0 , . . . , Xm = im ]1{X0 =i0 ,...,Xm =im } = 0.
X
E[Xn − Xm |X0 , . . . , Xm ] =
i0 ,...,im
by Example 22.6. Thus, for n ≥ m, we have E[Xn |X0 , . . . , Xm ] = Xm ; the best guess about
our future fortune at time n, given complete information up to time m, is our present for-
tune Xm . This is the essence of the “average fairness” property that defines a martingale.
Let’s collect the properties of conditional expectation from the last examples (and a few
more) in the following proposition.
Proposition 23.5. Let X, X 0 be random variables, and Y = Y1 , . . . , Yn a collection of random
variables. The following properties hold.
(1) For a, b ∈ R, E[aX + bX 0 |Y] = aE[X|Y] + bE[X 0 |Y].
(2) If X is Y-measurable, then E[X|Y] = X.
(3) If X is independent of Y, then E[X|Y] = E[X].
(4) (Tower Property) Let Z be another collection of random variables, and suppose Y are Z-
measurable, so Y = F (Z) for some function F . (A typical situation is when Z ⊇ Y).
Then
E[E[X|Z]|Y] = E[X|Y].
(5) (Factoring) If Y is Y-measurable then E[XY |Y] = Y E[X|Y].
Proof. Properties (2) and (3) were proved in Examples 23.2 and 23.3. Properties (1), (4),
and (5) all follow from straightforward (but lengthy) calculations immediate from the
definitions.
The simplest case of the Tower Property (4) in Proposition 23.5 is worthy of its own
statement.
Corollary 23.6. If X and Y = Y1 , . . . , Yn are random variables, then E[E[X|Y]] = E[X].
Proof. The set ∅ of random variables is (vacuously) independent from all random vari-
ables, including X, and Y ⊃ ∅ for any Y. Thus, by the Tower Property (4) of Proposition
23.5, we have
E[E[X|Y]|∅] = E[X|∅] = E[X]
where the second equality follows from (3) in Propostion 23.5. By the same argument, the
random variable E[X|Y] is independent from ∅, and so again by (3)
E[E[X|Y]|∅] = E[E[X|Y]]
which proves the claim.
90
Bn ≤ max{Wn , 0}, and so long as we start with finite capital, by induction this means
Bn is a bounded random variable so certainly has finite expectation. We also insist that
the betting strategy at time n can only use information up to time n: in other words, we
require Bn to be (W0 , W1 , . . . , Wn−1 )-measurable.
Under these conditions, we have
" n # " n # n
X X X
E[|Wn |] ≤ E |Xi Bi | = E |Bi | = E[|Bi |] < ∞
i=1 i=1 i=1
for each n, as required. Now, the winnings at time n + 1 are Wn+1 = Wn + Xn+1 Bn+1 , and
so
E[Wn+1 |W0 , . . . , Wn ] = E[Wn |W0 , . . . , Wn ] + E[Xn+1 Bn+1 |W0 , . . . , Wn ].
The first term is E[Wn |W0 , . . . , Wn ] = Wn . For the second term: by assumption Bn+1 is
measurable with respect to W0 , . . . , Wn , and so by the factoring property of Proposition
23.5(5), we have
E[Xn+1 Bn+1 |W0 , . . . , Wn ] = Bn+1 E[Xn+1 |W0 , . . . , Wn ].
Now, B1 is W0 measurable, and W0 is independent of Xk for all k, so B1 is as well. It
follows that X1 B1 is independent of Xn+1 , and therefore so is W1 = W0 + X1 B1 . Then B2
is (W0 , W1 )-measurable, and since both are independent of Xn+1 , so is B2 . Thence so is
X2 B2 , and so too W2 = W1 + X2 B2 . Continuing this way, we find that W0 , . . . , Wn are all
independent of Xn+1 . Hence, by Propositon 23.5(3),
E[Xn+1 |W0 , . . . , Wn ] = E[Xn+1 ] = 0.
So, we have shown that E[Wn+1 |W0 , . . . , Wn ] = Wn + 0 = Wn , and so Wn is a martingale.
24.2. Stopping Martingales. Just as for a (discrete-time) Markov chain, or any stochastic
process (Xn )n≥0 , a stopping time T is a random variable taking values in {0, 1, . . .} ∪ {∞},
which has the property that the event {T ≤ n} depends only on X0 , . . . , Xn . (To be precise:
{T ≤ n} can be expressed in the form {(X0 , . . . , Xn ) ∈ U } for some set of states U .)
In the context of Example 24.4, we would like to stop betting at some time. We could
stop after a fixed number n of coin tosses. More likely, we would like to stop as soon as
Wn ≥ F for some fixed fortune F . (Indeed, T = min{n ≥ 0 : Wn ≥ F } is a stopping time.)
If we stop betting at this time, then no matter what happens to the coin, our fortune is
equal to WT from then on. In other words, we have replaced the original martingale Wn
with the new process WT ∧n , where T ∧ n = min{T, n}. In fact, this stopped martingale is
still a martingale.
Proposition 24.5. Let (Xn )n≥0 be a martingale and let T be a stopping time for this martingale.
Then (XT ∧n )n≥0 is a martingale.
Proof. Let Yn = XT ∧n . First note that Yn = MT ∧n ∈ {X0 , . . . , Xn } for each n, so
|Yn | ≤ max{|X0 |, . . . , |Xn |} ≤ |X0 | + · · · + |Xn |,
hence E[|Yn |] ≤ E[|X0 |] + · · · + E[|Xn |] < ∞, as required. We now need to show that
E[Yn+1 |Y0 , . . . , Yn ] = Yn . We will first show that
E[Yn+1 |X0 , . . . , Xn ] = Yn . (24.1)
92
As a corollary, we therefore have that E[XT ∧n ] = E[XT ∧0 ] = E[X0 ] for all n. This is not
quite the same as saying that, when we’ve stopped playing, our expected fortune is still
unchanged. What we would like to see, for a truly “fair game”, is that E[XT ] = E[X0 ]. If
this can be violated, then even though at each fixed time n we cannot expect any profit
from any betting strategy, perhaps we can choose a clever enough random stopping time
(based on events we’ve seen thus far in the game) to come out on top, on average. In fact,
this is possible.
Example 24.6 (Martingale Betting Strategy). Consider a betting strategy as in Example
24.4, where the strategy is fixed as Bn = 2n (double your bet each round). Example 24.4
shows that the winnings Wn form a martingale. Now, let T = min{n : Xn = 1}, the first
time heads comes up. If the initial capital is the fixed amount W0 = C, then we have
∞
X
E[WT ] = E[WT |T = n]P(T = n).
n=0
93
and so
∞
X
E[WT ] = 2−n (C + 1) = C + 1.
n=1
Example 24.6 shows that “arbitrage” is possible: one can employ a betting strategy that
is guaranteed to make money in a fair game (if one has infinite capital). But the winnings-
martingale Wn wildly fluctuates in this case. In fact, if we don’t allow this (or require that
our stopping time must occur by a fixed time), then arbitrage is impossible.
Theorem 24.8 (Optional Sampling Theorem). Let (Xn )n≥0 be a martingale, and let T be a
finite stopping time. Suppose that either:
(1) T is bounded: there is some N < ∞ so that P(T ≤ N ) = 1; or
(2) (Xn )0≤n≤T is bounded: for some B < ∞, P(|Xn | ≤ B for n ≤ T ) = 1.
Then E[XT ] = E[X0 ].
Proof. First, suppose (1) holds. By proposition 24.5, XT ∧n is a martingale, and so E[XT ∧n ] =
E[XT ∧0 ] = E[X0 ] for all n. In particular, E[X0 ] = E[XT ∧N ]; but T ∧N = T , so E[X0 ] = E[XT ]
as claimed.
Now, suppose T is not necessarily bounded, but (2) holds. Then for any n, we can write
XT = XT ∧n + (XT − XT ∧n ) = XT ∧n + (XT − XT ∧n )1{T >n} .
94
(The second equality follows from the fact that, if T ≤ n, thenXT − XT ∧n = XT − XT = 0.)
So we have
E[XT ] = E[XT ∧n ] + E[(XT − XT ∧n )1{T >n} ].
Now, T ∧ n bounded by n, and it is easy to see that it is a stopping time; so by part (1), it
follows that E[XT ∧n ] = E[X0 ]. For the second term, we have
|E[(XT − XT ∧n )1{T >n} ]| ≤ E[|XT − XT ∧n |1{T >n} ] ≤ 2BE[1{T >n} ] = 2BP(T > n).
Since P(T < ∞) = 1, P(T > n) → 0 as n → ∞. Hence, we have
|E[XT ] − E[X0 ]| = |E[XT ] − E[XT ∧n ]| ≤ 2BP(T > n) → 0.
Since the left-hand-side does not depend on n, it follows that E[XT ] = E[X0 ], as required.
95
Example 25.2. Let X1 , X2 , . . . be a sequence of i.i.d. random variables with common mean
E[Xn ] = µ, and let Sn = X1 + · · · + Xn . Define Mn = Sn − nµ, and M0 = 0. Then note that
E[Mn+1 |M0 , M1 , . . . , Mn ] = E[Sn+1 − (n + 1)µ|M0 , . . . , Mn ]
= E[Sn − nµ + Xn+1 − µ|M0 , . . . , Mn ]
= E[Mn + Xn+1 − µ|M0 , . . . , Mn ]
= Mn + E[Xn+1 − µ|M0 , . . . , Mn ]
Since the Xn are independent, Xn+1 is independent from Sk for all k ≤ n, and hence also
from Mk for k ≤ n. Thus the last term is
E[Xn+1 − µ|S0 , . . . , Sn ] = E[Xn+1 − µ] = µ − µ = 0.
Also: E[M1 |M0 ] = E[M1 |0] = E[M1 ] = E[X1 − µ] = 0 = M0 . Thus (Mn )n≥0 is a martingale.
Now, let T be a bounded stopping time for (Xn )n≥0 (which is therefore also a bounded
stopping time for (Mn )n≥0 . Then by the Optional Sampling Theorem,
0 = E[M0 ] = E[MT ] = E[ST − T µ] = E[ST ] − µE[T ].
We have thus proved Wald’s equation: if T is a bounded stopping time for an i.i.d. se-
quence with mean µ, then E[X1 + X2 + · · · + XT ] = µE[T ]. (In fact, this is true for any
stopping time T with E[T ] < ∞, and it can be proved using the optional sampling theo-
rem in this stronger form, but the proof is more involved.)
25.2. Submartingales. A stochastic process (Xn )n≥0 is called a submartingale (resp. su-
permartingale) if, for each n, E[Xn+1 |X0 , . . . , Xn ] ≥ Xn . (resp. E[Xn+1 |X0 , . . . , Xn ] ≤ Xn ).
For example, suppose we bet on coin-tosses with a betting strategy as in Example 24.4, but
we additionally have a secret admirer who watches us bet. At each time n, the admirer
adds some amount Gn to our winnings, where the amount given depends only on our
betting and the coin tosses up to time n − 1. That is: Wn+1 = Wn + Bn+1 Xn+1 + Gn+1 where
Gn+1 is (X1 , . . . , Xn , B1 , . . . , Bn )-measurable. Then our total fortune Wn is a submartin-
gale: E[Wn+1 |Wn ] = Wn + Gn+1 ≥ Wn . (We might similarly consider a game where, at
each time, we bet but may also buy some number of drinks, with the number depending
only on our current and past fortune; this would give a supermartingale.)
96
Sub- and supermartingales are very useful technical tools in analyzing martingales and
other stochastic processes. The main example we will consider comes from maximal in-
equalities. One of the most basic estimates useful in probability theory is Markov’s in-
equality: P(|X| ≥ a) ≤ E[|X|] a
. There are more general versions (proved similarly) that
E[|X|r ] bX
give P(|X| ≥ a) ≤ ar and P(X ≥ a) ≤ E[eeba ] for any b > 0. If Xn is a martingale,
stronger estimates are possible: the same bounds can be given not only for Xn , but for
max{X0 , . . . , Xn }. The key is the following strengthening of the r = 1 Markov inequality
for a positive submartingale.
Theorem 25.3 (Doob’s Maximal Inequality). Let (Xn )n≥0 be a non-negative submartingale.
Then for any a > 0,
E[Xn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ .
a
Proof. Let T = min{n : Xn ≥ a} (a stopping time). Consider the event
Ak = {T = k} = {T ≤ k} ∩ {T 6≤ k − 1}.
Since T is a stopping time, {T ≤ k} is (X0 , . . . , Xk )-measurable, and the event {T ≤ k − 1}
is (X0 , . . . , Xk−1 )-measurable, so too is its complement, and therefore Ak is (X0 , . . . , Xk )-
measurable. Now, since Xn ≥ 0, we have Xn ≥ Xn 1{T ≤n} , and hence
" n
# n
E[Xn ] ≥ E[Xn 1{T ≤n} ] = E Xn 1Ak = E[Xn 1Ak ].
X X
k=0 k=0
Now, the event {T ≤ n} is the event that there exists a k ≤ n with Xk ≥ a; i.e. {T ≤ n} =
{max{X0 , . . . , Xn } ≥ a}. This proves the theorem.
We will use Doob’s maximal inequality directly for the special case of a non-negative
martingale, in the next section. The more general result for submartingales actually al-
lows a much more powerful family of estimates for martingales. This is because of the
following.
Lemma 25.4. Let (Xn )n≥0 be a martingale, and let f : R → R be a convex function such that
E[|f (Xn )|] < ∞ for all n. Then Yn = f (Xn ) is a submartingale.
The proof of Lemma 25.4 is an Exercise on Homework 7.
97
Corollary 25.5. Let (Xn )n≥0 be a martingale, and let r ≥ 1 and a, b > 0. Then
E[|Xn |r ]
P(max{X0 , . . . , Xn } ≥ a) ≤ (25.1)
ar
E[ebXn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ . (25.2)
eba
Proof. The function f (x) = |x|r is convex for r ≥ 1. Since (Xn )n≥0 is a martingale, Lemma
25.4 shows that (|Xn |r )n≥0 is a (non-negative) submartingale. Now, for fixed a, if there
exists a k ∈ {0, . . . , n} with Xk ≥ a then |Xk |r ≥ ar . Thus {max{X0 , . . . , Xn } ≥ a} ⊆
{max{|X0 |r , . . . , |Xn |r } ≥ ar }. Therefore, using Theorem 25.3 for Yn = |Xn |r , we have
E[Yn ]
P(max{X0 , . . . , Xn } ≥ a) ≤ P(max{Y0 , . . . , Yn } ≥ ar ) ≤
ar
which proves Equation 25.1. The proof of Equation 25.2 is very similar.
In fact, (Wn )n≥1 is a martingale; the proof is similar to the proof in Example 24.4. We have
E[Wn+1 |X1 , . . . , Xn ] = E[Wn + Bn+1 (Xn+1 − Xn )|X1 , . . . , Xn ].
Now, the bet Bn+1 is determined by X1 , . . . , Xn , and the winnings Wn are determined by
X1 , . . . , Xn and B1 , . . . , Bn (hence just by X1 , . . . , Xn ). Thus Wn and Bn+1 are (X1 , . . . , Xn )-
measurable, and so
E[Wn+1 |X1 , . . . , Xn ] = Wn + Bn+1 E[Xn+1 − Xn |X1 , . . . , xn ] = Wn
since Xn is a martingale. Now, since W1 , . . . , Wn are (X1 , . . . , Xn )-measurable, we there-
fore have
Wn = E[Wn+1 |W1 , . . . , Wn ] = E[E[Wn+1 |X1 , . . . , Xn ]|W1 , . . . , Wn ] = E[Wn+1 |W1 , . . . , Wn ]
and so (Wn )n≥1 is a martingale. In particular E[Wn ] = E[W1 ] for all n, and W1 equals either
X1 − X0 or 0, both if which have E = 0. Thus E[Wn ] = 0 for all n.
Now, note that
X X
Wn = 1 · (Xj − Xj−1 )
k : Tk ≤n Sk−1 <j≤Tk
Un
X X n
X
= (Xj − Xj−1 ) + (Xj − Xj−1 ).
k=0 Sk−1 <j≤Tk j=SUn +1
The first sum is over precisely the complete Un up-crossings. In each up-crossing, the sum
over j is a telescoping sum with value XTk − XSk−1 ; by definition XTk ≥ b and XSk−1 ≤ a.
Hence, each of those terms is ≥ b − a, and we have
Un
X n
X
Wn ≥ (b − a) + (Xj − Xj−1 ).
k=0 j=SUn +1
The latter sum is also telescoping, with value Xn − XSUn . By assumption Xn ≥ 0, and
SUn ≤ a, so we conclude that
Wn ≥ (b − a)Un − a.
a
Hence 0 = E[Wn ] ≥ (b − a)E[Un ] − a, and so we have E[Un ]l b−a for all n. It follows that
a
the limit U = limn→∞ Un also has E[U ] ≤ b−a < ∞. In particular, P(U < ∞) = 1, which
concludes the proof of the theorem.
So, start Xn as X0 = 0 as usual. Its first step is to ±1. In the case it is +1, the above
analysis shows that Xn = 0 for some n; a similar analysis works for X1 = −1. This gives
an alternate proof the SRW is recurrent.
Note: in Example 26.3, conditioning X0 = 1 means that M0 = XT ∧0 = X0 = 1,
and so E[Mn ] = E[M0 ] = 1 for all n. But M∞ = 0 so E[M∞ ] = 0. We see that the
constant expectation property of martingales does not generally extend to their limits.
This is not hard to believe – it is always a tricky business exchanging a limit with a
sum/integral/expectation.
Example 26.4 (Polya Urns). An urn initially contains a red balls and b blue balls. At each
time step, draw a ball at random from the urn, then return it along with another ball of
the same color. Let Xn denote the number of red balls after n turns. Then (Xn )n≥0 is a
(time-inhomogeneous) Markov chain. Indeed, suppose at time n there are k red balls in
the bin. The total number of balls is at time n is n + a + b, and so the probability that the
k
ball we pick a red ball next is n+a+b . Then at time n + 1, we return this ball with another
of the same color; if it was red, Xn+1 = k + 1, and if it was blue, Xn+1 = k again. Hence,
for k ≥ a,
k k
P(Xn+1 = k + 1|Xn = k) = P(Xn+1 = k|Xn = k) = 1 − .
n+a+b n+a+b
All our techniques for studying Markov chains required time homogeneity, so we are left
in the lurch to analyze this process directly.
Xn
Instead, let Mn = n+a+b be the fraction of red balls after n turns. Then (Mn )n≥0 is a
martingale: first, note that 0 ≤ Mn ≤ 1, so E[|Mn |] ≤ 1 < ∞. Now, since Xn is a Markov
chain, we know that for any states i0 , i1 , . . . , in , in+1 , P(Xn+1 = in+1 |Xn = in , . . . , X0 =
i0 ) = P(Xn+1 = in+1 |Xn = in ). It follows (from the summation definition of conditional
expectation) that E[Xn+1 |X0 , . . . , Xn ] = E[Xn+1 |Xn ]. This we can compute:
X
E[Xn+1 |Xn = k] = jP(Xn+1 = j|Xn = k)
j
of adding a red ball at the next step.) Since Mn is Xn -measurable, we therefore have
1
E[Mn+1 |Mn , . . . , M0 ] = E[Mn+1 |Xn ] = E[Xn+1 |Xn ]
n+1+a+b
and this equals
1 Xn Xn
Xn + = = Mn
n+1+a+b n+a+b n+a+b
showing that (Mn )n≥0 is a martingale.
This martingale is non-negative, hence by the Martingale Convergence Theorem, Mn →
M∞ with probability 1 for some random variable M∞ . It can be shown (though we are not
equipped to show it) that M∞ has a beta distribution: it is a continuous random variable
on [0, 1] with density
(a + b − 1)! a−1
fM∞ (x) = x (1 − x)b−1 , 0 < x < 1.
(a − 1)!(b − 1)!
There are good heuristics for why a beta should show up, so it is not too hard to justify
that if there is a limit it ought to be beta. The martingale convergence theorem is the key to
making the proof rigorous.
102
27.1. Random Continuous Motion. Call a stochastic process (Xt )t≥0 a random continu-
ous motion if it has:
(1) Stationary Increments: There is a one parameter family of probability distribu-
tions {µt }t≥0 so that, for 0 ≤ s < t < ∞, the distribution of Xt − Xs is µt−s .
(2) Independent Increments: If 0 ≤ t0 < t1 < · · · < tn < ∞, then the increments of
the process Xt1 − Xt0 , Xt2 − Xt1 , . . . , Xtn − Xtn−1 are independent.
(3) Continuous Trajectories: The function t 7→ Xt is continuous with probability 1.
A Poisson process only misses the mark on item (3); but this is a serious change. Indeed,
unless (Xt )t≥0 is constant (the degenerate case), the condition that t 7→ Xt be continuous
means that the sample space is continuous (i.e. it must contain an interval). Hence, a random
continuous motion is a new kind of stochastic process beyond what we’ve studied: it is a
continuous-time, continuous-state process.
One might expect this to open a can of worms on the kinds of processes that fit the bill;
but the continuity of the paths (together with the other properties) actually pins down
the measures: up to scale, there is only one random continuous motion! To see how the
measure µt is determined, note that µt is the distribution of Xt − X0 . We can write this
increment as a telescoping sum: let tj = nj t, so that t0 = t and tn = t. Then
Xt − X0 = [Xtn − Xtn−1 ] + [Xtn−1 − Xtn−2 ] + · · · + [Xt1 − Xt0 ].
By conditions (1) and (2), this means Xt − X0 is a sum of n independent random vari-
ables, each of which has distribution µtj −tj−1 = µt/n . Probability distributions µt that can
be written as the sum of n i.i.d. random variables for any n are called infinitely-divisible,
and they form a small (but still infinite) family of probability distributions. Indeed, it we
forget about the continuity, then there are many such processes, called Lévy processes,
among which the Poisson process is an example. (As follows from Homework 5, Problem
3, a sum if independent Poisson random variables is Poisson, hence the Poisson distribu-
tion of any fixed rate is infinitely-divisible.)
103
since E[Bs ]2 = 0, and the variance of N (0, σ 2 s) is (by definition) σ 2 s. We can restate all of
this as
Cov(Bs , Bt ) = σ 2 min{s, t}. (27.1)
On the subject of variances, note that we have
Var(Bt+h − Bt ) = σ 2 h.
√
In other words, the standard deviation of the increment Bt+h − Bt is σ h. For small h,
this is h. So, on average, we cannot expect h1 [Bt+h − Bt ] to have a limit as h ↓ 0. Indeed,
Brownian trajectories are nowhere differentiable.
Here is a very useful property of Brownian motion we will exploit in investigating its
properties.
Lemma 27.2 (Scaling Property). Let (Bt )t≥0 be a standard Brownian motion, and let a > 0. For
t ≥ 0, define Wt = a−1/2 Bat . Then (Wt )t≥0 is a standard Brownian motion.
Proof. Since B0 = 0, W0 = 0. Scaling does not affect continuity. So we need only
check that the increments are independent and normally distributed. For fixed times
0 ≤ t0 < t1 < · · · < tn , we also have 0 ≤ at0 < at1 < · · · < atn , and so by as-
sumption the increments Bat1 − Bat0 , . . . , Batn − Batn−1 are independent. Scaling them
by a−1/2 preserves independence, and so the increments of Wt are independent. Finally,
Wt − Ws = a−1/2 (Bat − Bas ). By assumption the distirbution of Bat − Bas is N (0, a(t − s)),
and so scaling by a−1/2 scales the variance by (a−1/2 )2 = a−1 , showing that Wt − Ws has
distribution N (0, t − s). Thus (Wt )t≥0 is a standard Brownian motion.
Example 27.5. Let (Bt )t≥0 be a standard Brownian motion, and let U be a uniform random
variable (on [0, 1]) independent from (Bt )t≥0 . Define
(
Bt , t 6= U
Xt = .
0, t=U
In fact, Xt is a Gaussian process with the same marginal distributions as Bt . To see this,
it suffices to show that for any times t1 , . . . , tn and any coefficients a1 , . . . , an , the two
random variables a1 Xt1 + · · · + an Xtn and a1 Bt1 + · · · + an Btn have the same distribution.
This can be tested by looking at their characteristic functions χX (ξ) = E[eiξX ]. We have, for
any ξ ∈ R,
E[eiξ(a1 Xt1 +···+an Xtn ) ] = E[eiξ(a1 Xt1 +···+an Xtn ) 1{U 6=t1 ,...,tn } ]
because the event {U 6= t1 , . . . , tn } has probability 1. But on this event Xt = Bt for each t,
so
E[eiξ(a1 Xt1 +···+an Xtn ) ] = E[eiξ(a1 Bt1 +···+an Btn ) 1{U 6=t1 ,...,tn } ]
= E[eiξ(a1 Bt1 +···+an Btn ) ]E[1{U 6=t1 ,...,tn } ]
= E[eiξ(a1 Bt1 +···+an Btn ) ],
where the second equality follows from the independence of U from Bt1 , . . . , Btn , and the
third is just the statement that P(U 6= t1 , . . . , tn ) = 1. So a1 Xt1 + · · · + an Xtn has the same
characteristic function as a1 Bt1 + · · · + an Btn for all times t1 , . . . , tn and all coefficients
a1 , . . . , an . It follows that Xt and Bt have the same marginal distributions, and that Xt is a
Gaussian process. But the function t 7→ Xt disagrees with t 7→ Bt at t = U (unless BU = 0,
which is an event of probability 0), hence with probability 1 the trajectory t 7→ Bt has a
discontinuity in [0, 1].
We therefore have to be careful to verify continuity properties separarely from mar-
ginal distribution properties. In particular, Brownian motion is a continuous Gaussian
process; the “continuous” has to be added to the description, as it does not follow from a
description of the marginal distributions.
106
It now suffices to show that t 7→ Wt is continuous with probability 1. For t > 0 this
is clear, since t 7→ tB1/t is continuous wherever 1/t is continuous. At t = 0, first we
note that the above treatment shows that Bt and Xt have the same marginal distributions.
In particular, their distributions agree on the dense set of rationals Q ∩ (0, ∞). Since Bt
has continuous paths and Xt is continuous on [0, ∞), it follows (from the uniqueness
of Brownian motion) that Wt is continuous at 0 as well. Hence (Wt )t≥0 is a standard
Brownian motion.
Bt
Corollary 28.6. If (Bt )t≥0 is a standard Brownian motion, then limt→∞ = 0.
t
Pn
Remark 28.7. For integer t = n, we can write Bn = Bn − B0 = k=1 [Bk − Bk−1 ] The
increments Bk − Bk−1 are i.i.d. with mean 0, and hence by the strong law of large numbers
Bn
n
→ 0 with probability 1. Extending this argument to t ∈ / N directly takes a lot of
work (owing to the very rough nature of Brownian motion: just because it is continuous
doesn’t mean there’s a very nice way to bound Bt by Bn and Bn+1 for n ≤ t < n + 1). But
time-inversion gives an immediate easy proof.
Proof. Let Wt = tB1/t ; then W1/t = Btt . In Lemma 28.5, we saw that Wt is a standard
Brownian motion; in particular this means W0 = 0 and t 7→ Wt is continuous. It follows
that limt→∞ Btt = limt→∞ W1/t = lims↓0 Ws = 0.
28.1. Donsker’s Scaling Limit. Brownian motion is the pre-eminent Gaussian process.
The Gaussian distribution arises as the limit distribution in the central limit theorem; sim-
ilarly, Brownian motion arises as the “limit process” in a collection of limit theorems. The
gist is that if one rescales a discrete-time random walk with space scaled as the square-
root of time, then in the scaling limit the result is Brownian motion. Here is a precise
statement.
Theorem 28.8 (Donsker’s Invariance Principle). Let (Xn )n≥0 be SRW on Z. Let c > 0, and
(c)
define a continuous-time, continuous process Xt as follows: for t ≥ 0 such that ct ∈ N, set
(c) (c)
Xt = c−1/2 Xct ; for other times t, define Xt by linear interpolation. Then for any times 0 ≤
t1 < · · · < tn ,
(c) (c) D
(Xt1 , . . . , Xtn ) −→ (Bt1 , . . . , Btn )
D
as c → ∞, where (Bt )t≥0 is a standard Brownian motion, and −→ means convergence in distri-
bution.
The proof, boiled down to its basic idea, is just repeated application of the central limit
theorem. Indeed, one does not need to use the SRW; instead, one could take Xn = Y1 +
· · · + Yn where (Yn )n≥0 is a sequence of i.i.d. random variables each with E[Yj ] = 0 and
Var[Yj ] = 1. (If one takes Var[Yj ] = σ 2 then the theorem holds with a Brownian motion
of variance σ 2 in the limit.) Note also that the linear-interpolation definition of the scaled
(c)
process Xt is included only to make these processes continuous. One could just as well
use (and indeed the proof uses)
et(c) = c−1/2 Xbctc .
X
The sense of convergence stated in Theorem 28.8 is very weak; much stronger forms of
(c)
convergence hold. For example, the random continuous path (t 7→ Xt )0≤t≤R converge
108
to the Brownian path (t 7→ Bt )0≤t≤R uniformly, with probability 1, for any R > 0. (This is
one way to prove the existence of Brownian motion – if one can show this uniform limit
exists, then the limit is a uniform limit of continuous paths and therefore has continuous
paths; that is satisfies the marginal distribution properties of Brownian motion is then
verified via central limit theorem considerations as above.)
Theorem 29.2 (Strong Markov property). Let (Bt )t≥0 be a standard Brownian motion, and let
T be a stopping time. For t ≥ 0, set Xt = BT +t − BT . Then (Xt )t≥0 is a standard Brownian
motion, independent from (Bt )0≤t≤T .
Example 29.3. Fix x > 0; we will determine the distribution of the first passage time τx .
This can be done with the following trick. If we had a discrete random variable Y , then
we could compute using the Law of Total Probability that, for any event A,
X X
P(A) = P(A, Y = y) = P(A > x|Y = y)P(Y = y).
y y
This formula does not make sense of Y is a continuous random variable with density fY ,
since the sum is over a continuous set, and since P(Y = y) = 0. The analog of this formula
which does hold is Z
P(A) = P(A|Y = y)fY (y) dy. (29.1)
The range of the integral is all possible states that Y can take, in the event A. We apply
this in the setting A = {Bt > x} and Y = τx . Note: when Bt > x, since B0 = 0 and the
Brownian path is continuous, by the intermediate value theorem there is a time s ∈ (0, t)
where Bs = x; hence τx ∈ (0, t). Thus
Z t
P(Bt > x) = P(Bt > x|τx = s)fτx (s) ds. (29.2)
0
We can deal with the integrand as follows. First, by definition Bτx = x. Also, as we
integrate over 0 < s < t, we may write t = s + (t − s). Hence
P(Bt > x|τx = s) = P(Bs+(t−s) − Bτx > 0|τx = s) = P(Bτx +(t−s) − Bτx > 0|τx = s).
By the strong Markov property, Bτx +t − Bτx is a standard Brownian motion, independent
from (Bs )0≤s≤τx . In particular, it is independent from τx , which is measurable with respect
to (Bs )0≤s≤τx , and so
1
P(Bτx +(t−s) − Bτx > 0|τx = s) = P(Bτx +(t−s) − Bτx > 0) = P(Bt−s > 0) =
2
where the last equality follows from the fact that Bt−s is a N (0, t − s) random variable
(hence symmetric). Plugging this into Equation 29.2 yields
Z t
1 1 1
P(Bt > x) = fτx (s) ds = P(τx ∈ (0, t)) = P(τx < t).
0 2 2 2
110
F IGURE 1. The density fτ1 of the passage time to 1 for a standard Brownian motion.
Thus P(τx < t) = 2P(Bt > x), which is a probability we can calculate explicitly (in terms
of the Gaussian kernel).
Z ∞
1 −y2 /2t
P(τx < t) = 2P(Bt > x) = 2 √ e dy.
x 2πt
√
To calculate the density fτx , it is convenient to make the change of variables y/ t = u, so
the integral becomes
Z ∞
1 2
P(τx < t) = 2 √ √ e−u /2 du.
x/ t 2π
This is convenient because the t only appears in the limits of integration. So we can
differentiate easily:
d 2 √ 2 d x x 2
fτx (t) = P(τx < t) = − √ e−(x/ t) /2 √ = √ e−x /2t .
dt 2π dt t 2πt3
We assumed that x > 0 in all of the above. Note, however, that (−Bt )t≥0 is also a standard
Brownian motion (as can be very easily verified from the definition), and so by symmetry
we will then have the more general formula
|x| −x2 /2t
fτx (t) = √ e , x ∈ R.
2πt3
Example 29.4. Consider the Maximum process for Brownian motion: Mt = max0≤s≤t Bs .
First note that Mt ≥ 0 for all t, since Mt is increasing and M0 = B0 = 0. We can calculate
the distribution of the random variable Mt using the passage time. Notice that the event
{Mt ≥ x} is equal to the event that there is some time s ≤ t for which Bs ≥ x. Since
Brownian paths are continuous, this is equal to the event that there is some (possibly
different) time s0 ≤ t where Bs0 = x, and so we have {Mt ≥ x} = {τx ≤ t}. Thus, by
111
29.1. The Zero Set of Brownian Motion. Let (Bt )t≥0 be a standard Brownian motion,
and denote by Z the (random) set of zeroes of Bt :
Z = {t ≥ 0 : Bt = 0}.
Using techniques like the ones above, we can prove the following exact formula which is
great asset in understand the zeroes, and general behavior, of Brownian motion.
Proposition 29.5 (Arcsine Law for Zeroes of BM). Let 0 < s < t. Define
Θ(s, t) = P(Z ∩ [s, t] 6= ∅).
Then r
2 s
Θ(s, t) = 1 − arcsin .
π t
Proof. To begin, we will condition on the value of Brownian motion at time s. Using
Equation 29.1,
Z ∞
Θ(s, t) = P(Z ∩ [s, t] 6= ∅) = P(Z ∩ [s, t] 6= ∅|Bs = x)fBs (x) dx.
−∞
The density fBs is a centered Gaussian of variance s. As for the conditional probability,
we can use the Markov property: given Bs = x, we restart the process at time s: take
Xr = Bs+r − Bs = Bs+r − x. Then (Xr )r≥0 is a standard Brownian motion. Translating, the
event that B· has a zero in [s, t] is the same as the event that X· reaches height −x in the
interval [0, t − s]. Since X· is a standard Brownian motion, this is the event that τ−x ≤ t − s.
Hence: Z ∞
1 −x2 /2s
Θ(s, t) = P(τ−x ≤ t − s) √ e dx.
−∞ 2πs
|x| 2
Referring to Example 29.3, the random variable τ−x has density fτ−x (t) = √
2πt3
e−x /2t .
This is symmetric in x, as is the Gaussian density, so we have
Z ∞ Z t−s
1 −x2 /2s
Θ(s, t) = 2 fτ−x (r) dr √ e dx
0 0 2πs
r Z ∞ Z t−s
2 −x2 /2s x 2
= e dx √ e−x /2r dr.
πs 0 0 2πr 3
112
We must now evaluate this double integral. To start, we exchange the order of integration
(since everything in sight is positive, this follows from Fubini’s theorem).
Z t−s Z ∞
1 −3/2 2 2
Θ(s, t) = √ r dr xe−x /2r−x /2s dx.
π s 0 0
2 2
d −x /2r−x /2s
The inside integral can be explicitly evaluated, since the derivative dx e is equal
−x2 /2r−x2 /2s
to −(1/r + 1/s)xe . Thus
Z t−s
1 −1 2 2 x→∞
Θ(s, t) = √ r−3/2 · e−x /2r−x /2s dr
π s 0 1/r + 1/s x=0
Z t−s
1 t−s s dr
Z r
1 −3/2 rs
= √ r dr = .
π s 0 r+s π 0 rr+s
p
Now, we substitute
p u = r/s. Then r = su2 , so dr = 2su du. The limits of integration
become 0 up to (t − s)/s, and we have
Z √(t−s)/s Z √(t−s)/s
1 1 2su du 2 du
Θ(s, t) = = .
π 0 u su2 + s π 0 1 + u2
1
The antiderivative of is arctan u, so we can write this as
1+u2
" r # r
2 t−s 2 t−s
Θ(s, t) = arctan − arctan 0 = arctan .
π s π s
This is a perfectly good
q final answer, but it is more common to write this as an arcsine.
If we let θ = arctan t−ss
, then θ is an angle in a right triangle whose opposite side has
√ √
length t − s and adjacent side has length s. The hypoteneuse therefore has length
√ √
s + (t − s) = t. Hence cos θ = √st . Using the relation sin( π2 − θ) = cos θ, we therefore
p
√
π s
have 2
− θ = arcsin √ ,
t
and so
r r
2 π s 2 s
Θ(s, t) = − arcsin = 1 − arcsin .
π 2 t π t
We can use the arcsine law for the zeroes to show, for example, that Brownian motion
(in one dimension) is recurrent.
113
Theorem 29.7 (Recurrence of Brownian Motion on R). Let (Bt )t≥0 be a standard Brownian
motion on R. Then the zero set Z is unbounded with probability 1. (I.e. the process returns to 0
at arbitrarily large times.)
Proof. If the process is not recurrent, there is some time s after which there are no zeroes
(with positive probability):
P(∀t ≥ s Bt 6= 0) = P(Z ∩ [s, ∞) = ∅).
The event {Z ∩ [t0 , ∞) = ∅} is contained in the event {Z ∩ [s, t] = ∅} for any t > s.
Hence
r
2 s
P(Z ∩ [s, ∞) = ∅) ≤ P({Z ∩ [s, t] = ∅) = 1 − Θ(s, t) = arcsin .
π t
As t → ∞, arcsin st → 0 for any fixed s, and hence P(Z ∩ [s, ∞) = ∅) = 0 for any s > 0.
p
Thus, there are arbitrarily large elements in Z , with probability 1.
Remark 29.8. Since we proved, early on, that SRW on Z is recurrent, it is natural that
its scaling limit Brownian motion is recurrent as well – but this is highly non-obvious.
Indeed, given space is scaled along with time, the kind of recurrence that is more natural
to expect for Brownian motion is this: given any > 0, the set of t ≥ 0 for which |Bt | ≤
is unbounded with probability 1. The fact that the Brownian motion actually returns to 0
infinitely often is a very fine property. Now, one can consider Brownian motion on Rd :
(d)
simply take Bt = (Bt1 , Bt2 , . . . , Btd ) where the Btj are independent Brownian motions.
(d)
Donsker’s invariance theorem extends to this case, and Bt is the scalaing limit of the
SRW on Zd . Recall that SRW on Z2 is recurrent, while SRW on Zd is transient for d ≥ 3. If
we interpret recurrence and transience in the weaker sense of return to a neighborhood of
(2)
0, then these translate to Brownian motion as well. But it is not true that Bt is recurrent
(2)
in the strong sense: the event that {Bt 6= (0, 0) for all large t} has positive probability.
(d)
For d ≥ 3, the results match up again: {|Bt | → ∞} has probability 1 for d ≥ 3.
The recurrence of Brownian motion on R, together with its time inversion property,
gives one of the most graphic demonstration of just how rough Brownian paths are.
Corollary 29.9. For any > 0, the (0, ) ∩ Z 6= ∅ with probability 1; hence it contains infinitely
many zeroes of (Bt )t≥0 .
Proof. Consider Wt = tB1/t . We have shown that (Wt )t≥0 is a standard Brownian motion,
and hence with probability 1 Wt = 0 for arbitrarily large times t. But Bt = tW1/t , and
therefore Bt has zeroes for t arbitrarily close to 0.
Remark 29.10. By the strong Markov property, if T is a stopping time at which BT = 0,
then the same argument shows that Z ∩ [T, T + ) contains infinitely-many zeroes of the
Brownian motion. It is important to note that this only holds true for stopping times T , not
just any random time that the Brownian motion has value 0. So, for example, if we fix
a time t0 and define Tt0 = min{t ≥ t0 : Bt = 0}, then Tt0 is a stopping time, and indeed
zeroes of the Brownian motion will accumulate to the right of Tt0 . On the other hand,
there is a largest zero L smaller than 1 and, by Example 29.6, it has a positive probability
of being less than 0.9. Indeed, from the arcine law for Z , given any interval [s, t] with
s < t, the probability that there are no zeroes in [s, t] is π2 arcsin st which is strictly positive.
p
114
What can be proved, using the stopping times Tt0 above, is that Z is a perfect set: it is
closed (since t 7→ Bt is continuous) and contains no isolated points. So, considering the
largest zero L less than 1 again, there must be a sequence of zeroes less than L convering
to L from below. In general, the zero set has the topological structure of a Cantor set.