HMMs and the forward-backward algorithm
Ramesh Sridharan
These notes give a short review of Hidden Markov Models (HMMs) and the forwardbackward algorithm. Theyre written assuming familiarity with the sum-product belief
propagation algorithm, but should be accessible to anyone whos seen the fundamentals
of HMMs before.
The notation here is borrowed from Introduction to Probability by Bertsekas & Tsitsiklis:
random variables are represented with capital letters, values they take are represented with
lowercase letters, pX represents a probability distribution for random variable X, and pX (x)
represents the probability of value x (according to pX ).
Hidden Markov Models
Figure 1 shows the (undirected) graphical model for HMMs. Heres a quick recap of the
important facts:
X1
X2
X3
Y1
Y2
Y3
Xn
Yn
Figure 1: An undirected graphical model for the HMM. Connections between nodes indicate
dependence.
We observe Y1 through Yn , which we model as being observed from hidden states X1
through Xn .
Any particular state variable Xk depends only on Xk1 (what came before it), Xk+1
(what comes after it), and Yk (the observation associated with it).
The goal of the forward-backward algorithm is to find the conditional distribution over
hidden states given the data.
In order to specify an HMM, we need three pieces:
Contact: [email protected]
m1!2 (x2 )
X1
m2!3 (x3 )
X2
m2!1 (x1 )
Y1
m3!4 (x4 )
X3
m3!2 (x2 )
Y2
Xn
m4!3 (x3 )
Y3
Yn
Figure 2: A visualization of the forward and backward messages. Each message is a table
that indicates what the node at the start point believes about the node at the end point.
A transition distribution, pXk+1 |Xk (xk+1 |xk ) = W (xk+1 |xk ) 1 , which describes the
distribution for the next state given the current state. This is often represented
as a matrix that well call A. Rows of A correspond to the current state, columns
correspond to the next state, and each entry corresponds to the transition probability. So, the entry at row i and column j, Aij , is pXk+1 |Xk (j|i), or equivalently
W (j|i).
An observation distribution (also called an emission distribution) pYk |Xk (yk |xk ) =
pY |X (yk |xk ) 2 , which describes the distribution for the output given the current
state. Well represent this with matrix B. Here, rows correspond to the current
state, and columns correspond to the observation. So, Bij = pY |X (j|i): the probability of observing output j from state i is Bij . Since the number of possible
observations isnt necessarily the same as the number of possible states, B wont
necessarily be square.
An initial state distribution pX1 , which describes the starting distribution over
states. Well represent this with a vector called 0 , where item i in the vector
represents pX1 (i).
The forward-backward algorithm computes forward and backward messages as follows:
prev. message
observation term
transition term
}|
{z
}|
{z
}|
{
Xz
m(k2)(k1) (xk1 ) pY |X (yk1 |xk1 ) W (xk1 |xk )
m(k1)k (xk ) =
xk1
m(k+1)k (xk ) =
X
xk+1
m(k+2)(k+1) (xk+1 ) pY |X (yk+1 |xk+1 ) W (xk |xk+1 )
{z
}
|
{z
}|
{z
}|
prev. message
observation term
transition term
These messages are illustrated in Figure 2. The first forward message m01 (x1 ) is
initialized to 0 (x1 ) = pX1 (x1 ). The first backward message m(n+1)n (xn ) is initialized
to uniform (this is equivalent to not including it at all).
Figure 3 illustrates the computation of one forward message m23 (x3 ).
To obtain a marginal distribution for a particular state given all the observations,
pXk |Y1 ,...,Yn , we simply multiply the incoming messages together with the observation
1
Were only going to worry about homogeneous Markov chains, where the transition distribution doesnt
change over time: thats why our W and A notations only depend on the values and not the timepoints.
2
Once again, well focus on Markov chains where the emission distribution is the same for every state.
term, and then normalize:
pXk |Y1 ,...,Yn (xk |y1 , . . . , yn ) m(k1)k (xk )m(k+1)k (xk )pY |X (yk |xk )
Here, the symbol means is proportional to, and indicates that we have to normalize
at the end so that the answer sums to 1.
Traditionally, the forward-backward algorithm computes a slightly different set of messages. The forward message k represents a message from k 1 to k that includes
pY |X (yk |xk ), and the backward message k represents a message from k + 1 to k identical to m(k+1)k above.
prev. message transition term
observation term
}|
{ X z }| { z
}|
{
z
k (xk ) = pY |X (yk |xk )
k1 (xk1 ) W (xk1 |xk )
xk1
k (xk ) =
k+1 (xk+1 ) pY |X (yk+1 |xk+1 ) W (xk |xk+1 )
| {z } |
{z
}
{z
}|
xk+1 prev. message
observation term
transition term
These messages have a particularly nice interpretation as probabilities:
k (xk ) = pY1 ,Y2 ,...,Yk ,Xk (y1 , y2 , . . . , yk , xk )
k (xk ) = pYk+1 ,Yk+2 ,...,Yn |Xk (yk+1 , yk+2 , . . . , yn |xk )
The initial forward message is initialized to 1 (x1 ) = pX1 (x1 )pY |X (y1 |x1 ). To obtain
a marginal distribution, we simply multiply the messages together and normalize:
pXk |Y1 ,...,Yn (xk |y1 , . . . , yn ) k (xk )k (xk )
Example
Suppose you send a robot to Mars. Unfortunately, it gets stuck in a canyon while landing
and most of its sensors break. You know the canyon has 3 areas. Areas 1 and 3 are sunny
and hot, while Area 2 is cold. You decide to plan a rescue mission for the robot from Area
3, knowing the following things about the robot:
m1!2 (x2 )
X1
m2!3 (x3 )
X2
X3
Xn
W (x3 |x2 )
pY |X (y2 |x2 )
Y1
Y2
Y3
Yn
Figure 3: An illustration of how to compute m23 (x3 ). In order for node 2 to summarize
its belief about X3 , it must incorporate the previous message m12 (x2 ), its observation
pY |X (y2 |x2 ), and the relationship W (x3 |x2 ) between X2 and X3 .
3
Every hour, it tries to move forward by one area (i.e. from Area 1 to Area 2, or Area 2
to Area 3). It succeeds with probability 0.75 and fails with probability 0.25. If it fails,
it stays where it is. If it is in Area 3, it always stays there (and waits to be rescued).
The temperature sensor still works. Every hour, we get a binary reading telling us
whether the robots current environment is hot or cold.
We have no idea where the robot initially got stuck.
Solution:
(a) Construct an HMM for this problem: define a transition matrix A, an observation matrix
B, and an initial state distribution 0 .
(b) Suppose we observe the sequence (hot, cold, hot). First, before doing any computation,
determine the sequence of locations. Then, compute the forward and backward messages,
and determine the distribution for the second state using the messages. Do your answers
match up?
(a) Well start with the transition matrix. Remember that each row corresponds to the
current state, and each column corresponds to the next state. Well use 3 states, each
corresponding to an area.
If the robot is in Area 1, it stays where it is with probability 0.25, moves to Area
2 with probability 0.75, and cant move to Area 3.
Similarly, if the robot is in Area 2, it stays where it is with probability 0.25, cant
move back to Area 1, and moves to Area 3 with probability 0.75.
If the robot is in Area 3, it always stays in Area 3.
Each item above gives us one row of A. Putting it all together, we obtain
1
2
1 0.25 0.75
A= 2
0
0.25
3
0
0
3
0 !
0.75
1
Next, lets look at the observation matrix. There are two possible observations, hot and
cold. Areas 1 and 3 always produce hot readings while Area 2 always produces a
cold reading:
hot
1
0
1
1
B= 2
3
cold
0 !
1
0
Last but not least, since we have no idea where the robot starts, our initial state distribution will be uniform:
1 1/3!
0 = 2
3
1/3
1/3
(b) Before doing any computation, we see that the sequence (hot,cold,hot) could only have
been observed from the hidden state sequence (1,2,3). Make sure you convince yourself
this is true before continuing!
Well start with the forward messages.
X
m12 =
m01 (x1 )pY |X (y1 |x1 ) (x1 , x2 )
{z
}
|
x
1
depends only on x1 and y1
The output message should have three different possibilities, one for each value of x2 .
We can therefore represent it as a vector indexed by x2 :
! value for x = 1
2
value for x2 = 2
value for x2 = 3
For each term in the sum (i.e., each possible value of x1 ):
m01 comes from from the initial distribution. Normally it would come from the
previous message, but our first forward message is always set to initial state distribution.
pY |X (y1 |x1 ) comes from the column of B corresponding to our observation y1 = hot.
comes from a row of A: we are fixing x1 and asking about possible values for
x2 , which corresponds exactly to the transition distributions given in the rows of
A (remember that the rows of A correspond to the current state and the columns
correspond to the next state).
So, we obtain
x =2
x1 =1
x =3
1
1
}|
}|
}|
{ z
{ z
{
.25
0
0
1
1
1
.75
.25
+ 0
+ 1 0
= 1
3
3
3
0
.75
1
1
3
4
m12
Since our probabilities are eventually computed by multiplying messages and normalizing, we can arbitrary renormalize at any step to make the computation easier.
5
For the second message, we perform a similar computation:
X
2 )(x2 , x3 )
m23 =
m12 (x2 )(x
x2
x2 =1
x =2
x =3
2
2
}|
{ z }| {
{ z
.25
0
0
= 1 0 .75 + 3 1 .25 + 4 0 0
0
.75
1
0
1
3
}|
The backwards messages are computed using a similar formula:
X
3 ) (x2 , x3 )
m32 =
m43 (x3 )(x
{z
}
|
x3
depends only on x3
The first backwards message, m43 (x3 ), is always initialized to uniform since we have
no information about what the last state should be. Note that this is equivalent to not
including that term at all.
For each value of x3 , the transition term (x2 , x3 ) is now drawn from a column of A,
since we are interested in the probability of arriving at x3 from each possible state for
x2 . We compute the messages as:
x =1
m32
x =2
x =3
3
3
3
z }|
{ z }| { z }| {
.25
.75
0
= 1 0 + 0 .25 + 1 .75
0
0
1
1
3
4
Similarly, the second backwards message is:
x =2
x2 =1
m21
x =3
2
2
}|
}|
{ z
{ z
{
.25
.75
0
= 1 0 0 + 3 1 .25 + 4 0 .75
0
0
1
3
1
0
}|
Notice from the symmetry of the problem that our forwards messages and backwards
messages were the same.
6
To compute the marginal distribution for X2 given the data, we multiply the messages
and the observation:
2)
pX2 |Y1 ,...,Yn (x2 |y1 , . . . , yn ) m12 (x2 )m32 (x2 )(x
1
1
0
3 3 1
4
4
0
0
= 1
0
Notice that in this case, because of our simplified observation model, the observation
cold allowed us to determine the state. This matches up with our earlier conclusion
that the robot must have been in Area 2 during the second hour.
If we were to compute messages, we would start with our initial message, 1 :
1/3
0
1 (x1 ) = pX1 (x1 )pY |X (y1 |x1 ) =
1/3
The first real message is computed as follows:
x1 =2
x1 =3
x1 =1
}|
{
}|
{
}|
{
z
z
z
.25
0
0
0
.75 + 0 .25 + 1/3 0
1/3
2 = 1
0
.75
1
0
0
1
0
The second message is similar:
x1 =1
x1 =2
x1 =3
z
}| { z
}| { z }|
{
.25
0
0
1
3 = 0
0 .75 + 1 .25 + 0 0
1
0
.75
1
1
0
1
The messages would be identical to our backwards messages computed earlier.