0% found this document useful (0 votes)
145 views46 pages

Understanding Hidden Markov Models

Hidden Markov models are a type of probabilistic graphical model that can be represented as a Bayes net. The key properties of hidden Markov models are: 1) They consist of a set of states, initial state probabilities, and state transition probabilities between each pair of states. 2) They assume the Markov property - the probability of moving to the next state depends only on the current state, not any previous states. 3) The state at each time step is represented by a random variable, and the goal is to compute the probability of being in each state at each time step using dynamic programming techniques like the forward algorithm.

Uploaded by

v123321
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views46 pages

Understanding Hidden Markov Models

Hidden Markov models are a type of probabilistic graphical model that can be represented as a Bayes net. The key properties of hidden Markov models are: 1) They consist of a set of states, initial state probabilities, and state transition probabilities between each pair of states. 2) They assume the Markov property - the probability of moving to the next state depends only on the current state, not any previous states. 3) The state at each time step is represented by a random variable, and the goal is to compute the probability of being in each state at each time step using dynamic programming techniques like the forward algorithm.

Uploaded by

v123321
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Hidden Markov Models

Ronald J. Williams
CSG220
Spring 2007

Contains several slides adapted from an Andrew Moore tutorial on this topic and a few figures
from Russell & Norvig’s AIMA site and Alpaydin’s Introduction to Machine Learning site.

A Simple Markov Chain


1/3
Numbers at nodes represent
1/3 s2 1/2 probability of starting at the
1/2 s1
2/3
corresponding state.
2/3
Numbers on arcs represent
s3 1/3
1/3
0
transition probabilities.
At each time step, t = 1, 2, ... a
1/3
new state is selected randomly
according to the distribution at
the current state.

Let Xt be a random variable for the state at time step t.


Let xt represent the actual value of the state at time t.
In this example, xt can be s1, s2, or s3.

Hidden Markov Models: Slide 2

1
Markov Property
• For any t, Xt+1 is conditionally independent of {Xt-1, Xt-2, … X1}
given Xt.
• In other words:
P(Xt+1 = sj |Xt = si ) = P(Xt+1 = sj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, X2, … , Xt-1, Xt, ...)?

Hidden Markov Models: Slide 3

Markov Property
• For any t, Xt+1 is conditionally independent of {Xt-1, Xt-2, … X1}
given Xt.
• In other words:
P(Xt+1 = sj |Xt = si ) = P(Xt+1 = sj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, X2, … , Xt-1, Xt, ...)?
• Answer:

X1 X2 ... Xt-1 Xt ...

Hidden Markov Models: Slide 4

2
Markov chain as a Bayes net
i P(X1 = si)
1 π1 X1 X2 ... Xt-1 Xt ...
2 π2
3 π3 1 2 … j … N
1 a11 a12 … a1j … a1N
i πi
2 a21 a22 … a2j … a2N
3 a31 a32 … a3j … a3N
N πN
: : : : : : :
i ai1 ai2 … aij … aiN

N aN1 aN2 … aNj … aNN


Notation:
aij = P(Xt+1 = sj | Xt = si ) Same CPT at every
node except X1
πi = P(X1 = si )
Hidden Markov Models: Slide 5

Markov Chain: Formal Definition

A Markov chain is a 3-tuple consisting of


• a set of N possible states {s1, s2, ..., sN}
• {π1, π2, .. πN} The starting state probabilities
πi = P(X1 = si)
• a11 a22 … a1N
The state transition
a21 a22 … a2N
probabilities
: : :
aN1 aN2 … aNN aij = P(Xt+1=sj | Xt=si)

Hidden Markov Models: Slide 6

3
Computing stuff in Markov chains
• Some notation and assumptions
• Assume time t runs from 1 to T
• Recall that Xt is the r.v. representing the state at
time t and xt denotes the actual value
• Use Xt1:t2 and xt1:t2 as shorthand for
(Xt1, Xt1+1, ..., Xt2) and (xt1, xt1+1, ... xt2),
respectively
• Use notation like P(xt) as shorthand for P(Xt=xt)

Hidden Markov Models: Slide 7

What is P(Xt = si)? 1st attempt


Step 1: Work out how to compute P(x1:t) for any state
sequence x1:t
P (x1:t ) = P( xt | x1:t −1 )P( x1:t −1 )
= P( xt | x1:t −1 )P( xt −1 | x1:t − 2 )P( x1:t − 2 ) WHY?
M
= P( xt | x1:t −1 )P( xt −1 | x1:t − 2 )L P( x2 | x1:1 )P( x1:1 )
= P( xt | xt −1 )P( xt −1 | xt − 2 )L P( x2 | x1 )P ( x1 )
Step 2: Use this knowledge to get P(Xt =si)
ion is
omputat l in t
∑ P( x
C
P( X t = si ) = ) nentia
1:t expo
sequences for which xt = si
Hidden Markov Models: Slide 8

4
State sequence as a path

trellis

Exponentially many paths, but at each time step


only goes through exactly one of the N states
Hidden Markov Models: Slide 9

What is P(Xt =si)? Clever approach


• For each state si, define
pt (i ) = P( X t = si )

• Express inductively

∀i p1 (i ) ≡ P( X 1 = si ) = π i

∀j pt +1 ( j ) ≡ P(X t +1 = s j )
N
= ∑ P( X t +1 = s j | X t = si ) P( X t = si )
i =1
N
= ∑ aij pt (i )
i =1

Hidden Markov Models: Slide 10

5
What is P(Xt =si)? Clever approach
• For each state si, define time step
pt (i ) = P( X t = si ) 1 2 ... T

state index
1
2
• Express inductively
:
∀i p1 (i ) ≡ P( X 1 = si ) = π i N

pt +1 ( j ) ≡ P(X t +1 = s j )
• Computation is simple.
∀j • Just fill in this table one
N column at a time, from left
= ∑ P( X t +1 = s j | X t = si ) P( X = si )
to tright
i =1 • Cells in this table
N correspond to nodes in the
= ∑ aij pt (i ) trellis
i =1

Hidden Markov Models: Slide 11

What is P(Xt =si)? Clever approach


• For each state si, define
• Cost of computing pt(i) for all
pt (i ) = P( X t = si ) states si is now O(TN2)
• The first way was O(NT)
• Express inductively • This was a simple example
• It was meant to warm you up
∀i p1 (i ) ≡ P( X 1 = si ) = π i to this trick, called Dynamic
Programming, because HMM
computations involve many
∀j pt +1 ( j ) ≡ P(X t +1 = s j ) tricks just like this.
N
= ∑ P( X t +1 = s j | X t = si ) P( X t = si )
i =1
N
= ∑ aij pt (i )
i =1

Hidden Markov Models: Slide 12

6
Inductive step:
graphical representation

pt +1 ( j ) = ∑i =1 aij pt (i )
N
pt (i )

Compare this with similar


depictions of updates we’ll
use in HMMs

Hidden Markov Models: Slide 13

Hidden State
• Given a Markov model of a process, computation of various
quantities of interest (e.g., probabilities) is straightforward if
the state is observable – use techniques like the one just
described.
• More realistic: assume the true state is not observable – only
have observations that depend on, but do not fully determine,
the actual states.
• Examples
• Robot localization
• state = actual location
• observations = (noisy) sensor readings
• Speech recognition
• state sequence => word
• observations = acoustic signal
• In this situation, we say the state is hidden
• Model this using a Hidden Markov Model (HMM)
Hidden Markov Models: Slide 14

7
HMMs
• An HMM is just a Markov chain augmented
with
• a set of M possible observations {o1, o2, ..., oM}
• for each state s1, s2, ..., sN a distribution over
possible observations that might be sensed in that
state
• We’ll let Zt be the r.v. for the observation that
occurs at time t (with zt representing the
actual observation)
• In addition, we’ll assume that the observation
at time t depends only on the state at time t, in
the sense about to be described
Hidden Markov Models: Slide 15

Markov Property of Observations


• For any t, Zt is conditionally independent of {Xt-1, Xt-2, … X1,
Zt-1, Zt-2, ..., Z1} given Xt.
• In other words:
P(Zt = oj |Xt = si ) = P(Zt = oj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, Z1, X2, Z2, … , Xt-1,Zt-1,
Xt, Zt, ...)?

Hidden Markov Models: Slide 16

8
Markov Property of Observations
• For any t, Zt is conditionally independent of {Xt-1, Xt-2, … X1,
Zt-1, Zt-2, ..., Z1} given Xt.
• In other words:
P(Zt = oj |Xt = si ) = P(Zt = oj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, Z1, X2, Z2, … , Xt-1,Zt-1,
Xt, Zt, ...)?
• Answer:
X1 X2 ... Xt-1 Xt ...

Z1 Z2 Zt-1 Zt
Hidden Markov Models: Slide 17

HMM as a Bayes Net


X1 X2 ... Xt-1 Xt ...

Z1 Z2 Zt-1 Zt
observation index
1 2 … k … M
This is the CPT for
every Z node
1 b1(o1) b1 (o2) … b1 (ok) … b1(oM)
2 b2 (o1) b2 (o2) … b2(ok) … b2 (oM) Notation:
state index

3 b3 (o1) b3 (o2) … b3(ok) … b3 (oM)


bi (ok ) = P(Zt = ok | X t = si )
: : : : : : :

i bi(o1) bi (o2) … bi(ok) … bi (oM)


: : : : : : :

N bN (o1) bN (o2) … bN(ok) … bN (oM)


Hidden Markov Models: Slide 18

9
Are HMMs Useful?
You bet !!
• Robot planning & sensing under uncertainty (e.g.
Reid Simmons / Sebastian Thrun / Sven Koenig)
• Robot learning control (e.g. Yangsheng Xu’s work)
• Speech Recognition/Understanding
Phones → Words, Signal → phones
• Human Genome Project
Complicated stuff your lecturer knows nothing
about.
• Consumer decision modeling
• Economics & Finance.
Plus at least 5 other things I haven’t thought of.
Hidden Markov Models: Slide 19

Dynamic Bayes Nets


• An HMM is actually a special case of a more
general concept: Dynamic Bayes Net (DBN)
• Can decompose into multiple state variables
and multiple observation variables at each
time slice, with only direct influences
represented explicitly
• (1st order) Markov property: nodes in any time
slice have arcs only from nodes in their own
or the immediately preceding time slice
• Higher-order Markov models also easily
represented in this framework
Hidden Markov Models: Slide 20

10
DBN Example

Linear dynamical system with position sensors


E.g., target tracking

Hidden Markov Models: Slide 21

Another DBN Example

Modeling a robot
with position
sensors and a
battery charge
meter

Hidden Markov Models: Slide 22

11
Back to HMMs ...
Summary of our HMM notation:
• Xt = state at time t (r.v.)
• Zt = observation at time t (r.v.)
• Vt1:t2 = (Vt1, Vt1+1, ..., Vt2) for any time-indexed r.v. V
• Possible states = {s1, s2, ..., sN}
• Possible observations = {o1, o2, ..., oM}
• vt = actual value of r.v. V at time step t
• vt1:t2 = (vt1, vt1+1, ..., vt2) = sequence of actual values
of r.v. V from time steps t1 through t2
• Convenient shorthand: E.g., P(x1:t⏐z1:t) means
P(X1:t = x1:t⏐Z1:t = z1:t)
• T = final time step
Hidden Markov Models: Slide 23

HMM: Formal Definition


An HMM λ is a 5-tuple consisting of
• a set of N possible states {s1, s2, ..., sN}
• a set of M possible observations {o1, o2, ..., oM}
• {π1, π2, .. πN} The starting state probabilities
πi = P(X1 = si)
• a11 a22 … a1N
a21 a22 … a2N The state transition probabilities
: : : aij = P(Xt+1=sj | Xt=si)
aN1 aN2 … aNN

• b1(o1) b1(o2) … b1(oM)


b2(o1) b2(o2) … b2(oM) The observation probabilities
: : : bi(ok) = P(Zt=ok | Xt=si)
bN(o1) bN(o2) … bN(oM)

Hidden Markov Models: Slide 24

12
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3
2/3

1/3 uw 0
1/3

State set {s1, s2, s3} s3


Observation set {u, v, w} 1/3

π1 = 1/2 π2 = 1/2 π3 = 0

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3

b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0


b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 25

Here’s an HMM Start randomly in state 1 or 2


Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3


50-50 choice
Observation set {u, v, w} 1/3
between s1 and
π1 = 1/2 π2 = 1/2 π3 = 0 s2

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= __ z1= __
x2= __ z2= __
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= __ z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 26

13
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
2/3 1/2 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3 50-50 choice


Observation set {u, v, w} 1/3 between u and v
π1 = 1/2 π2 = 1/2 π3 = 0

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= __
x2= __ z2= __
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= __ z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 27

Here’s an HMM Start randomly in state 1 or 2


Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
2/3 1/2 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3 Goto s2 with


Observation set {u, v, w} 1/3 probability 1/3 or
π1 = 1/2 π2 = 1/2 π3 = 0 s3 with prob. 2/3

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= u
x2= __ z2= __
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= __ z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 28

14
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3 50-50 choice


Observation set {u, v, w} 1/3 between u and w
π1 = 1/2 π2 = 1/2 π3 = 0

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= u
x2= s3 z2= __
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= __ z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 29

Here’s an HMM Start randomly in state 1 or 2


Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3


Each of the three
Observation set {u, v, w} 1/3
next states is
π1 = 1/2 π2 = 1/2 π3 = 0
equally likely

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= u
x2= s3 z2= u
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= __ z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 30

15
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3 50-50 choice


Observation set {u, v, w} 1/3 between u and w
π1 = 1/2 π2 = 1/2 π3 = 0

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= u
x2= s3 z2= u
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= s3 z3= __
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 31

Here’s an HMM Start randomly in state 1 or 2


Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3


Observation set {u, v, w} 1/3

π1 = 1/2 π2 = 1/2 π3 = 0

a11 = 0 a12 = 1/3 a13 = 2/3


a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= s1 z1= u
x2= s3 z2= u
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= s3 z3= w
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 32

16
Hidden State Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3

State set {s1, s2, s3} s3 This is what the


Observation set {u, v, w} 1/3
observer has to
π1 = 1/2 π2 = 1/2 π3 = 0
work with…
a11 = 0 a12 = 1/3 a13 = 2/3
a12 = 1/3 a22 = 0 a13 = 2/3
a13 = 1/3 a32 = 1/3 a13 = 1/3 x1= ? z1= u
x2= ? z2= u
b1 (u) = 1/2 b1 (v) = 1/2 b1 (w) = 0
x3= ? z3= w
b2 (u) = 0 b2 (v) = 1/2 b2 (w) = 1/2
b3 (u) = 1/2 b3 (v) = 0 b3 (w) = 1/2
Hidden Markov Models: Slide 33

Problems to solve
• So now we have an HMM (or, more generally,
a DBN) that models a temporal process of
interest
• What are some of the kinds of problems we’d
like to be able to solve with this?

Hidden Markov Models: Slide 34

17
Temporal Model Problems to Solve
• Filtering: Compute P(Xt⏐z1:t , λ)
• Prediction: Compute P(Xk⏐z1:t , λ) for k > t
• Smoothing: Compute P(Xk⏐z1:t , λ) for k < t
• Observation sequence likelihood:
Compute P(z1:T⏐λ)
• Most probable path (state sequence):
Compute x1:T maximizing P(x1:T⏐z1:T , λ)
• Maximum likelihood model: Given a set of
{ }
observation sequences z1r:Tr r , compute λ
maximizing ∏ P( z1:Tr | λ )
r

r Hidden Markov Models: Slide 35

Temporal Model Problems to Solve


• Used in a wide variety of dynamical systems
modeling applications:
• filtering
• prediction
• smoothing
• Used especially in HMM applications:
• observation sequence likelihood
• most probable path
• maximum likelihood model fitting

Hidden Markov Models: Slide 36

18
Filtering
Compute P( X t z1:t , λ )

X1 X2 ... Xt-1 Xt Xt+1 ...

Z1 Z2 Zt-1 Zt Zt+1

observed
current time
infer (distribution)

Hidden Markov Models: Slide 37

Prediction
Compute P ( X k z1:t , λ ) for k > t

X1 X2 ... Xt-1 Xt ... Xk ...

Z1 Z2 Zt-1 Zt Zk

observed
current time
infer (distribution)

Hidden Markov Models: Slide 38

19
Smoothing
Compute P ( X k z1:t , λ ) for k < t

X1 X2 ... Xk ... Xt Xt+1 ...

Z1 Z2 Zk Zt Zt+1

observed
current time
infer (distribution)

Hidden Markov Models: Slide 39

Observation Sequence Likelihood


Compute P(z1:t λ )
X1 X2 ... Xt-1 Xt ... XT

Z1 Z2 Zt-1 Zt ZT

observed

What’s the probability of this particular sequence of


observations as a function of the model parameters?
Useful for such things as finding which of a set of HMM models
best fits an observation sequence, as in speech recognition.
Hidden Markov Models: Slide 40

20
Most Probable Path
Compute arg max x1:T P (x1:T z1:T , λ )

X1 X2 ... Xt-1 Xt ... XT

Z1 Z2 Zt-1 Zt ZT

observed
infer (only most probable)

Not necessarily the same as the sequence of individually


most probable states (obtained by smoothing)
Hidden Markov Models: Slide 41

Maximum Likelihood Model


Assume number of states given
Given a set of R observation sequences
(
z11:T1 = z11 , z 12 , K , z T11 )
z12:T2 = (z 2
1 , z 22 , K , z T22 )
M
(
z1R:T R = z1R , z 2R , K , z TRR )
Compute R
λ = arg max λ ∏ P( z1r:T | λ )
*
r
r =1

Hidden Markov Models: Slide 42

21
Solution methods for these problems
Let’s start by considering the observation
sequence likelihood problem:
Given z1:T , compute P(z1:t λ )

Use our example HMM to illustrate

Hidden Markov Models: Slide 43

Prob. of a sequence of 3 observations


1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3

2/3
vw
1/2

uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3

How do we compute P(x1:3)


for an arbitrary path x1:3?
How do we compute P(z1:3|x1:3) for
an arbitrary path x1:3?

Hidden Markov Models: Slide 44

22
Prob. of a sequence of 3 observations
1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3

2/3
vw
1/2

uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3

How do we compute P(x1:3) P(x1,x2,x3) = P(x1) P(x2|x1) P(x3| x2)


for an arbitrary path x1:3? E.g, P(s1, s3, s3) =1/2 * 2/3 * 1/3 = 1/9

How do we compute P(z1:3|x1:3) for


an arbitrary path x1:3?

Hidden Markov Models: Slide 45

Prob. of a sequence of 3 observations


1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3

2/3
vw
1/2

uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3

How do we compute P(x1:3) P(x1,x2,x3) = P(x1) P(x2|x1) P(x3| x2)


for an arbitrary path x1:3? E.g, P(s1, s3, s3) =1/2 * 2/3 * 1/3 = 1/9

How do we compute P(z1:3|x1:3) for


an arbitrary path x1:3?
P(z1, z2, z3 | x1, x2, x3)
= P(z1 | x1 ) P(z2 | x2 ) P(z3 | x3 )
E.g, P(uuw | s1, s3, s3) =1/2 * 1/2 * 1/2 = 1/8
Hidden Markov Models: Slide 46

23
Prob. of a sequence of 3 observations
1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3

2/3
vw
1/2

uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3

But this sum has 33 = 27 terms in it!


Exponential in the length of the sequence
Need to use a dynamic programming trick like before

Hidden Markov Models: Slide 47

The probability of a given sequence of


observations, non-exponential-cost-style
Given observation sequence (z1, z2,…, zT) = z1:T
Define the forward variable
αt(i) = P(z1:t, Xt = si | λ) for 1 ≤ t ≤ T
αt(i) = Probability that, in a random trial,
• we’d have seen the first t observations; and
• we’d have ended up in si as the tth state visited.

Hidden Markov Models: Slide 48

24
Computing the forward variables
Base case:
1 1
α1 (i ) ≡ P( z1 ∧ X 1 = si )
= P(z1 X 1 = si )P( X 1 = si ) πi i i
= bi ( z1 )π i z1
Note: For simplicity, we’ll N
N
drop explicit reference to
conditioning on the HMM
parameters λ for many of time step 1 2
the upcoming slides, but it’s
always there implicitly.

Hidden Markov Models: Slide 49

Forward variables: inductive step


( j ) ≡ P (z ∧ X = s )
α t +1 1:t +1
sum over all possible
t +1 j
previous states
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P ( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

(
= ∑ i =1 P z t +1 X t +1 = s j aijα t (i )
N
)
= b j ( z t +1 )∑ i =1 aijα t (i )
N

Hidden Markov Models: Slide 50

25
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N split off last
i =1 1:t +1 observation
t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t

= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 51

Forward variables: inductive step


( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s ) chain rule
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t

= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 52

26
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s latest
)P(Xstate=ands observation
X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i ) earlier observations given


N conditionally independent of
i =1 t +1 t +1 j ij t
previous state
= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 53

Forward variables: inductive step


( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i ) chain rule
N
i =1 t +1 t +1 j ij t

= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 54

27
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t

= b ( z )∑ a α (i )
latest observation conditionally independent
N
j t +1 of earlier states given latest state
i =1 ij t
Hidden Markov Models: Slide 55

Forward variables: inductive step


( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j

= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j

= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j

= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i

= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t

= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t

= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t

= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 56

28
Forward variables: inductive step

α t (i ) α t +1 ( j ) = b j ( zt +1 )∑i =1 aijα t (i )
N

zt +1

Hidden Markov Models: Slide 57

Observation Sequence Likelihood

Efficient solution to the observation sequence


likelihood problem using the forward variables:

P(z1:t λ ) = ∑i =1 P(z1:t ∧ X t = si λ ) =∑i =1α t (i )


N N

Hidden Markov Models: Slide 58

29
In our example
1/3
s2
α t (i ) ≡ P(z1:t ∧ X t = si λ )
s1
1/2 uv
1/3
vw
α1 (i ) = bi ( z1 )π i 2/3 1/2
2/3

α t +1 ( j ) = b j ( zt +1 )∑ aij α t (i ) 1/3 uw 0
1/3

i s3
1/3

Observed: z1 z2 z3 = u u w

α1 (1) = 14 α 2 (1) = 0 α 2 (1) = 0


α1 (2 ) = 0 α 2 (2) = 0 α 3 (2) = 721
α1 (3) = 0 α 2 (3) = 121 α 3 (3) = 721
So probability of observing uuw is 1/36

Hidden Markov Models: Slide 59

Filtering

Efficient solution to the filtering problem using


the forward variables:
P( X t = si ∧ z1:t ) α t (i )
P( X t = si z1:t ) = =
P( z1:t ) ∑ j =1α t ( j )
N

Estimating current state based on all observations up to


the current time.

So in our example, after observing uuw, prob. of being in s1 is 0 and


prob. of being in s2 = prob. of being in s3 = 1/2

Hidden Markov Models: Slide 60

30
Prediction
• Note that the (state) prediction problem can
be viewed as a special case of the filtering
problem in which there are missing
observations.
• That is, trying to compute the probability of Xk
given observations up through time step t,
with k > t, amounts to filtering with missing
observations at time steps t+1, t+2, ..., k.
• Therefore, we now focus on the missing
observations problem.
Hidden Markov Models: Slide 61

Missing Observations
• Looking at the derivation of the inductive step for computing the forward
variables, we see that the last step involves writing

( )
α t +1 ( j ) = P zt +1 X t +1 = s j P(X t +1 = s j all observations up through time t )

b j ( zt +1 ) ∑ aijα t (i )
N
i =1

• Thus the second factor gives us a prediction of the state at time t+1 based
on all earlier observations, which we then multiply by the observation
probability at time t+1 given the state at time t+1.
• If there is no observation at time t+1, clearly the set of observations made
through time t+1 is the same as the set of observations made through
time t.

Hidden Markov Models: Slide 62

31
Missing Observations (cont.)
• Thus we redefine
α t (i ) = P( X t = si ∧ all available observations through time t )
• This generalizes our earlier definition but allows for the possibility that
some observations are present and others are missing
• Then define
⎧b ( z ) if there is an observation at time t
bi′(zt ) = ⎨ i t
⎩1 otherwise
• It’s not hard to see that the correct forward compution should then
proceed as:
α1 (i ) = bi′(z1 )π i
α t +1 ( j ) = b′j (zt +1 )∑ aij α t (i )
i
• Amounts to propagating state predictions forward wherever there are no
observations
• Interesting special case: When there are no observations at any time, the
α values are identical to the p values we defined earlier for Markov chains

Hidden Markov Models: Slide 63

Solving the smoothing problem


• Define the backward variables
β t (i ) = P (zt +1:T X t = si , λ )

• Probability of observing zt+1, ..., zT given that


system was in state si at time step t
• These can be computed efficiently by starting
at the end (time T) and working backwards
• Base case: βT (i ) = 1 for all i, 1 ≤ i ≤ N
• Valid because zT+1:T is an empty sequence of
observations so its probability is 1
Hidden Markov Models: Slide 64

32
Backward variables: inductive step
β t (i ) ≡ P(zt +1:T X t = si )
= ∑ j =1 P(zt +1:T ∧ X t +1 = s j X t = si )
N

( )
= ∑ j =1 P zt +1:T X t +1 = s j ∧ X t = si P (X t +1 = s j X t = si )
N

= ∑ j =1 P (z )
N
t +1 ∧ zt + 2:T X t +1 = s j ∧ X t = si aij

= ∑ j =1 P (z )
N
t +1 ∧ zt + 2:T X t +1 = s j aij

= ∑ j =1 P (z = s )P (z )
N
t +1 zt + 2:T ∧ X t +1 j t + 2:T X t +1 = s j aij

= ∑ j =1 P (z )
X t +1 = s j β t +1 ( j )aij
N
t +1

= ∑ j =1 b j ( zt +1 )β t +1aij
N

Hidden Markov Models: Slide 65

Backward variables: inductive step

β t (i ) = ∑ j =1 b j (zt +1 )β t +1aij
N
β t +1 ( j )
zt +1

Hidden Markov Models: Slide 66

33
Solving the smoothing problem
• Use the notation
γ t (i ) = P (X t = si z1:T )
for the probability we want to compute.
• Then
γ t (i ) = cP(z1:T X t = si )P( X t = si )
= cP(z1:t X t = si )P (zt +1:T X t = si )P( X t = si )
= cP( z1:t ∧ X t = si )P (zt +1:T X t = si )
= cα t (i )β t (i )
where c = 1/P(z1:T) is a constant of proportionality we can
ignore as long as we normalize to get the actual probs.
Hidden Markov Models: Slide 67

Smoothing

Efficient solution to the smoothing problem


using the forward and backward variables:
α t (i ) β t (i )
P( X t = si z1:T ) =
∑ j =1α t ( j )β t ( j )
N

Estimating a state based on all observations before,


during, and after that time step.

Forward-backward algorithm
Hidden Markov Models: Slide 68

34
Solving the most probable path problem
• Want arg max x1:T P x1:T z1:T( )
• One approach:
P(z1:T x1:T )P(x1:T )
arg max x1:T P(x1:T z1:T ) = arg max x1:T
P(z1:T )
= arg max x1:T P(z1:T x1:T )P(x1:T )

• Easy to compute each factor for a given state


and observation sequence, but number of
paths is exponential in T
• Use dynamic programming instead
Hidden Markov Models: Slide 69

DP for Most Probable Path


• Define
δ t (i ) = max x1:t−1 P( x1:t −1 ∧ X t = si ∧ z1:t )
• A path giving this maximum is one of length
t-1 having the highest probability of
simultaneously
• occuring
• ending at si
• producing observation sequence z1:t

Hidden Markov Models: Slide 70

35
DP for MPP (cont.)
• We’ll show that these values can be
computed by an efficient forward computation
similar to the computation of the α values
• But first, let’s check that it gives us something
useful:
δ T (i ) = max x1:T −1 P(x1:T −1 ∧ X T = si ∧ z1:T )
= max x1:T −1 P (x1:T −1 ∧ X T = si z1:T )P( z1:T )
• Thus a value of i maximizing δT(i) identifies a
state which represents the final state in a path
maximizing P(x1:T z1:T )
Hidden Markov Models: Slide 71

DP for MPP (cont.)


• First, base case is
δ1 (i ) = max one choice P( X t = si ∧ z1:1 )
= P(z1 X t = si )P( X t = si )
= bi (z1 )π i
• Then, since the max. prob. path ending at sj at time t+1 must
go through some state at time t, we can write
δ t +1 ( j ) ≡ max x P(x1:t ∧ X t +1 = s j ∧ z1:t +1 )
1:t

= max i max x1:t −1 P(x1:t −1 ∧ X t = si ∧ z1:t ∧ zt +1 ∧ X t +1 = s j )

Now work on just this part


Call it Δ(i,j)

Hidden Markov Models: Slide 72

36
DP for MPP (cont.)
• Using the chain rule and the Markov property, we find that
the probability to be maximized can be written as

Δ(i, j ) ≡ P(x1:t −1 ∧ X t = si ∧ z1:t ∧ zt +1 ∧ X t +1 = s j )


= P(zt +1 ∧ X t +1 = s j x1:t −1 ∧ X t = si ∧ z1:t )P(x1:t −1 ∧ X t = si ∧ z1:t )
= P(zt +1 ∧ X t +1 = s j X t = si )P(x1:t −1 ∧ X t = si ∧ z1:t )
( )
= P zt +1 X t +1 = s j P(X t +1 = s j X t = si )P(x1:t −1 ∧ X t = si ∧ z1:t )
= bj (zt +1 )aij P(x1:t −1 ∧ X t = si ∧ z1:t )

Hidden Markov Models: Slide 73

DP for MPP (cont.)


• Finally, then, we get

δ t +1 ( j ) = max i max x Δ(i, j )


1:t −1

[
= max i max x1:t −1 b j ( zt +1 )aij P( x1:t −1 ∧ X t = si ∧ z1:t ) ]
= b ( z ) max [a
j t +1 i ij max x1:t −1 P( x1:t −1 ∧ X t = si ∧ z1:t )]
= b j ( zt +1 ) max i aijδ t (i )
• This is inductive step
• Virtually identical to computation of forward variables α – only
difference is that it uses max instead of sum
• Also need to keep track of which state si gives max for each
state sj at the next time step to be able to determine actual
MPP, not just its probability
Hidden Markov Models: Slide 74

37
Viterbi Algorithm for Most Probable Path
Summary
• Base case: ∀i δ1 (i ) = bi ( z1 )π i
• Inductive step: ∀j δ t +1 ( j ) = b j ( zt +1 ) max i aijδ t (i )
• Compute for all states at t=1, then t=2, etc.
• Also save index giving max for each state at
each time step (backward pointers)
• Construct the MPP by determining state with
largest δT(i), then following backward pointers
to time steps T-1, T-2, etc.

Hidden Markov Models: Slide 75

Viterbi Algorithm

Store two numbers at each node in this trellis, one for δ and the other a
backward pointer to a node in the previous layer giving the max for this
node – this is computed left to right.
To find a most probable path, determine a node in the T layer with max δ
value, then follow backward pointers from right to left.
Hidden Markov Models: Slide 76

38
Viterbi algorithm: inductive step

δ t +1 ( j ) = b j ( zt +1 ) max i aijδ t (i )
δ t (i )
ψ t +1 ( j ) = arg max i aijδ t (i )
zt +1

Hidden Markov Models: Slide 77

Prob. of a given transition


• The final problem we want to address is the HMM
inference (learning) problem, given a training set of
observation sequences
• Most of the ingredients for deriving a max. likelihood
method for this are in place
• But there’s one more sub-problem we’ll need to
address:
Given an observation sequence z1:T, what’s the probability
that the state transition si to sj occurred at time t?
• Thus we define
(
ξ t (i, j ) = P X t = si ∧ X t +1 = s j z1:T )
Hidden Markov Models: Slide 78

39
Prob. of a given transition (cont.)
ξ t (i, j ) ≡ P(X t = si ∧ X t +1 = s j z1:T )
( )
= cP z1:T X t = si ∧ X t +1 = s j P (X t = si ∧ X t +1 = s j )
= cP(z X = s ∧ X = s )P (X = s X = s )P ( X = s )
1:T t i t +1 j t +1 j t i t i

= cP(z X = s ∧ X = s )a P ( X = s )
1:T t i t +1 j ij t i

= cP(z X = s )P (z X = s )P (z
1:t t i X = s )a P( X = s )
t +1 t +1 j t + 2:T t +1 j ij t i

= cP( z ∧ X = s )b ( z )P (z
1:t t X = s )a
i j t +1 t + 2:T t +1 j ij

= cα t (i )b j ( zt +1 )β t +1 ( j )aij c = 1/P(z1:T) is a normalizing


constant we can ignore as long
= cα t (i )aij b j ( zt +1 )β t +1 ( j ) as we make the sum over all
(i,j) pairs equal to 1 when
computing actual probabilities.
Hidden Markov Models: Slide 79

Prob. of a given transition (cont.)

zt +1

ξ t (i, j ) ≡ P (X t = si ∧ X t +1 = s j z1:T )
α t (i )aij b j (zt +1 )β t +1 ( j )
=
∑k ,l α t (k )akl bl (zt +1 )βt +1 (l )
Hidden Markov Models: Slide 80

40
Max. Likelihood HMM Inference
Given a state set {s1, s2, ..., sN} and a set of R
observation sequences
(
z11:T1 = z11 , z 12 , K , z T11 )
z12:T2 = (z 2
1 , z 22 , K , z T22 )
M
(
z1R:T R = z1R , z 2R , K , z TRR )
determine parameter set λ = (πi,{aij},{bi(oj)})
maximizing
R From now on, we’ll

λ* = arg max λ ∏ P( z1r:T | λ )


make conditioning on
r
λ explicit
r =1

Hidden Markov Models: Slide 81

A cheat
Let’s first imagine that along with each observation sequence
(
z1r:Tr = z1r , z 2r , K , z Trr )
an oracle also gives us the corresponding state sequence
(
x1r:Tr = x1r , x 2r , K , x Trr )
Then we could obtain max. likelihood estimates of all
parameters as follows:
# of sequences starting with si
πˆ i =
total # of sequences
# of transitions si → s j
aˆij =
# of visits to state si
# of visits to state si where ok observed
bˆi (ok ) =
# visits to state si
Hidden Markov Models: Slide 82

41
A cheat (cont.)
More formally, define the indicator functions

⎧ 1 if xtr = si
χ (i ) = ⎨
r

⎩0 otherwise
t

⎧ 1 if xtr = si and xtr+1 = s j


χ (i → j ) = ⎨
r

⎩0 otherwise
t


χ tr (i : k ) = ⎨1 if xt = si and zt = ok
r r

⎩0 otherwise

Hidden Markov Models: Slide 83

A cheat (cont.)
In terms of these indicator functions, our ML estimates would
then be


R
χ r
1 (i )
πˆ i = r =1
R
∑ ∑ χ (i → j )
R Tr −1 r
For this, we can’t
aˆ = r =1 t =1 t
use the last state in
∑ ∑ χ (i )
ij R Tr −1 r any of the training
r =1 t =1 t sequences because

∑ ∑ χ (i : k )
R Tr r there’s no next state

bˆ (o ) = r =1 t =1 t

∑ ∑ χ (i )
i k R Tr r
r =1 t =1 t

Hidden Markov Models: Slide 84

42
The bad news ...
• There is no oracle to tell us the state sequence
corresponding to each observation sequence
• So we don’t know these actual indicator function
values
• So we can’t compute these sums

Hidden Markov Models: Slide 85

The good news ...


• We can compute their expected values efficiently:
( )
γ tr (i ) ≡ P X t = si z1r:T , λ = E (χ tr (i ) λ )
r

ξ tr (i, j ) ≡ P(X t ) (
= si ∧ X t +1 = s j z1r:Tr , λ = E χ tr (i → j ) λ )
• Also:
( ) (
E χ tr (i : k ) λ = P X t = si ∧ Z t = ok z1r:T , λ )
=⎨
(
⎧ P X t = si z1r:T , λ ) if ztr = ok
⎩0 otherwise
( )
= γ tr (i )I ztr = ok

Usual indicator function:


1 if true, 0 if false

Hidden Markov Models: Slide 86

43
The good news ...
• We can compute their expected values efficiently:
( )
γ tr (i ) ≡ P X t = si z1r:T , λ = E (χ tr (i ) λ )
r

ξ tr (i, j ) ≡ P(X ) (
= si ∧ X t +1 = s j z1r:Tr , λ = E χ tr (i → j ) λ )
t
a
• Also:
E (χ (i : k ) λ ) = P (X = sli∧ kZe= o !z , λ )
s
r r

k M
t t i t k 1:T

o P (X = s zE, λ ) if z = o
Lo= ⎨⎩0 for
⎧ t i
r
1:T
r
t k
otherwise

j=oγ b(i )I (z = o )
t
r r
t k

Usual indicator function:


1 if true, 0 if false

Hidden Markov Models: Slide 87

EM for HMMs (Baum-Welch)


E-step
Use the current estimate of model parameters λ to
compute all the γ t (i ) and ξ tr (i, j ) values for each
r
r
training sequence z1:Tr .
M-step
∑ ∑ ξ tr (i, j )
Tr −1
∑r =1 γ 1r (i )
R R

πi ← aij ← r =1 t =1

∑ ∑ γ tr (i )
R Tr −1
R r =1 t =1

∑ ∑ γ (i )I (z = o )
R Tr r r

b (k ) ← r =1 t =1 t t k

∑ ∑ γ (i )
i R Tr r
r =1 t =1 t

Hidden Markov Models: Slide 88

44
Remarks on Baum-Welch
• Bad news: There may be many local maxima
• Good news: The local maxima are usually
adequate models of the data
• Any probabilities initialized to zero will remain
zero throughout – useful when one wants a
model with limited state transitions

Hidden Markov Models: Slide 89

Summary of solution methods


• Filtering: forward variables (α’s)
• Prediction: (modified) forward variables
• Smoothing: forward-backward algorithm
• Observation sequence likelihood: forward
variables
• Most probable path: Viterbi algorithm
• Maximum likelihood model: Baum-Welch
algorithm

Hidden Markov Models: Slide 90

45
Some good references
• Standard HMM reference:
L. R. Rabiner, "A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition,"
Proc. of the IEEE, Vol.77, No.2, pp.257-286, 1989.
• Excellent reference for Dynamic Bayes Nets
as a unifying framework for probabilistic
temporal models (including HMMs and
Kalman filters):
Chapter 15 of Artificial Intelligence, A Modern
Approach, 2nd Edition, by Russell & Norvig

Hidden Markov Models: Slide 91

What You Should Know


• What an HMM is
• Definition, computation, and use of αt(i)
• The Viterbi algorithm
• Outline of the EM algorithm for HMM learning
(Baum-Welch)
• Be comfortable with the kind of math needed
to derive the HMM algorithms described here
• What a DBN is and how an HMM is a special
case
• Appreciate that a DBN (and thus an HMM) is
really just a special kind of Bayes net
Hidden Markov Models: Slide 92

46

You might also like