Understanding Hidden Markov Models
Understanding Hidden Markov Models
Ronald J. Williams
CSG220
Spring 2007
Contains several slides adapted from an Andrew Moore tutorial on this topic and a few figures
from Russell & Norvig’s AIMA site and Alpaydin’s Introduction to Machine Learning site.
1
Markov Property
• For any t, Xt+1 is conditionally independent of {Xt-1, Xt-2, … X1}
given Xt.
• In other words:
P(Xt+1 = sj |Xt = si ) = P(Xt+1 = sj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, X2, … , Xt-1, Xt, ...)?
Markov Property
• For any t, Xt+1 is conditionally independent of {Xt-1, Xt-2, … X1}
given Xt.
• In other words:
P(Xt+1 = sj |Xt = si ) = P(Xt+1 = sj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, X2, … , Xt-1, Xt, ...)?
• Answer:
2
Markov chain as a Bayes net
i P(X1 = si)
1 π1 X1 X2 ... Xt-1 Xt ...
2 π2
3 π3 1 2 … j … N
1 a11 a12 … a1j … a1N
i πi
2 a21 a22 … a2j … a2N
3 a31 a32 … a3j … a3N
N πN
: : : : : : :
i ai1 ai2 … aij … aiN
3
Computing stuff in Markov chains
• Some notation and assumptions
• Assume time t runs from 1 to T
• Recall that Xt is the r.v. representing the state at
time t and xt denotes the actual value
• Use Xt1:t2 and xt1:t2 as shorthand for
(Xt1, Xt1+1, ..., Xt2) and (xt1, xt1+1, ... xt2),
respectively
• Use notation like P(xt) as shorthand for P(Xt=xt)
4
State sequence as a path
trellis
• Express inductively
∀i p1 (i ) ≡ P( X 1 = si ) = π i
∀j pt +1 ( j ) ≡ P(X t +1 = s j )
N
= ∑ P( X t +1 = s j | X t = si ) P( X t = si )
i =1
N
= ∑ aij pt (i )
i =1
5
What is P(Xt =si)? Clever approach
• For each state si, define time step
pt (i ) = P( X t = si ) 1 2 ... T
state index
1
2
• Express inductively
:
∀i p1 (i ) ≡ P( X 1 = si ) = π i N
pt +1 ( j ) ≡ P(X t +1 = s j )
• Computation is simple.
∀j • Just fill in this table one
N column at a time, from left
= ∑ P( X t +1 = s j | X t = si ) P( X = si )
to tright
i =1 • Cells in this table
N correspond to nodes in the
= ∑ aij pt (i ) trellis
i =1
6
Inductive step:
graphical representation
pt +1 ( j ) = ∑i =1 aij pt (i )
N
pt (i )
Hidden State
• Given a Markov model of a process, computation of various
quantities of interest (e.g., probabilities) is straightforward if
the state is observable – use techniques like the one just
described.
• More realistic: assume the true state is not observable – only
have observations that depend on, but do not fully determine,
the actual states.
• Examples
• Robot localization
• state = actual location
• observations = (noisy) sensor readings
• Speech recognition
• state sequence => word
• observations = acoustic signal
• In this situation, we say the state is hidden
• Model this using a Hidden Markov Model (HMM)
Hidden Markov Models: Slide 14
7
HMMs
• An HMM is just a Markov chain augmented
with
• a set of M possible observations {o1, o2, ..., oM}
• for each state s1, s2, ..., sN a distribution over
possible observations that might be sensed in that
state
• We’ll let Zt be the r.v. for the observation that
occurs at time t (with zt representing the
actual observation)
• In addition, we’ll assume that the observation
at time t depends only on the state at time t, in
the sense about to be described
Hidden Markov Models: Slide 15
8
Markov Property of Observations
• For any t, Zt is conditionally independent of {Xt-1, Xt-2, … X1,
Zt-1, Zt-2, ..., Z1} given Xt.
• In other words:
P(Zt = oj |Xt = si ) = P(Zt = oj |Xt = si ,any earlier history)
• Question: What would be the best Bayes Net structure to
represent the Joint Distribution of (X1, Z1, X2, Z2, … , Xt-1,Zt-1,
Xt, Zt, ...)?
• Answer:
X1 X2 ... Xt-1 Xt ...
Z1 Z2 Zt-1 Zt
Hidden Markov Models: Slide 17
Z1 Z2 Zt-1 Zt
observation index
1 2 … k … M
This is the CPT for
every Z node
1 b1(o1) b1 (o2) … b1 (ok) … b1(oM)
2 b2 (o1) b2 (o2) … b2(ok) … b2 (oM) Notation:
state index
9
Are HMMs Useful?
You bet !!
• Robot planning & sensing under uncertainty (e.g.
Reid Simmons / Sebastian Thrun / Sven Koenig)
• Robot learning control (e.g. Yangsheng Xu’s work)
• Speech Recognition/Understanding
Phones → Words, Signal → phones
• Human Genome Project
Complicated stuff your lecturer knows nothing
about.
• Consumer decision modeling
• Economics & Finance.
Plus at least 5 other things I haven’t thought of.
Hidden Markov Models: Slide 19
10
DBN Example
Modeling a robot
with position
sensors and a
battery charge
meter
11
Back to HMMs ...
Summary of our HMM notation:
• Xt = state at time t (r.v.)
• Zt = observation at time t (r.v.)
• Vt1:t2 = (Vt1, Vt1+1, ..., Vt2) for any time-indexed r.v. V
• Possible states = {s1, s2, ..., sN}
• Possible observations = {o1, o2, ..., oM}
• vt = actual value of r.v. V at time step t
• vt1:t2 = (vt1, vt1+1, ..., vt2) = sequence of actual values
of r.v. V from time steps t1 through t2
• Convenient shorthand: E.g., P(x1:t⏐z1:t) means
P(X1:t = x1:t⏐Z1:t = z1:t)
• T = final time step
Hidden Markov Models: Slide 23
12
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3
2/3
1/3 uw 0
1/3
π1 = 1/2 π2 = 1/2 π3 = 0
13
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
2/3 1/2 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3
14
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3
15
Here’s an HMM Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3
π1 = 1/2 π2 = 1/2 π3 = 0
16
Hidden State Start randomly in state 1 or 2
Choose one of the output
1/3
s1 s2 symbols in each state at
1/3 random.
1/2 uv vw
1/2
2/3 Let’s generate a sequence of
2/3
observations:
1/3 uw 0
1/3
Problems to solve
• So now we have an HMM (or, more generally,
a DBN) that models a temporal process of
interest
• What are some of the kinds of problems we’d
like to be able to solve with this?
17
Temporal Model Problems to Solve
• Filtering: Compute P(Xt⏐z1:t , λ)
• Prediction: Compute P(Xk⏐z1:t , λ) for k > t
• Smoothing: Compute P(Xk⏐z1:t , λ) for k < t
• Observation sequence likelihood:
Compute P(z1:T⏐λ)
• Most probable path (state sequence):
Compute x1:T maximizing P(x1:T⏐z1:T , λ)
• Maximum likelihood model: Given a set of
{ }
observation sequences z1r:Tr r , compute λ
maximizing ∏ P( z1:Tr | λ )
r
18
Filtering
Compute P( X t z1:t , λ )
Z1 Z2 Zt-1 Zt Zt+1
observed
current time
infer (distribution)
Prediction
Compute P ( X k z1:t , λ ) for k > t
Z1 Z2 Zt-1 Zt Zk
observed
current time
infer (distribution)
19
Smoothing
Compute P ( X k z1:t , λ ) for k < t
Z1 Z2 Zk Zt Zt+1
observed
current time
infer (distribution)
Z1 Z2 Zt-1 Zt ZT
observed
20
Most Probable Path
Compute arg max x1:T P (x1:T z1:T , λ )
Z1 Z2 Zt-1 Zt ZT
observed
infer (only most probable)
21
Solution methods for these problems
Let’s start by considering the observation
sequence likelihood problem:
Given z1:T , compute P(z1:t λ )
2/3
vw
1/2
uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3
22
Prob. of a sequence of 3 observations
1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3
2/3
vw
1/2
uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3
2/3
vw
1/2
uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3
23
Prob. of a sequence of 3 observations
1/3
s1 s2
P( z1:3 ) = ∑ P( z 1:3
x1:3∈paths of length 3
∧ x1:3 ) 1/2 uv
2/3
1/3
2/3
vw
1/2
uw
∑ P( z
1/3
= 1:3 | x1:3 ) P ( x1:3 ) 1/3 0
s3
x1:3∈paths of length 3 1/3
24
Computing the forward variables
Base case:
1 1
α1 (i ) ≡ P( z1 ∧ X 1 = si )
= P(z1 X 1 = si )P( X 1 = si ) πi i i
= bi ( z1 )π i z1
Note: For simplicity, we’ll N
N
drop explicit reference to
conditioning on the HMM
parameters λ for many of time step 1 2
the upcoming slides, but it’s
always there implicitly.
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P ( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
(
= ∑ i =1 P z t +1 X t +1 = s j aijα t (i )
N
)
= b j ( z t +1 )∑ i =1 aijα t (i )
N
25
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j
= ∑ P (z ∧ X = s ∧ X = s )
N split off last
i =1 1:t +1 observation
t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t
= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 51
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s ) chain rule
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t
= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 52
26
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s latest
)P(Xstate=ands observation
X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z X = s )a α (i ) chain rule
N
i =1 t +1 t +1 j ij t
= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 54
27
Forward variables: inductive step
( j ) ≡ P(z ∧ X = s )
α t +1 1:t +1 t +1 j
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t
= b ( z )∑ a α (i )
latest observation conditionally independent
N
j t +1 of earlier states given latest state
i =1 ij t
Hidden Markov Models: Slide 55
= ∑ P (z ∧ X = s ∧ X = s )
N
i =1 1:t +1 t i t +1 j
= ∑ P (z ∧ z ∧ X = s ∧ X = s )
N
i =1 1:t t +1 t i t +1 j
= ∑ P (z ∧ X = s z ∧ X = s )P( z ∧ X = s )
N
i =1 t +1 t +1 j 1:t t i 1:t t i
= ∑ P (z ∧ X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t
= ∑ P (z X = s ∧ X = s )P (X = s X = s )α (i )
N
i =1 t +1 t +1 j t i t +1 j t i t
= ∑ P (z X = s )a α (i )
N
i =1 t +1 t +1 j ij t
= b ( z )∑ a α (i )
N
j t +1 i =1 ij t
Hidden Markov Models: Slide 56
28
Forward variables: inductive step
α t (i ) α t +1 ( j ) = b j ( zt +1 )∑i =1 aijα t (i )
N
zt +1
29
In our example
1/3
s2
α t (i ) ≡ P(z1:t ∧ X t = si λ )
s1
1/2 uv
1/3
vw
α1 (i ) = bi ( z1 )π i 2/3 1/2
2/3
α t +1 ( j ) = b j ( zt +1 )∑ aij α t (i ) 1/3 uw 0
1/3
i s3
1/3
Observed: z1 z2 z3 = u u w
Filtering
30
Prediction
• Note that the (state) prediction problem can
be viewed as a special case of the filtering
problem in which there are missing
observations.
• That is, trying to compute the probability of Xk
given observations up through time step t,
with k > t, amounts to filtering with missing
observations at time steps t+1, t+2, ..., k.
• Therefore, we now focus on the missing
observations problem.
Hidden Markov Models: Slide 61
Missing Observations
• Looking at the derivation of the inductive step for computing the forward
variables, we see that the last step involves writing
( )
α t +1 ( j ) = P zt +1 X t +1 = s j P(X t +1 = s j all observations up through time t )
b j ( zt +1 ) ∑ aijα t (i )
N
i =1
• Thus the second factor gives us a prediction of the state at time t+1 based
on all earlier observations, which we then multiply by the observation
probability at time t+1 given the state at time t+1.
• If there is no observation at time t+1, clearly the set of observations made
through time t+1 is the same as the set of observations made through
time t.
31
Missing Observations (cont.)
• Thus we redefine
α t (i ) = P( X t = si ∧ all available observations through time t )
• This generalizes our earlier definition but allows for the possibility that
some observations are present and others are missing
• Then define
⎧b ( z ) if there is an observation at time t
bi′(zt ) = ⎨ i t
⎩1 otherwise
• It’s not hard to see that the correct forward compution should then
proceed as:
α1 (i ) = bi′(z1 )π i
α t +1 ( j ) = b′j (zt +1 )∑ aij α t (i )
i
• Amounts to propagating state predictions forward wherever there are no
observations
• Interesting special case: When there are no observations at any time, the
α values are identical to the p values we defined earlier for Markov chains
32
Backward variables: inductive step
β t (i ) ≡ P(zt +1:T X t = si )
= ∑ j =1 P(zt +1:T ∧ X t +1 = s j X t = si )
N
( )
= ∑ j =1 P zt +1:T X t +1 = s j ∧ X t = si P (X t +1 = s j X t = si )
N
= ∑ j =1 P (z )
N
t +1 ∧ zt + 2:T X t +1 = s j ∧ X t = si aij
= ∑ j =1 P (z )
N
t +1 ∧ zt + 2:T X t +1 = s j aij
= ∑ j =1 P (z = s )P (z )
N
t +1 zt + 2:T ∧ X t +1 j t + 2:T X t +1 = s j aij
= ∑ j =1 P (z )
X t +1 = s j β t +1 ( j )aij
N
t +1
= ∑ j =1 b j ( zt +1 )β t +1aij
N
β t (i ) = ∑ j =1 b j (zt +1 )β t +1aij
N
β t +1 ( j )
zt +1
33
Solving the smoothing problem
• Use the notation
γ t (i ) = P (X t = si z1:T )
for the probability we want to compute.
• Then
γ t (i ) = cP(z1:T X t = si )P( X t = si )
= cP(z1:t X t = si )P (zt +1:T X t = si )P( X t = si )
= cP( z1:t ∧ X t = si )P (zt +1:T X t = si )
= cα t (i )β t (i )
where c = 1/P(z1:T) is a constant of proportionality we can
ignore as long as we normalize to get the actual probs.
Hidden Markov Models: Slide 67
Smoothing
Forward-backward algorithm
Hidden Markov Models: Slide 68
34
Solving the most probable path problem
• Want arg max x1:T P x1:T z1:T( )
• One approach:
P(z1:T x1:T )P(x1:T )
arg max x1:T P(x1:T z1:T ) = arg max x1:T
P(z1:T )
= arg max x1:T P(z1:T x1:T )P(x1:T )
35
DP for MPP (cont.)
• We’ll show that these values can be
computed by an efficient forward computation
similar to the computation of the α values
• But first, let’s check that it gives us something
useful:
δ T (i ) = max x1:T −1 P(x1:T −1 ∧ X T = si ∧ z1:T )
= max x1:T −1 P (x1:T −1 ∧ X T = si z1:T )P( z1:T )
• Thus a value of i maximizing δT(i) identifies a
state which represents the final state in a path
maximizing P(x1:T z1:T )
Hidden Markov Models: Slide 71
36
DP for MPP (cont.)
• Using the chain rule and the Markov property, we find that
the probability to be maximized can be written as
[
= max i max x1:t −1 b j ( zt +1 )aij P( x1:t −1 ∧ X t = si ∧ z1:t ) ]
= b ( z ) max [a
j t +1 i ij max x1:t −1 P( x1:t −1 ∧ X t = si ∧ z1:t )]
= b j ( zt +1 ) max i aijδ t (i )
• This is inductive step
• Virtually identical to computation of forward variables α – only
difference is that it uses max instead of sum
• Also need to keep track of which state si gives max for each
state sj at the next time step to be able to determine actual
MPP, not just its probability
Hidden Markov Models: Slide 74
37
Viterbi Algorithm for Most Probable Path
Summary
• Base case: ∀i δ1 (i ) = bi ( z1 )π i
• Inductive step: ∀j δ t +1 ( j ) = b j ( zt +1 ) max i aijδ t (i )
• Compute for all states at t=1, then t=2, etc.
• Also save index giving max for each state at
each time step (backward pointers)
• Construct the MPP by determining state with
largest δT(i), then following backward pointers
to time steps T-1, T-2, etc.
Viterbi Algorithm
Store two numbers at each node in this trellis, one for δ and the other a
backward pointer to a node in the previous layer giving the max for this
node – this is computed left to right.
To find a most probable path, determine a node in the T layer with max δ
value, then follow backward pointers from right to left.
Hidden Markov Models: Slide 76
38
Viterbi algorithm: inductive step
δ t +1 ( j ) = b j ( zt +1 ) max i aijδ t (i )
δ t (i )
ψ t +1 ( j ) = arg max i aijδ t (i )
zt +1
39
Prob. of a given transition (cont.)
ξ t (i, j ) ≡ P(X t = si ∧ X t +1 = s j z1:T )
( )
= cP z1:T X t = si ∧ X t +1 = s j P (X t = si ∧ X t +1 = s j )
= cP(z X = s ∧ X = s )P (X = s X = s )P ( X = s )
1:T t i t +1 j t +1 j t i t i
= cP(z X = s ∧ X = s )a P ( X = s )
1:T t i t +1 j ij t i
= cP(z X = s )P (z X = s )P (z
1:t t i X = s )a P( X = s )
t +1 t +1 j t + 2:T t +1 j ij t i
= cP( z ∧ X = s )b ( z )P (z
1:t t X = s )a
i j t +1 t + 2:T t +1 j ij
zt +1
ξ t (i, j ) ≡ P (X t = si ∧ X t +1 = s j z1:T )
α t (i )aij b j (zt +1 )β t +1 ( j )
=
∑k ,l α t (k )akl bl (zt +1 )βt +1 (l )
Hidden Markov Models: Slide 80
40
Max. Likelihood HMM Inference
Given a state set {s1, s2, ..., sN} and a set of R
observation sequences
(
z11:T1 = z11 , z 12 , K , z T11 )
z12:T2 = (z 2
1 , z 22 , K , z T22 )
M
(
z1R:T R = z1R , z 2R , K , z TRR )
determine parameter set λ = (πi,{aij},{bi(oj)})
maximizing
R From now on, we’ll
A cheat
Let’s first imagine that along with each observation sequence
(
z1r:Tr = z1r , z 2r , K , z Trr )
an oracle also gives us the corresponding state sequence
(
x1r:Tr = x1r , x 2r , K , x Trr )
Then we could obtain max. likelihood estimates of all
parameters as follows:
# of sequences starting with si
πˆ i =
total # of sequences
# of transitions si → s j
aˆij =
# of visits to state si
# of visits to state si where ok observed
bˆi (ok ) =
# visits to state si
Hidden Markov Models: Slide 82
41
A cheat (cont.)
More formally, define the indicator functions
⎧ 1 if xtr = si
χ (i ) = ⎨
r
⎩0 otherwise
t
⎩0 otherwise
t
⎧
χ tr (i : k ) = ⎨1 if xt = si and zt = ok
r r
⎩0 otherwise
A cheat (cont.)
In terms of these indicator functions, our ML estimates would
then be
∑
R
χ r
1 (i )
πˆ i = r =1
R
∑ ∑ χ (i → j )
R Tr −1 r
For this, we can’t
aˆ = r =1 t =1 t
use the last state in
∑ ∑ χ (i )
ij R Tr −1 r any of the training
r =1 t =1 t sequences because
∑ ∑ χ (i : k )
R Tr r there’s no next state
bˆ (o ) = r =1 t =1 t
∑ ∑ χ (i )
i k R Tr r
r =1 t =1 t
42
The bad news ...
• There is no oracle to tell us the state sequence
corresponding to each observation sequence
• So we don’t know these actual indicator function
values
• So we can’t compute these sums
ξ tr (i, j ) ≡ P(X t ) (
= si ∧ X t +1 = s j z1r:Tr , λ = E χ tr (i → j ) λ )
• Also:
( ) (
E χ tr (i : k ) λ = P X t = si ∧ Z t = ok z1r:T , λ )
=⎨
(
⎧ P X t = si z1r:T , λ ) if ztr = ok
⎩0 otherwise
( )
= γ tr (i )I ztr = ok
43
The good news ...
• We can compute their expected values efficiently:
( )
γ tr (i ) ≡ P X t = si z1r:T , λ = E (χ tr (i ) λ )
r
ξ tr (i, j ) ≡ P(X ) (
= si ∧ X t +1 = s j z1r:Tr , λ = E χ tr (i → j ) λ )
t
a
• Also:
E (χ (i : k ) λ ) = P (X = sli∧ kZe= o !z , λ )
s
r r
k M
t t i t k 1:T
o P (X = s zE, λ ) if z = o
Lo= ⎨⎩0 for
⎧ t i
r
1:T
r
t k
otherwise
j=oγ b(i )I (z = o )
t
r r
t k
πi ← aij ← r =1 t =1
∑ ∑ γ tr (i )
R Tr −1
R r =1 t =1
∑ ∑ γ (i )I (z = o )
R Tr r r
b (k ) ← r =1 t =1 t t k
∑ ∑ γ (i )
i R Tr r
r =1 t =1 t
44
Remarks on Baum-Welch
• Bad news: There may be many local maxima
• Good news: The local maxima are usually
adequate models of the data
• Any probabilities initialized to zero will remain
zero throughout – useful when one wants a
model with limited state transitions
45
Some good references
• Standard HMM reference:
L. R. Rabiner, "A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition,"
Proc. of the IEEE, Vol.77, No.2, pp.257-286, 1989.
• Excellent reference for Dynamic Bayes Nets
as a unifying framework for probabilistic
temporal models (including HMMs and
Kalman filters):
Chapter 15 of Artificial Intelligence, A Modern
Approach, 2nd Edition, by Russell & Norvig
46