CS440/ECE448 Lecture 16: HMM Inference
and the Viterbi Algorithm
Mark Hasegawa-Johnson, 3/2021
CC-BY 4.0: You may remix or
redistribute if you cite the source.
Louis-Leopold Boilly, Passer Payez, 1803. Public domain
work of art, https://en.wikipedia.org/wiki/Umbrella
Outline
• Inference by Enumeration in an HMM
• Decoding using the Viterbi Algorithm
Example Scenario: UmbrellaWorld
Characters from the novel Hammered by Elizabeth Bear,
Scenario from chapter 15 of Russell & Norvig
Since he has read a lot about rain,
Richard proposes a hidden Markov
model:
state
• Rain on day t-1 (𝑅!"# ) makes rain
on day t (𝑅! ) more likely.
• Elspeth usually brings her
observation
umbrella (𝑈! ) on days when it
rains (𝑅! ), but not always.
Example Scenario: UmbrellaWorld
Characters from the novel Hammered by Elizabeth Bear,
Scenario from chapter 15 of Russell & Norvig
Initial
• Richard has no idea whether or not it’s state
Transition model
raining on day 1, so he assumes model
𝑃(𝑅! ) = 0.5. 𝑃 𝑅!
• Richard learns that the weather 0.5
changes on 3 out of 10 days, thus state
𝑃 𝑅" |𝑅"#! = 0.7
𝑃 𝑅" |¬𝑅"#! = 0.3
• He also learns that Elspeth sometimes
forgets her umbrella when it’s raining, observation
and that she sometimes brings an
umbrella when it’s not raining.
Specifically,
𝑃 𝑈" |𝑅" = 0.9
𝑃 𝑈" |¬𝑅" = 0.2
Observation model
Inference in an HMM: Example
Initial
Transition
• Elspeth has no umbrella on day 1. state
model
model
• Elspeth has an umbrella on day 2. 𝑃 𝑅!
0.5
• Assume 𝑃 𝑅# = 0.5
state
• What is the probability that it’s
raining on day 2?
observation
𝑃 𝑅$ |¬𝑈# , 𝑈$ ?
Observation model
Inference by Enumeration
To calculate a probability 𝑃 𝑅$ |𝑈# ,𝑈$ :
1. Select: which variables do we need, in order to model the
relationship among 𝑈# , 𝑈$ , and 𝑅$ ?
• We need also 𝑅! .
2. Multiply to compute joint probability:
𝑃 𝑅# , 𝑅$ , 𝑈# ,𝑈$ = 𝑃 𝑅# 𝑃 𝑈# |𝑅# … 𝑃 𝑈$ |𝑅$
3. Add to eliminate those we don’t care about
𝑃 𝑅$ , 𝑈# ,𝑈$ = - 𝑃 𝑅# , 𝑅$ , 𝑈# ,𝑈$
%$
4. Divide: use Bayes’ rule to get the desired conditional
𝑃 𝑅$ |𝑈# ,𝑈$ = 𝑃 𝑅$ , 𝑈# ,𝑈$ /𝑃 𝑈# ,𝑈$
R1 R2 … RT-1 RT
U1 U2 UT-1 UT
Computational Complexity of “Inference by
Enumeration”
• Russell & Norvig call this “inference by enumeration” because you
have to enumerate every possible combination of 𝑅# , 𝑅$ , 𝑈# ,𝑈$ .
• The complexity comes from this enumeration: if there are 2& possible
combinations, then the complexity can’t be less than 2& !
First simplification for HMMs: only enumerate
the values of the hidden variables
• Notice: we don’t really need to calculate 𝑃 ¬𝑅# , 𝑅$ , ¬𝑈# ,¬𝑈$ if we
have already observed that 𝑈$ is True!
• First computational simplification for HMMs:
• Only enumerate the possible values of the hidden variables.
• Set the observed variables to their observed values.
Inference by Enumerating only the Hidden Variables
Multiply:
𝑃 𝑅& , 𝑅' , ¬𝑈& ,𝑈' =
𝑃 𝑅& 𝑃 ¬𝑈& |𝑅& 𝑃 𝑅' |𝑅& 𝑃 𝑈' |𝑅' Transition model
state R1 R2
¬𝑹𝟏 ¬𝑼𝟏 𝑼𝟐 𝑹𝟏 ¬𝑼𝟏 𝑼𝟐
observation U1 U2
¬𝑹𝟐 0.056 0.003
𝑹𝟐 0.108 0.0315
Transition probabilities Observation probabilities
Rt = T Rt = F Ut = T Ut = F
• We only compute joint probabilities that
Rt-1 = T 0.7 0.3 Rt = T 0.9 0.1
include the observed events, ¬𝑼𝟏 and 𝑼𝟐 .
Rt-1 = F 0.3 0.7 Rt = F 0.2 0.8
• The numbers don’t add up to one; they add
up to 𝑃 ¬𝑈! ,𝑈' .
Inference by Enumerating only the Hidden Variables
Add:
𝑃 𝑅$ , ¬𝑈# ,𝑈$
Transition model
= - 𝑃 𝑅# , 𝑅$ , ¬𝑈# ,𝑈$
%$ state R1 R2
¬𝑼𝟏 𝑼𝟐
observation U1 U2
¬𝑹𝟐 0.059
𝑹𝟐 0.1395
Transition probabilities Observation probabilities
Rt = T Rt = F Ut = T Ut = F
• We only compute joint probabilities that
Rt-1 = T 0.7 0.3 Rt = T 0.9 0.1
include the observed events, ¬𝑼𝟏 and 𝑼𝟐 .
Rt-1 = F 0.3 0.7 Rt = F 0.2 0.8
• The numbers don’t add up to one; they add
up to 𝑃 ¬𝑈! ,𝑈' .
Inference by Enumerating only the Hidden Variables
Divide:
𝑃 𝑅$ , ¬𝑈# ,𝑈$
𝑃 𝑅$ |¬𝑈# ,𝑈$ = Transition model
𝑃 ¬𝑈# ,𝑈$
state R1 R2
¬𝑼𝟏 𝑼𝟐
observation U1 U2
¬𝑹𝟐 0.30
𝑹𝟐 0.70
Transition probabilities Observation probabilities
Rt = T Rt = F Ut = T Ut = F
• Normalize, so that the column sums Rt-1 = T 0.7 0.3 Rt = T 0.9 0.1
to one. Rt-1 = F 0.3 0.7 Rt = F 0.2 0.8
First simplification for HMMs: only enumerate
the values of the hidden variables
• Only enumerate the possible values of the hidden variables. Set the
observed variables to their observed values.
• Inference with binary hidden variables: enumerate 𝑅# , … , 𝑅' ,
complexity is 2'(# = 𝒪 2' .
• Inference with N-ary hidden variables: If each of the variables 𝑅! has
N possible values, instead of only 2 possible values, then the
inference complexity would be 𝒪 𝑁 ' .
Outline
• Inference by Enumeration in an HMM
• Decoding using the Viterbi Algorithm
Inference complexity in an HMM
• 𝒪 𝑁 ' is still a lot. Can we do better?
• For a general Bayes net, no. Bayes net inference, in an arbitrary Bayes
net, is NP-complete.
• For an HMM, yes, we can do better.
Hard EM for an HMM: the Viterbi Algorithm
The Viterbi algorithm is the inference step of hard EM for an HMM.
Given a particular sequence of observations, what is the most likely underlying
sequence of states?
Viterbi Algorithm Example
Given a particular sequence of observations, what is the most likely underlying
sequence of states?
• Example: given 𝑈# = 𝐹, 𝑈$ = 𝑇, 𝑈) = 𝑇, 𝑈& = 𝐹
• what is the most likely sequence of state variables, 𝑅# , 𝑅$ , 𝑅) , 𝑅& ?
The Trellis
• X-Axis = time T
Value of the hidden variable, 𝑅"
• Y-Axis = state variable (rt)
• Node = a particular state
at a particular time
…
• Edge = possible
transition from 𝑅!"# to
𝑅!
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
A Path Through the Trellis
• A path through the trellis T
Value of the hidden variable, 𝑅"
is a sequence of
connected states.
• For example, this path is …
the sequence 𝑅# =
𝑇, 𝑅$ = 𝐹, 𝑅) = 𝑇, 𝑅& =
𝑇
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Viterbi Algorithm Key Concept
Given a particular sequence of observations, what is the most
likely underlying sequence of states?
In other words, given a particular sequence of observations,
what is the most probable path through the trellis?
Viterbi Algorithm: Key concepts
Nodes and edges have numbers attached to them:
• Edge Probability: Probability of taking that transition, and then generating the
next observed output
𝑒*+! = 𝑃 𝑅! = 𝑗, 𝑈! = 𝑢! |𝑅!"# = 𝑖
• Node Probability: Probability of the best path until node j at time t
𝑣+! = max 𝑃 𝑈# = 𝑢# … , 𝑈! = 𝑢! , 𝑅# = 𝑟# , … , 𝑅! = 𝑗
,$,…,,()$
Edge Probabilities
𝑒()* T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= 𝑃 𝑅* = 𝑗, 𝑈* = 𝑢* |𝑅*+& = 𝑖
0.0
= 𝑃 𝑅* = 𝑗|𝑅*+& = 𝑖 ×
0.0
0.2
…
6
𝑃 𝑈* = 𝑢* |𝑅* = 𝑗
3 4
7
0.0
0.2
0.2
Notice that, since 𝑈' and 𝑈,
have the same observed values,
their inbound edges have the 0.14 0.14 0.56
same weights. F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Node Probabilities
𝑣+! T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= max 𝑃(𝑈#
,$,…,,()$
? ? ? ?
= 𝑢# … , 𝑈! = 𝑢! , 𝑅#
0.0
0.0
0.2
= 𝑟# , … , 𝑅! = 𝑗) …
3 4
7
0.0
0.2
0.2
? ? ? ?
0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Node Probabilities: Initialization
For example, let’s
consider how to find 𝑣OP T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
for 𝑡 = 1: 0.05 ? ? ?
0.0
0.0
0.2
…
6
𝑣+# = 𝑃 𝑈# = 𝑢# , 𝑅# = 𝑗
3 4
7
7
= 𝑃 𝑅# = 𝑗
0.0
0.2
0.2
×𝑃 𝑈# = 𝑢# |𝑅# = 𝑗
0.40 ? ? ?
(0.5)(0.1) 𝑗 = 𝑇 0.14 0.14 0.56
=N
(0.5)(0.8) 𝑗 = 𝐹 F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
… and what about time t=2?
Notice that, at time t=2,
there are two ways to get T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
to any particular state: 0.05 ? ? ?
• The previous state might
0.0
0.0
0.2
…
6
have been F
3 4
7
0.0
• The previous state might
0.2
0.2
have been T
0.40 ? ? ?
0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Viterbi Algorithm: the iteration step
Given edge probabilities defined as
𝑒*+! = 𝑃 𝑅! = 𝑗, 𝑈! = 𝑢! |𝑅!"# = 𝑖
and node probabilities defined as
𝑣+! = max 𝑃 𝑈# = 𝑢# … , 𝑈! = 𝑢! , 𝑅# = 𝑟# , … , 𝑅!"# = 𝑖, 𝑅! = 𝑗
,$,…,,()*,*
The node probability can be efficiently computed as
𝑣OP
= max + max 𝑃 𝑈K = 𝑢K … , 𝑈PTK = 𝑢PTK , 𝑅K = 𝑟K , … , 𝑅PTK = 𝑖
Q R!,…,R"#$
×𝑃 𝑅P = 𝑗, 𝑈P = 𝑢P |𝑅PTK = 𝑖 ,
Viterbi Algorithm: the iteration step
Given edge probabilities defined as
𝑒*+! = 𝑃 𝑅! = 𝑗, 𝑈! = 𝑢! |𝑅!"# = 𝑖
and node probabilities defined as
𝑣+! = max 𝑃 𝑈# = 𝑢# … , 𝑈! = 𝑢! , 𝑅# = 𝑟# , … , 𝑅!"# = 𝑖, 𝑅! = 𝑗
,$,…,,()*,*
The node probability can be efficiently computed as
𝑣OP = max 𝑣Q,PTK 𝑒QOP
Q
… and what about time t=2?
𝑣 +,' = max 𝑣-! 𝑒-,+,'
-
= max 0.05 0.63 , (0.4)(0.27) T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= 0.108 0.05 .108 ? ?
0.0
0.0
0.2
…
3 4
7
0.0
0.2
0.2
0.40 ? ? ?
0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
… and what about time t=2?
𝑣 +,' = max 𝑣-! 𝑒-,+,'
-
= max 0.05 0.63 , (0.4)(0.27) T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= 0.108 0.05 .108 ? ?
0.0
𝑣.,' = max 𝑣-! 𝑒-,.,'
0.0
0.2
-
…
4
= max 0.05 0.06 , (0.4)(0.14)
3
7
7
= 0.056
0.0
0.2
0.2
0.40 .056 ? ?
0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Node probabilities and backpointers
• Node Probability: Probability of the best path until node j at time t
𝑣+! = max 𝑃 𝑈# = 𝑢# … , 𝑈! = 𝑢! , 𝑅# = 𝑟# , … , 𝑅!"# = 𝑖, 𝑅! = 𝑗
,$,…,,()*,*
• Backpointer: which node, 𝑖, precedes node 𝑗 on the best path?
∗
𝑖+! = argmax 𝑃 𝑈# , = 𝑢# … , 𝑈! = 𝑢! , 𝑅# = 𝑟# , … , 𝑅!"# = 𝑖, 𝑅! = 𝑗
,$,…,,()*,*
Backpointers at t=2
∗
𝑖',$ = argmax 𝑣*# 𝑒*,',$
*
0.05 0.63 ,
T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= argmax ?
(0.4)(0.27) 0.05 .108 ?
=𝐹
0.0
0.0
0.2
…
4
∗
𝑖0,$ = argmax 𝑣*# 𝑒*,0,$
3
7
0.0
0.2
0.2
*
0.05 0.06 ,
= argmax
(0.4)(0.14) 0.40 .056 ? ?
=𝐹 0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Backpointers at t=3
∗
𝑖',) = argmax 𝑣*$ 𝑒*,',)
*
0.108 0.63 ,
T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= argmax 0.05 .108 .068 ?
(0.056)(0.27)
=𝑇
0.0
0.0
0.2
…
4
∗
𝑖0,) = argmax 𝑣*$ 𝑒*,0,)
3
7
0.0
0.2
0.2
*
0.108 0.06 ,
= argmax
(0.056)(0.14) 0.40 .056 .008 ?
=𝐹 0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Backpointers at t=4
∗
𝑖',& = argmax 𝑣*) 𝑒*,',&
*
0.068 0.07 ,
T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= argmax 0.05 .108 .068 .005
(0.008)(0.03)
=𝑇
0.0
0.2
0.0
…
4
6
∗
𝑖0,& = argmax 𝑣*) 𝑒*,0,&
3
7
0.0
0.2
0.2
*
0.068 0.24 ,
= argmax 0.40 .016
(0.008)(0.56) .056 .008
=𝑇 0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
So which is the best path?
• Answer: whichever one is most probable.
Node probabilities at t=4
𝑣',& = max 𝑣*) 𝑒*,',&
*
0.068 0.07 , T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
= max
(0.008)(0.03) 0.05 .108 .068 .005
= 0.005
0.0
0.2
0.0
…
4
6
𝑣0,& = max 𝑣*) 𝑒*,0,&
3
7
0.0
*
0.2
0.2
0.068 0.24 ,
= max
(0.008)(0.56) 0.40 .056 .008 .016
= 0.016 0.14 0.14 0.56
The best path is the one that F
ends at 𝑅& = 𝐹 Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Termination: which is the best path?
• Best final state is whichever final state has the highest node
probability.
• The best path leading to that state is the most probable one
• … but we’ve already found the most probable path…
• …we just need to follow the backpointers!
Follow the backpointers!
Given the observations
T 0.63 0.63 0.07
Value of the hidden variable, 𝑅"
𝑈# , 𝑈$ , 𝑈) , 𝑈& = 𝐹, 𝑇, 𝑇, 𝐹 0.05 .108 .068 .005
0.0
0.2
0.0
… and given our hidden
…
4
6
Markov model, we conclude
3
7
0.0
that the most probable
0.2
0.2
sequence of state variables is
0.40 .056 .008 .016
𝑅# , 𝑅$ , 𝑅) , 𝑅& = 𝐹, 𝑇, 𝑇, 𝐹 0.14 0.14 0.56
F
Time
𝑈K = 𝐹 𝑈L = 𝑇 𝑈M = 𝑇 𝑈N = 𝐹
Some conclusions
• We have discovered that, if Elspeth brings her umbrella only on days 2
and 3, then the best inference is that it’s raining on only those days.
• Well, this was kind of obvious from the beginning!
Some conclusions
Other types of HMMs might be less obvious. For example, consider the
following assertion:
A fly flies well. A well does not fly.
In order to decide if these sentences are true or false, you first need to
know which words are nouns, which verbs, and which adverbs.
In MP4, you will solve this problem using an HMM.
• State variable = part of speech
• Observation = word
• Transition model: verbs tend to come after nouns.
Final Word: Computational Complexity
• Inference by Enumeration in an HMM: 𝒪 𝑁 '
𝑃 vars you care about = - 𝑃 all vars
123/4"5678 967:
• Decoding using the Viterbi Algorithm: 𝒪 𝑇𝑁 $
𝑣+! = max 𝑣*,!"# 𝑒*+!
*
Max over N values of 𝑖, performed for N values of 𝑗, and for T values of 𝑡.