INT3406E 20, 2023-2024
Sequence Model:
Hidden Markov Models
Dr. Nguyen Van Vinh
UET-VNU
Why model Sequence?
General Framework for many NLP
problems
Examples
Part-of-Speech Tagging
Chunking (Shallow Parsing)
Name Entity Recognition
Semantic Role Labeling
…
Part-of-Speech Tagging
What is Part of Speech?
The part of speech explains how a word is
used in a sentence
nouns, pronouns, adjectives, verbs, adverbs,
prepositions, conjunctions, …
How does POS Tagging works?
The/DT cat/NN sat/VBD on/IN the/DT
The cat sat on the mat mat/NN
Penn treebank part-of-speech tagset
45 tags based on Wall Street Journal (WSJ)
Quiz
How many part of speech tags do you think
English has?
A) <10
B) 10-20
C) 20-40
D) > 40
Chunking – Name Entity Recognition
NLP pipelines
Source: https://spacy.io/usage/processing-pipelines
Outline
Hidden markov models
Viterbi algorithm
Introduction
Modeling dependencies in input
Sequences:
Temporal: In speech; phonemes in a word (dictionary),
words in a sentence (syntax, semantics of the
language).
In handwriting, pen movements
Spatial: In a DNA sequence; base pairs
Andrei Andreyevich Markov
Born: 14 June 1856 in Ryazan, Russia
Died: 20 July 1922 in Petrograd (now
St Petersburg), Russia
Markov is particularly remembered
for his study of Markov chains,
sequences of random variables in
which the future variable is
determined by the present variable
Discrete Markov Process
N states: S1, S2, ..., SN State at “time” t, qt = Si
First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1
Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
Markov random processes
A random sequence has the Markov property if its distribution is
determined solely by its current state. Any random process having
this property is called a Markov random process.
For observable state sequences (state is known from data), this
leads to a Markov chain model.
For non-observable states, this leads to a Hidden Markov Model
(HMM).
Chain Rule & Markov Property
Bayes rule
P(qt , qt 1 ,...q1 ) P(qt | qt 1 ,...q1 ) P(qt 1 ,...q1 )
P(qt , qt 1 ,...q1 ) P(qt | qt 1 ,...q1 ) P(qt 1 | qt 2 ,...q1 ) P(qt 2 ,...q1 )
t
P(qt , qt 1 ,...q1 ) P(q1 ) P(qi | qi 1 ,...q1 )
i 2
Markov property
P (qi | qi 1 ,...q1 ) P(qi | qi 1 ) for i 1
t
P(qt , qt 1 ,...q1 ) P(q1 ) P(qi | qi 1 ) P(q1 ) P(q2 | q1 )...P(qt | qt 1 )
i 2
Stochastic Automaton
3
PO Q(123) | A, Pq1 Pqt | qt 1 q1 aq1q2 aq2 q3
t 2
Example: Balls and Urns
Markov process with a non-hidden observation process –
stochastic automoton
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
0.4 0.3 0.3
0.5,0.2,0.3 A 0.2 0.6 0.2
T
0.1 0.1 0.8
O S1 , S1 , S3 , S3
PO | A , PS1 PS1 | S1 PS 3 | S1 PS3 | S3
1 a11 a13 a33
0.5 0.4 0.3 0.8 0.048
Hidden Markov Models
States are not observable
Discrete observations {v1,v2,...,vM} are recorded;
a probabilistic function of the state
Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
Example: In each urn, there are balls of
different colors, but with different probabilities.
For each observation sequence, there are
multiple state sequences
From Markov To Hidden Markov
The previous model assumes that each state can be uniquely
associated with an observable event
Once an observation is made, the state of the system is then trivially retrieved
This model, however, is too restrictive to be of practical use for most realistic
problems
To make the model more flexible, we will assume that the outcomes or
observations of the model are a probabilistic function of each state
Each state can produce a number of outputs according to a unique probability
distribution, and each distinct output can potentially be generated at any state
These are known a Hidden Markov Models (HMM), because the state sequence
is not directly observable, it can only be approximated from the sequence of
observations produced by the system
The urn-ball problem
To further illustrate the concept of an HMM, consider this scenario
You are placed in the same room with a curtain
Behind the curtain there are N urns, each containing a large number of balls with M
different colors
The person behind the curtain selects an urn according to an internal random process,
then randomly grabs a ball from the selected urn
He shows you the ball, and places it back in the urn
This process is repeated over and over
Questions?
How would you represent this experiment with an HMM?
What are the states?
Why are the states hidden?
What are the observations?
Lecture Notes for E
Alpaydın 2004
Introduction to
Machine Learning ©
The MIT Press (V1.1) 18
Hidden Markov Model (HMM)
HMMs allow you to estimate probabilities
of unobserved events
Given plain text, which underlying
parameters generated the surface
HMMs and their Usage
HMMs are very common in AI:
Speech recognition (observed: acoustic signal,
hidden: words)
Handwriting recognition (observed: image, hidden:
words)
Part-of-speech tagging (observed: words, hidden:
part-of-speech tags)
Name Entity Recognition (observed: words, hidden:
name entity label)
Example: POS Tagging
Homework exercise!!!
Reference: https://web.stanford.edu/~jurafsky/slp3/8.pdf
Example: Chunking
The Trellis
Parameters of an HMM
States: A set of states S=s1,…,sn
Transition probabilities: A= a1,1,a1,2,…,an,n Each
ai,j represents the probability of transitioning
from state si to sj.
Emission probabilities: A set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si
Initial state distribution: i is the probability that
si is a start state
The Three Basic HMM Problems
Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
(A,B, ) , how do we compute the
probability of O given the model?
Problem 2 (Decoding): Given the observation
sequence O=o1,…,oT and an HMM model
(A,B, ), how do we find the state
sequence that best explains the observations?
The Three Basic HMM Problems
Problem 3 (Learning): How do we adjust
the model parameters (A,B, ) , to
maximize P(O | ) ?
Problem 1: Probability of an Observation
Sequence
What is P(O | ) ?
The probability of a observation sequence is the
sum of the probabilities of all possible state
sequences in the HMM.
Naïve computation is very expensive. Given T
observations and N states, there are NT
possible state sequences.
Even small HMMs, e.g. T=10 and N=10,
contain 10 billion different paths
Solution to this and problem 2 is to use dynamic
programming
Examples
The Ice cream task by Jason
Source: https://web.stanford.edu/~jurafsky/slp3/A.pdf
Example (cont.)
P(O) P(O, Q) P(O | Q) P(Q)
Q Q
P(3 1 3) = P (3 1 3, cold cold cold) +
P(313, cold cold hot) + P(313, hot
hot cold) + … = ?
The observation likelihood for the ice-
cream events 3 1 3 given the hidden state
sequence hot hot cold
n n
P(O, Q) P(O | Q) P(Q) P(oi | qi ) P(qi | qi 1 )
i 1 i 1
P(3 1 3, hot hot cold) = ?
Forward Probabilities
What is the probability that, given an
HMM , at time t the state is i and the
partial observation o1 … ot has been
generated?
t (i) P(o1 ... ot , qt si | )
Forward Probabilities
t (i) P(o1 ...ot , qt si | )
N
t ( j) t1(i) aij b j (ot )
i1
Forward Algorithm
Initialization: 1(i) ibi (o1) 1 i N
Induction:
N
t ( j) t1(i) aij b j (ot ) 2 t T,1 j N
i1
Termination: P(O | ) T (i)
i 1
Example
Forward Algorithm Complexity
In the naïve approach to solving problem
1 it takes on the order of 2T*NT
computations
The forward algorithm takes on the order
of N2T computations
Backward Probabilities
Analogous to the forward probability, just in
the other direction
What is the probability that given an HMM
and given the state at time t is i, the partial
observation ot+1 … oT is generated?
t (i) P(ot 1 ...oT | qt si , )
Backward Probabilities
t (i) P(ot 1 ...oT | qt si , )
N
t (i) aij b j (ot 1) t 1 ( j)
j1
Backward Algorithm
Initialization: T (i) 1, 1 i N
Induction:
N
t (i) aij b j (ot 1) t 1 ( j) t T 1...1,1 i N
j1
Termination: N
P(O | ) i 1(i)
i1
Problem 2: Decoding
The solution to Problem 1 (Evaluation) gives us
the sum of all paths through an HMM efficiently.
For Problem 2, we wan to find the path with the
highest probability.
We want to find the state sequence Q=q1…qT,
such that
Q argmax P(Q'| O, )
Q'
Viterbi Algorithm
Similar to computing the forward
probabilities, but instead of summing
over transitions from incoming states,
compute the maximum
Forward: N
( j) (i) a b (o )
t t1 ij j t
i1
Viterbi Recursion:
t ( j ) max t 1 (i )aij b j (ot )
1i N
t ( j ) max P(q0 , q1 ,..., qt 1, o0 , o1 ,..., ot , qt j | )
q0 , q1 ,...,qt 1
Viterbi Algorithm
Initialization: 1 (i) ib j (o1 ) 1 i N
Induction:
t ( j) maxt1 (i) aij b j (ot )
1iN
t ( j) argmaxt1 (i) aij 2 t T,1 j N
1iN
Termination: p max
*
T (i) q argmax T (i)
*
T
1iN 1iN
Read out path: q*t t 1 (q*t 1 ) t T 1,...,1
Example (1)
Example (2)
v3(2)= 0.012544
Problem 3: Learning
Up to now we’ve assumed that we know the
underlying model (A,B, )
Often these parameters are estimated on
annotated training data, which has two
drawbacks:
is difficult and/or expensive
Annotation
Training data is different from the current data
We want to maximize the parameters with
respect to the current data, i.e., we’re looking for
a model , such' that ' argmax P(O | )
Problem 3: Learning
Unfortunately, there is no known way to
analytically find a global maximum, i.e., a model
' , such that ' argmax P(O | )
But it is possible to find a local maximum
Given an initial model , we can always find a
model ', such that P(O | ') P(O | )
Parameter Re-estimation
Use the forward-backward (or Baum-
Welch) algorithm, which is a hill-climbing
algorithm
Using an initial parameter instantiation,
the forward-backward algorithm iteratively
re-estimates the parameters and
improves the probability that given
observation are generated by the new
parameters
Parameter Re-estimation
Three parameters need to be re-
estimated:
Initial state distribution: i
Transition probabilities: ai,j
Emission probabilities: bi(ot)
Re-estimating Transition Probabilities
What’s the probability of being in state si
at time t and going to state sj, given the
current model and parameters?
t (i, j) P(qt si , qt 1 s j | O, )
Re-estimating Transition Probabilities
t (i, j) P(qt si , qt 1 s j | O, )
t (i) ai, j b j (ot 1 ) t 1 ( j)
t (i, j) N N
(i) a t i, j b j (ot 1 ) t 1 ( j)
i1 j1
Re-estimating Transition Probabilities
The intuition behind the re-estimation
equation for transition probabilities is
expected number of transitions from state si to state sj
aˆ i, j
expected number of transitions from state si
Formally:
T 1
(i, j)t
aˆ i, j t1
T 1 N
(i, j') t
t1 j'1
Re-estimating Transition Probabilities
N
Defining t (i) t (i, j )
j 1
As the probability of being in state si,
given the complete observation O
T 1
(i, j)
t
We can say: aˆ i, j t1
T 1
(i) t
t1
Review of Probabilities
Forward probability: t (i)
The probability of being in state si, given the partial
observation o1,…,ot
Backward probability: t (i)
The probability of being in state si, given the partial
ot+1,…,oT
observation
Transition probability: t (i, j)
of going from state si, to state sj, given
The probability
the complete observation o1,…,oT
State probability: t (i)
The probability
of being in state si, given the complete
observation o1,…,oT
Re-estimating Initial State Probabilities
Initial state distribution: i is the
probability that si is a start state
Re-estimation is easy:
ˆ i expected number
of times in state s i at time 1
Formally:
ˆ i 1 (i)
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
expected number of times in state si and observe symbol vk
bˆi (k)
expected number of times in state si
Formally: T
(o ,v ) (i)
t k t
bˆi (k) t1
T
(i) t
t1
Where (ot ,v k ) 1, if ot v k , and 0 otherwise
Note that here is the Kronecker delta function and is not
to the in the discussion of the Viterbi algorithm!!
related
The Updated Model
Coming from (A,B, ) we get to
' ( Aˆ , Bˆ ,
ˆ ) by the following update rules:
T 1
(i, j)
T
t (o ,v ) (i)
t k t
aˆ i, j t1
T 1
bˆi (k) t1
T
ˆ i 1 (i)
(i) t
(i) t
t1
t1
The inner loop for
forward-backward algorithm
Given an input sequence and ( S , A, B, )
1. Calculate forward probability:
• Base case i (1) i
• Recursive case: j (t 1) i (t )aij b j (ot )
i
2. Calculate backward probability:
• Base case: i (T 1) 1
• i (t ) j (t 1)aij b j (ot )
Recursive case:
j
i (t )aij b j (ot ) j (t 1)
3. Calculate expected counts: t (ij ) N
4. Update the
T
parameters: m (t ) m (t )
m 1
T
t (ij ) (ot , vk ) t (ij) N
aij N T
t 1
b j (k ) t 1 T (i) 1 (i, j )
t (ij )
j 1 t 1
t (ij) j 1
t 1
Iterations
Each iteration provides values for all the
parameters
The new model always improve the
likeliness of the training data:
ˆ ) P(O | )
P(O |
The algorithm does not guarantee to
reach global maximum.
Expectation Maximization
The forward-backward algorithm is an
instance of the more general EM
algorithm
The E Step: Compute the forward and
backward probabilities for a give model
The M Step: Re-estimate the model
parameters
HMM, MEMM, CRF
Graphical Structures of simple HMM(A),
MEMM(B), and chain-structured CRF(C)
Homework (Study and coding)!
Study of CRF model
Refer: https://web.stanford.edu/~jurafsky/slp3/8.pdf
Programming with Viterbi Algorithm
Apply HMM, CRF for Part-of-Speech
Tagging
Reference
https://web.stanford.edu/~jurafsky/slp3/8.pdf
https://web.stanford.edu/~jurafsky/slp3/A.pdf
Some slides in Internet