Probabilistic graphical models
CPSC 532c (Topics in AI)
Stat 521a (Topics in multivariate analysis)
Lecture 9
Kevin Murphy
Monday 18 October 2004
Administrivia
• HW4 due today
Review
• Variable elimination can be used to answer a single query, P (Xq |e).
• VarElim requires an elimination ordering; you can use elimOrderGreedy
to find this.
• VarElim implicitly creates an elimination tree (a junction tree with
non-maximal cliques).
• You can create a jtree of maximal cliques by triangulating and using
max weight spanning tree.
• Given a jtree, we can compute P (Xc|e) for all cliques c using belief
propagation (BP).
Belief propagation
• There are 2 variants of BP, which we will cover today:
• Shafer-Shenoy, that multiplies by all-but-one incoming message:
Y
δi→j = f δk →i
k∈Ni\{j}
• Lauritzen-Spiegelhalter, that multiplies by all incoming messages and
then divides out by one
Q !
k∈Ni δk →i
δi→j = f
δj →i
Ck
Ci Cj Cr
Ck’
Shafer-Shenoy algorithm
1 def
{ψi } = function Ctree-VE-calibrate({φ}, T, α)
R := pickRoot(T )
DT := mkRootedTree(T, R)
{ψi0} := initializeCliques(φ, α)
(* Upwards pass *)
for i ∈ postorder(DT )
j := pa(DT, i)
δi→j := VE-msg({δk →i : k ∈ ch(DT, i)}, ψi0 )
Ck
Ci Cj Cr
Ck’
Sub-functions
0 def
{ψi } = function initializeCliques(φ, α)
for i := 1 : C
0 Q
ψi (Ci) = φ:α(φ)=i φ
def
δi→j = function VE-msg({δk →i}, ψi0)
1 0 Q
ψi (Ci) := ψi (Ci) k δk →i
P
δi→j (Si,j ) := Ci\Sij ψi1(Ci)
Shafer-Shenoy algorithm
(* Downwards pass *)
for i ∈ preorder(DT )
for j ∈ ch(DT, i)
δi→j = VE-msg({δk →i : k ∈ Ni \ j}, ψi0)
(* Combine *)
for i := 1 : C
1 0 Q
ψi := ψi k∈Ni δk →i
Cj
Ci Ck Cr
Ck’
Shafer Shenoy for HMMs
X1 X2 X3 X4 ...
Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4
ψt0(Xt, Xt+1) = P (Xt+1|Xt)p(yt+1|Xt+1)
X
δt→t+1(Xt+1) = δt−1→t(Xt)ψt0(Xt, Xt+1)
Xt
X
δt→t−1(Xt) = δt+1→t(Xt+1)ψt0(Xt, Xt+1)
Xt+1
ψt1(Xt, Xt+1) = δt−1→t(Xt)δt+1→t(Xt+1)ψt0(Xt, Xt+1)
Forwards-backwards algorithm for HMMs
def
αt(i) = δt−1→t(i) = P (Xt = i, y1:t)
def
βt(i) = δt→t−1(i) = p(yt+1:T |Xt = i)
def
ξt(i, j) = ψt1(Xt = i, Xt+1 = j) = P (Xt = i, Xt+1 = j, y1:T )
def
P (Xt+1 = j|Xt = i) = A(i, j)
def
p(yt|Xt = i) = Bt(i)
X
αt(j) = αt−1(i)A(i, j)Bt(j)
Xi
βt(i) = βt+1(j)A(i, j)Bt+1(j)
j
ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j)
def X
γt(i) = P (Xt = i|y1:T ) ∝ αt(i)βt(j) ∝ ξt(i, j)
j
Forwards-backwards algorithm, matrix-vector form
X1 X2 X3 X4 ...
Y1 Y2 Y3 Y4
X
αt(j) = αt−1(i)A(i, j)Bt(j)
i
αt = (AT αt−1). ∗ Bt
X
βt(i) = βt+1(j)A(i, j)Bt+1(j)
j
βt = A(βt+1. ∗ Bt+1)
ξt(i, j) = α
t(i)βt+1(j)A(i, j)Bt+1(j)
ξt = αt(βt+1. ∗ Bt+1)T . ∗ A
γt(i) ∝ αt(i)βt(j)
γt ∝ αt. ∗ βt
HMM trellis
• Forwards algorithm uses dynamic programming to efficiently sum
over all possible paths that state i at time t.
def
αt (i) = P (Xt = i, y1:t)
X X
= ... P (X1 , . . . , Xt − 1, y1:t−1)P (Xt |Xt−1 ) p(yt|Xt )
X1 Xt−1
X
= P (Xt − 1, y1:t−1)P (Xt |Xt−1 ) p(yt|Xt )
Xt−1
X
= αt−1(Xt−1)P (Xt |Xt−1 ) p(yt |Xt )
Xt−1
Avoiding numerical underflow in HMMs
def
• αt(j) = P (Xt = j, y1:t) is a tiny number
• Hence in practice we use
def P (Xt, yt|y1:t−1)
α̂t(j) = P (Xt = j|y1:t) =
P p(yt|y1:t−1)
i P (Xt−1 = i|y1:t−1)P (Xt = j|Xt−1 = i)p(yt|Xt = j)
=
p(yt|y1:t−1)
1X
= α̂t−1(i)A(i, j)Bt(j)
ct
i
where
def XX
ct = P (yt|y1:t−1) = α̂t−1(i)A(i, j)Bt(j)
j i
T
Y T
X
log p(y1:T ) = log p(y1)p(y2|y1)p(y3|y1:2) . . . = log ct = log ct
t=1 t=1
Avoiding numerical underflow in Shafer Shenoy
• We always normalize all the messages
1
δ̂i→j = VE-msg(δ̂k →i, ψi0)
zi
• By keeping track of the normalization constants during the collect-
to-root, we can compute the log-likelihood
X
log p(e) = log zi
i
Shafer-Shenoy for pairwise MRFs
• Consider an MRF with one potential per edge
1 Y Y
P (X) = ψij (Xi, Xj ) φi(Xi)
Z
<ij> i
• We can generalize the forwards-backwards algorithm as follows:
X Y
mij (xj ) = φi(xi)ψij (xi, xj ) mji(xi)
xi k∈Ni\{j}
Y
bi(xi) ∝ φi(xi) mji(xi)
j∈Ni
• In matrix-vector form, this becomes
Y
mij = φi. ∗ ψijT mki
Yk
bi ∝ φ i . ∗ mji
j∈Ni
Message passing with division
• The posterior is the product of all incoming messages
Y
0
πi(Ci) = πi (Ci) δk →i(Sik )
k∈Ni
• The message from i to j is the product of all incoming messages
excluding δj →i:
Ck
Ci Cj Cr
XCk’ Y
δi→j (Sij ) = πi0(Ci) δk →i(Sik )
Ci\Sij k∈Ni\{j}
Q
X k∈Ni δk →i(Sik )
= πi0(Ci)
δj →i(Sij )
Ci\Sij
P
Ci\Sij πi(Ci)
=
δj →i(Sij )
Lauritzen-Spiegelhalter algorithm
def
{ψi} = function Ctree-BP-two-pass({φ}, T, α)
R := pickRoot(T )
DT := mkRootedTree(T, R)
{ψi} := initializeCliques(φ, α)
µi,j := 1 (* initialize messages for each edge *)
(* Upwards pass *)
for i ∈ postorder(DT )
j := pa(DT,
i)
ψj , µi,j := BP-msg(ψi, ψj , µi,j )
Ck
Ci Cj Cr
Ck’
Lauritzen-Spiegelhalter algorithm
(* Downwards pass *)
for i ∈ preorder(DT )
for j ∈ ch(DT,
i)
ψj , µij := BP-msg(ψi, ψj , µi,j )
def
P = function BP-msg(ψi, ψj , µi,j )
ψj , µi,j
δij := Ci\Sij ψi
δi→j
ψ := ψ ∗
j j µi,j
µi,j := δi→j
Cj
Ci Ck Cr
Ck’
Properties of BP
• µi,j stores the most recent message sent alone edge Ci − Cj , in
either direction.
• We can send messages in any order, including multiple times,
because the recipient divides out by the old µi,j , to avoid
overcounting.
• Hence the algorithm can be run in a parallel, distributed fashion.
• ψi ∝ P (Ci|e0) contains the product of all received messages so far
(summarizing evidence e0); it is our best partial guess (belief) about
P (Ci|e).
Parallel BP
(* send *)
for i = 1 : C
for j ∈ Ni
δiold
→j = P
δi→j
δi→j = Ci\Sij ψi
end
end
(* receive *)
for i = 1 : C
for j ∈ Ni
δj → i
ψi := ψi ∗ old
δi→j
end
end
Using a clique tree to answer queries
• We can enter evidence about Xi by multiplying a local evidence
factor into any potential that contains Xi in its scope.
• After the tree is calibrated, we can compute P (Xq |e) for any q
contained in a clique (e.g., a node and its parents).
• If new evidence arrives about Xi, we pick a clique Cr that contains
Xi and distribute the evidence (downwards pass from Cr ).
Separator sets
• Define the separator sets on each edge to be Sij = Ci ∩ CJ .
• Thm 8.1.8: Let Xi be all the nodes to the “left” of Sij and Xj be
all the nodes to the “right”. Then Xi ⊥ Xj |Sij .
• ABCDE ⊥ DEF |DE, i.e., ABC ⊥ F |DE.
A B
C D
ABC C CDE DE DEF
E F
Clique tree as a distribution
• Consider Markov net A − B − C with clique tree
C1 : A, B − C2 : B, C
• After BP has converged, we have
ψ1(A, B) = PF (A, B), ψ2(B, C) = PF (B, C)
• In addition, neighboring cliques agree on their intersection, e.g.
X X
ψ1(A, B) = ψ2(B, C) = PF (B)
A C
• Hence the joint is
P (B, C)
P (A, B, C) = P (A, B)P (C|B) = P (A, B)
P (B)
ψ2(B, C) ψ2(B, C)
= ψ1(A, B) P = ψ1(A, B) P
c ψ2(B, c) a ψ1(a, c)
ψ2(B, C)
= ψ1(A, B)
µ1,2(B)
Clique tree as a distribution
• Defn 8.9: The clique tree invariant for T is
Q
Y
i∈T ψi(Ci)
πT = φ=Q
<ij∈T > µi,j (Si,j )
• Initially, the clique tree over all factors satisfies the invariant since
µi,j = 1 and all the factors φ are assigned to cliques.
• Thm 8.3.6: Each step of BP maintains the clique invariant.
Message passing maintains clique invariant
• Proof. Suppose Ci sends to Cj resulting in new message µnew i,j and
new potential
µ new
ij
ψjnew = ψj
µij
Then Q new
i0 ψi0
πT = Q new
0
<ij> i0,j 0µ
new Q
ψj i06=j ψi0
= new Q
µij <ij>06=(i,j) µi0,j 0
new Q
ψj µij i06=j ψi0
= new Q
µij µij <ij> 06=(i,j) µi0,j 0
Q
i 0 ψi0
= Q
<ij>0 µi0,j 0
Proof of correctness of BP
• Message passing does not change the invariant, so the clique tree
always represents the distribution as a whole.
• However, we want to show that when the algorithm has converged,
the clique potentials represent correct marginals.
• Defn 8.3.7. Ci is ready to transmit to Cj when Ci has
received informed messages from all its neighbors except from Cj ; a
message from Ci to Cj is informed if it is sent when Ci is ready
to transmit to Cj .
• e.g., leaf nodes are always ready to transmit.
• Defn 8.3.8: A connected subtree T 0 is fully informed if, for each
Ci ∈ T 0 and each Cj 6∈ T 0, we have that Cj has sent an informed
message to Ci.
• Thm 8.3.9: After running BP, then πT 0 = PF (Scope(T 0)) for any
fully informed connected subtree T 0.
Proof of correctness of BP
• Corollary 8.3.10: If all nodes in T are fully informed, then
πT = PF (Scope(T )). Hence πi = PF (Ci).
• Claim: There is a scheduling such that all nodes can become fully
informed (namely postorder/ preorder).
• Defn 8.3.11. A clique tree is said to be calibrated if for each
edge Ci − Cj , they agree on their intersection
X X
ψi(Ci) = ψj (Cj )
Ci\Sij Cj \Sij
• Claim: if all nodes are fully informed, the clique tree is calibrated.
Hence any further message passing will have no effect.
Out-of-clique queries
• To compute P (Xq |e) where q is not contained with a clique, we
look at the smallest subtree that contains q, and perform variable
elimination on those factors.
• e.g. Consider Markov net A − B − C − D with clique tree
C1 : A, B − C2 : B, C − C3 : C, D
• We can compute P (B, D) as follows
X
P (B, D) = P (B, C, D)
C
X π2(B, C)π3(C, D)
=
µ2,3(C)
C
X
= P (B|C)P (C, D)
C
π3
= VarElim({π2, }, {B, D})
µ2,3
Viterbi decoding (finding the MPE)
• Let x∗1:n = arg maxx1:N P (x1:N ) be (one of the) most probable
assignments.
• We can compute p∗ = P (x∗1:N ) using the max product algorithm.
• e.g., A → B.
P (a∗, b∗) = max max P (a)P (b|a)
a b
= max max φA(a)φB (b, a)
a b
= max φA(a) max φB (b, a)
a
| b {z }
τB (a)
= max φA(a)τB (a)
|a {z }
τA(∅)
• We can push max inside products.
Q P Q
• (max, ) and ( , ) are both commutative semi-rings.
Viterbi decoding (finding the MPE)
• Max-product gives us p∗ = maxx1:N P (x1:N ), but not
x∗1:N = arg maxx1:N P (x1:N ).
• To compute the most probable assignment, we need to do
max-product followed by a traceback procedure.
• e.g., A → B.
• We cannot find the most probable value for A unless we know what
B we would choose in response.
• So when we compute τB (a) = maxb φB (b, a), also store
λB (a) = arg max φB (b, a)
b
• When we compute τA(∅) = maxa φA(a)τA(a), we also compute
a∗ = arg max φA(a)τA(a)
a
• Then traceback: b∗ = λB (a∗).
More complex example
Difficulty Intelligence
Grade SAT
Letter
p∗ = max max φL(L, G) max φD (D) max φI (I)φG(G, I, D) max φS (I, S)
G L D I S
More complex example
p∗ = max max φL(L, G) max φD (D) max φI (I)φG(G, I, D) max φS (I, S)
G L D I | S {z }
τ1(I)
= max max φL(L, G) max φD (D) max φI (I)φG(G, I, D)τ1(I)
G L D |I {z }
τ2(G,D)
= max max φL(L, G) max φD (D)τ2(G, D)
G L |D {z }
τ3(G)
= max max φL(L, G)τ3(G)
G |L {z }
τ4(G)
= max τ4(G)
| G {z }
τ5(∅)
= 0.184
Traceback
p∗ = max max φL (L, G) max φD (D) max φI (I)φG (G, I, D) max φS (I, S)
G L D I
| S {z }
τ1 (I)
= max max φL (L, G) max φD (D) max φI (I)φG (G, I, D)τ1 (I)
G L D
|I {z }
τ2 (G,D)
= max max φL (L, G) max φD (D)τ2 (G, D)
G L
|D {z }
τ3 (G)
= max max φL (L, G)τ3 (G)
G
|L {z }
τ4 (G)
= max τ4 (G)
| G {z }
τ5 (∅)
λ5(∅) = arg max τ4(g) = g ∗
g
λ4(g) = arg max φL(L, G)τ3(g), l∗ = λ4(g ∗)
l
λ3(g) = arg max φD (d)τ2(G, d), d∗ = λ3(g ∗)
d
λ2(g, d) = arg max φI (i)φG(G, i, D)τ1(i), i∗ = λ2(g ∗, d∗)
i
λ1(i) = arg max φS (I, s), s∗ = λ1(i∗)
s
Finding K-most probable assignments
• There may be several (m1) assignments with the same highest
(1,1) (1,m1)
probability, call them x1:n , . . . , x1:n .
• These can be found by breaking ties in the argmax.
• The second most probable assignment(s) after these,
(2,1) (2,m2)
x1:n , . . . , x1:n , must differ in at least one assignment,.
• Hence we assert evidence that the next assignment must be distinct
from all m1 MPEs, and re-run Viterbi.
• Project idea: implement this and compare to the loopy belief
propagation version to be discussed later.
• This is often used to produce the “N-best list” in speech
recognition; these hypotheses are then re-ranked using more
sophisticated (discriminative) models.
Marginal MAP (max-sum-product)
Difficulty Intelligence
Grade SAT
Letter
X X X
p∗ = max max φL(L, G) φI (I)φS (S, I) φG(G, I, D)
L S
G I D
• We can easily modify the previous algorithms to cope with
examples such as this.
• However, max and sum do not commute!
X X
max φ(X, Y ) 6= max φ(X, Y )
X X
Y Y
• Hence we must use a constrained elimination ordering, in which we
sum out first, then max out.
Constrained elimination orderings may induce large
cliques
Y1 Y2 Yn
X1 X2 ... Xn
X
p∗ = max P (Y1:n, X1:n)
Y1,...,Yn
X1,...,Xn
• We must eliminate all the Xi’s first, which induces a huge clique
over all the Yi’s!
• Thm: exact max-marginal inference is NP-hard even in
tree-structured graphical models.
• An identical problem arises with decision diagrams, where we must
sum out random variables before maxing out action variables.
• An identical problem arises with “hybrid networks”, where we must
sum out discrete random variables before integrating out Gaussian
random variables (ch 11).