0% found this document useful (0 votes)
62 views35 pages

Probabilistic Graphical Models CPSC 532c (Topics in AI) Stat 521a (Topics in Multivariate Analysis)

This document discusses probabilistic graphical models and belief propagation algorithms. It provides an overview of: - The Shafer-Shenoy algorithm for exact inference in junction trees using message passing. It describes an upwards and downwards pass to compute marginal distributions. - Applying the Shafer-Shenoy algorithm to Hidden Markov Models using a forwards-backwards approach to efficiently compute probabilities. - The Lauritzen-Spiegelhalter algorithm, which is similar but divides messages by the previous message to avoid overcounting. - Properties of belief propagation algorithms, including their ability to run in parallel by sending messages in any order.

Uploaded by

aofgubh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views35 pages

Probabilistic Graphical Models CPSC 532c (Topics in AI) Stat 521a (Topics in Multivariate Analysis)

This document discusses probabilistic graphical models and belief propagation algorithms. It provides an overview of: - The Shafer-Shenoy algorithm for exact inference in junction trees using message passing. It describes an upwards and downwards pass to compute marginal distributions. - Applying the Shafer-Shenoy algorithm to Hidden Markov Models using a forwards-backwards approach to efficiently compute probabilities. - The Lauritzen-Spiegelhalter algorithm, which is similar but divides messages by the previous message to avoid overcounting. - Properties of belief propagation algorithms, including their ability to run in parallel by sending messages in any order.

Uploaded by

aofgubh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Probabilistic graphical models

CPSC 532c (Topics in AI)


Stat 521a (Topics in multivariate analysis)

Lecture 9

Kevin Murphy

Monday 18 October 2004


Administrivia

• HW4 due today


Review

• Variable elimination can be used to answer a single query, P (Xq |e).


• VarElim requires an elimination ordering; you can use elimOrderGreedy
to find this.
• VarElim implicitly creates an elimination tree (a junction tree with
non-maximal cliques).
• You can create a jtree of maximal cliques by triangulating and using
max weight spanning tree.
• Given a jtree, we can compute P (Xc|e) for all cliques c using belief
propagation (BP).
Belief propagation

• There are 2 variants of BP, which we will cover today:


• Shafer-Shenoy, that multiplies by all-but-one incoming message:
 
Y
δi→j = f  δk →i
k∈Ni\{j}
• Lauritzen-Spiegelhalter, that multiplies by all incoming messages and
then divides out by one
Q !
k∈Ni δk →i
δi→j = f
δj →i
Ck
Ci Cj Cr

Ck’
Shafer-Shenoy algorithm
1 def
{ψi } = function Ctree-VE-calibrate({φ}, T, α)

R := pickRoot(T )
DT := mkRootedTree(T, R)
{ψi0} := initializeCliques(φ, α)
(* Upwards pass *)
for i ∈ postorder(DT )
j := pa(DT, i)
δi→j := VE-msg({δk →i : k ∈ ch(DT, i)}, ψi0 )

Ck
Ci Cj Cr

Ck’
Sub-functions
0 def
{ψi } = function initializeCliques(φ, α)

for i := 1 : C
0 Q
ψi (Ci) = φ:α(φ)=i φ

def
δi→j = function VE-msg({δk →i}, ψi0)

1 0 Q
ψi (Ci) := ψi (Ci) k δk →i
P
δi→j (Si,j ) := Ci\Sij ψi1(Ci)
Shafer-Shenoy algorithm

(* Downwards pass *)
for i ∈ preorder(DT )
for j ∈ ch(DT, i)
δi→j = VE-msg({δk →i : k ∈ Ni \ j}, ψi0)
(* Combine *)
for i := 1 : C
1 0 Q
ψi := ψi k∈Ni δk →i
Cj
Ci Ck Cr

Ck’
Shafer Shenoy for HMMs
X1 X2 X3 X4 ...

Y1 Y2 Y3 Y4 C1: X1,X2 C2:X2,X3 C3:X3,X4

ψt0(Xt, Xt+1) = P (Xt+1|Xt)p(yt+1|Xt+1)


X
δt→t+1(Xt+1) = δt−1→t(Xt)ψt0(Xt, Xt+1)
Xt
X
δt→t−1(Xt) = δt+1→t(Xt+1)ψt0(Xt, Xt+1)
Xt+1
ψt1(Xt, Xt+1) = δt−1→t(Xt)δt+1→t(Xt+1)ψt0(Xt, Xt+1)
Forwards-backwards algorithm for HMMs

def
αt(i) = δt−1→t(i) = P (Xt = i, y1:t)
def
βt(i) = δt→t−1(i) = p(yt+1:T |Xt = i)
def
ξt(i, j) = ψt1(Xt = i, Xt+1 = j) = P (Xt = i, Xt+1 = j, y1:T )
def
P (Xt+1 = j|Xt = i) = A(i, j)
def
p(yt|Xt = i) = Bt(i)
X
αt(j) = αt−1(i)A(i, j)Bt(j)
Xi
βt(i) = βt+1(j)A(i, j)Bt+1(j)
j
ξt(i, j) = αt(i)βt+1(j)A(i, j)Bt+1(j)
def X
γt(i) = P (Xt = i|y1:T ) ∝ αt(i)βt(j) ∝ ξt(i, j)
j
Forwards-backwards algorithm, matrix-vector form
X1 X2 X3 X4 ...

Y1 Y2 Y3 Y4

X
αt(j) = αt−1(i)A(i, j)Bt(j)
i
αt = (AT αt−1). ∗ Bt
X
βt(i) = βt+1(j)A(i, j)Bt+1(j)
j
βt = A(βt+1. ∗ Bt+1)
ξt(i, j) = α
t(i)βt+1(j)A(i, j)Bt+1(j)
ξt = αt(βt+1. ∗ Bt+1)T . ∗ A
γt(i) ∝ αt(i)βt(j)
γt ∝ αt. ∗ βt
HMM trellis

• Forwards algorithm uses dynamic programming to efficiently sum


over all possible paths that state i at time t.
def
αt (i) = P (Xt = i, y1:t)
 
X X
=  ... P (X1 , . . . , Xt − 1, y1:t−1)P (Xt |Xt−1 ) p(yt|Xt )
X1 Xt−1
 
X
=  P (Xt − 1, y1:t−1)P (Xt |Xt−1 ) p(yt|Xt )
Xt−1
 
X
=  αt−1(Xt−1)P (Xt |Xt−1 ) p(yt |Xt )
Xt−1
Avoiding numerical underflow in HMMs
def
• αt(j) = P (Xt = j, y1:t) is a tiny number
• Hence in practice we use
def P (Xt, yt|y1:t−1)
α̂t(j) = P (Xt = j|y1:t) =
P p(yt|y1:t−1)
i P (Xt−1 = i|y1:t−1)P (Xt = j|Xt−1 = i)p(yt|Xt = j)
=
p(yt|y1:t−1)
1X
= α̂t−1(i)A(i, j)Bt(j)
ct
i
where
def XX
ct = P (yt|y1:t−1) = α̂t−1(i)A(i, j)Bt(j)
j i
T
Y T
X
log p(y1:T ) = log p(y1)p(y2|y1)p(y3|y1:2) . . . = log ct = log ct
t=1 t=1
Avoiding numerical underflow in Shafer Shenoy

• We always normalize all the messages


1
δ̂i→j = VE-msg(δ̂k →i, ψi0)
zi
• By keeping track of the normalization constants during the collect-
to-root, we can compute the log-likelihood
X
log p(e) = log zi
i
Shafer-Shenoy for pairwise MRFs
• Consider an MRF with one potential per edge
1 Y Y
P (X) = ψij (Xi, Xj ) φi(Xi)
Z
<ij> i
• We can generalize the forwards-backwards algorithm as follows:
X Y
mij (xj ) = φi(xi)ψij (xi, xj ) mji(xi)
xi k∈Ni\{j}
Y
bi(xi) ∝ φi(xi) mji(xi)
j∈Ni
• In matrix-vector form, this becomes
Y
mij = φi. ∗ ψijT mki
Yk
bi ∝ φ i . ∗ mji
j∈Ni
Message passing with division
• The posterior is the product of all incoming messages
Y
0
πi(Ci) = πi (Ci) δk →i(Sik )
k∈Ni
• The message from i to j is the product of all incoming messages
excluding δj →i:
Ck
Ci Cj Cr

XCk’ Y
δi→j (Sij ) = πi0(Ci) δk →i(Sik )
Ci\Sij k∈Ni\{j}
Q
X k∈Ni δk →i(Sik )
= πi0(Ci)
δj →i(Sij )
Ci\Sij
P
Ci\Sij πi(Ci)
=
δj →i(Sij )
Lauritzen-Spiegelhalter algorithm
def
{ψi} = function Ctree-BP-two-pass({φ}, T, α)

R := pickRoot(T )
DT := mkRootedTree(T, R)
{ψi} := initializeCliques(φ, α)
µi,j := 1 (* initialize messages for each edge *)
(* Upwards pass *)
for i ∈ postorder(DT )
j := pa(DT,
 i)
ψj , µi,j := BP-msg(ψi, ψj , µi,j )

Ck
Ci Cj Cr

Ck’
Lauritzen-Spiegelhalter algorithm

(* Downwards pass *)
for i ∈ preorder(DT )
for j ∈ ch(DT,
 i)
ψj , µij := BP-msg(ψi, ψj , µi,j )

  def
P = function BP-msg(ψi, ψj , µi,j )
ψj , µi,j
δij := Ci\Sij ψi
δi→j
ψ := ψ ∗
j j µi,j
µi,j := δi→j
Cj
Ci Ck Cr

Ck’
Properties of BP

• µi,j stores the most recent message sent alone edge Ci − Cj , in


either direction.
• We can send messages in any order, including multiple times,
because the recipient divides out by the old µi,j , to avoid
overcounting.
• Hence the algorithm can be run in a parallel, distributed fashion.
• ψi ∝ P (Ci|e0) contains the product of all received messages so far
(summarizing evidence e0); it is our best partial guess (belief) about
P (Ci|e).
Parallel BP

(* send *)
for i = 1 : C
for j ∈ Ni
δiold
→j = P
δi→j
δi→j = Ci\Sij ψi
end
end
(* receive *)
for i = 1 : C
for j ∈ Ni
δj → i
ψi := ψi ∗ old
δi→j
end
end
Using a clique tree to answer queries

• We can enter evidence about Xi by multiplying a local evidence


factor into any potential that contains Xi in its scope.
• After the tree is calibrated, we can compute P (Xq |e) for any q
contained in a clique (e.g., a node and its parents).
• If new evidence arrives about Xi, we pick a clique Cr that contains
Xi and distribute the evidence (downwards pass from Cr ).
Separator sets

• Define the separator sets on each edge to be Sij = Ci ∩ CJ .


• Thm 8.1.8: Let Xi be all the nodes to the “left” of Sij and Xj be
all the nodes to the “right”. Then Xi ⊥ Xj |Sij .
• ABCDE ⊥ DEF |DE, i.e., ABC ⊥ F |DE.
A B

C D
ABC C CDE DE DEF
E F
Clique tree as a distribution
• Consider Markov net A − B − C with clique tree
C1 : A, B − C2 : B, C
• After BP has converged, we have
ψ1(A, B) = PF (A, B), ψ2(B, C) = PF (B, C)
• In addition, neighboring cliques agree on their intersection, e.g.
X X
ψ1(A, B) = ψ2(B, C) = PF (B)
A C
• Hence the joint is
P (B, C)
P (A, B, C) = P (A, B)P (C|B) = P (A, B)
P (B)
ψ2(B, C) ψ2(B, C)
= ψ1(A, B) P = ψ1(A, B) P
c ψ2(B, c) a ψ1(a, c)
ψ2(B, C)
= ψ1(A, B)
µ1,2(B)
Clique tree as a distribution

• Defn 8.9: The clique tree invariant for T is


Q
Y
i∈T ψi(Ci)
πT = φ=Q
<ij∈T > µi,j (Si,j )
• Initially, the clique tree over all factors satisfies the invariant since
µi,j = 1 and all the factors φ are assigned to cliques.
• Thm 8.3.6: Each step of BP maintains the clique invariant.
Message passing maintains clique invariant

• Proof. Suppose Ci sends to Cj resulting in new message µnew i,j and


new potential
µ new
ij
ψjnew = ψj
µij
Then Q new
i0 ψi0
πT = Q new
0
<ij> i0,j 0µ
new Q
ψj i06=j ψi0
= new Q
µij <ij>06=(i,j) µi0,j 0
new Q
ψj µij i06=j ψi0
= new Q
µij µij <ij> 06=(i,j) µi0,j 0
Q
i 0 ψi0
= Q
<ij>0 µi0,j 0
Proof of correctness of BP
• Message passing does not change the invariant, so the clique tree
always represents the distribution as a whole.
• However, we want to show that when the algorithm has converged,
the clique potentials represent correct marginals.
• Defn 8.3.7. Ci is ready to transmit to Cj when Ci has
received informed messages from all its neighbors except from Cj ; a
message from Ci to Cj is informed if it is sent when Ci is ready
to transmit to Cj .
• e.g., leaf nodes are always ready to transmit.
• Defn 8.3.8: A connected subtree T 0 is fully informed if, for each
Ci ∈ T 0 and each Cj 6∈ T 0, we have that Cj has sent an informed
message to Ci.
• Thm 8.3.9: After running BP, then πT 0 = PF (Scope(T 0)) for any
fully informed connected subtree T 0.
Proof of correctness of BP

• Corollary 8.3.10: If all nodes in T are fully informed, then


πT = PF (Scope(T )). Hence πi = PF (Ci).
• Claim: There is a scheduling such that all nodes can become fully
informed (namely postorder/ preorder).
• Defn 8.3.11. A clique tree is said to be calibrated if for each
edge Ci − Cj , they agree on their intersection
X X
ψi(Ci) = ψj (Cj )
Ci\Sij Cj \Sij

• Claim: if all nodes are fully informed, the clique tree is calibrated.
Hence any further message passing will have no effect.
Out-of-clique queries
• To compute P (Xq |e) where q is not contained with a clique, we
look at the smallest subtree that contains q, and perform variable
elimination on those factors.
• e.g. Consider Markov net A − B − C − D with clique tree
C1 : A, B − C2 : B, C − C3 : C, D
• We can compute P (B, D) as follows
X
P (B, D) = P (B, C, D)
C
X π2(B, C)π3(C, D)
=
µ2,3(C)
C
X
= P (B|C)P (C, D)
C
π3
= VarElim({π2, }, {B, D})
µ2,3
Viterbi decoding (finding the MPE)
• Let x∗1:n = arg maxx1:N P (x1:N ) be (one of the) most probable
assignments.
• We can compute p∗ = P (x∗1:N ) using the max product algorithm.
• e.g., A → B.
P (a∗, b∗) = max max P (a)P (b|a)
a b
= max max φA(a)φB (b, a)
a b
= max φA(a) max φB (b, a)
a
| b {z }
τB (a)
= max φA(a)τB (a)
|a {z }
τA(∅)
• We can push max inside products.
Q P Q
• (max, ) and ( , ) are both commutative semi-rings.
Viterbi decoding (finding the MPE)
• Max-product gives us p∗ = maxx1:N P (x1:N ), but not
x∗1:N = arg maxx1:N P (x1:N ).
• To compute the most probable assignment, we need to do
max-product followed by a traceback procedure.
• e.g., A → B.
• We cannot find the most probable value for A unless we know what
B we would choose in response.
• So when we compute τB (a) = maxb φB (b, a), also store
λB (a) = arg max φB (b, a)
b
• When we compute τA(∅) = maxa φA(a)τA(a), we also compute
a∗ = arg max φA(a)τA(a)
a
• Then traceback: b∗ = λB (a∗).
More complex example
Difficulty Intelligence

Grade SAT

Letter

p∗ = max max φL(L, G) max φD (D) max φI (I)φG(G, I, D) max φS (I, S)


G L D I S
More complex example

p∗ = max max φL(L, G) max φD (D) max φI (I)φG(G, I, D) max φS (I, S)


G L D I | S {z }
τ1(I)
= max max φL(L, G) max φD (D) max φI (I)φG(G, I, D)τ1(I)
G L D |I {z }
τ2(G,D)
= max max φL(L, G) max φD (D)τ2(G, D)
G L |D {z }
τ3(G)
= max max φL(L, G)τ3(G)
G |L {z }
τ4(G)
= max τ4(G)
| G {z }
τ5(∅)
= 0.184
Traceback
p∗ = max max φL (L, G) max φD (D) max φI (I)φG (G, I, D) max φS (I, S)
G L D I
| S {z }
τ1 (I)

= max max φL (L, G) max φD (D) max φI (I)φG (G, I, D)τ1 (I)
G L D
|I {z }
τ2 (G,D)

= max max φL (L, G) max φD (D)τ2 (G, D)


G L
|D {z }
τ3 (G)

= max max φL (L, G)τ3 (G)


G
|L {z }
τ4 (G)

= max τ4 (G)
| G {z }
τ5 (∅)

λ5(∅) = arg max τ4(g) = g ∗


g
λ4(g) = arg max φL(L, G)τ3(g), l∗ = λ4(g ∗)
l
λ3(g) = arg max φD (d)τ2(G, d), d∗ = λ3(g ∗)
d
λ2(g, d) = arg max φI (i)φG(G, i, D)τ1(i), i∗ = λ2(g ∗, d∗)
i
λ1(i) = arg max φS (I, s), s∗ = λ1(i∗)
s
Finding K-most probable assignments

• There may be several (m1) assignments with the same highest


(1,1) (1,m1)
probability, call them x1:n , . . . , x1:n .
• These can be found by breaking ties in the argmax.
• The second most probable assignment(s) after these,
(2,1) (2,m2)
x1:n , . . . , x1:n , must differ in at least one assignment,.
• Hence we assert evidence that the next assignment must be distinct
from all m1 MPEs, and re-run Viterbi.
• Project idea: implement this and compare to the loopy belief
propagation version to be discussed later.
• This is often used to produce the “N-best list” in speech
recognition; these hypotheses are then re-ranked using more
sophisticated (discriminative) models.
Marginal MAP (max-sum-product)
Difficulty Intelligence

Grade SAT

Letter

X X X
p∗ = max max φL(L, G) φI (I)φS (S, I) φG(G, I, D)
L S
G I D
• We can easily modify the previous algorithms to cope with
examples such as this.
• However, max and sum do not commute!
X X
max φ(X, Y ) 6= max φ(X, Y )
X X
Y Y
• Hence we must use a constrained elimination ordering, in which we
sum out first, then max out.
Constrained elimination orderings may induce large
cliques
Y1 Y2 Yn

X1 X2 ... Xn

X
p∗ = max P (Y1:n, X1:n)
Y1,...,Yn
X1,...,Xn
• We must eliminate all the Xi’s first, which induces a huge clique
over all the Yi’s!
• Thm: exact max-marginal inference is NP-hard even in
tree-structured graphical models.
• An identical problem arises with decision diagrams, where we must
sum out random variables before maxing out action variables.
• An identical problem arises with “hybrid networks”, where we must
sum out discrete random variables before integrating out Gaussian
random variables (ch 11).

You might also like