08 Graphical Models
08 Graphical Models
Mixing these three notions is a source of lots of trouble. We will start with the
frequentist interpretation and then discuss the Bayesian one.
2
2/62
/62
Why do we need probability for ML?
3
3/62
/62
Probabilistic vs. Algorithmic learning
• Algorithmic ML (e.g., SVMs):
◦ Strictly focus on the task at hand → discriminative
◦ Black box
◦ Algorithms often motivated directly by optimization methods → fast
◦ Examples: the perceptron, SVM, etc.
◦ “Frequentist”
• Probabilistic ML (e.g., graphical models):
◦ Everything in the world is a random variable → generative
◦ Flexible modeling framework for incorporating prior knowledge
◦ Models are often expressed with graphs → efficient message passing
algorithms
◦ Example: k –means clustering
◦ “Bayesian”
[Breiman: Statistical modeling: the two cultures]
4
4/62
/62
Joint probabilities and
independence
Machine learning applications often involve a large number of variables
(features) X1 , . . . , Xn .
• The conditional probability of Xi given Xj is
IDEA: When faced with a large number of features, use our prior knowledge
of indepdencies to make learning easier. 5
5/62
/62
Directed graphical models
Also called Bayes nets or Belief Networks. Each vertex v ∈ V corresponds
to a random variable. Graph must be acyclic but not necessarily a tree.
6
6/62
/62
Directed graphical models
Assuming that X1 , . . . , X6 are binary random variables, how many
numbers are need to describe their joint distribution? 26 − 1 = 63 .
7
7/62
/62
Example: Markov chains
• If x1 , x2 , . . . is a series of (discrete or continuous) random variables
corresponding to a process evolving in time, then xt should only depend
on what happened in the past:
8
8/62
/62
Hidden Markov Models (HMM)
9
9/62
/62
Applications of HMMs
10
10/62
/62
Undirected graphical models
Also called Markov Random Fields. Graph can be any undirected graph.
Common example used for image segmentation:
The general form of the joint distribution over all the variables is
1 Y
p(x) = ϕc (xc )
Z
c∈Cliques(G)
12
12/62
/62
Example: MRFs for segmentation
13
13/62
/62
Purpose of graphical models
Graphical models
• capture prior knowledge about relationships between variables
• provide a compact representation of distributions over many variables
• define a specific hypothesis class
• help with figuring out causality
• the variables can be either discrete (e.g., “airbag yes/no”), continous (e.g.,
“value”) or a mixture of both types
14
/62
14/62
Tasks for graphical models
15
15/62
/62
Inference
Partition V , the set of nodes, into three sets:
1. the set O of observed nodes
2. the set Q of query nodes
3. the set L of latent nodes P
x p(xQ , xL , xO )
Interested in p(xQ |xO ) = P L
xL ,xQ p(xQ , xL , xO )
16
16/62
/62
Directed graphical models (Bayes nets)
Common cause
XA ̸⊥⊥ XB but XA ⊥⊥ XB | XC
Therefore, if C is observed, then A and B become independent.
18
18/62
/62
Explaining away
XA ⊥⊥ XB but XA ̸⊥⊥ XB | XC
Therefore, if C is not obsereved (and neither are any of its descendents)
then A and B become independent.
Example: Burglary ̸⊥
⊥ Earthquake | Alarm
19
19/62
/62
D–separation
Theorem
X⊥
⊥ Y | S if and only if all paths from X to Y are blocked.
20
20/62
/62
Learning parameters in Bayes nets
Assuming for now that everyone has two parents, (xm(v) , xf (v) ) , the
conditional distributions can be parametrized by 3D arrays θ1 , . . . , θk :
21
21/62
/62
Simpson’s paradox: word of caution
You are trying to determine whether a particular treatment for a serious
disease is beneficial. Given the following observations would you
recommend it?
Now what if you discovered that the breakdown by gender was this?
22
22/62
/62
Simpson’s paradox
• A graphical model can never capture all the variables that might possibly
be relevant. In the first case we ignored gender. This can affect what
interpretation the model suggests.
• The fact that there is an arrow from A (treatment) to B (outcome) does
not imply that A causes B . In our case we had a hidden common
cause, gender, of the opposite effect on B .
• To tease out causal structure we need more sophisitcated tools than just
ordinary graphical models: need to introduce interventions.
• Observational studies are not sufficient. The gold standard in medicine is
randomized controlled trials (RCTs).
23
23/62
/62
Learning parameters in Bayes nets
p(xv |xm(v) , xf (v) ) = [θv ]xm(v) ,xf (v) ,xv .
Y
T Y Y
ℓ(θ|D) = [θv ]xtm(v) ,xtf (v) ,xtv = ℓv (θv |D)
t=1 v∈V v∈V
Y
T
ℓv (θv |D) = [θv ]xtm ,xtf ,xtv
t=1
or
YY Na,b ! Na,b,1 Na,b,2 Na,b,v
[θv ]a,b,1 [θv ]a,b,2 . . . [θv ]a,b,vk k
a
Na,b,1 ! Na,b,2 ! . . . Na,b,kv !
b
Na,b,c = t | xtm = a, xtf = b, xtv = c
24
24/62
/62
Learning parameters in Bayes nets
Each
Na,b,c
[θbv ]a,b,c = P .
c Na,b,c
Na,b,c + γ
[θbv ]a,b,c = P .
c (Na,b,c + γ)
25
25/62
/62
Inference in Bayes nets: example
XA ⊥
⊥ XC | {XB , XD } , XB ⊥⊥ XD | {XA , XC } .
28
28/62
/62
Examples of undirected models
29
29/62
/62
Ordinary separation
1 Y
p(x) = ϕc (xc )
Z
c∈Cliques(G)
Theorem
X⊥
⊥ Y | S if and only if all paths from X to Y contain at least one node
in S .
This is simpler than in the directed case.
30
30/62
/62
Parameter estimation and
inference
In undirected models
• Parameter estimation: Not as easy as in the directed case!
• Inference : message passing algorithms.
31
31/62
/62
Bayesian vs. Frequentists
Joint and conditional probability
Joint:
P(A, B) = P(A, B)
Conditional:
P(A, B)
P(A|B) =
P(B)
P(A, B)
P(B|A) =
P(A)
33
33/62
/62
Mammography
Sensitivity of screening mammogram p(+|cancer) ≈ 90%
34
34/62
/62
Bayes’ rule
p(A|B) p(B)
p(B|A) =
p(A)
35
35/62
/62
Prosecutor’s fallacy: Sally Clark
Two kids died with no explanation.
36
36/62
/62
Convict if ...
37
37/62
/62
Statistical estimation
The fundamental problem of
statistics
Probability:
Statistics:
θb
estimation
S = {X1 , X2 , . . . , Xm } −−−−−→
39
39/62
/62
The fundamental problem of
statistics
The problem of inferring θ from X1 , X2 , . . . , Xm is inherently ill defined:
• pθ (x) as a function of x is a probability — OK
• pθ (x) as a function of θ (called the likelihood ℓ(θ) = pθ (x) ) is not a
probability — Panic!
Two religions:
• Turn pθ (x) into a probability by putting a distribution on θ (called the
prior) → Bayesian statistics
• Just come up with a guess θ̂ for θ and then show that it is unlikely to get
a sample that will lead to a θ̂ which is far off. → Frequentist statistics
40
40/62
/62
Bayesian statistics in ML
likelihood prior
z }| { z}|{
posterior
z }| { p(X|θ) p(θ) p(X|θ) p(θ)
p(θ|X) = =R
p(X) p(X|θ) p(θ) dθ
| {z }
evidence
41
41/62
/62
Frequentist statistics in ML
Example: perceptron
42
42/62
/62
Frequentist vs. Bayesian estimators
Frequentist:
• Use the maximum likelihood estimator θbMLE = arg max ℓ(θ)
• Just one of many options
Bayesian:
• A true Bayesian always reports the full posterior p(θ|X) .
• When pressed, might give the maximum a posteriori (MAP) estimator
θ̂MAP = arg max p(θ|X)
R
• or the posterior mean θ̂ = θ p(θ|X) dθ .
43
43/62
/62
Naive Bayes
A generative model for documents
45
45/62
/62
The multinomial model
n! X
k
p(x) = θx1 θx2 . . . θkxk θi = 1.
x1 ! x2 ! . . . x k ! 1 2
i=1
46
46/62
/62
Naive Bayes — the Frequentist way
n! X
k
p(x) = θx1 θx2 . . . θkxk θi = 1.
x1 ! x2 ! . . . x k ! 1 2
i=1
MLE: maximize
X
k X
k
log ℓ(θ1 , θ2 , . . . , θk ) = constant + xi log θi s.t. θi = 1
i=1 i=1
xi xi
=λ −→ θbi = Pk
θbi i=1 xi
47
47/62
/62
Naive Bayes — the Frequentist way
Pk
The MLE θi = xi / i=1 xi makes perfect sense but gives p(x′ ) = 0
whenever D ′ contains a word not seen in training!
where γ is a “pseudocount”.
48
48/62
/62
Naive Bayes — the Bayesian way
1. Take the prior p(θ1 , θ2 , . . . , θk ) to be
Γ(α1 + . . . + αk ) α1 −1
Dirichlet(α1 , α2 , . . . , αk ) = θ . . . θkαk −1
Γ(α1 ) . . . Γ(αk ) 1
49
49/62
/62
Naive Bayes — the Bayesian way
xi + α i − 1
θi = Pk
i=1 xi + αi − 1
50
50/62
/62
Learning the parameters of Bayes nets
Parameter Learning
Recall that the joint distribution of Bayes net is of the form
Y
p(x) = pv (xv |xpa(v) ).
v∈V
52
52/62
/62
A simple model
The simplest case is a model with just two variables, x = (x1 , x2 ) :
Assume that:
• x1 can take on k1 different values {1, 2, . . . , k1 } ,
• x2 can take on k2 different values {1, 2, . . . , k2 } .
In this case p(x2 |x1 ) is describred by the matrix Θ ∈ [0, 1]k1 ×k2 of
parameters
θi,j = p(X2 = j|X1 = i).
m
How do we learn this matrix from the training data {(xu u
1 , x2 )}u=1 ?
53
53/62
/62
The Maximum Likelihood Principle
Assume that we are given
• a parametric family of distributions pΘ (x) parametrized some the (set of)
parameters Θ ,
• a sample {x1 , x2 , . . . , xm } from pΘ .
Y
m Y
m k1 Y
Y k2
N (j)
ℓ(Θ) = p(xu2 |xu1 ) = θx u
1 ,x u
2
= θi,ji ,
u=1 u=1 i=1 j=1
• The likelihood only depends on how many times we have observed each
combination of (x1 , x2 ) together. This makes sense.
• For each possible i , the conditional p(X2 = j|X1 = i) = θi,j is a
prob. distr. as a function of j . → We can learn each row of Θ
P k2
separately, with the constraint that j=1 θi,j = 1 .
55
55/62
/62
Maximum Likelihood
To learn (θi,1 , θi,2 , . . . , θi,k2 ) only need the likelihood over those training
examples in which x1 = i :
Y
k2
N (j)
ℓi (θi,1 , . . . , θi,k2 ) = θi,ji .
j=1
X
k2 X
k2
maximize Ni (j) log θi,j subject to θi,j = 1.
j=1 j=1
56
56/62
/62
Zero counts
One problem with the MLE
Ni (j)
θbi,j = Pk2
′
j ′ =1 Ni (j )
is that if one (i, j) pair never occurs in the training data, then the
corresponding θbi,j will be zero. → The learned model will not be able to
deal with any example in which X1 = i and X2 = j .
Ni (j) + γ
θbi,j = Pk2 .
′
j ′ =1 (Ni (j ) + γ)
57
/62
57/62
Multiple parents
What if the model is
However, fixing any (i1 , . . . , ip ) we can solve for the corresponding θb...,j ’s
just as before and get (using pseudocounts)
Ni ,...,i (j) + γ
θbi1 ,...,ir ,j = Pkr+1 1 r .
′
j ′ =1 (Ni1 ,...,ir (j ) + γ)
58
58/62
/62
MLE for the whole Bayes Net
Recall that the joint distribution of the whole Bayes net is
Y
p(x) = pv (xv |xpa(v) ),
v∈V
2. Use the same steps as before to find the corresponding MLE solution
Ni ,...,i (j) + γ
[θbv ]i1 ,...,ir ,j = Pkr+1 1 r .
′
j ′ =1 (Ni1 ,...,ir (j ) + γ)
59
59/62
/62
Expectation maximization
What about when some of the xi ’s are not observed? Use the EM strategy
and iterate until convergence:
1. E-step: Compute the expected log-likelihood (w.r.t. the hidden variables)
b old
under Θ
LΘ
b old(Θ).
b:
2. M-step: Maximize this to get the new estimate for Θ
b = arg max L b (Θ).
Θ
ΘΘold
60
60/62
/62
FURTHER READING
61
61/62
/62