0% found this document useful (0 votes)
17 views61 pages

08 Graphical Models

Uploaded by

zhanghaojing62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views61 pages

08 Graphical Models

Uploaded by

zhanghaojing62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Topic 8: GRAPHICAL MODELS

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning


Risi Kondor, The University of Chicago
Three types of “Probability”
1. Frequency of repeated trials: if an experiment is repeated infinitiely
many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will
be A . Typical example: number of times that a coin comes up heads.
→ Frequentist probability.
2. Degree of belief: A quantity obeying the same laws as the above,
describing how likely we think a (possibly deterministic) event is. Typical
example: the probability that the Earth will warm by more than 5◦ F by
2100. → Bayesian probability.
3. Subjective probability: “I’m 110% sure that I’ll go out to dinner with you
tonight.”

Mixing these three notions is a source of lots of trouble. We will start with the
frequentist interpretation and then discuss the Bayesian one.

2
2/62
/62
Why do we need probability for ML?

Two distinct reasons:


1. To analyze, understand and predict the performance of learning
algorithms (Statistical Learning Theory, PAC model, etc.)
2. To build flexible and intuitive probabilistic models.

3
3/62
/62
Probabilistic vs. Algorithmic learning
• Algorithmic ML (e.g., SVMs):
◦ Strictly focus on the task at hand → discriminative
◦ Black box
◦ Algorithms often motivated directly by optimization methods → fast
◦ Examples: the perceptron, SVM, etc.
◦ “Frequentist”
• Probabilistic ML (e.g., graphical models):
◦ Everything in the world is a random variable → generative
◦ Flexible modeling framework for incorporating prior knowledge
◦ Models are often expressed with graphs → efficient message passing
algorithms
◦ Example: k –means clustering
◦ “Bayesian”
[Breiman: Statistical modeling: the two cultures]

4
4/62
/62
Joint probabilities and
independence
Machine learning applications often involve a large number of variables
(features) X1 , . . . , Xn .
• The conditional probability of Xi given Xj is

p(xi |xj ) = P(Xi = xi | Xj = xj ) p(xi , xj ) = p(xi |xj ) p(xj ).

• Xi and Xj are independent (denoted Xi ⊥


⊥ Xj ) if
p(xi |xj ) is indep of xj ⇐⇒ p(xi , xj ) = p(xi ) p(xj ).

• Xi is conditionally independent of Xj given Xk (denoted


Xi ⊥⊥ Xj |Xk ) if
p(xi , xj |xk ) = p(xi |xk ) p(xj |xk ).

IDEA: When faced with a large number of features, use our prior knowledge
of indepdencies to make learning easier. 5
5/62
/62
Directed graphical models
Also called Bayes nets or Belief Networks. Each vertex v ∈ V corresponds
to a random variable. Graph must be acyclic but not necessarily a tree.

The general form of the joint distribution of all the variables is


Y
p(x) = p(xv |xpa(v) ),
v∈V
where pa(v) are all the parents of v in the graph.

6
6/62
/62
Directed graphical models
Assuming that X1 , . . . , X6 are binary random variables, how many
numbers are need to describe their joint distribution? 26 − 1 = 63 .

Now what if we know that they conform to this Bayes net?

Each p(xi |xj ) corresponds to a 2 × 2 table, but rows sum to 1 , so only 2


numbers required. p(x6 |x2 , x5 ) requires 4 numbers.
Total: 1 + 2 + 2 + 2 + 2 + 4 = 13 . Quite a saving!

7
7/62
/62
Example: Markov chains
• If x1 , x2 , . . . is a series of (discrete or continuous) random variables
corresponding to a process evolving in time, then xt should only depend
on what happened in the past:

p(xt |x1 , . . . , xt−1 , xt+1 , . . .) = p(xt |x1 , . . . , xt−1 ).

• The sequence x1 , x2 , . . . is said to be a k ’th order Markov chain if

p(xt |x1 , . . . , xt−1 , xt+1 , . . .) = p(xt |xt−1 , . . . , xt−k ).

• A (first order) Markov chain is said to be stationary if the p(xt |xt−1 )


transition probabilities are independent of t ,

p(xt |xt−1 ) = Mxt ,xt−1 .

8
8/62
/62
Hidden Markov Models (HMM)

An HMM is a Markov chain of unobserved random variables x1 , x2 , . . . ,


each of which is related to an oberved random variable y1 , y2 , . . . .

Example: Tracking, part of speech tagging, phonemes, physiological states


of babies,...

9
9/62
/62
Applications of HMMs

HMMs and related state space models are widely applied in


• speech recognition (which phoneme/word/etc.)
• part of speech tagging (is it a NP, VP, etc.)?
• biological sequence analysis (intron or extron)?
• time series analysis (finance, climate, etc.)
• robotics (what is the actual location of the robot)?
• tracking

10
10/62
/62
Undirected graphical models
Also called Markov Random Fields. Graph can be any undirected graph.
Common example used for image segmentation:

The general form of the joint distribution over all the variables is

1 Y
p(x) = ϕc (xc )
Z
c∈Cliques(G)

where each ϕc is a potentially different clique potential (just a positive


P Q
function) and Z is the normalizing factor Z = x c∈Cliques(G) ϕc (xc ) .
11
11/62
/62
Example: the Ising model

Imagine an infinite grid of {−1, +1} valued random variables in which


neighboring variables are connected by the potential

ϕ(xi , xj ) = e−β/2(xi −xj ) .


2

Simple model of ferromagnetism. Exhibits a phase transition.

12
12/62
/62
Example: MRFs for segmentation

13
13/62
/62
Purpose of graphical models

In ML we often have a large number of variables related in complicated ways.

Graphical models
• capture prior knowledge about relationships between variables
• provide a compact representation of distributions over many variables
• define a specific hypothesis class
• help with figuring out causality
• the variables can be either discrete (e.g., “airbag yes/no”), continous (e.g.,
“value”) or a mixture of both types

14
/62
14/62
Tasks for graphical models

• Model selection (i.e., learn the graph itself from data)


• Learn the parameters of the model from data (i.e., the individual
conditionals or clique potentials)
• Deduce conditional independence relations
• Infer marginals and conditional distributions

15
15/62
/62
Inference
Partition V , the set of nodes, into three sets:
1. the set O of observed nodes
2. the set Q of query nodes
3. the set L of latent nodes P
x p(xQ , xL , xO )
Interested in p(xQ |xO ) = P L
xL ,xQ p(xQ , xL , xO )

Essential for both


• Training, when we are trying to learn the distribution of some of the
nodes from data.
• Prediction, when we are trying to predict the values of some nodes (the
output) given the values of some other nodes (the input)
Question: How can we do this in less than m| Q |+| L | time?

16
16/62
/62
Directed graphical models (Bayes nets)
Common cause

XA ̸⊥⊥ XB but XA ⊥⊥ XB | XC
Therefore, if C is observed, then A and B become independent.

Example: Lung cancer ⊥


⊥ Yellow teeth | Smoking

18
18/62
/62
Explaining away

XA ⊥⊥ XB but XA ̸⊥⊥ XB | XC
Therefore, if C is not obsereved (and neither are any of its descendents)
then A and B become independent.

Example: Burglary ̸⊥
⊥ Earthquake | Alarm

19
19/62
/62
D–separation

Is X independent of Y given the set of nodes S ?

An underected path from X to Y is said to be blocked if


1. it includes at least one node Z from S such that the arrows along the
path at Z meet head to tail or tail to tail; or
2. it includes at least one node W such that the arrows along the path at W
meet head to head, and neither W nor any of its descendants are in S .

Theorem
X⊥
⊥ Y | S if and only if all paths from X to Y are blocked.

20
20/62
/62
Learning parameters in Bayes nets

Recall the general form of a discrete Bayes net:


Y
p(x) = p(xv |xpa(v) ) xv ∈ {1, 2, . . . , kv } .
v∈V

Assuming for now that everyone has two parents, (xm(v) , xf (v) ) , the
conditional distributions can be parametrized by 3D arrays θ1 , . . . , θk :

p(xv |xm(v) , xf (v) ) = [θv ]xm(v) ,xf (v) ,xv .


P
To ensure normalization, xv [θv ]xm(v) ,xf (v) ,xv = 1 for all xm(v) , xf (v) .

Given data D = (x1 , . . . , xT ) , what is the MLE setting of (θv )v∈V ?

21
21/62
/62
Simpson’s paradox: word of caution
You are trying to determine whether a particular treatment for a serious
disease is beneficial. Given the following observations would you
recommend it?

Survived Did not survive Survival rate


Treatment 20 20 50%
No treatment 16 24 40%

Now what if you discovered that the breakdown by gender was this?

Males Survived Did not survive Survival rate


Treatment 18 12 60%
No treatment 7 3 70%

Females Survived Did not survive Survival rate


Treatment 2 8 20%
No treatment 9 21 30%

22
22/62
/62
Simpson’s paradox

• A graphical model can never capture all the variables that might possibly
be relevant. In the first case we ignored gender. This can affect what
interpretation the model suggests.
• The fact that there is an arrow from A (treatment) to B (outcome) does
not imply that A causes B . In our case we had a hidden common
cause, gender, of the opposite effect on B .
• To tease out causal structure we need more sophisitcated tools than just
ordinary graphical models: need to introduce interventions.
• Observational studies are not sufficient. The gold standard in medicine is
randomized controlled trials (RCTs).

23
23/62
/62
Learning parameters in Bayes nets
p(xv |xm(v) , xf (v) ) = [θv ]xm(v) ,xf (v) ,xv .

Y
T Y Y
ℓ(θ|D) = [θv ]xtm(v) ,xtf (v) ,xtv = ℓv (θv |D)
t=1 v∈V v∈V

Y
T
ℓv (θv |D) = [θv ]xtm ,xtf ,xtv
t=1
or
YY Na,b ! Na,b,1 Na,b,2 Na,b,v
[θv ]a,b,1 [θv ]a,b,2 . . . [θv ]a,b,vk k
a
Na,b,1 ! Na,b,2 ! . . . Na,b,kv !
b

Na,b,c = t | xtm = a, xtf = b, xtv = c

24
24/62
/62
Learning parameters in Bayes nets
Each

Na,b ! Na,b,1 Na,b,v


ℓv,a,b (θv |D) = [θv ]a,b,1 . . . [θv ]a,b,vk k
Na,b,1 ! Na,b,2 ! . . . Na,b,kv !

is just a multinomial like in Naive Bayes, so we know the MLE is

Na,b,c
[θbv ]a,b,c = P .
c Na,b,c

As before, can also use biased estimator

Na,b,c + γ
[θbv ]a,b,c = P .
c (Na,b,c + γ)

25
25/62
/62
Inference in Bayes nets: example

Is there a general algorithm that allows us to find factorizations like this?


→ Message passing algorithms
26
26/62
/62
Undirected graphical models
Undirected graphical models
Not every type of conditional dependency structure can be represented by a
Bayes net. Example:

XA ⊥
⊥ XC | {XB , XD } , XB ⊥⊥ XD | {XA , XC } .

Exercise: Give an example of a structure that cannot be represented by a


directed model either.

28
28/62
/62
Examples of undirected models

Social Network Protein interaction net


Grid model (e.g., Ising)

29
29/62
/62
Ordinary separation

Recall the general form of the undirected models:

1 Y
p(x) = ϕc (xc )
Z
c∈Cliques(G)

Is X independent of Y given the set of nodes S ?

Theorem
X⊥
⊥ Y | S if and only if all paths from X to Y contain at least one node
in S .
This is simpler than in the directed case.

30
30/62
/62
Parameter estimation and
inference

In undirected models
• Parameter estimation: Not as easy as in the directed case!
• Inference : message passing algorithms.

31
31/62
/62
Bayesian vs. Frequentists
Joint and conditional probability

Joint:
P(A, B) = P(A, B)

Conditional:
P(A, B)
P(A|B) =
P(B)
P(A, B)
P(B|A) =
P(A)

AI is all about conditional probabilities.

33
33/62
/62
Mammography
Sensitivity of screening mammogram p(+|cancer) ≈ 90%

Specificity of screening mammogram p(−|no cancer) ≈ 91%

Probability that a woman age 40 has breast cancer ≈ 1%

If a previously unscreened 40 year old woman’s mammogram is positive,


what is the probability that she has breast cancer?

P(cancer, +) P(+|cancer) P(cancer)


P(cancer|+) = = =
P(+) P(+)

0.01 × .9 0.009 0.009


≈ ≈ ≈ 9%
0.01 × .9 + 0.99 × 0.09 0.009 + 0.09 0.1

Message: P(A|B) ̸= P(B|A) .

34
34/62
/62
Bayes’ rule

p(A|B) p(B)
p(B|A) =
p(A)

Rev. Thomas Bayes (1701–1761)

35
35/62
/62
Prosecutor’s fallacy: Sally Clark
Two kids died with no explanation.

Sir Roy Meadow testified that chance of this


happening due to SIDS is (1/8500)2 ≈
(73 × 106 )−1 .

Sally Clark found guilty and imprisoned.

Later verdict overturned and Meadow struck


off medical register.
Sally Clark (1964–2007)

Fallacy: P(SIDS|2 deaths) ̸= P(SIDS, 2 deaths)

P(guilty|+) = 1 − P(not guilty|+) ̸= 1 − P(+|not guilty)

36
36/62
/62
Convict if ...

P(innocence) < reasonable doubt


or

P(innocence) < shadow of a doubt

37
37/62
/62
Statistical estimation
The fundamental problem of
statistics

Probability:

parameter the model the sample


z}|{ p
z}|{ sampling(IID) z }| {
θ −→ p(x) −−−−−−−−−→ S = {X1 , X2 , . . . , Xm }

Statistics:
θb
estimation
S = {X1 , X2 , . . . , Xm } −−−−−→

39
39/62
/62
The fundamental problem of
statistics
The problem of inferring θ from X1 , X2 , . . . , Xm is inherently ill defined:
• pθ (x) as a function of x is a probability — OK
• pθ (x) as a function of θ (called the likelihood ℓ(θ) = pθ (x) ) is not a
probability — Panic!

Two religions:
• Turn pθ (x) into a probability by putting a distribution on θ (called the
prior) → Bayesian statistics
• Just come up with a guess θ̂ for θ and then show that it is unlikely to get
a sample that will lead to a θ̂ which is far off. → Frequentist statistics

40
40/62
/62
Bayesian statistics in ML

likelihood prior
z }| { z}|{
posterior
z }| { p(X|θ) p(θ) p(X|θ) p(θ)
p(θ|X) = =R
p(X) p(X|θ) p(θ) dθ
| {z }
evidence

Example: HMM for tracking

Assumptions are steep, but at least we are honest about them.

41
41/62
/62
Frequentist statistics in ML

• Come up with an algorithm to get an estimator


• Try and justify later.

Example: perceptron

Justification itself involves frequentist statistics, because it requires


estimating the probability of error! → Error analysis of Bayesian methods
also requires frequentist statistics.

42
42/62
/62
Frequentist vs. Bayesian estimators

In the more classical setting of parametric models, how do we get an actual


estimator for θ ?

Frequentist:
• Use the maximum likelihood estimator θbMLE = arg max ℓ(θ)
• Just one of many options
Bayesian:
• A true Bayesian always reports the full posterior p(θ|X) .
• When pressed, might give the maximum a posteriori (MAP) estimator
θ̂MAP = arg max p(θ|X)
R
• or the posterior mean θ̂ = θ p(θ|X) dθ .

43
43/62
/62
Naive Bayes
A generative model for documents

Let xi be the number of times that word i occurs in a document D .

Generative model for docs from author A: p A ( x)


Generative model for docs from author B: pB (x)

Given a new document with vector x′ attribute it to A iff

pA (x′ ) > pB (x′ )

What form should p take?

45
45/62
/62
The multinomial model

Mutinomial (Naive Bayes) model for word counts:

n! X
k
p(x) = θx1 θx2 . . . θkxk θi = 1.
x1 ! x2 ! . . . x k ! 1 2
i=1

How do we find the parameters θ1 , θ2 , . . . , θk ?

46
46/62
/62
Naive Bayes — the Frequentist way
n! X
k
p(x) = θx1 θx2 . . . θkxk θi = 1.
x1 ! x2 ! . . . x k ! 1 2
i=1
MLE: maximize
X
k X
k
log ℓ(θ1 , θ2 , . . . , θk ) = constant + xi log θi s.t. θi = 1
i=1 i=1

Introduce the λ Lagrange multiplier:


" #
∂ X
k
log ℓ(θ1 , . . . , θk ) + λ θi = 0
∂θi
i=1

xi xi
=λ −→ θbi = Pk
θbi i=1 xi

47
47/62
/62
Naive Bayes — the Frequentist way

Pk
The MLE θi = xi / i=1 xi makes perfect sense but gives p(x′ ) = 0
whenever D ′ contains a word not seen in training!

Idea: bias the estimator:


xi + γ
θbi = P
kγ + ki=1 xi

where γ is a “pseudocount”.

48
48/62
/62
Naive Bayes — the Bayesian way
1. Take the prior p(θ1 , θ2 , . . . , θk ) to be

Γ(α1 + . . . + αk ) α1 −1
Dirichlet(α1 , α2 , . . . , αk ) = θ . . . θkαk −1
Γ(α1 ) . . . Γ(αk ) 1

where Γ is the Gamma function obeying Γ(n) = (n − 1)! for any


n ∈ N.
2. Apply Bayes’ rule
p(x|θ) p(θ)
p(θ|x) = R
p(x|θ) p(θ) dθ

3. The posterior becomes

p(θ1 , θ2 , . . . , θk |X) = Dirichlet(α1 + x1 , . . . , αk + xk ).

49
49/62
/62
Naive Bayes — the Bayesian way

The maximum a posterior (MAP) is given by

xi + α i − 1
θi = Pk
i=1 xi + αi − 1

Equivalent to frequentist estimator with pseudocounts of αi − 1 !!!

50
50/62
/62
Learning the parameters of Bayes nets
Parameter Learning
Recall that the joint distribution of Bayes net is of the form
Y
p(x) = pv (xv |xpa(v) ).
v∈V

Up to now, we have assumed that each pv is fully specified, and asked


questions about the distribution of certain variables given others.

Now the question is:


• Assuming that each pv (xv |xpa(v) ) is parametrized by some set of

parameters Θv , given a training set x1 , x2 , . . . , xm of all the
variables together, how should set the Θv parameters?

This type of task is central to using graphical models in practice.

52
52/62
/62
A simple model
The simplest case is a model with just two variables, x = (x1 , x2 ) :

p(x) = p(x2 |x1 ) p(x1 ) X1 X2

Assume that:
• x1 can take on k1 different values {1, 2, . . . , k1 } ,
• x2 can take on k2 different values {1, 2, . . . , k2 } .

In this case p(x2 |x1 ) is describred by the matrix Θ ∈ [0, 1]k1 ×k2 of
parameters
θi,j = p(X2 = j|X1 = i).

m
How do we learn this matrix from the training data {(xu u
1 , x2 )}u=1 ?

53
53/62
/62
The Maximum Likelihood Principle
Assume that we are given
• a parametric family of distributions pΘ (x) parametrized some the (set of)
parameters Θ ,
• a sample {x1 , x2 , . . . , xm } from pΘ .

The likelihood of {x1 , x2 , . . . , xm } under pΘ is


Y
m
ℓ(Θ) = pΘ (xu ).
u=1

The Maximum Likelihood Estimator (MLE) Θ b of Θ is then the setting of


the parameters that maximizes this expression.

Finding (estimating) parameters is a general strategy in statistics, not just for


Bayes nets. Often it is slightly more convenient to maximize the
log-likelihood
X
m
log ℓ(Θ) = log pΘ (xu ).
u=1
54
54/62
/62
Maximum Likelihood

In our case, recalling that p(X2 = j|X1 = i) = θi,j ,

Y
m Y
m k1 Y
Y k2
N (j)
ℓ(Θ) = p(xu2 |xu1 ) = θx u
1 ,x u
2
= θi,ji ,
u=1 u=1 i=1 j=1

where Ni (j) is the number of samples where X1 = i and X2 = j .

• The likelihood only depends on how many times we have observed each
combination of (x1 , x2 ) together. This makes sense.
• For each possible i , the conditional p(X2 = j|X1 = i) = θi,j is a
prob. distr. as a function of j . → We can learn each row of Θ
P k2
separately, with the constraint that j=1 θi,j = 1 .

55
55/62
/62
Maximum Likelihood
To learn (θi,1 , θi,2 , . . . , θi,k2 ) only need the likelihood over those training
examples in which x1 = i :

Y
k2
N (j)
ℓi (θi,1 , . . . , θi,k2 ) = θi,ji .
j=1

In logarithmic form, to get the MLE (θbi,1 , θbi,2 , . . . , θbi,k2 ) , solve

X
k2 X
k2
maximize Ni (j) log θi,j subject to θi,j = 1.
j=1 j=1

This is a constrained optimization problem. → Use Lagrange multipliers.

56
56/62
/62
Zero counts
One problem with the MLE

Ni (j)
θbi,j = Pk2

j ′ =1 Ni (j )

is that if one (i, j) pair never occurs in the training data, then the
corresponding θbi,j will be zero. → The learned model will not be able to
deal with any example in which X1 = i and X2 = j .

Standard solution: add a small “pseudocount” to each Ni (j) , e.g., γ = 0.1 :

Ni (j) + γ
θbi,j = Pk2 .

j ′ =1 (Ni (j ) + γ)

This is a form of regularization. We will see other interpretations later.

57
/62
57/62
Multiple parents
What if the model is

p(x) = p(xr+1 |x1 , x2 , . . . , xr ) ?

Now Θ is a k1 × . . . × kr × kr+1 array (tensor) with

θi1 ,...,ir ,j = p( Xr+1 = j | X1 = i1 , . . . , Xr = ir ).

However, fixing any (i1 , . . . , ip ) we can solve for the corresponding θb...,j ’s
just as before and get (using pseudocounts)

Ni ,...,i (j) + γ
θbi1 ,...,ir ,j = Pkr+1 1 r .

j ′ =1 (Ni1 ,...,ir (j ) + γ)

58
58/62
/62
MLE for the whole Bayes Net
Recall that the joint distribution of the whole Bayes net is
Y
p(x) = pv (xv |xpa(v) ),
v∈V

so we need to learn a separate Θv array for each of these factors.


Assuming that each of the variables is discrete, this is not so hard. For each
v , assuming that the parents of v are {p1 , . . . , pr } :
1. Form the corresponding likelihood
Y
m
ℓ(Θv ) = p(xuv |xup1 , . . . , xupr ).
u=1

2. Use the same steps as before to find the corresponding MLE solution
Ni ,...,i (j) + γ
[θbv ]i1 ,...,ir ,j = Pkr+1 1 r .

j ′ =1 (Ni1 ,...,ir (j ) + γ)

59
59/62
/62
Expectation maximization

What about when some of the xi ’s are not observed? Use the EM strategy
and iterate until convergence:
1. E-step: Compute the expected log-likelihood (w.r.t. the hidden variables)
b old
under Θ

b old(Θ).

b:
2. M-step: Maximize this to get the new estimate for Θ
b = arg max L b (Θ).
Θ
ΘΘold

Just like in probabilistic k –means. Whether or not this is viable for a


complicated model is not obvious.

60
60/62
/62
FURTHER READING

• David Barber: Bayesian Reasoning and Machine Learning (online)


• Daphne Koller and Nir Friedman: Probabilistic Graphical Models
• Tutorial by Sam Roweis:
http://videolectures.net/mlss06tw_roweis_mlpgm/
• Coursera course “Probabilistic Graphical Models” by Daphne Koller

61
61/62
/62

You might also like