Artificial Intelligence: Cognitive Agents
AI, Uncertainty & Bayesian Networks
2015-03-10 / 03-12
Kim, Byoung-Hee
Biointelligence Laboratory
Seoul National University
[Link]
A Bayesian network is a graphical model
for probabilistic relationships among a set of variables
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 2
ACM Turing Award
Nobel Prize in Computing
2011 Winner: Judea Pearl (UCLA)
For fundamental contributions to artificial intelligence
through the development of a calculus for probabilistic
and causal reasoning
Invention of Bayesian networks
Pearl's accomplishments have “redefined the term
'thinking machine’” over the past 30 years
BN mimics “the neural activities of the human brain,
constantly exchanging messages without benefit of a
supervisor”
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 3
Notes
Related chapters in the textbook (AIMA 3 rd ed. by Russell and Norvig)
Ch. 13 Quantifying Uncertainty
Ch. 14 Probabilistic Reasoning (14.1~14.2)
Ch. 20 Learning Probabilistic Models (20.1~20.2)
On the reference ‘A Tutorial on Learning with Bayesian
Networks by David Heckerman’
It is very technical but covers insights and comprehensive
backgrounds on Bayesian networks
This lecture covers the ‘Introduction’ section
This lecture is an easier introductory tutorial
Both contents in the textbook and Heckerman’s tutorial is fairly
mathematical
This lecture covers basic concepts and tools to understand
Bayesian networks
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 4
Contents
Bayesian Networks: Introduction
Motivating example
Decomposing a joint distribution of variables
d-separation
A mini Turing test in causal conversation
Correlation & causation
AI & Uncertainty
Bayesian Networks in Detail
d-separation: revisited & details
Probability & Bayesian
Inference in & learning Bayesian networks
BN as AI tools and advantages
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 5
Bayesian Networks:
Introduction
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
Causality, Dependency
From correlation to causality
정성적 방법
정량적 방법
Granger causality index
[Link]
Google의 CausalImpact
R package for causal inference in time series
Official posting: [Link]
[Link]/2014/09/causalimpact-new-open-
[Link]
소개 기사(영문): [Link]
has-open-sourced-a-tool-for-inferring-cause-from-
correlations/
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 7
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
Assuming binary states for all the variables
Ex) Season: dry or rainy
Ex) Sprinkler: ON or OFF
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 8
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 9
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
(directed acyclic graph)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 10
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 11
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 12
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 13
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 14
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 15
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 16
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 17
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 18
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 19
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 20
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 21
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 22
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 23
Slide from K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal Inference
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 24
Correlation & Causation
Correlation does not imply causation
A chart that, correlates the number of a correlation between ice cream
pirates with global temperature. The consumption and crime, but
two variables are correlated, but one shows that the actual cause is
does not imply the other temperature.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 25
Example
Designing a Bayesian Network
My own design of conditional probability tables
SEASON
DRY 0.6
SPRINKLER RAINY 0.4 RAIN
SEASON DRY RAINY SEASON DRY RAINY
DRY 0.8 0.1 YES 0.3 0.8
RAINY 0.2 0.9 NO 0.7 0.2
WET
SPRINKLER ON OFF
RAIN YES NO YES NO
YES 0.8 0.1 0.8 0.1
NO 0.2 0.9 0.2 0.9
SLIPPERY
WET YES NO
YES 0.8 0.1
NO 0.2 0.9
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 26
Example
Designing a Bayesian Network
Tool: GeNIe ([Link]
GeNIe (Graphical Network Interface) is the graphical interface to SMILE,
a fully portable Bayesian inference engine in C++
Inference based on the designed Bayesian Network
Q1
A1
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 27
AI & Uncertainty
Probability
Probability plays a central role in modern pattern
recognition.
The main tool to deal uncertainties
All of the probabilistic inference and learning amount to
repeated application of the sum rule and the product rule
Random Variables: variables + probability
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 29
Artificial Intelligence (AI)
The objective of AI is to build intelligent computers
We want intelligent, adaptive, robust behavior
cat car
Often hand programming not possible.
Solution? Get the computer to program itself, by
showing it examples of the behavior we want!
This is the learning approach to AI.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 30
Artificial Intelligence (AI)
(Traditional) AI
Knowledge & reasoning; work with facts/assertions;
develop rules of logical inference
Planning: work with applicability/effects of actions;
develop searches for actions which achieve goals/avert
disasters.
Expert systems: develop by hand a set of rules for
examining inputs, updating internal states and
generating outputs
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 31
Artificial Intelligence (AI)
Probabilistic AI
emphasis on noisy measurements, approximation in hard
cases, learning, algorithmic issues.
The power of learning
Automatic system building
old expert systems needed hand coding of knowledge and
of output semantics
learning automatically constructs rules and supports all
types of queries
Probabilistic databases
traditional DB technology cannot answer queries about
items that were never loaded into the dataset
UAI models are like probabilistic databases
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 32
Uncertainty and Artificial Intelligence
(UAI)
Probabilistic methods can be used to:
make decisions given partial information about the world
account for noisy sensors or actuators
explain phenomena not part of our models
describe inherently stochastic behavior in the world
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 33
Other Names for UAI
Machine learning (ML), data mining, applied
statistics, adaptive (stochastic) signal processing,
probabilistic planning/reasoning...
Some differences:
Data mining almost always uses large data sets,
statistics almost always small ones
Data mining, planning, decision theory often have no
internal parameters to be learned
Statistics often has no algorithm to run!
ML/UAI algorithms are rarely online and rarely scale to
huge data (changing now).
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 34
Learning in AI
Learning is most useful
when the structure of the task is not well understood
but can be characterized by a dataset with strong
statistical regularity
Also useful in adaptive or dynamic situations when
the task (or its parameters) are constantly
changing
Currently, these are challenging topics of machine
learning and data mining research
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 35
Probabilistic AI
Let inputs=X, correct answers=Y, outputs of our
machine=Z
Learning: estimation of p(X, Y)
The central object of interest is the joint distribution
The main difficulty is compactly representing it and
robustly learning its shape given noisy samples
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 36
Probabilistic Graphical Models (PGMs)
Probabilistic graphical models represent large joint
distributions compactly using a set of “local”
relationships specified by a graph
Each random variable in our model corresponds
to a graph node.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 37
Probabilistic Graphical Models (PGMs)
There are useful properties in using
probabilistic graphical models
A simple way to visualize the structure of a probabilistic
model
Insights into the properties of the model
Complex computations (for inference and learning) can
be expressed in terms of graphical manipulations
underlying mathematical expressions
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 38
Directed graph vs. undirected graph
Both (probabilistic) graphical models
Specify a factorization (how to express the joint distribution)
Define a set of conditional independence properties
Parent - child Maximal clique
Local conditional distribution Potential function
Bayesian Networks (BN) Markov Random Field (MRF)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 39
Bayesian Networks in Detail
(DAG)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 41
Designing a Bayesian Network Model
TakeHeart II: Decision support system for clinical cardiovascular risk
assessment
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 42
Inference in a Bayesian Network Model
Given an assignment of a subset of variables (evidence) in a BN,
estimate the posterior distribution over another subset of unobserved
variables of interest.
Inferences viewed as
message passing along the network
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 43
Bayesian Networks
The joint distribution defined by a graph is given by the
product of a conditional distribution of each node
conditioned on their parent nodes.
K
p (x ) p ( x k | Pa ( x k ))
k 1
(𝑃𝑎(𝑥𝑘 ) denotes the set of parents of xk)
ex) 𝑝 𝑥1 , 𝑥2 , … , 𝑥7 =
* Without given DAG structure, usual chain rule can be applied to get
the joint distribution. But computational cost is much higher.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 44
Bayes’ Theorem
Likelihood
p ( X | Y ) p (Y ) Prior
p (Y | X )
p( X )
Posterior Normalizing
constant
p( X ) p ( X | Y ) p (Y )
Y
posterior likelihood × prior
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 45
Bayes’ Theorem
Figure from Figure 1. in (Adams, et all, 2013)
obtained from [Link]
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 46
Bayesian Probabilities
-Frequentist vs. Bayesian
p (D | w ) p ( w )
p (w | D )
p (D ) Thomas Bayes
Likelihood: p (D | w )
Frequentist
w: a fixed parameter determined by ‘estimator’
Maximum likelihood: Error function = log p ( D | w )
Error bars: Obtained by the distribution of possible data sets D
Bootstrap
Cross-validation
Bayesian
a probability distribution w: the uncertainty in the parameters
Prior knowledge
Noninformative (uniform) prior, Laplace correction in estimating priors
Monte Carlo methods, variational Bayes, EP
(See an article ‘WHERE Do PROBABILITIES COME FROM?’ on page 491 in the textbook (Russell and Norvig, 2010) for more discussion)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 47
Conditional Independence
Conditional independence simplifies both the structure of a model and
the computations
An important feature of graphical models is that conditional
independence properties of the joint distribution can be read directly
from the graph without having to perform any analytical manipulations
The general framework for this is called d-separation
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 48
Three example graphs – 1st case
None of the variables are observed
Node c is tail-to-tail
The variable c is observed
The conditioned node ‘blocks’ the path from a to b,
causes a and b to become (conditionally) independent.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 49
Three example graphs – 2nd case
None of the variables are observed
Node c is head-to-tail
The variable c is observed
The conditioned node ‘blocks’ the path from a to b,
causes a and b to become (conditionally) independent.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 50
Three example graphs – 3rd case
None of the variables are observed
Node c is head-to-head
The variable c is observed
When node c is unobserved,
it ‘blocks’ the path and the variables a and b are independent.
Conditioning on c ‘unblocks’ the path and render a and b dependent.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 51
Three example graphs - Fuel gauge example
B – Battery, F-fuel, G-electric fuel gauge (rather unreliable fuel gauge)
Checking the fuel gauge
( Makes it more likely )
Checking the battery also has the meaning?
Makes it less likely than observation of fuel gauge only.
(explaining away)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 52
d-separation
Tail-to-tail node or head-to-tail node
Unless it is observed in which case it blocks a path, the
path is unblocked.
Head-to-head node
Blocks a path if is unobserved, but on the node, and/or
at least one of its descendants, is observed the path
becomes unblocked.
d-separation?
All paths are blocked.
The joint distribution will satisfy conditional
independence w.r.t. concerned variables.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 53
d-separation
(a) a is dependent to b given c
Head-to-head node e is unblocked, because a descendant c is in
the conditioning set.
Tail-to-tail node f is unblocked
(b) a is independent to b given f
Head-to-head node e is blocked
Tail-to-tail node f is blocked
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 54
d-separation
Another example of conditional independence and d-separation: i.i.d.
(independent identically distributed) data
Problem: finding posterior dist. for the mean of a univariate Gaussian dist.
Every path is blocked and so the observations D={x1,…,xN} are independent
given
(independent)
(The observations are in general no longer independent!)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 55
d-separation
Naïve Bayes model
Key assumption: conditioned on the class z, the distribution of the
input variables x1,…, xD are independent.
Input {x1,…,xN} with their class labels,
then we can fit the naïve Bayes model to the training data using
maximum likelihood assuming that the data are drawn
independently from the model.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 56
d-separation
Markov blanket or Markov boundary
When dealing with the conditional distribution of xi , consider the
minimal set of nodes that isolates xi from the rest of the graph.
The set of nodes comprising parents, children, co-parents is called
the Markov blanket.
parents
Co-parents
children
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 57
Probability Distributions
Discrete variables
Beta, Bernoulli, binomial
Dirichlet, multinomial
…
beta Dirichlet
Continuous variables
Normal (Gaussian)
Student-t
…
binomial Gaussian
Exponential family & conjugacy
Many probability densities on x can be represented as the same form
p ( x | ) h ( x ) g ( ) exp u ( x )
T
There are conjugate family of density functions having the same form of density
functions
Beta & binomial F beta Dirichlet
Dirichlet & multinomial
Normal & Normal x binomial multinomial
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 58
Inference in Graphical Models
Inference in graphical models
Given evidences (some nodes are clamped to observed values)
Wish to compute the posterior distributions of other nodes
Inference algorithms in graphical structures
Main idea: propagation of local messages
Exact inference
Sum-product algorithm, max-product algorithm, junction tree
A
algorithm ABD
B D BCD
Approximate inference
C E CDE
Loopy belief propagation + message passing schedule
Variational methods, sampling methods (Monte Carlo methods)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 59
Learning Parameters of
Bayesian Networks
Parameters
probabilities in conditional probability tables (CPTs) for all the
variables in the network SEASON RAIN
SEASON DRY RAINY
DRY ? YES ? ?
RAINY ? NO ? ?
Learning parameters
Assuming that the structure is fixed, i.e. designed or learned.
We need data, i.e. observed instances
Estimation based on relative frequencies from data + belief
Example: coin toss. Estimation of ‘heads’ in various ways
1 The principle of indifference: head and tail are equally probable 𝑃 ℎ𝑒𝑎𝑑𝑠 = 1ൗ2
2 If we tossed a coin 10,000 times and it landed
heads 3373 times, we would estimate the
probability of heads to be about .3373
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 60
Learning Parameters of
Bayesian Networks
Learning parameters (continued)
Estimation based on relative frequencies from data + belief
Example: A-match soccer game between Korea and Japan. How,
do you think, is it probable that Korean would win?
A: 0.85 (Korean), B: 0.3 (Japanese)
3 This probability is not a ratio, and it is not a relative
frequency because the game cannot be repeated many
times under the exact same conditions
Degree of belief or subjective probability
Usual method
Estimate the probability distribution of a variable X
based on a relative frequency and belief concerning a
relative frequency
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 61
Learning Parameters of
Bayesian Networks
Simple ‘counting’ solution (Bayesian point of view)
Parameter estimation of a single node
Assume local parameter independence
For a binary variable (for example, a coin toss)
prior: Beta distribution - Beta(a,b)
after we have observed m heads and N-m tails posterior -
Beta(a+m,b+N-m) and 𝑃 𝑋 = ℎ𝑒𝑎𝑑 = (𝑎+𝑚)ൗ𝑁
(conjugacy of Beta and Binomial distributions)
binomial
beta
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 62
Learning Parameters of
Bayesian Networks
Simple ‘counting’ solution (Bayesian point of view)
For a multinomial variable (for example, a dice toss)
prior: Dirichlet distribution – Dirichlet(a1,a2, …, ad)
𝑎𝑘
𝑃 𝑋 = 𝑘 = Τ𝑁 𝑁 = σ 𝑎𝑘
Observing state i: Dirichlet(a1,…,ai+1,…, ad)
(conjugacy of Dirichlet and Multinomial distributions)
For an entire network
We simply iterate over its nodes
In the case of incomplete data
In real data, many of the variable values may be incorrect or missing
Usual approximating solution is given by Gibbs sampling or EM
(expectation maximization) technique
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 63
Learning Parameters of
Bayesian Networks
Smoothing
Another viewpoint
Laplace smoothing or additive smoothing given observed counts for
d states of a variable 𝑋 = (𝑥1 , 𝑥2 ,…𝑥𝑑 )
𝑥𝑘 + 𝛼
𝑃 𝑋=𝑘 = 𝑖 = 1, … , 𝑑 , (𝛼 = 𝛼1 = 𝛼2 = ⋯ 𝛼𝑑 )
𝑁 + 𝛼𝑑
From a Bayesian point of view, this corresponds to the expected
value of the posterior distribution, using a symmetric Dirichlet
distribution with parameter α as a prior.
Additive smoothing is commonly a component of naive Bayes
classifiers.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 64
Learning the Graph Structure
Learning the graph structure itself from data requires
A space of possible structures
A measure that can be used to score each structure
From a Bayesian viewpoint
: score for each model
Tough points
Marginalization over latent variables => challenging computational
problem
Exploring the space of structures can also be problematic
The # of different graph structures grows exponentially with the # of
nodes
Usually we resort to heuristics
Local score based, global score based, conditional independence test based, …
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 65
Bayesian Networks as Tools for AI
Learning
Extracting and encoding knowledge from data
Knowledge is represented in Probabilistic relationship among variables
Causal relationship
Network of variables
Common framework for machine learning models
Supervised and unsupervised learning
Knowledge Representation & Reasoning
Bayesian networks can be constructed from prior knowledge alone
Constructed model can be used for reasoning based on probabilistic inference methods
Expert System
Uncertain expert knowledge can be encoded into a Bayesian network
DAG in a Bayesian network is hand-constructed by domain experts
Then the conditional probabilities were assessed by the expert, learned from data, or
obtained using a combination of both techniques.
Bayesian network-based expert systems are popular
Planning
In some different form, known as decision graphs or influence diagrams
We don’t cover about this direction
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 66
Advantages of Bayesian Networks for
Data Analysis
Ability to handle missing data
Because the model encodes dependencies among all variables
Learning causal relationships
Can be used to gain understanding about a problem domain
Can be used to predict the consequences of intervention
Having both causal and probabilistic semantics
It is an ideal representation for combining prior knowledge (which comes in
causal form) and data
Efficient and principled approach for avoiding the overfitting of data
By Bayesian statistical methods in conjunction with Bayesian networks
(summary from the abstract of D. Heckerman’s Tutorial on BN)
(Read ‘Introduction’ section for detailed explanations)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 67
References
K. Mohan & J. Pearl, UAI ’12 Tutorial on Graphical Models for Causal
Inference
S. Roweis, MLSS ’05 Lecture on Probabilistic Graphical Models
Chapter 1, Chapter 2, Chapter 8 (Graphical Models), in Pattern
Recognition and Machine Learning by C.M. Bishop, 2006.
David Heckerman, A Tutorial on Learning with Bayesian Networks.
R.E. Neapolitan, Learning Bayesian Networks, Pearson Prentice Hall,
2004.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 68
More Textbooks and Courses
[Link] :
Probabilistic Graphical Models by D. Koller
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 69
APPENDIX
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 70
Learning Parameters of
Bayesian Networks
F
The probability distribution of F represents our belief
concerning the relative frequency with which X equals k.
Case 1: X is a binary variable
F: beta distribution, X: Bernoulli or binomial distribution
Ex) F ~ Beta(a,b), then 𝑃 𝑋 = 1 = 𝑎Τ𝑁 (𝑁 = 𝑎 + 𝑏)
Case 2: X is a multinomial variable
F: Dirichlet distribution, X: multinomial distribution
Ex) F ~ Dirichlet(a1,a2, …, ad), then 𝑃 𝑋 = 𝑘 = 𝑎𝑘Τ𝑁 𝑁 = σ 𝑎𝑘
Laplace smoothing or additive smoothing given observed
frequencies 𝑋 = (𝑥1 , 𝑥2 ,…𝑥𝑑 )
𝑥𝑘 + 𝛼
𝑃 𝑋=𝑘 = 𝑖 = 1, … , 𝑑 , (𝛼 = 𝛼1 = 𝛼2 = ⋯ 𝛼𝑑 )
𝑁 + 𝛼𝑑
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 71
Graphical interpretation of
Bayes’ theorem
Given structure: p ( x , y ) p ( x ) p ( y | x ) (a)
We observe the value of y (b)
Goal: infer the posterior distribution overp(x,x | y)
Marginal distribution p ( x ) : a prior over the latent
variable x
p( y)
We can evaluate the marginal distribution
p( y) p ( y | x ') p ( x ')
x'
By Bayes’ theorem we can calculate
p( y | x) p( x)
p(x | y) (c)
p( y)
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 72
d-separation
Directed factorization
Filtering whether can be expressed in terms of the factorization implied by
the graph?
If we present to the filter the set of all possible distributions p(x) over the set
of variables X, then the subset of distributions that are passed by the filter
will be denoted DF (Directed Factorization)
Fully connected graph: The set DF will contain all possible distributions
Fully disconnected graph: The joint distributions which factorize into the
product of the marginal distributions over the variables only.
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 73
Gaussian distribution
1 1 2
N ( x | , ) exp
2
( x )
(2 )
2 1/ 2 2
2
Multivariate Gaussian
1 1 1 1
N (x | μ, Σ ) exp ( x μ ) Σ ( x μ )
T
(2 )
D /2 1/ 2
|Σ| 2
© 2014-2015, SNU CSE Biointelligence Lab., [Link] 74