BAYESIAN NETWORKS
Bayesian Networks
• A Bayesian network specifies a joint distribution in a structured form
• Represent dependence/independence via a directed graph
– Nodes = random variables
– Edges = direct dependence
• Structure of the graph Conditional independence relations
In general,
p(X1, X2,....XN) = p(Xi | parents(Xi ) )
The full joint distribution The graph-structured approximation
• Requires that graph is acyclic (no directed cycles)
• 2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
Example of a simple Bayesian network
A B
p(A,B,C) = p(C|A,B)p(A)p(B)
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, graphical models, causal networks
• Other formulations, e.g., undirected graphical models
Examples of 3-way Bayesian Networks
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks
Conditionally independent effects:
p(A,B,C) = p(B|A)p(C|A)p(A)
A B and C are conditionally independent
Given A
e.g., A is a disease, and we model
B C B and C as conditionally independent
symptoms given A
Examples of 3-way Bayesian Networks
A B Independent Causes:
p(A,B,C) = p(C|A,B)p(A)p(B)
C
“Explaining away” effect:
Given C, observing A makes B less likely
e.g., earthquake/burglary/alarm example
A and B are (marginally) independent
but become dependent once C is known
Examples of 3-way Bayesian Networks
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Example
• Consider the following 5 binary variables:
– B = a burglary occurs at your house
– E = an earthquake occurs at your house
– A = the alarm goes off
– J = John calls to report the alarm
– M = Mary calls to report the alarm
– What is P(B | M, J) ? (for example)
– We can use the full joint distribution to answer this question
• Requires 25 = 32 probabilities
• Can we use prior domain knowledge to come up with a Bayesian
network that requires fewer probabilities?
Constructing a Bayesian Network: Step 1
• Order the variables in terms of causality (may be a partial order)
e.g., {E, B} -> {A} -> {J, M}
• P(J, M, A, E, B) = P(J, M | A, E, B) P(A| E, B) P(E, B)
~ P(J, M | A) P(A| E, B) P(E) P(B)
~ P(J | A) P(M | A) P(A| E, B) P(E) P(B)
These CI assumptions are reflected in the graph structure of the
Bayesian network
The Resulting Bayesian Network
Constructing this Bayesian Network: Step 2
• P(J, M, A, E, B) =
P(J | A) P(M | A) P(A | E, B) P(E) P(B)
• There are 3 conditional probability tables (CPTs) to be determined:
P(J | A), P(M | A), P(A | E, B)
– Requiring 2 + 2 + 4 = 8 probabilities
• And 2 marginal probabilities P(E), P(B) -> 2 more probabilities
• Where do these probabilities come from?
– Expert knowledge
– From data (relative frequency estimates)
– Or a combination of both - see discussion in Section 20.1 and 20.2 (optional)
The Bayesian network
Number of Probabilities in Bayesian
Networks
• Consider n binary variables
• Unconstrained joint distribution requires O(2n) probabilities
• If we have a Bayesian network, with a maximum of k parents for
any node, then we need O(n 2k) probabilities
• Example
– Full unconstrained joint distribution
• n = 30: need 109 probabilities for full joint distribution
– Bayesian network
• n = 30, k = 4: need 480 probabilities
The Bayesian Network from a different Variable Ordering
The Bayesian Network from a different Variable Ordering
Given a graph, can we “read off” conditional independencies?
A node is conditionally independent
of all other nodes in the network
given its Markov blanket (in gray)
Inference (Reasoning) in Bayesian
Networks
• Consider answering a query in a Bayesian Network
– Q = set of query variables
– e = evidence (set of instantiated variable-value pairs)
– Inference = computation of conditional distribution P(Q | e)
• Examples
– P(burglary | alarm)
– P(earthquake | JCalls, MCalls)
– P(JCalls, MCalls | burglary, earthquake)
• Can we use the structure of the Bayesian Network
to answer such queries efficiently? Answer = yes
– Generally speaking, complexity is inversely proportional to sparsity of graph
Example: Tree-Structured Bayesian Network
B E
A C F G
p(a, b, c, d, e, f, g) is modeled as p(a|b)p(c|b)p(f|e)p(g|e)p(b|d)p(e|d)p(d)
Example
D
B E
A c F g
Say we want to compute p(a | c, g)
Example
D
B E
A c F g
Direct calculation: p(a|c,g) = Sbdef p(a,b,d,e,f | c,g)
Complexity of the sum is O(m4)
Example
D
B E
A c F g
Reordering:
Sd p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
Example
D
B E
A c F g
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
p(e|g)
Example
D
B E
A c F g
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) p(e|g)
p(d|g)
Example
D
B E
A c F g
Reordering:
Sb p(a|b) Sd p(b|d,c) p(d|g)
p(b|c,g)
Example
D
B E
A c F g
Reordering:
Sb p(a|b) p(b|c,g)
p(a|c,g) Complexity is O(m), compared to O(m4)
General Strategy for inference
• Want to compute P(q | e)
Step 1:
P(q | e) = P(q,e)/P(e) = a P(q,e), since P(e) is constant wrt Q
Step 2:
P(q,e) = Sa..z P(q, e, a, b, …. z), by the law of total probability
Step 3:
Sa..z P(q, e, a, b, …. z) = Sa..z i P(variable i | parents i)
(using Bayesian network factoring)
Step 4:
Distribute summations across product terms for efficient computation
Complexity of Bayesian Network
inference
• Assume the network is a polytree
– Only a single directed path between any 2 nodes
• Complexity scales as O(n m
K+1)
• n = number of variables
• m = arity of variables
• K = maximum number of parents for any node
– Compare to O(mn-1) for brute-force method
• Network is not a polytree?
– Can cluster variables to render the new graph a tree
– Very similar to tree methods used for
– Complexity is O(n m
W+1), where W = num variables in largest cluster
Real-valued Variables
• Can Bayesian Networks handle Real-valued variables?
– If we can assume variables are Gaussian, then the inference and theory
for Bayesian networks is well-developed,
• E.g., conditionals of a joint Gaussian is still Gaussian, etc
• In inference we replace sums with integrals
– For other density functions it depends…
• Can often include a univariate variable at the “edge” of a graph, e.g., a Poisson
conditioned on day of week
– But for many variables there is little know beyond their univariate
properties, e.g., what would be the joint distribution of a Poisson and a
Gaussian? (its not defined)
– Common approaches in practice
• Put real-valued variables at “leaf nodes” (so nothing is conditioned on them)
• Assume real-valued variables are Gaussian or discrete
• Discretize real-valued variables
Other aspects of Bayesian Network
Inference
• The problem of finding an optimal (for inference) ordering and/or
clustering of variables for an arbitrary graph is NP-hard
– Various heuristics are used in practice
– Efficient algorithms and software now exist for working with large
Bayesian networks
• E.g., work in Professor Rina Dechter’s group
• Other types of queries?
– E.g., finding the most likely values of a variable given evidence
– arg max P(Q | e) = “most probable explanation”
or maximum a posteriori query
- Can also leverage the graph structure in the same manner as for
inference – essentially replaces “sum” operator with “max”
Naïve Bayes Model
Y1 Y2 Y3 Yn
P(C | Y1,…Yn) = a P P(Yi | C) P (C)
Features Y are conditionally independent given the class variable C
Widely used in machine learning
e.g., spam email classification: Y’s = counts of words in emails
Conditional probabilities P(Yi | C) can easily be estimated from labeled data
APPLICATIONS OF BAYESIAN NETWORKS
• Bayesian networks are used for modelling knowledge in computational
biology and bioinformatics (gene regulatory networks, protein structure,
gene expression analysis
• Sports betting, learning epistasis from GWAS data sets
• Medicine
• Bio-monitoring
• Document classification, information retrieval
• Semantic search
• Image processing, data fusion, decision support systems
• Engineering, gaming, law, and risk analysis.
• There are texts applying Bayesian networks to bioinformatics, financial and
marketing informatics.
Summary
• Bayesian networks represent a joint distribution using a
graph
• The graph encodes a set of conditional independence
assumptions
• Answering queries (or inference or reasoning) in a Bayesian
network amounts to efficient computation of appropriate
conditional probabilities
• Probabilistic inference is intractable in the general case
– But can be carried out in linear time for certain classes of
Bayesian networks