0% found this document useful (0 votes)
62 views22 pages

Mod3 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views22 pages

Mod3 Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BAYESIAN NETWORKS

Syntax: How the Network is Structured


The syntax refers to the graphical structure of the Bayesian network: the nodes (variables),
arrows (directed edges), and the conditional probability tables (CPTs) associated with each node.
Example: Consider a simple Bayesian network with 3 binary variables:
• A: "Is it raining?" (Yes/No)
• B: "Is the grass wet?" (Yes/No)
• C: "Is the sidewalk slippery?" (Yes/No)
The graph structure is:
A → B → C

This means:
• A has no parents.
• B's parent is A.
• C's parent is B.

2. Semantics: What the Numbers Mean


The semantics define the meaning of the numbers in the CPTs in terms of real-world
probabilities.
CPTs for the Example:
1. P(A): Prior probability of raining.

o P(A = Yes) = 0.2 (it rains 20% of the time).


o P(A = No) = 0.8.
2. P(B | A): Probability of grass being wet given rain.

o If it rains (A = Yes), the grass is likely wet:


▪ P(B = Yes | A = Yes) = 0.9.
▪ P(B = No | A = Yes) = 0.1.
o If it doesn’t rain (A = No), the grass might still be wet (e.g., from sprinklers):
▪ P(B = Yes | A = No) = 0.2.
▪ P(B = No | A = No) = 0.8.
3. P(C | B): Probability of sidewalk being slippery given grass is wet.

o If grass is wet (B = Yes), sidewalk is likely slippery:


▪ P(C = Yes | B = Yes) = 0.7.
▪ P(C = No | B = Yes) = 0.3.
o If grass is not wet (B = No), sidewalk is unlikely slippery:
▪ P(C = Yes | B = No) = 0.1.
▪ P(C = No | B = No) = 0.9.

3. Joint Distribution
The joint distribution is the probability of all variables taking on specific values simultaneously.
For a Bayesian network, it is computed as:
𝑃(𝐴, 𝐵, 𝐶) = 𝑃(𝐴) ⋅ 𝑃(𝐵 ∣ 𝐴) ⋅ 𝑃(𝐶 ∣ 𝐵)
Example Calculation: Compute the probability of:
• It is raining (A = Yes),
• The grass is wet (B = Yes),
• The sidewalk is slippery (C = Yes).
Using the CPTs:
𝑃(𝐴 = 𝑌, 𝐵 = 𝑌, 𝐶 = 𝑌) = 𝑃(𝐴 = 𝑌) ⋅ 𝑃(𝐵 = 𝑌 ∣ 𝐴 = 𝑌) ⋅ 𝑃(𝐶 = 𝑌 ∣ 𝐵 = 𝑌) = 0.2 ⋅ 0.9 ⋅ 0.7
= 0.126.
So, there’s a 12.6% chance of this specific scenario happening.

4. Why Bayesian Networks?


Storing the full joint distribution explicitly for all combinations of variables would require
exponential space. For our 3 binary variables, there are 23 = 8 possible combinations, but for
larger networks, this becomes intractable.
The Bayesian network factorizes the joint distribution using the graph structure, allowing
compact representation and efficient inference. Here, we only store:
• 1 parameter for 𝑃(𝐴) (since it’s binary, one number suffices: 𝑃(𝐴 = 𝑌) = 0.2).
• 2 parameters for 𝑃(𝐵 ∣ 𝐴) (since 𝑃(𝐵 ∣ 𝐴) has 2 entries for 𝐴 = 𝑌 and 𝐴 = 𝑁, and the
other is determined as probabilities sum to 1).
• 2 parameters for 𝑃(𝐶 ∣ 𝐵).
Total: 5 parameters instead of 7 (for a full joint distribution over 3 binary variables, we’d need
23 − 1 = 7 parameters). The savings grow dramatically with more variables.
The joint distribution defined by the Bayesian network allows us to derive conditional
probabilities (like 𝑃(𝐵 ∣ 𝐴)) and how the 𝜃-values (parameters) in the network are indeed
conditional probabilities. We’ll use the Alarm Network example to illustrate this.

Key Concepts Recap:


Joint Distribution Factorization:
In a Bayesian network, the joint distribution is factorized as:
𝑛

𝑃(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) = ∏ 𝑃 (𝑥𝑖 ∣ Parents(𝑋𝑖 )).


𝑖=1

This means the probability of any complete scenario is the product of conditional
probabilities of each variable given its parents.

𝜃-values are Conditional Probabilities:


The 𝜃-values in the network (e.g., 𝜃(𝑏 ∣ 𝑎)) are the conditional probabilities 𝑃(𝐵 ∣ 𝐴)
stored in the Conditional Probability Tables (CPTs). These are the building blocks of
the joint distribution.

Deriving Conditional Probabilities:


Once the joint distribution is defined, we can derive other conditional probabilities
(like 𝑃(𝐵 ∣ 𝐴)) using standard probability rules, such as:

𝑃(𝐵, 𝐴)
𝑃(𝐵 ∣ 𝐴) = .
𝑃(𝐴)
Here, 𝑃(𝐵, 𝐴) and 𝑃(𝐴) are computed by summing over the joint distribution.

The Alarm Network Example:


Consider the classic Alarm Network with the following variables and structure:
Burglary (B) Earthquake (E)
\ /
Alarm (A)
/ \
John Calls (J) Mary Calls (M)

1. Graph Structure and CPTs:


The Bayesian network encodes the following dependencies:
• Burglary (B) and Earthquake (E) are independent root nodes (no parents).
• Alarm (A) depends on both B and E.
• John Calls (J) and Mary Calls (M) depend only on Alarm (A).
CPTs (Conditional Probability Tables):
• 𝑃(𝐵): Prior probability of a burglary.
o 𝑃(𝐵 = True) = 0.001, 𝑃(𝐵 = False) = 0.999.
• 𝑃(𝐸): Prior probability of an earthquake.
o 𝑃(𝐸 = True) = 0.002, 𝑃(𝐸 = False) = 0.998.
• 𝑃(𝐴 ∣ 𝐵, 𝐸): Alarm's response to burglary/earthquake.
o E.g., 𝑃(𝐴 = True ∣ 𝐵 = True, 𝐸 = True) = 0.95 (alarm is very likely to go on if
both happen).
• 𝑃(𝐽 ∣ 𝐴): Probability John calls given the alarm.
o E.g., 𝑃(𝐽 = True ∣ 𝐴 = True) = 0.9 (John is reliable).
• 𝑃(𝑀 ∣ 𝐴): Probability Mary calls given the alarm.
o E.g., 𝑃(𝑀 = True ∣ 𝐴 = True) = 0.7 (Mary is less reliable).
2. Joint Distribution Factorization:
The joint distribution for all variables is:
𝑃(𝐵, 𝐸, 𝐴, 𝐽, 𝑀) = 𝑃(𝐵) ⋅ 𝑃(𝐸) ⋅ 𝑃(𝐴 ∣ 𝐵, 𝐸) ⋅ 𝑃(𝐽 ∣ 𝐴) ⋅ 𝑃(𝑀 ∣ 𝐴).
This factorization exploits the independence assumptions in the graph (e.g., 𝐽 and 𝑀 are
independent given 𝐴).
3. Example Calculation:
Compute the probability of a specific scenario:
• Burglary occurs (𝐵 = True),
• No earthquake (𝐸 = False),
• Alarm goes off (𝐴 = True),
• John calls (𝐽 = True),
• Mary doesn't call (𝑀 = False).
Using the CPTs:
𝑃(𝐵 = 𝑇, 𝐸 = 𝐹, 𝐴 = 𝑇, 𝐽 = 𝑇, 𝑀 = 𝐹)
= 𝑃(𝐵 = 𝑇) ⋅ 𝑃(𝐸 = 𝐹) ⋅ 𝑃(𝐴 = 𝑇 ∣ 𝐵 = 𝑇, 𝐸 = 𝐹) ⋅ 𝑃(𝐽 = 𝑇 ∣ 𝐴 = 𝑇)
⋅ 𝑃(𝑀 = 𝐹 ∣ 𝐴 = 𝑇).
Plugging in values (hypothetical CPT numbers for illustration):
= 0.001 ⋅ 0.998 ⋅ 0.94 ⋅ 0.9 ⋅ 0.3 ≈ 0.000253.
So, this specific scenario has a ~0.025% chance of occurring.
4. Deriving Conditional Probabilities:
Suppose we want to compute 𝑃(𝐵 = True ∣ 𝐽 = True), the probability of a burglary given that
John calls.
Using the definition of conditional probability:
𝑃(𝐵, 𝐽)
𝑃(𝐵 ∣ 𝐽) = .
𝑃(𝐽)
We compute 𝑃(𝐵, 𝐽) and 𝑃(𝐽) by summing over the joint distribution:
• 𝑃(𝐵 = 𝑇, 𝐽 = 𝑇) = ∑𝐸,𝐴,𝑀 𝑃 (𝐵 = 𝑇, 𝐸, 𝐴, 𝐽 = 𝑇, 𝑀).
• 𝑃(𝐽 = 𝑇) = ∑𝐵,𝐸,𝐴,𝑀 𝑃 (𝐵, 𝐸, 𝐴, 𝐽 = 𝑇, 𝑀).
This involves marginalizing (summing out) the other variables, which is computationally
tractable due to the factorization.
5. 𝜃-values as Conditional Probabilities:
The CPT entries (e.g., 𝑃(𝐴 ∣ 𝐵, 𝐸)) are the 𝜃-values. They are:
• Defined upfront as part of the network's semantics.
• Used to construct the joint distribution via the product rule.
• Can also be derived from data (learning) in real-world applications.
For example:
𝜃(𝐴 = True ∣ 𝐵 = True, 𝐸 = False) = 𝑃(𝐴 = True ∣ 𝐵 = True, 𝐸 = False),
which is a fixed parameter in the CPT.

Key Takeaways:
1. Bayesian networks compactly represent the joint distribution by factoring it into
conditional probabilities (CPTs).
2. The 𝜃-values are the CPT entries, which are conditional probabilities like 𝑃(𝑋𝑖 ∣
Parents(𝑋𝑖 )).
3. Any conditional probability can be derived from the joint distribution using standard
𝑃(𝐵,𝐽)
probability rules (e.g., 𝑃(𝐵 ∣ 𝐽) = 𝑃(𝐽) ).
4. The Alarm Network shows how real-world scenarios (burglary, alarms, calls) can be
modeled efficiently using conditional independence.
This is the power of Bayesian networks: they combine graph theory (for structure) and
probability theory (for semantics) to enable scalable reasoning under uncertainty.
Exact Inference in Bayesian Networks
Query: P(Burglary | JohnCalls = true, MaryCalls = true)

Explanation:
You want to compute the probability that a Burglary happened, given that JohnCalls = true
and MaryCalls = true.
This involves:
• Evidence (e): JohnCalls = true, MaryCalls = true
• Query Variable (X): Burglary
• Hidden Variables (y): variables not in query or evidence, here Earthquake and Alarm.

Formula:
By enumeration, we use:

𝑃(𝑋 ∣ 𝑒) = 𝛼 ∑ 𝑃 (𝑋, 𝑒, 𝑦)
𝑦

Where:
• 𝛼 is a normalization constant (to ensure the result sums to 1),
• We sum over all possible values of the hidden variables 𝑦.

Writing the Full Joint:


The joint distribution 𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦, 𝐸𝑎𝑟𝑡ℎ𝑞𝑢𝑎𝑘𝑒, 𝐴𝑙𝑎𝑟𝑚, 𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠, 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠) factorizes
according to the Bayesian network into:
𝑃(𝐵, 𝐸, 𝐴, 𝐽, 𝑀) = 𝑃(𝐵)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵, 𝐸)𝑃(𝐽 ∣ 𝐴)𝑃(𝑀 ∣ 𝐴)
Thus, instead of writing a huge full joint table, we can compute it using the CPTs.

Now, Step-by-step, to compute 𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦 ∣ 𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒, 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠 =


𝑡𝑟𝑢𝑒):
We calculate:
𝑃(𝐵 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒)
= 𝛼 ∑ ∑ 𝑃 (𝐵)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵, 𝐸)𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)
𝐸 𝐴
Breaking it down:
1. For each possible value of Earthquake (true/false) and Alarm (true/false),
2. Compute the product 𝑃(𝐵)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵, 𝐸)𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴),
3. Sum over all combinations of Earthquake and Alarm,
4. Multiply by 𝛼 to normalize.

Quick Sketch of Terms:


If you enumerate all combinations of hidden variables:

Burglary Earthquake Alarm JohnCalls MaryCalls Probability


true true true true true compute
true true false true true compute
true false true true true compute
true false false true true compute
false true true true true compute
false true false true true compute
false false true true true compute
false false false true true compute

(Only two cases for Burglary: true and false.)


Each row gets a probability via CPTs.

To Compute the probability that there's a burglary given that both John and Mary have called,
i.e., 𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦 ∣ 𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒, 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒).

To do this, we'll use the given formula:


𝑃(𝐵 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒)
= 𝛼 ∑ ∑ 𝑃 (𝐵)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵, 𝐸)𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)
𝐸 𝐴

This involves summing over all possible values of Earthquake (E) and Alarm (A), computing the
product of the probabilities for each combination, and then normalizing at the end.

Given Probabilities
First, let's recall the probabilities we have from the Bayesian network:
5. Prior Probabilities:

o 𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦 = 𝑡𝑟𝑢𝑒) = 0.001 (let's denote Burglary as B)


o 𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦 = 𝑓𝑎𝑙𝑠𝑒) = 0.999
o 𝑃(𝐸𝑎𝑟𝑡ℎ𝑞𝑢𝑎𝑘𝑒 = 𝑡𝑟𝑢𝑒) = 0.002 (let's denote Earthquake as E)
o 𝑃(𝐸𝑎𝑟𝑡ℎ𝑞𝑢𝑎𝑘𝑒 = 𝑓𝑎𝑙𝑠𝑒) = 0.998
6. Alarm (A) Conditional Probabilities:

o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.95


o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.94
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.29
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.001
o 𝑃(𝐴 = 𝑓𝑎𝑙𝑠𝑒 ∣. . . ) is just 1 minus the above probabilities.
7. John Calls (J) Conditional on Alarm:

o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90


o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
8. Mary Calls (M) Conditional on Alarm:

o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70


o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01

Calculating 𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒)


We need to compute:
𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒)
= 𝛼 ∑ ∑ 𝑃 (𝐵 = 𝑡𝑟𝑢𝑒)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸)𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)𝑃(𝑀 = 𝑡𝑟𝑢𝑒
𝐸 𝐴
∣ 𝐴)
Similarly, we'll compute 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒) and then normalize.
Step 1: Compute the sum for 𝐵 = 𝑡𝑟𝑢𝑒
We'll iterate over all combinations of E (true/false) and A (true/false).
There are 4 combinations:
9. E=true, A=true
10. E=true, A=false
11. E=false, A=true
12. E=false, A=false
Let's calculate each term:
1. E=true, A=true:
o 𝑃(𝐵 = 𝑡𝑟𝑢𝑒) = 0.001
o 𝑃(𝐸 = 𝑡𝑟𝑢𝑒) = 0.002
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.95
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70
o Product: 0.001 × 0.002 × 0.95 × 0.90 × 0.70
Calculating: 0.001 × 0.002 = 0.000002 0.000002 × 0.95 = 0.0000019 0.0000019 ×
0.90 = 0.00000171 0.00000171 × 0.70 = 0.000001197
2. E=true, A=false:

o 𝑃(𝐵 = 𝑡𝑟𝑢𝑒) = 0.001


o 𝑃(𝐸 = 𝑡𝑟𝑢𝑒) = 0.002
o 𝑃(𝐴 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 1 − 0.95 = 0.05
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01
o Product: 0.001 × 0.002 × 0.05 × 0.05 × 0.01
Calculating: 0.001 × 0.002 = 0.000002 0.000002 × 0.05 = 0.0000001 0.0000001 ×
0.05 = 0.000000005 0.000000005 × 0.01 = 0.00000000005
3. E=false, A=true:

o 𝑃(𝐵 = 𝑡𝑟𝑢𝑒) = 0.001


o 𝑃(𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.94
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70
o Product: 0.001 × 0.998 × 0.94 × 0.90 × 0.70
Calculating: 0.001 × 0.998 = 0.000998 0.000998 × 0.94 ≈ 0.00093812
0.00093812 × 0.90 ≈ 0.000844308 0.000844308 × 0.70 ≈ 0.0005910156
4. E=false, A=false:

o 𝑃(𝐵 = 𝑡𝑟𝑢𝑒) = 0.001


o 𝑃(𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998
o 𝑃(𝐴 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 1 − 0.94 = 0.06
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01
o Product: 0.001 × 0.998 × 0.06 × 0.05 × 0.01
Calculating: 0.001 × 0.998 = 0.000998 0.000998 × 0.06 = 0.00005988
0.00005988 × 0.05 = 0.000002994 0.000002994 × 0.01 = 0.00000002994
Now, sum all four terms for 𝐵 = 𝑡𝑟𝑢𝑒:
1. 0.000001197
2. 0.00000000005 ≈ 0
3. 0.0005910156
4. 0.00000002994 ≈ 0
Sum ≈ 0.000001197 + 0.0005910156 ≈ 0.0005922126
Step 2: Compute the sum for 𝐵 = 𝑓𝑎𝑙𝑠𝑒
Now, we do the same for 𝐵 = 𝑓𝑎𝑙𝑠𝑒:
𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒)
= 𝛼 ∑ ∑ 𝑃 (𝐵 = 𝑓𝑎𝑙𝑠𝑒)𝑃(𝐸)𝑃(𝐴 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸)𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴)𝑃(𝑀
𝐸 𝐴
= 𝑡𝑟𝑢𝑒 ∣ 𝐴)
Again, four combinations:
1. E=true, A=true:

o 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999


o 𝑃(𝐸 = 𝑡𝑟𝑢𝑒) = 0.002
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.29
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70
o Product: 0.999 × 0.002 × 0.29 × 0.90 × 0.70
Calculating: 0.999 × 0.002 = 0.001998 0.001998 × 0.29 ≈ 0.00057942
0.00057942 × 0.90 ≈ 0.000521478 0.000521478 × 0.70 ≈ 0.0003650346
2. E=true, A=false:

o 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999


o 𝑃(𝐸 = 𝑡𝑟𝑢𝑒) = 0.002
o 𝑃(𝐴 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 1 − 0.29 = 0.71
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01
o Product: 0.999 × 0.002 × 0.71 × 0.05 × 0.01
Calculating: 0.999 × 0.002 = 0.001998 0.001998 × 0.71 ≈ 0.00141858
0.00141858 × 0.05 ≈ 0.000070929 0.000070929 × 0.01 = 0.00000070929
3. E=false, A=true:

o 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999


o 𝑃(𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.001
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70
o Product: 0.999 × 0.998 × 0.001 × 0.90 × 0.70
Calculating: 0.999 × 0.998 ≈ 0.997002 0.997002 × 0.001 = 0.000997002
0.000997002 × 0.90 ≈ 0.0008973018 0.0008973018 × 0.70 ≈ 0.00062811126
4. E=false, A=false:

o 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999


o 𝑃(𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998
o 𝑃(𝐴 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 1 − 0.001 = 0.999
o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01
o Product: 0.999 × 0.998 × 0.999 × 0.05 × 0.01
Calculating: 0.999 × 0.998 ≈ 0.997002 0.997002 × 0.999 ≈ 0.996004998
0.996004998 × 0.05 ≈ 0.0498002499 0.0498002499 × 0.01 ≈ 0.000498002499

Now, sum all four terms for 𝐵 = 𝑓𝑎𝑙𝑠𝑒:


1. 0.0003650346
2. 0.00000070929 ≈ 0
3. 0.00062811126
4. 0.000498002499
Sum ≈ 0.0003650346 + 0.00062811126 + 0.000498002499 ≈ 0.001491148359
Step 3: Normalize the Probabilities
Now, we have:
• 𝑃′ (𝐵 = 𝑡𝑟𝑢𝑒) = 0.0005922126
• 𝑃′ (𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.001491148359
The normalizing constant 𝛼 is:
1 1
𝛼= =
𝑃′ (𝐵 = 𝑡𝑟𝑢𝑒) + 𝑃′ (𝐵 = 𝑓𝑎𝑙𝑠𝑒) 0.0005922126 + 0.001491148359
1
≈ ≈ 480
0.002083360959
Now, compute the actual probabilities:
• 𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒) = 𝛼 × 𝑃′ (𝐵 = 𝑡𝑟𝑢𝑒) ≈ 480 × 0.0005922126 ≈
0.284
• 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒) = 𝛼 × 𝑃′ (𝐵 = 𝑓𝑎𝑙𝑠𝑒) ≈ 480 ×
0.001491148359 ≈ 0.716
The normalization:
Actually, 𝑃′ (𝐵 = 𝑡𝑟𝑢𝑒) + 𝑃′ (𝐵 = 𝑓𝑎𝑙𝑠𝑒) ≈ 0.002083, so:
0.000592
𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣. . . ) = ≈ 0.284
0.002083
0.001491
𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣. . . ) = ≈ 0.716
0.002083
Yes, they sum to 1 now.

Final Answer
After performing all the calculations:
𝑃(𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦 = 𝑡𝑟𝑢𝑒 ∣ 𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒, 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒) ≈ 0.284
So, there's approximately a 28.4% chance that there's been a burglary given that both John and
Mary have called.
Variable Elimination Method

Variable Elimination (VE) algorithm for the classic Burglary-Earthquake-Alarm Bayesian


network. We'll follow the steps you outlined, focusing on summing out variables 𝐴 and 𝐸 to
compute 𝑃(𝐵 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒).

Given Bayesian Network (Parameters)


We'll use the standard probabilities from the Burglary-Earthquake-Alarm network:
1. Prior Probabilities:

o 𝑃(𝐵 = 𝑡𝑟𝑢𝑒) = 0.001


o 𝑃(𝐸 = 𝑡𝑟𝑢𝑒) = 0.002
2. Alarm (A) CPT:

o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.95


o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.94
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.29
o 𝑃(𝐴 = 𝑡𝑟𝑢𝑒 ∣ 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.001
3. John Calls (J) CPT:

o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90


o 𝑃(𝐽 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05
4. Mary Calls (M) CPT:

o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70


o 𝑃(𝑀 = 𝑡𝑟𝑢𝑒 ∣ 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01

Step 1: Define Factors


We represent the joint distribution as factors (tables):
5. 𝑓1 (𝐵): Prior for Burglary

o 𝑓1 (𝐵 = 𝑡𝑟𝑢𝑒) = 0.001, 𝑓1 (𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999.


6. 𝑓2 (𝐸): Prior for Earthquake

o 𝑓2 (𝐸 = 𝑡𝑟𝑢𝑒) = 0.002, 𝑓2 (𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998.


7. 𝑓3 (𝐴, 𝐵, 𝐸): Alarm CPT

o E.g., 𝑓3 (𝐴 = 𝑡𝑟𝑢𝑒, 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.95.


8. 𝑓4 (𝐽, 𝐴): John Calls CPT
o 𝑓4 (𝐽 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑡𝑟𝑢𝑒) = 0.90, 𝑓4 (𝐽 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.05.
9. 𝑓5 (𝑀, 𝐴): Mary Calls CPT

o 𝑓5 (𝑀 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑡𝑟𝑢𝑒) = 0.70, 𝑓5 (𝑀 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑓𝑎𝑙𝑠𝑒) = 0.01.

Step 2: Sum Out 𝐴 (Combine 𝑓3 , 𝑓4 , 𝑓5 )


We compute a new factor 𝑓6 (𝐵, 𝐸) by:
10. Multiply 𝑓3 ⋅ 𝑓4 ⋅ 𝑓5 (pointwise products for all 𝐴, 𝐵, 𝐸).
11. Sum over 𝐴 (marginalize 𝐴).
Calculation for 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒:
For 𝐴 = 𝑡𝑟𝑢𝑒:
𝑓3 (𝐴 = 𝑡𝑟𝑢𝑒, 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) ⋅ 𝑓4 (𝐽 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑡𝑟𝑢𝑒) ⋅ 𝑓5 (𝑀 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑡𝑟𝑢𝑒)
= 0.95 × 0.90 × 0.70 = 0.5985.
For 𝐴 = 𝑓𝑎𝑙𝑠𝑒:
𝑓3 (𝐴 = 𝑓𝑎𝑙𝑠𝑒, 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) ⋅ 𝑓4 (𝐽 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑓𝑎𝑙𝑠𝑒) ⋅ 𝑓5 (𝑀 = 𝑡𝑟𝑢𝑒, 𝐴 = 𝑓𝑎𝑙𝑠𝑒)
= (1 − 0.95) × 0.05 × 0.01 = 0.000025.
Sum over 𝐴 for 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒:
0.5985 + 0.000025 = 0.598525.
Repeat for all 𝐵, 𝐸 combinations:
• 𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒:
(0.94 × 0.90 × 0.70) + (0.06 × 0.05 × 0.01) = 0.5922 + 0.00003 = 0.59223.

• 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑡𝑟𝑢𝑒:
(0.29 × 0.90 × 0.70) + (0.71 × 0.05 × 0.01) = 0.1827 + 0.000355 = 0.183055.

• 𝐵 = 𝑓𝑎𝑙𝑠𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒:
(0.001 × 0.90 × 0.70) + (0.999 × 0.05 × 0.01) = 0.00063 + 0.0004995 =
0.0011295.

Result: 𝑓6 (𝐵, 𝐸)
B E 𝑓6 (𝐵, 𝐸)
true true 0.598525
true false 0.59223
false true 0.183055
false false 0.0011295

Step 3: Sum Out 𝐸 (Combine 𝑓2 and 𝑓6 )


Now compute 𝑓7 (𝐵) by:
12. Multiply 𝑓2 (𝐸) ⋅ 𝑓6 (𝐵, 𝐸) for all 𝐵, 𝐸.
13. Sum over 𝐸.
Calculation for 𝐵 = 𝑡𝑟𝑢𝑒:
For 𝐸 = 𝑡𝑟𝑢𝑒:
𝑓2 (𝐸 = 𝑡𝑟𝑢𝑒) ⋅ 𝑓6 (𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑡𝑟𝑢𝑒) = 0.002 × 0.598525 = 0.00119705.
For 𝐸 = 𝑓𝑎𝑙𝑠𝑒:
𝑓2 (𝐸 = 𝑓𝑎𝑙𝑠𝑒) ⋅ 𝑓6 (𝐵 = 𝑡𝑟𝑢𝑒, 𝐸 = 𝑓𝑎𝑙𝑠𝑒) = 0.998 × 0.59223 = 0.59104554.
Sum over 𝐸 for 𝐵 = 𝑡𝑟𝑢𝑒:
0.00119705 + 0.59104554 = 0.59224259.
Calculation for 𝐵 = 𝑓𝑎𝑙𝑠𝑒:
For 𝐸 = 𝑡𝑟𝑢𝑒:
0.002 × 0.183055 = 0.00036611.
For 𝐸 = 𝑓𝑎𝑙𝑠𝑒:
0.998 × 0.0011295 = 0.00112724.
Sum over 𝐸 for 𝐵 = 𝑓𝑎𝑙𝑠𝑒:
0.00036611 + 0.00112724 = 0.00149335.
Result: 𝑓7 (𝐵)
B 𝑓7 (𝐵)
true 0.59224259
false 0.00149335

Step 4: Final Calculation (Multiply 𝑓1 (𝐵) and 𝑓7 (𝐵))


Now compute the unnormalized posterior:
• For 𝐵 = 𝑡𝑟𝑢𝑒:
𝑓1 (𝐵 = 𝑡𝑟𝑢𝑒) ⋅ 𝑓7 (𝐵 = 𝑡𝑟𝑢𝑒) = 0.001 × 0.59224259 = 0.00059224.

• For 𝐵 = 𝑓𝑎𝑙𝑠𝑒:
𝑓1 (𝐵 = 𝑓𝑎𝑙𝑠𝑒) ⋅ 𝑓7 (𝐵 = 𝑓𝑎𝑙𝑠𝑒) = 0.999 × 0.00149335 = 0.00149186.

Normalize:
• Total mass: 0.00059224 + 0.00149186 = 0.00208410.
0.00059224
• 𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣ 𝐽, 𝑀) = 0.00208410 ≈ 0.284.
0.00149186
• 𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐽, 𝑀) = 0.00208410 ≈ 0.716.
Final Answer:
After variable elimination, the posterior probabilities are:
𝑃(𝐵 = 𝑡𝑟𝑢𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒) ≈ 28.4%𝑃(𝐵 = 𝑓𝑎𝑙𝑠𝑒 ∣ 𝐽 = 𝑡𝑟𝑢𝑒, 𝑀 = 𝑡𝑟𝑢𝑒) ≈ 71.6%.
This matches the result from the enumeration method, confirming the correctness of the variable
elimination steps. The key takeaway is that summing out variables reduces the problem size
systematically, making inference tractable in larger networks.
1. Variable Elimination: Limitations
• Efficient for Single Queries:
If you just want the probability of one variable given some evidence, variable
elimination can be very efficient. It eliminates other variables one by one, reusing
computations as needed.

• Inefficient for All Posterior Probabilities:


If you want to compute the posterior probabilities for every variable, variable
elimination has to repeat a lot of work.

• Why it becomes slow in a polytree (a singly connected network):

o For each variable’s posterior, you may need a separate query.


o Each query may cost O(n) time.
o With n variables, you do O(n) such queries.
o So, total cost = O(n²).

2. Clustering (Join Tree): The Solution


• Goal: Avoid doing repeated work for each variable. Ideally, do one computation
that lets you get all posterior probabilities efficiently — in O(n) total time.

• How:

o Combine multiple nodes into cluster nodes (aka meganodes).


o This turns a complex network (possibly with loops) into a polytree of clusters.
o Inference on a polytree is efficient: each message is passed just once.

3. Example: Multiply Connected Network


• Say you have nodes like Sprinkler ↔ Rain, which are not in a polytree (there’s a loop).
• You cluster them into one node: Sprinkler+Rain.
• Now, the network of cluster nodes becomes a polytree (no loops between clusters).
• You can now run efficient inference on this polytree.

4. Cluster Nodes (Meganodes):


• A meganode represents all combinations of values for the nodes it includes.
• Example: Sprinkler and Rain are both Boolean.
o Possible values: (True, True), (True, False), (False, True), (False, False) → 4
total.
• Meganodes may share variables (e.g., Rain might appear in two clusters), which is key
to how messages are passed between clusters in inference.

Summary Table:
Technique Pros Cons
Variable Simple, good for single query Slow for all variables (O(n²) total)
Elimination
Clustering/Join Tree Efficient for all variables More complex to set up (needs
(O(n)) clustering)

Why is a Special Inference Algorithm Needed


• In a normal Bayesian network (with no clustering), inference (like message passing) is
straightforward because each node has its own variables — no overlaps.
• In a clustered network (Join Tree):
o Meganodes (clusters) may share variables with each other (e.g., Rain might
appear in two neighboring clusters).
o So normal inference doesn’t work directly — because if two clusters have the
same variable, they must agree on what value that variable has.
Problem:
How do we make sure that when passing information (probabilities) between clusters, shared
variables have consistent values?

The Solution: Constraint Propagation


• The inference process forces agreement on the shared variables between neighboring
clusters.
• This is done through a special message-passing algorithm, often called belief
propagation or message passing on a junction tree.
Key idea:
• Each cluster computes a local probability table.
• Then, when sending messages to neighboring clusters:
o Marginalize (sum out) all variables except the shared ones.
o Pass that marginalized probability as a message.
o Update local beliefs based on incoming messages to maintain consistency on
shared variables.
Constraint Propagation = ensuring that connected clusters agree on their shared
variables by exchanging summarized probability information.
1. The Core Idea: Modeling the World Over Time

Imagine the world is unfolding as a sequence of time slices (t = 0, 1, 2, ...), like frames in a
movie.
In each time slice, we model two kinds of variables:

State Variables (Xt):

• These are hidden, meaning we can’t directly observe them.

• They represent the true condition of the world at time t.

• Example:

o Rt = Is it raining at time t? (Umbrella scenario)

o BloodSugar_t = True blood sugar level (in diabetes monitoring)

Evidence Variables (Et):

• These are what we can observe at each time.

• They are noisy signals of the hidden state.

• Example:

o Ut = Director is carrying an umbrella → gives indirect info about rain

o MeasuredBloodSugar_t → may be close but not exactly correct

2. Transition Model (How the World Changes)

This defines how the state at time t (Xt) depends on the previous state:

P(Xt | Xt−1)

• This is called a first-order Markov process.

• Markov Assumption: The future only depends on the most recent past, not the full
history.

o So: P(Xt | Xt−1, Xt−2, ..., X0) = P(Xt | Xt−1)

• Stationary Process: The same rule applies for all time steps.

o Example: The chance of rain tomorrow only depends on today’s rain, not
which day it is.
3. Sensor Model (How We Observe the World)

This defines how the current state leads to observations:

P(Et | Xt)

• This reflects the idea that state causes evidence — not the other way around.

• Called the Sensor Markov Assumption.

• Example:

o If it’s raining (Xt = true), the director is more likely to carry an umbrella (Et =
true).

4. Initial State Distribution

We need to define where things start:

P(X0)

• This gives the probability distribution over the state at time 0.

• It’s the model’s initial belief (e.g., 50% chance it’s raining on day 0).

5. Full Joint Probability Over Time

To represent the entire evolution of the system from time 0 to t, we multiply:


𝑡

𝑃(𝑋0 , 𝑋1 , . . . , 𝑋𝑡 , 𝐸1 , . . . , 𝐸𝑡 ) = 𝑃(𝑋0 ) × ∏ 𝑃 (𝑋𝑖 ∣ 𝑋𝑖−1 ) × 𝑃(𝐸𝑖 ∣ 𝑋𝑖 )


𝑖=1

• It chains together:

o The initial state,

o The transition model (how states evolve),

o The sensor model (how states generate observations).

6. Model Refinements

What if our model isn’t accurate enough?


• Option 1: Use a higher-order Markov model

o Instead of just depending on Xt−1, let Xt depend on Xt−2 too.


(e.g., P(Xt | Xt−1, Xt−2))

• Option 2: Add more state variables

o Add Temperature, Season, or BatteryLevel if they influence the process.

These help the model capture more realistic dependencies.

7. Example: Robot Tracking

• State Variables: Position, Velocity, BatteryLevel

• Evidence Variables: Noisy GPS, battery monitor readings

• Transition Model:

o P(Position_t | Position_{t-1}, Velocity_{t-1})

o P(Velocity_t | Velocity_{t-1}, BatteryLevel_{t-1})

• Sensor Model:

o P(GPS_t | Position_t)

o P(BatteryReading_t | BatteryLevel_t)

This lets us infer where the robot is, even if GPS is noisy or missing.

You might also like