Proba Based Notes
Proba Based Notes
Lyes HADJAR
z May 31, 2025
Contents
1 Big Idea 1
2 Fundamentals 1
2.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Standard Approach: The Naive Bayes Model 2
3.1 Naive Bayes Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Prediction Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Example: Naive Bayes with Categorical Features . . . . . . . . . . . . . . 2
4 Extensions and Variations 4
4.1 Smoothing (Laplace Correction) . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Continuous Features: Probability Density Functions (PDF) . . . . . . . 5
4.3 Continuous Features: Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Bayesian Networks 9
5.1 Intuition and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Structure and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Conditional Probability Tables (CPTs) . . . . . . . . . . . . . . . . . . . . . 10
5.4 Inference in Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4.1 Exact Inference: Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4.2 Approximate Inference: Sampling . . . . . . . . . . . . . . . . . . . . . . . . 10
5.5 Worked Example: Wet Grass Problem . . . . . . . . . . . . . . . . . . . . . 11
6 Summary of Key Concepts 12
1
Probability-Based Learning 2
7 Practice Problems 13
7.1 Problem 1: Medical Diagnosis with Bayes’ Theorem . . . . . . . . . . . . 13
7.2 Problem 2: Multi-feature Naive Bayes with Smoothing . . . . . . . . . . 13
7.3 Problem 3: Continuous Features with Gaussian Distribution . . . . . . . 15
Probability-Based Learning 1
1 Big Idea
Big Idea
Probability-based learning is about revising beliefs as new data arrives. It’s built on
Bayes’ Theorem, which connects prior beliefs with observed evidence to form new (pos-
terior) beliefs.
j Intuitive Understanding
Imagine you’re playing a game called Find the Lady where a queen is hidden under one
of three cards. If you have no information, you assume equal probability (1/3) for each
card. But if you watch the dealer favor the right position 19 out of 30 times, you revise
your guess — now you believe the queen is more likely to be on the right.
This updating of belief using evidence is the essence of Bayesian thinking.
2 Fundamentals
2.1 Bayes’ Theorem
j Intuitive Understanding
e Formal Idea
Formally:
P (d | t) · P (t)
P (t | d) = (1)
P (d)
Where:
We often ignore P (d) for classification because it is the same for all classes.
P (t | d) ∝ P (d | t) · P (t) (2)
We choose the class with the highest posterior — called the MAP (Maximum A Pos-
teriori) prediction:
t∗ = arg max P (d | t) · P (t) (3)
t
Probability-Based Learning 2
e Formal Idea
Naive Bayes Assumption: All features are conditionally independent given the
target.
This reduces the complexity significantly.
y Example
Assume a dataset for predicting whether an email is SPAM or NOT SPAM based on
three binary features:
The dataset is summarized below, where each row shows a unique feature combination
and how many emails fall into that category.
Class Free Buy Click Count
Spam Yes Yes Yes 3
Spam No Yes Yes 2
Not No No Yes 3
Not No No No 2
Total number of training examples: 3 + 2 + 3 + 2 = 10
You want to predict the class of an email with: {Free=Yes, Buy=Yes, Click=Yes}.
Step 1: Compute Prior Probabilities
• P (Spam) = 3+2
10 = 5
10 = 0.5
• P (N ot) = 3+2
10 = 5
10 = 0.5
• P (F ree = Y es | Spam) = 3
5 = 0.6
• P (F ree = Y es | N ot) = 0
5 =0
• P (Buy = Y es | N ot) = 0
5 =0
• P (Click = Y es | N ot) = 3
5 = 0.6
For Not:
Final Prediction: Since P (Spam | d) > P (N ot | d), the model predicts Spam.
. Warning
Question: In this example, why did we calculate P (Click = Y es | Spam) but not
P (Click = N o | Spam)?
Answer: This is an excellent question that highlights a key principle of Naive Bayes
classification.
Key Principle: In Naive Bayes, we only compute the probabilities of the specific
feature values observed in the test instance we’re trying to classify.
Our test email has: {Free=Yes, Buy=Yes, Click=Yes}
Therefore, for each class, we calculate:
We do not need P (Click=No | Spam) because the test instance has Click=Yes, not
Click=No.
When would we compute P (Click=No | Spam)?
Intuition: Think of it as asking "Given this exact email, how likely is it to be spam?"
We’re not concerned with emails that look different — only the observed features matter
for this specific prediction.
. Warning
y Example
For Not:
e Formal Idea
The Gaussian formula:
(di − µ)2
1
P (di | t) = √ exp − (7)
2πσ 2 2σ 2
Where:
y Example
Step-by-step example:
If for class Spam:
• µ = 40, σ 2 = 25
Probability-Based Learning 6
Then:
(50 − 40)2
1
P (di | t = Spam) = √ exp − (8)
2π · 25 2 · 25
1 100
= √ exp − (9)
157.08 50
1
= exp(−2) ≈ 0.135 · 0.135 = 0.0182 (10)
12.53
Big Idea
y Example
Advantages of Binning:
• Simple to implement and understand
• Works well when meaningful intervals exist
• Robust to outliers within bins
• Preserves interpretability
Disadvantages of Binning:
• Loss of information due to discretization
• Arbitrary bin boundaries can affect results
• May not capture smooth relationships
• Requires domain knowledge for optimal binning
j Intuitive Understanding
5 Bayesian Networks
5.1 Intuition and Purpose
j Intuitive Understanding
y Example
Example DAG:
Root Variable
Weather
Sprinkler Rain
WetGrass
Leaf Variable
Dependency structure:
• Weather → Sprinkler
• Weather → Rain
j Intuitive Understanding
• Weather is the root node — it has no parents and influences both Sprinkler and
Rain
• Sprinkler and Rain are intermediate nodes — they depend on Weather and influ-
ence WetGrass
• WetGrass is the leaf node — it depends on both Sprinkler and Rain but influences
nothing else
The arrows show causal relationships: Weather affects whether the sprinkler is on and
whether it rains, and both of these factors determine if the grass gets wet.
y Example
Where H are hidden variables. This can be slow for large networks.
• Likelihood weighting
• Gibbs sampling
• Rejection sampling
These simulate values to estimate probabilities without full enumeration.
y Example
DAG Structure:
Weather (W)
Variables:
• S: Sprinkler (On/Off)
• R: Rain (Yes/No)
P (W, S, R, G) = P (W )P (S | W )P (R | W )P (G | S, R)
P (R = Yes | W ) (17)
P (G = Yes | S, R = Yes) (18)
P (R = Yes, G = Yes)
P (R = Yes | G = Yes) =
P (G = Yes)
Interpretation: This answers the question: "If the grass is wet, what is the probability
that it rained?"
Á Summary
Bayesian Networks are powerful for encoding structured probabilistic relationships and
allow for efficient reasoning under uncertainty.
Á Summary
P (d|t)P (t)
• Bayes’ Theorem: Updates beliefs using evidence. P (t | d) = P (d)
7 Practice Problems
7.1 Problem 1: Medical Diagnosis with Bayes’ Theorem
« Practice Problem
Therefore:
0.95 × 0.002 0.0019
P (Disease+ | T est+ ) = = ≈ 0.087
0.02186 0.02186
Insight: Despite a 95% accurate test, a positive result only indicates 8.7% chance of
having the disease due to the low base rate!
« Practice Problem
Email classification dataset with three binary features:
Probability-Based Learning 14
For Ham:
« Practice Problem
Height classification for gender prediction:
Training Data:
• Male heights (cm): [175, 180, 170, 185, 178, 172, 183, 177]
• Female heights (cm): [160, 165, 158, 162, 167, 163, 159, 164]
(174 − 162.25)2
1
P (174 | F emale) = √ exp − (33)
2π × 10.21 2 × 10.21
1 138.06
= exp − (34)
5.05 20.42
= 0.198 × exp(−6.76) = 0.198 × 0.001 ≈ 0.0002 (35)