0% found this document useful (0 votes)
14 views17 pages

Proba Based Notes

Proba_Based_Notes ml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views17 pages

Proba Based Notes

Proba_Based_Notes ml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proba-Based Learning

Comprehensive Course Material

Lyes HADJAR
z May 31, 2025

‡ Machine Learning Course Navigation


ò Course Overview
This comprehensive course covers probability-based learning methods from fundamentals to
advanced techniques, including Naive Bayes and Bayesian Networks with detailed mathematical
derivations and practical examples.

Contents
1 Big Idea 1
2 Fundamentals 1
2.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Bayesian Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Standard Approach: The Naive Bayes Model 2
3.1 Naive Bayes Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Prediction Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Example: Naive Bayes with Categorical Features . . . . . . . . . . . . . . 2
4 Extensions and Variations 4
4.1 Smoothing (Laplace Correction) . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Continuous Features: Probability Density Functions (PDF) . . . . . . . 5
4.3 Continuous Features: Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Bayesian Networks 9
5.1 Intuition and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Structure and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Conditional Probability Tables (CPTs) . . . . . . . . . . . . . . . . . . . . . 10
5.4 Inference in Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4.1 Exact Inference: Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4.2 Approximate Inference: Sampling . . . . . . . . . . . . . . . . . . . . . . . . 10
5.5 Worked Example: Wet Grass Problem . . . . . . . . . . . . . . . . . . . . . 11
6 Summary of Key Concepts 12

1
Probability-Based Learning 2

7 Practice Problems 13
7.1 Problem 1: Medical Diagnosis with Bayes’ Theorem . . . . . . . . . . . . 13
7.2 Problem 2: Multi-feature Naive Bayes with Smoothing . . . . . . . . . . 13
7.3 Problem 3: Continuous Features with Gaussian Distribution . . . . . . . 15
Probability-Based Learning 1

1 Big Idea

 Big Idea

Probability-based learning is about revising beliefs as new data arrives. It’s built on
Bayes’ Theorem, which connects prior beliefs with observed evidence to form new (pos-
terior) beliefs.

j Intuitive Understanding

Imagine you’re playing a game called Find the Lady where a queen is hidden under one
of three cards. If you have no information, you assume equal probability (1/3) for each
card. But if you watch the dealer favor the right position 19 out of 30 times, you revise
your guess — now you believe the queen is more likely to be on the right.
This updating of belief using evidence is the essence of Bayesian thinking.

2 Fundamentals
2.1 Bayes’ Theorem

j Intuitive Understanding

Intuitively, Bayes’ Theorem tells us:

Updated belief = Evidence-based likelihood × Prior belief

e Formal Idea
Formally:
P (d | t) · P (t)
P (t | d) = (1)
P (d)
Where:

• P (t | d): Posterior probability (probability of target t given data d)

• P (d | t): Likelihood (probability of seeing data d if target is t)

• P (t): Prior (our belief about t before seeing d)

• P (d): Evidence (overall probability of data d)

We often ignore P (d) for classification because it is the same for all classes.

2.2 Bayesian Prediction


To predict a class for an instance d, we compute:

P (t | d) ∝ P (d | t) · P (t) (2)

We choose the class with the highest posterior — called the MAP (Maximum A Pos-
teriori) prediction:
t∗ = arg max P (d | t) · P (t) (3)
t
Probability-Based Learning 2

3 Standard Approach: The Naive Bayes Model


The problem with Bayes’ approach is scalability — we need to estimate probabilities for every
possible combination of features. Naive Bayes simplifies this using a strong assumption:

e Formal Idea
Naive Bayes Assumption: All features are conditionally independent given the
target.
This reduces the complexity significantly.

3.1 Naive Bayes Formula


Given features d1 , d2 , ..., dm , and class t:
m
Y
P (t | d) ∝ P (t) P (di | t) (4)
i=1

3.2 Prediction Rule


m
!
Y
t̂ = arg max P (t) P (di | t) (5)
t
i=1

3.3 Example: Naive Bayes with Categorical Features

y Example

Assume a dataset for predicting whether an email is SPAM or NOT SPAM based on
three binary features:

• Free: email contains the word "free"

• Buy: email contains the word "buy"

• Click: email contains the word "click"

The dataset is summarized below, where each row shows a unique feature combination
and how many emails fall into that category.
Class Free Buy Click Count
Spam Yes Yes Yes 3
Spam No Yes Yes 2
Not No No Yes 3
Not No No No 2
Total number of training examples: 3 + 2 + 3 + 2 = 10
You want to predict the class of an email with: {Free=Yes, Buy=Yes, Click=Yes}.
Step 1: Compute Prior Probabilities

• P (Spam) = 3+2
10 = 5
10 = 0.5

• P (N ot) = 3+2
10 = 5
10 = 0.5

Step 2: Compute Conditional Probabilities (Likelihoods)


For class Spam (total Spam emails = 5):
Probability-Based Learning 3

• P (F ree = Y es | Spam) = 3
5 = 0.6

• P (Buy = Y es | Spam) = 3+2


5 = 5
5 = 1.0

• P (Click = Y es | Spam) = 3+2


5 = 5
5 = 1.0

For class Not (total Not emails = 5):

• P (F ree = Y es | N ot) = 0
5 =0

• P (Buy = Y es | N ot) = 0
5 =0

• P (Click = Y es | N ot) = 3
5 = 0.6

Step 3: Apply Naive Bayes Formula for Each Class


For Spam:

P (Spam | d) ∝ P (Spam) · P (F ree = Y es | Spam)·


P (Buy = Y es | Spam) · P (Click = Y es | Spam)
= 0.5 · 0.6 · 1.0 · 1.0
= 0.3

For Not:

P (N ot | d) ∝ P (N ot) · P (F ree = Y es | N ot)·


P (Buy = Y es | N ot) · P (Click = Y es | N ot)
= 0.5 · 0 · 0 · 0.6 = 0

Final Prediction: Since P (Spam | d) > P (N ot | d), the model predicts Spam.

. Warning

Question: In this example, why did we calculate P (Click = Y es | Spam) but not
P (Click = N o | Spam)?
Answer: This is an excellent question that highlights a key principle of Naive Bayes
classification.
Key Principle: In Naive Bayes, we only compute the probabilities of the specific
feature values observed in the test instance we’re trying to classify.
Our test email has: {Free=Yes, Buy=Yes, Click=Yes}
Therefore, for each class, we calculate:

P (Free=Yes | Class) · P (Buy=Yes | Class) · P (Click=Yes | Class)

We do not need P (Click=No | Spam) because the test instance has Click=Yes, not
Click=No.
When would we compute P (Click=No | Spam)?

• Only if the test instance had Click=No

• When building a complete probability table for analysis

• When predicting a different email with Click=No


Probability-Based Learning 4

Intuition: Think of it as asking "Given this exact email, how likely is it to be spam?"
We’re not concerned with emails that look different — only the observed features matter
for this specific prediction.

4 Extensions and Variations


4.1 Smoothing (Laplace Correction)

. Warning

If a feature-value/class combination never appears in training, its probability becomes


zero. This causes the entire product to be zero, even if other features are strongly pre-
dictive.

To solve this, we apply Laplace Smoothing with smoothing parameter k:


count(di = v, t) + k
P (di = v | t) = (6)
count(t) + kV
Where:
• k is the smoothing parameter (typically k = 1 for Laplace smoothing)
• V is the number of possible values for feature di (for binary features, V = 2)
• Adding k to numerator and kV to denominator avoids zero probabilities

y Example

Applying Laplace Smoothing to Previous Example (using k = 1):


Updated Conditional Probabilities with Smoothing (k = 1, V = 2 for binary
features):
For class Spam (total Spam emails = 5):
• P (F ree = Y es | Spam) = 3+1
5+1·2 = 4
7 ≈ 0.571

• P (Buy = Y es | Spam) = 5+1


5+1·2 = 6
7 ≈ 0.857

• P (Click = Y es | Spam) = 5+1


5+1·2 = 6
7 ≈ 0.857
For class Not (total Not emails = 5):
• P (F ree = Y es | N ot) = 0+1
5+1·2 = 1
7 ≈ 0.143

• P (Buy = Y es | N ot) = 0+1


5+1·2 = 1
7 ≈ 0.143

• P (Click = Y es | N ot) = 3+1


5+1·2 = 4
7 ≈ 0.571
Updated Naive Bayes Predictions with Smoothing:
For Spam:
P (Spam | d) ∝ P (Spam) · P (F ree = Y es | Spam)·
P (Buy = Y es | Spam) · P (Click = Y es | Spam)
4 6 6
= 0.5 · · ·
7 7 7
144
= 0.5 · ≈ 0.210
343
Probability-Based Learning 5

For Not:

P (N ot | d) ∝ P (N ot) · P (F ree = Y es | N ot)·


P (Buy = Y es | N ot) · P (Click = Y es | N ot)
1 1 4
= 0.5 · · ·
7 7 7
4
= 0.5 · ≈ 0.006
343
Final Prediction with Smoothing: Since P (Spam | d) > P (N ot | d) (0.210 > 0.006),
the model still predicts Spam, but now with non-zero probabilities for both classes.
Key Improvement: Without smoothing, P (N ot | d) = 0 due to zero probabilities.
With smoothing (k = 1), both classes have non-zero probabilities, making the model
more robust and providing meaningful probability estimates.
Note on Smoothing Parameter: Different values of k can be used:

• k = 1: Standard Laplace smoothing (add-one smoothing)

• k < 1: Less aggressive smoothing for large datasets

• k > 1: More aggressive smoothing for small or noisy datasets

4.2 Continuous Features: Probability Density Functions (PDF)


Instead of estimating probabilities for every value, we model P (di | t) using a probability distri-
bution, typically Gaussian (normal distribution).

e Formal Idea
The Gaussian formula:
(di − µ)2
 
1
P (di | t) = √ exp − (7)
2πσ 2 2σ 2

Where:

• µ: mean of feature di for class t

• σ 2 : variance of feature di for class t

y Example

Step-by-step example:
If for class Spam:

• Feature: word count di = 50

• µ = 40, σ 2 = 25
Probability-Based Learning 6

Then:
(50 − 40)2
 
1
P (di | t = Spam) = √ exp − (8)
2π · 25 2 · 25
 
1 100
= √ exp − (9)
157.08 50
1
= exp(−2) ≈ 0.135 · 0.135 = 0.0182 (10)
12.53

4.3 Continuous Features: Binning


When dealing with continuous features in Naive Bayes, we cannot directly compute probabilities
like P (Age = 25.7 | Class) because continuous values have infinite possibilities. One effective
strategy is to discretize continuous values into bins (intervals) and treat them as categorical
features.

 Big Idea

Core Concept: Transform continuous features into discrete categorical features by


grouping similar values into meaningful intervals.

Steps for Binning Approach:


1. Choose bin boundaries based on domain knowledge or data distribution
2. Assign each continuous value to its corresponding bin
3. Apply standard Naive Bayes treating bins as categorical values
4. Compute probabilities using bin frequencies

y Example

Age Binning Example:


Suppose we have a dataset predicting loan approval based on age:
Step 1: Define Age Bins
• Bin 1: [18-25] (Young adults)

• Bin 2: [26-40] (Early career)

• Bin 3: [41-55] (Mid career)

• Bin 4: [56+] (Senior)


Step 2: Convert Continuous Ages to Bins
Original Age Assigned Bin Loan Approved Class
23 [18-25] Yes Approved
34 [26-40] No Rejected
19 [18-25] Yes Approved
45 [41-55] Yes Approved
62 [56+] No Rejected
Step 3: Compute Conditional Probabilities
For Approved class (3 instances):
Probability-Based Learning 7

• P (Age ∈ [18 − 25] | Approved) = 2


3

• P (Age ∈ [26 − 40] | Approved) = 0


3

• P (Age ∈ [41 − 55] | Approved) = 1


3

• P (Age ∈ [56+] | Approved) = 0


3

For Rejected class (2 instances):

• P (Age ∈ [18 − 25] | Rejected) = 0


2

• P (Age ∈ [26 − 40] | Rejected) = 1


2

• P (Age ∈ [41 − 55] | Rejected) = 0


2

• P (Age ∈ [56+] | Rejected) = 1


2

Step 4: Make Predictions


For a new applicant aged 28 (falls in [26-40] bin):

P (Approved | Age = 28) ∝ P (Approved) · P (Age ∈ [26 − 40] | Approved)


3 0
= · =0
5 3

P (Rejected | Age = 28) ∝ P (Rejected) · P (Age ∈ [26 − 40] | Rejected)


2 1 1
= · =
5 2 5
Prediction: Rejected (but should use Laplace smoothing in practice!)

Advantages of Binning:
• Simple to implement and understand
• Works well when meaningful intervals exist
• Robust to outliers within bins
• Preserves interpretability
Disadvantages of Binning:
• Loss of information due to discretization
• Arbitrary bin boundaries can affect results
• May not capture smooth relationships
• Requires domain knowledge for optimal binning

j Intuitive Understanding

Choosing Good Bins:


• Equal-width bins: Divide range into equal intervals
• Equal-frequency bins: Each bin contains same number of samples
Probability-Based Learning 8

• Domain-based bins: Use meaningful thresholds (e.g., age groups)

• Decision tree-based: Use optimal splits from decision trees


Probability-Based Learning 9

5 Bayesian Networks
5.1 Intuition and Purpose

j Intuitive Understanding

A Bayesian Network is a graphical model that represents the probabilistic relation-


ships between variables using a Directed Acyclic Graph (DAG). Each node represents
a variable, and each edge represents a conditional dependency.
Unlike Naive Bayes, Bayesian Networks allow you to relax the conditional indepen-
dence assumption and explicitly model dependencies between features.

5.2 Structure and Semantics


Let each node in the DAG represent a random variable. The network defines a joint probability
distribution over all variables:
n
Y
P (X1 , X2 , ..., Xn ) = P (Xi | Parents(Xi )) (11)
i=1

Each variable is conditionally independent of its non-descendants, given its parents.

y Example

Example DAG:

Root Variable

Weather

Sprinkler Rain

WetGrass

Leaf Variable

Dependency structure:

• Weather → Sprinkler

• Weather → Rain

• Sprinkler, Rain → WetGrass


Probability-Based Learning 10

Joint probability factorization:

P (W eather, Sprinkler, Rain, W etGrass) = (12)


P (W eather) · P (Sprinkler | W eather)· (13)
P (Rain | W eather) · P (W etGrass | Sprinkler, Rain) (14)

j Intuitive Understanding

Reading the graph:

• Weather is the root node — it has no parents and influences both Sprinkler and
Rain

• Sprinkler and Rain are intermediate nodes — they depend on Weather and influ-
ence WetGrass

• WetGrass is the leaf node — it depends on both Sprinkler and Rain but influences
nothing else

The arrows show causal relationships: Weather affects whether the sprinkler is on and
whether it rains, and both of these factors determine if the grass gets wet.

5.3 Conditional Probability Tables (CPTs)


Each node has a Conditional Probability Table (CPT) specifying probabilities for each
combination of parent values.

y Example

Example: Rain | Weather

Weather Rain=True Rain=False


Sunny 0.1 0.9
Cloudy 0.4 0.6
Rainy 0.8 0.2

5.4 Inference in Bayesian Networks


Inference means computing the posterior probability of one or more variables given some
observed evidence.

5.4.1 Exact Inference: Enumeration


To compute P (Q | E = e), where Q is the query and E is the evidence:
P
P (Q, E = e, H)
P (Q | E = e) = P H (15)
Q,H P (Q, E = e, H)

Where H are hidden variables. This can be slow for large networks.

5.4.2 Approximate Inference: Sampling


We can use methods like:
Probability-Based Learning 11

• Likelihood weighting
• Gibbs sampling
• Rejection sampling
These simulate values to estimate probabilities without full enumeration.

5.5 Worked Example: Wet Grass Problem

y Example

DAG Structure:

Weather (W)

Sprinkler (S) Rain (R)

Wet Grass (G)

Variables:

• W: Weather (Sunny, Rainy)

• S: Sprinkler (On/Off)

• R: Rain (Yes/No)

• G: Wet Grass (Yes/No)

Goal: Compute P (R = Yes | G = Yes)


Step 1: Write joint distribution

P (W, S, R, G) = P (W )P (S | W )P (R | W )P (G | S, R)

Step 2: Use enumeration to marginalize non-query variables


X
P (R = Yes, G = Yes) = P (W )P (S | W ) (16)
W,S

P (R = Yes | W ) (17)
P (G = Yes | S, R = Yes) (18)

Step 3: Compute denominator to normalize


X
P (G = Yes) = P (R, G = Yes)
R
Probability-Based Learning 12

Step 4: Compute posterior

P (R = Yes, G = Yes)
P (R = Yes | G = Yes) =
P (G = Yes)

Interpretation: This answers the question: "If the grass is wet, what is the probability
that it rained?"

Á Summary

Bayesian Networks are powerful for encoding structured probabilistic relationships and
allow for efficient reasoning under uncertainty.

6 Summary of Key Concepts

Á Summary
P (d|t)P (t)
• Bayes’ Theorem: Updates beliefs using evidence. P (t | d) = P (d)

• Naive Bayes: Assumes feature independence. P (t | d) ∝ P (t)


Q
P (di | t)

• Smoothing: Laplace correction avoids zero probabilities.

• Continuous Features: Modeled using Gaussian PDFs or binned into intervals.

• Bayesian Networks: DAGs that model variable dependencies with conditional


probability tables.

• Inference: Can be exact (enumeration) or approximate (sampling methods).


Probability-Based Learning 13

7 Practice Problems
7.1 Problem 1: Medical Diagnosis with Bayes’ Theorem

« Practice Problem

A rare disease affects 0.2% of the population. A medical test has:

• Sensitivity (true positive rate): P (T est+ | Disease+ ) = 0.95

• Specificity (true negative rate): P (T est− | Disease− ) = 0.98

• Prior probability: P (Disease+ ) = 0.002

Part A: What is the probability of a false positive? P (T est+ | Disease− )


Part B: If someone tests positive, what’s the probability they actually have the disease?
P (Disease+ | T est+ )
Solution:
Part A:

P (T est+ | Disease− ) = 1 − P (T est− | Disease− ) = 1 − 0.98 = 0.02

Part B: Using Bayes’ theorem:

P (T est+ | Disease+ ) · P (Disease+ )


P (Disease+ | T est+ ) = (19)
P (T est+ )

First, compute P (T est+ ) using the law of total probability:

P (T est+ ) = P (T est+ | Disease+ ) · P (Disease+ ) (20)


− −
+
+ P (T est | Disease ) · P (Disease ) (21)
= 0.95 × 0.002 + 0.02 × 0.998 (22)
= 0.0019 + 0.01996 = 0.02186 (23)

Therefore:
0.95 × 0.002 0.0019
P (Disease+ | T est+ ) = = ≈ 0.087
0.02186 0.02186
Insight: Despite a 95% accurate test, a positive result only indicates 8.7% chance of
having the disease due to the low base rate!

7.2 Problem 2: Multi-feature Naive Bayes with Smoothing

« Practice Problem
Email classification dataset with three binary features:
Probability-Based Learning 14

Email Contains "Free" Contains "Click" Length > 100 Class


1 Yes No Yes Spam
2 Yes Yes No Spam
3 No Yes Yes Spam
4 Yes Yes Yes Spam
5 No No No Ham
6 No No Yes Ham
7 No Yes No Ham
8 No No No Ham

Part A: Calculate all conditional probabilities with Laplace smoothing (k = 1, V = 2).


Part B: Classify a new email: [Contains "Free" = Yes, Contains "Click" = No, Length
> 100 = Yes]
Part C: What happens if we don’t use smoothing? Calculate the difference.
Solution:
Part A: Prior probabilities: - P (Spam) = 48 = 0.5 - P (Ham) = 48 = 0.5
With Laplace smoothing (k = 1, V = 2):
For Spam (4 emails):

• P (F ree = Y es | Spam) = 3+1


4+2 = 4
6 = 0.667

• P (Click = Y es | Spam) = 3+1


4+2 = 4
6 = 0.667

• P (Length > 100 = Y es | Spam) = 3+1


4+2 = 4
6 = 0.667

For Ham (4 emails):

• P (F ree = Y es | Ham) = 0+1


4+2 = 1
6 = 0.167

• P (Click = Y es | Ham) = 1+1


4+2 = 2
6 = 0.333

• P (Length > 100 = Y es | Ham) = 1+1


4+2 = 2
6 = 0.333

Part B: For new email [Free=Yes, Click=No, Length>100=Yes]:


For Spam:

P (Spam | features) ∝ 0.5 × 0.667 × (1 − 0.667) × 0.667 (24)


= 0.5 × 0.667 × 0.333 × 0.667 (25)
= 0.074 (26)

For Ham:

P (Ham | features) ∝ 0.5 × 0.167 × (1 − 0.333) × 0.333 (27)


= 0.5 × 0.167 × 0.667 × 0.333 (28)
= 0.019 (29)

Prediction: Spam (0.074 > 0.019)


Part C: Without smoothing, P (F ree = Y es | Ham) = 04 = 0, making the entire Ham
probability 0, regardless of other features. Smoothing prevents this zero-probability trap.
Probability-Based Learning 15

7.3 Problem 3: Continuous Features with Gaussian Distribution

« Practice Problem
Height classification for gender prediction:
Training Data:

• Male heights (cm): [175, 180, 170, 185, 178, 172, 183, 177]

• Female heights (cm): [160, 165, 158, 162, 167, 163, 159, 164]

Part A: Calculate mean and variance for each class.


Part B: What’s the probability of height 174 cm given Male? Given Female?
Part C: Classify someone who is 174 cm tall (assume equal priors).
Solution:
Part A: Male: µM = 177.5 cm, σM 2 = 25.71 cm² Female: µ = 162.25 cm, σ 2 = 10.21
F F
cm²
Part B:
(174 − 177.5)2
 
1
P (174 | M ale) = √ exp − (30)
2π × 25.71 2 × 25.71
 
1 12.25
= exp − (31)
8.01 51.42
= 0.125 × exp(−0.238) = 0.125 × 0.788 = 0.099 (32)

(174 − 162.25)2
 
1
P (174 | F emale) = √ exp − (33)
2π × 10.21 2 × 10.21
 
1 138.06
= exp − (34)
5.05 20.42
= 0.198 × exp(−6.76) = 0.198 × 0.001 ≈ 0.0002 (35)

Part C: Since P (174 | M ale) ≫ P (174 | F emale), classify as Male.

You might also like