Aprendizagem 2023
Lab 3: Bayesian learning
Practical exercises
I. Probability theory
1. Consider the following registry where an experiment is repeated A B C D
𝐱1 1 1 0 0
six times and four events (A, B, C and D) are detected.
𝐱2 1 1 1 0
Considering frequentist estimates, compute: 𝐱3 0 0 0 1
2 1 𝐱4 0 0 0 1
𝑝(𝐴) = 𝑝(𝐴, 𝐵, 𝐶) = 0 0 0 0
6 6 𝐱5
𝑝(𝐴, 𝐵) =
2
𝑝(𝐴|𝐵, 𝐶) = 1 𝐱6 0 0 0 0
6
𝑝(𝐴, 𝐵, 𝐶, 𝐷) = 0
𝑝(𝐵|𝐴) = 1
𝑝(𝐷|𝐴, 𝐵, 𝐶) = 0
2. Considering the following two-dimensional measurements {(-2,2),(-1,3),(0,1),(-2,1)}.
a) What are the maximum likelihood parameters of a multivariate Gaussian distribution for this
set of points?
−1.25 0.92 −0.083 1.1 0.1
𝑁(𝐱|𝜇, Σ), µ=[ ], Σ=( ), 𝑑𝑒𝑡(Σ) = 0.83, Σ −1 = ( )
1.75 −0.083 0.92 0.1 1.1
b) What is the shape of the Gaussian?
Draw it approximately using a contour map.
II. Bayesian learning
3. Consider the following dataset where:
− 0: False and 1: True y1 y2 y3 y4 y5 class
𝐱1 1 1 0 1 0 1
− y1: Fast processing
𝐱2 1 1 1 0 0 0
− y2: Decent Battery 𝐱3 0 1 1 1 0 0
− y3: Good Camera 𝐱4 0 0 0 1 1 0
− y4: Good Look and Feel 𝐱5 1 0 1 1 1 1
− y5: Easiness of Use 𝐱6 0 0 1 0 0 1
𝐱7 0 0 0 0 1 1
− class: iPhone
And the query vector 𝐱 new = [1 1 1 1 1]𝑇
a) Using Bayes’ rule, without making any assumptions, compute the posterior probabilities for
the query vector. How is it classified?
3 4
𝑝(𝐶 = 0) = , 𝑝(𝐶 = 1) =
7 7
𝑝 (𝐶 = 0)𝑝(𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1|𝐶 = 0)
𝑝 (𝐶 = 0 | 𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1) =
𝑝 (𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1)
𝑝 (𝐶 = 1)𝑝(𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1|𝐶 = 1)
𝑝 (𝐶 = 1 | 𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1) =
𝑝 (𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1)
According to our estimated likelihoods, the denominators are equal to zero. Posteriors are not defined and,
thus, we cannot classify the input. A small training sample is not enough to decide under a classic Bayes rule.
b) What is the problem of working without assumptions?
Insufficient data to construct a meaningful joint distribution, e.g. applicable for datasets with high
dimensionality or low size (small sample).
c) Compute the class for the same query vector under the naive Bayes assumption.
0.0141
𝑝 (𝐶 = 0 | 𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1) =
𝑝(𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1)
0.0090
𝑝 (𝐶 = 1 | 𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1) =
𝑝(𝑦1 = 1, 𝑦2 = 1, 𝑦3 = 1, 𝑦4 = 1, 𝑦5 = 1)
Label 𝐶 = 0 (not an iPhone).
d) Consider the presence of missings. Under the same naive Bayes assumption, how do you
classify 𝐱 new = [1 ? 1 ? 1]𝑇
0.03175
𝑝 (𝐶 = 0 | 𝑦1 = 1, 𝑦2 =? , 𝑦3 = 1, 𝑦4 =? , 𝑦5 = 1) =
𝑝(𝑦1 = 1, 𝑦3 = 1, 𝑦5 = 1)
0.0714
𝑝 (𝐶 = 1 | 𝑦1 = 1, 𝑦2 =? , 𝑦3 = 1, 𝑦4 =? , 𝑦5 = 1) =
𝑝(𝑦1 = 1, 𝑦3 = 1, 𝑦5 = 1)
Label 𝐶 = 1.
weight (kg) height (cm) NBA player
4. Consider the following dataset
𝐱1 170 160 0
𝐱2 80 220 1
𝐱3 90 200 1
𝐱4 60 160 0
𝐱5 50 150 0
𝐱6 70 190 1
And the query vector 𝐱 new = [100 225]𝑇
a) Compute the most probable class for the query vector assuming that the likelihoods are 2-
dimensional Gaussians.
1 1
𝑝(𝐶 = 0) = , 𝑝(𝐶 = 1) = ,
2 2
𝑝(𝑦1, 𝑦2 | 𝐶 = 0) 𝑝(𝑦1, 𝑦2 | 𝐶 = 1)
93. (3) 80
𝜇 [ ] [ ]
156. (6) 203. (3)
4433. (3) 216. (6) 100 50
Σ [ ] [ ]
216. (6) 33. (3) 50 233. (3)
𝑝 (𝐶 = 0)𝑝( 𝑦1 = 100, 𝑦2 = 225 | 𝐶 = 0)
𝑝 (𝐶 = 0 | 𝑦1 = 100, 𝑦2 = 225) =
𝑝( 𝑦1 = 100, 𝑦2 = 225)
1 100 93. (3) 4433. (3) 216. (6)
𝑁 ([ ]| 𝜇 = [ ],Σ = [ ]) 1.74 × 10−48
2 225 156. (6) 216. (6) 33. (3)
=
𝑝( 𝑦1 = 100, 𝑦2 = 225) 𝑝( 𝑦1 = 100, 𝑦2 = 225)
𝑝 (𝐶 = 1)𝑝( 𝑦1 = 100, 𝑦2 = 225 | 𝐶 = 1)
𝑝 (𝐶 = 1 | 𝑦1 = 100, 𝑦2 = 225) =
𝑝( 𝑦1 = 100, 𝑦2 = 225)
1 100 80 100 50
𝑁 ([ ]| 𝜇 = [ ],Σ = [ ]) 5.38 × 10−5
2 225 203. (3) 50 233. (3)
=
𝑝( 𝑦1 = 100, 𝑦2 = 225) 𝑝( 𝑦1 = 100, 𝑦2 = 225)
Classified as an NBA player.
b) Compute the most probable class for the query vector, under the Naive Bayes assumption,
using 1-dimensional Gaussians to model the likelihoods
𝑝(𝑦1 | 𝐶 = 0) 𝑝(𝑦1| 𝐶 = 1) 𝑝(𝑦1 | 𝐶 = 0) 𝑝(𝑦1| 𝐶 = 1)
𝜇 93. (3) 80 156. (6) 203. (3)
σ 66.58 10 5.77 15.275
𝑝 (𝐶 = 0)𝑝( 𝑦1 = 100 | 𝐶 = 0)𝑝(𝑦2 = 225 | 𝐶 = 0)
𝑝 (𝐶 = 0 | 𝑦1 = 100, 𝑦2 = 225) =
𝑝( 𝑦1 = 100, 𝑦2 = 225)
1
𝑁(100 | 𝜇 = 93. (3), σ = 66.58)𝑁(225 | 𝜇 = 156. (6), σ = 5.77) 7.854 × 10−35
2 =
𝑝( 𝑦1 = 100, 𝑦2 = 225) 𝑝( 𝑦1 = 100, 𝑦2 = 225)
𝑝 (𝐶 = 1)𝑝( 𝑦1 = 100 | 𝐶 = 1)𝑝(𝑦2 = 225 | 𝐶 = 1)
𝑝 (𝐶 = 1 | 𝑦1 = 100, 𝑦2 = 225) =
𝑝( 𝑦1 = 100, 𝑦2 = 225)
1
𝑁(100 | 𝜇 = 80, σ = 10)𝑁(225 | 𝜇 = 203. (3), σ = 15.275) 2.578 × 10−5
2 =
𝑝( 𝑦1 = 100, 𝑦2 = 225) 𝑝( 𝑦1 = 100, 𝑦2 = 225)
Classified as an NBA player.
5. Assuming training examples with m features and a binary class.
a) How many parameters do you have to estimate considering features are Boolean and:
i. no assumptions about how the data is distributed
ii. naive Bayes assumption
One parameter for the prior 𝑝(𝑧 = 0) = 1 − 𝑝(𝑧 = 1).
Considering the classic Bayesian model: we need (2𝑚 − 1) × 2 parameters to estimate 𝑝(𝑦1 = 𝑣1 , … , 𝑦𝑚 =
𝑣𝑚 | 𝑧 = 𝑐), hence 2𝑚 × 2 − 1.
Considering the naïve Bayes: we need to estimate 𝑝(𝑦𝑖 | 𝑧 = 𝑐). Since there are 2 classes and 𝑚 features,
we have 2 × 𝑚 × 1 = 2𝑚 parameters for the likelihoods. The total number of parameters is 1 + 2𝑚.
b) How many parameters do you have to estimate considering features are numeric and:
iii. multivariate Gaussian assumption
iv. naive Bayes with Gaussian assumption
Similarly, one parameter for the prior, 𝑝(𝑧 = 0) = 1 − 𝑝(𝑧 = 1).
A multivariate Gaussian to estimate the likelihood 𝑝(𝐱 | 𝑧 = 0) requires a mean vector and a covariance
matrix. For 𝑚 variables, the mean vector has 𝑚 parameters. The covariance is a 𝑚 × 𝑚 matrix. However, the
matrix is symmetric so, we only need to count the diagonal and upper diagonal part of the matrix, i.e. 𝑚 +
(𝑚−1) (𝑚+1)
𝑚 . In this context, the total number of parameters is 2 (𝑚 + 𝑚 ) + 1.
2 2
Considering the naïve Bayes: we need to estimate 𝑝(𝑦𝑖 | 𝑧 = 𝑐), requiring the fitting of a (univariate)
Gaussian distribution with two parameters: 𝜇𝑖 and 𝜎𝑖 . Since there are 2 classes and m features, we have
2 × 𝑚 × 2 = 4𝑚 parameters for the likelihoods. The total number of parameters is 1 + 4𝑚.
Programming quests
Resources: Classification and Evaluation notebooks available at the course’s webpage
1. Reuse the sklearn code from last lab where we learnt a decision tree in the breast.w data:
a) apply the naïve Bayes classifier with default parameters
b) compare the accuracy of both classifiers using a 10-fold cross-validation
2. Consider the accuracy estimates collected under a 5-fold CV for two predictive models M1 and
M2, accM1=(0.7,0.5,0.55,0.55,0.6) and accM2=(0.75,0.6,0.6,0.65,0.55).
Using scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html),
assess whether the differences in predictive accuracy are statistically significant.