0% found this document useful (0 votes)
15 views49 pages

Lec 1 Prob Bayesian Modeling

Uploaded by

Aaditya Saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views49 pages

Lec 1 Prob Bayesian Modeling

Uploaded by

Aaditya Saraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Probabilistic Bayesian Modelling

Debaditya Roy
Probabilistic Model
• 𝑥 – an observation (random variable/vector)
• 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑛}, set of observations, evidence, data
• Probabilistic model – a mathematical form which provides stochastic
information about the random variable 𝑋
• 𝜃 - parameters of a model
• 𝑀 – hyperparameters of a model
Modelling Goals
• Estimation (of the underlying model parameters) - p(,m|X)
• Understand
• Generate new data

• Prediction − 𝑝(𝑥 ∗ |𝜃) or 𝑝(𝑥 ∗ |𝑋), x* is a new observation

• Model comparison – 𝑝(𝑋| 𝜃1) > 𝑝(𝑋| 𝜃2)

• Solving the first goal helps solve the second and third goals
Some probabilities of interest

Note: We are talking about probability distributions and not single (point) probabilities
Maximum Likelihood Estimation
Rules of Probability
Posterior Distribution
Posterior Distribution
Posterior Predictive Distribution
Marginal Likelihood
Model Comparison/Averaging
A Simple Parameter Estimation Problem
• for a single-parameter model
• hyperparameter if any will be assumed to be fixed/known
Simple Example (MLE)
• Consider a sequence of N coin tosses (call head = 1, tail = 0)
• The 𝑛ᵗʰ outcome 𝑥ₙ is a binary random variable ∈ {0, 1}
• Assume 𝜃 to be probability of a head (parameter we wish to estimate)

• Each likelihood term 𝑝(𝑥ₙ | 𝜃) is Bernoulli: 𝑝 𝑥𝑛 𝜃) = 𝜃 𝑥𝑛 1 − 𝜃 1−𝑥𝑛

• Log-likelihood:σ𝑁
𝑛=1 log 𝑝 𝑥𝑛 𝜃 = σ 𝑁
𝑛=1 𝑥𝑛 log 𝜃 + 1 − 𝑥𝑛 log(1 − 𝜃)

• Taking derivative of the log-likelihood w.r.t. 𝜃, and setting it to zero gives:


σ𝑁𝑛=1 𝑥𝑛

𝜽ₘₗₑ =
𝑁
෡ in this example is simply the fraction of heads!
𝜽ₘₗₑ
MAP Estimate
Posterior Distribution

Posterior has the same form as prior – conjugate prior


Posterior Predictive Distribution
Visualization

• Prior 𝐵𝑒𝑡𝑎(2,2)
• Likelihood (scaled)
• Posterior 𝐵𝑒𝑡𝑎(14,10)

Vertical lines for:


12
•MLE: 𝜃 = 20 = 0.60
13
•MAP: 𝜃 = 22 ≈ 0.59
14
•Bayesian Mean: 𝜃 = 24 ≈ 0.58

Bayesian mean and MAP are pulled slightly toward the prior compared to the MLE.
Multinoulli Observation Model
Multinoulli Model
Detour: Dirichlet Distribution
A Bag of Proportions
Imagine you're trying to model the proportions of 𝐾 different categories (say: red, green, blue
marbles in a bag). But instead of knowing the exact proportions, you're uncertain — and you
want a probabilistic guess of what those proportions might be.

The Dirichlet distribution gives you a way to describe that uncertainty:


• Each sample from a Dirichlet distribution gives you a possible set of proportions (like: 60%
red, 30% green, 10% blue).
• Different parameters of the Dirichlet control what kinds of proportions you're more likely to
see.
Detour: Dirichlet Distribution
Dirichlet distribution has a parameter α = [α₁, α₂, ..., αₖ] — one for each category.
Here’s what those parameters intuitively do:

• αᵢ > 1 → “I believe the 𝑖 𝑡ℎ category will have a large proportion.”

• αᵢ < 1 → “I believe the 𝑖 𝑡ℎ category will have a small proportion (or maybe even zero).”

• αᵢ = 1 → “I have no strong preference for the 𝑖 𝑡ℎ category.”

Sum of αs, often denoted α₀ = ∑αᵢ, controls concentration:


• High α₀ (e.g. all αᵢ = 10): samples are tightly clustered around the mean (less variability).
• Low α₀ (e.g. all αᵢ = 0.2): samples are sparse — most of the probability mass goes to just
one or two categories in each sample.
Detour: Dirichlet Distribution
Posterior Distribution
Exercise
For Multinoulli Likelihood and Dirichlet Prior
- What is the MLE/MAP?
- Posterior Predictive Distribution?
Gaussian Models
• Univariate with fixed variance
• Univariate with fixed mean
• Univariate with varying mean and variance
• Multivariate
26
Detour: Generative Models
Generative models invariably are also probabilistic models

• Image-to-image translation
• Deepfake generation

• Anomaly Detection in Medical Imaging


• Generating synthetic but interpretable data

• High-fidelity Audio Generation


e.g., WaveGlow for speech synthesis

• Text-to-Image Generation

Figure credit: Lilian Weng


Fixed Variance Gaussian Model
Bayesian Inference for Mean of a Gaussian

Notion of Sufficient Statistics


We only need sufficient statistics to estimate the parameters and values of individual observations aren’t
needed
Likelihood

Prior
Completing the square

Resulting Posterior
Posterior Predictive Distribution

Convolution of Gaussians
Posterior Predictive Distribution

Why? Because you are adding uncertainty


Fixed Mean Gaussian Model
Choosing a Conjugate Prior for 𝜎 2
Goal: Find a prior 𝑝(𝜎 2 ) that makes posterior inference tractable (i.e., conjugate prior).
Posterior Distribution over 𝜎 2 or Precision 𝜆

sum of squared deviations


Visualization

• Posterior sharpens around the true


variance
• Bayesian inference updates our
belief after observing data.
Univariate Gaussian — Unknown Mean & Variance

𝜅0 is a scaling parameter that determines confidence


in prior belief about 𝜇
Posterior Derivation — Normal-Inverse-Gamma
Visualization
Multivariate Gaussian

A two-dimensional Gaussian
Multivariate Gaussian: Examples
Covariance matrix Σ determines:
•Shape of the distribution
•Orientation and spread
Multivariate Gaussian: Marginals and Conditionals
Multivariate Gaussian : Full Bayesian Estimation
Multivariate Gaussian : Full Bayesian Estimation
Multivariate Gaussian : Full Bayesian Estimation
Linear Gaussian Model Formulation
LGM ↔ Bayesian Inference Mapping
Bayesian Inference Linear Gaussian Model
Unknown 𝝁 Latent variable

𝐱𝑖 = 𝜇 + 𝜖 LGM equation

Gaussian noise 𝜖 ∼ 𝒩(0, 𝚺) Measurement uncertainty

Gaussian prior on 𝝁 Conjugate prior


Posterior of 𝝁 LGM inference result

Bayesian inference in this setup is equivalent to inference in a Linear Gaussian Model where
parameters (like the mean vector) are latent variables and observations are generated through a
linear-Gaussian transformation.
48
Gaussian Observation Model
• MLE/MAP for 𝜇, 𝜎 2 (or both) is straightforward in Gaussian observation models.

• Posterior also straightforward in most situations for such models


• (As we saw) computing posterior of 𝜇 is easy (using Gaussian prior) if variance 𝜎 2 is known
• Likewise, computing posterior of 𝜎 2 is easy (using gamma prior on 𝜎 2 ) if mean 𝜇 is known

• If 𝜇, 𝜎 2 both are unknown, posterior computation requires computing 𝑝 𝜇, 𝜎 2 𝑥


• Computing joint posterior 𝑝 𝜇, 𝜎 2 𝒙 exactly requires a jointly conjugate prior 𝑝(𝜇, 𝜎 2 )
• “Gaussian-gamma” (“Normal-gamma”) is such a conjugate prior – a product of normal and
gamma
• Note: Computing joint posteriors exactly is possible only in rare cases such this one

▪If each observation 𝒙𝑛 ∈ ℝ𝐷 , can assume a likelihood/observation model 𝒩 𝒙 𝝁, 𝚺


• Need to estimate a vector-valued mean 𝝁 ∈ ℝ𝐷 . Can use a multivariate Gaussian prior
• Need to estimate a 𝐷 × 𝐷 positive definite covariance matrix 𝚺. Can use a Wishart prior
• If 𝝁, 𝚺 both are unknown, can use Normal-Wishart as a conjugate prior
References
• Review article on Ghahramani, Probabilistic machine learning and artificial
intelligence Nature, 521(7553), 452-459. (freely available online)

• Section 4.6 and Section 11.7 Kevin Murphy, Probabilistic Machine Learning:
An Introduction (PML-1), MIT Press, 2022 (freely available online)

• Chapter 2 and Appendix B of Christopher Bishop, Pattern Recognition and


Machine Learning (PRML), Springer, 2007 (freely available online)

• Kevin Murphy, Conjugate Bayesian analysis of the Gaussian distribution


https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

• Probabilistic Machine Learning (CS772A), Piyush Rai

You might also like