Introduction to Deep
Generative Modeling Lecture #3
HY-673 – Computer Science Dep., University of Crete
Professors: Yannis Pantazis & Yannis Stylianou
TAs: Michail Raptakis & Michail Spanakis
Taxonomy of Deep Generative Models Lecture #3
According to the Likelihood Function
GMs
Exact Approximate Implicit
ARMs NFs VAEs EBMs DPMs GANs GGFs
(R)NADE Planar Vanilla Belief nets diffusion Vanilla KALE
WaveNet Coupling β-VAE Boltzmann denoising WGAN Lipschitz-reg.
WaveRNN MAFs/IAFs VQ-VAE machines score 𝑓-GAN …
GPT … … … … (𝑓, Γ)-GAN
…
Introduction to Estimator Theory Lecture #3
Let D = {x1 , . . . , xn } be a set of data drawn from pd (x), and pθ (x) be a
family of models with θ ∈ Θ. A point estimator θ̂ = θ̂(D) is a random variable
for which we want:
pθ̂ (x) ≈ pd (x)
Introduction to Estimator Theory Lecture #3
• How to construct an estimator?
– Maximum Likelihood Estimation (MLE)
– Maximum A Posteriory (MAP) Estimation
– Based on a Probability Distance or a Divergence (implicit)
– Bayesian Inference (learns a distribution for the
estimator’s parameters)
Maximum Likelihood Estimator Lecture #3
Maximum Likelihood Estimator Lecture #3
− Ln (θ̂1 ) > Ln (θ̂2 ) implies that θ̂1 is
more likely to have generated
the observed samples x1 , ..., xn .
− Thus, it provides a ranking of model’s
fitness/accuracy/matching to the data.
MLE Example #1 Lecture #3
d
L(θ̂; D)
dθ
MLE Example #2 Lecture #3
MLE Example #3 Lecture #3
Partial derivative
or gradient vector:
MLE Example #3 Lecture #3
Maximizing L(θ) is equivalent to
minimizing the Sum of Squares
(Least Squares)
Exactly the same solution as LS!
MLE Example #4 Lecture #3
• Logistic regression with sigmoids a.k.a. binary classification.
Dataset: D = {(x1 , y1 ), . . . , (xn , yn )} with xi ∈ Rd and yi ∈ {0, 1},
Model family: pθ (yi = 1|xi ) = σ(θT xi ), pθ (yi = 0|xi ) = 1 − pθ (yi = 1|xi ),
θ ∈ Rd and σ(z) = 1+e1−z be the sigmoid function.
MLE Example #4 Lecture #3
Learning rate
Maximum Likelihood Estimator Lecture #3
Kullback-Leibler Divergence (KLD) Lecture #3
• Geometric interpretation:
MLE is equivalent to minimizing the KLD of pd (x) w.r.t. pθ (x).
Maximum Likelihood Estimator Lecture #3
where the cross entropy of probability P with PDF p(x) with respect to proba-
bility Q with PDF q(x) is defined as
Kullback-Leibler Divergence Lecture #3
• MLE is also equivalent to minimizing the KLD of pd (x) w.r.t. pθ (x).
arg max L(θ; pd ) = arg min DKL (pd ||pθ )
θ θ
• The Kullback-Leibler divergence (KLD) of P w.r.t. Q is defined as:
! ! !
p(x)
DKL (P ||Q) := log p(x)dx = log p(x)p(x)dx − log q(x)p(x)dx
q(x)
DKL (P ||Q) = −H(P ) + H × (P ||Q). Entropy Cross Entropy
Kullback-Leibler Divergence Lecture #3
DKL (P ||Q) ≥ 0 and
Jensen’s inequality
Maximum A Posteriori Estimator Lecture #3
arg max p(θ|D)
θ
Maximum A Posteriori Estimator Lecture #3
Maximum A Posteriori Estimator Lecture #3
• Linear model: D = {(x1 , y1 ), . . . , (xn , yn )}, xi ∈ Rd , yi ∈ R, model:
yi = θT xi + ϵi , ϵi ∼ N (0, 1)
− p(θ) = N (0, λ−1 Id ) ⇒ rigde regression a.k.a. (Tikhonov) regularized
Least Squares.
− p(θ) = Laplace(0, λ−1 ) ⇒ lasso regression (least absolute shrinkage and
selection operator).
Estimator Assessment Lecture #3
• Basic toolkit to assess an estimator:
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Chebyshev’s inequality
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
• Let θ̂1 and θ̂2 be two unbiased estimators of θ∗ . θ̂1 is more efficient than
θ̂2 if and only if Var(θ̂1 ) < Var(θ̂2 ).
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Lecture #3
HY-673
References Lecture #3
1. All of statistics: A Concise Course in Statistical Inference (Chapters 6 & 9)
Larry Wasserman, Springer (2004)
3. Matrix Calculus:
[Link]
[Link]
Introduction to Deep
Generative Modeling Lecture #3
HY-673 – Computer Science Dep., University of Crete
Professors: Yannis Pantazis & Yannis Stylianou
TAs: Michail Raptakis & Michail Spanakis