0% found this document useful (0 votes)
53 views5 pages

Intro to Variational Autoencoders

Uploaded by

Duy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views5 pages

Intro to Variational Autoencoders

Uploaded by

Duy Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Review: An Introduction to VAE

Type Literature

Introduction
Motivation
Discriminative models predict outcomes based on observed data, whereas
generative models learn the joint distribution of all variables, simulating how
data is generated in the real world

Generative modeling is attractive because it incorporates physical laws


and can simplify unknown details by treating them as noise → intuitive,
interpretable models that are tested against observations to confirm or
reject hypotheses

Transforming a generative model into a discriminative one involves Bayes’


rule, though computational costs are high. Discriminative methods directly
map inputs to predictions, and while efficient with large data → higher bias
if the model assumptions are incorrect

Variational Autoencoders (VAEs) serve this purpose by combining two


models: the encoder (recognition model) and the decoder (generative model)

The encoder approximates the posterior distribution, facilitating


expectation maximization during training

VAEs improve efficiency by “amortized inference”, where a single set of


parameters models the relationship between input and latent variables

The VAE framework is inspired by the Helmholtz Machine but avoids its
inefficiencies by optimizing a single objective through the reparameterization
trick, which reduces gradient noise during learning

Review: An Introduction to VAE 1


VAEs combine graphical models with deep learning, organizing latent
variables in hierarchical Bayesian networks, and are optimized through
expectation maximization and backpropagation

Aim
A principled approach for jointly learning deep latent-variable models and
inference models through stochastic gradient descent → supporting
applications like generative modeling, semi-supervised learning, and
representation learning

The structure of the paper includes a discussion of probabilistic models,


directed graphical models, and their integration with neural networks, as well
as learning approaches for fully observed and deep latent-variable models
(DLVMs)

Chapter 2 covers the basics of VAEs

Chapter 3 explores advanced inference techniques

Chapter 4 addresses advanced generative models

Mathematical notation can be found in section A.1

Probabilistic models and Variational inference


Probabilistic models inherently involve unknown factors → would specify all
correlations and higher-order dependencies between its variables, forming a
comprehensive joint probability distribution

A vector xrepresents all observed variables whose joint distribution is


modeled

The true distribution of x, denoted p∗ (x), is generally unknown, so the


goal is to approximate this with a model pθ (x), where θrepresents the

parameters

The learning process involves finding values for θso that pθ (x)closely


approximates p (x)for any observed x

Review: An Introduction to VAE 2


For effective modeling, pθ (x)must be flexible enough to adapt to the data

while allowing for prior knowledge about the data distribution to be integrated
into the model

Conditional models
Conditional model, pθ (y∣x), is preferred over an unconditional model, pθ (x).
​ ​

This model approximates the distribution p∗ (y∣x), which represents the


probability distribution over possible values of y(as a label) given an observed
variable x.

xis typically considered the model’s input


The goal is to choose and optimize pθ (y∣x)so that it closely approximates


p (y∣x)for any given xand y

pθ (y∣x) ≈ p∗ (y∣x)

Parameterizing conditional distribution with Neural


networks
Neural networks are used to parameterize probability density functions
(PDFs) or probability mass functions (PMFs) → allowing for stochastic
gradient-based optimization, which enables scaling to large datasets and
models

In applications like image classification, neural networks can parameterize a


conditional distribution pθ (y∣x)over a label y, given an input image x

The network can be represented as a function, denoted NeuralNet(x),


which outputs a set of parameters, p, for the categorical distribution
pθ (y∣x) = Categorical(y; p)

The final layer in such models typically employs a softmax function to


ensure that the sum of output probabilities ∑i pi = 1→ classification
​ ​

tasks

Directed graph models and Neural networks

Review: An Introduction to VAE 3


Directed probabilistic models, also known as directed probabilistic graphical
models (PGMs) or Bayesian networks structure variables in a directed acyclic
graph (DAG), where each variable’s joint distribution is represented as a
product of prior and conditional distributions:

M
pθ (x1 , … , xM ) = ∏ pθ (xj ∣Pa(xj ))
​ ​ ​ ​ ​ ​ ​

j=1

Pa(xj )represents the set of parent variables for each node j in the graph

Root nodes have no parents, so their distributions are unconditional

Neural networks provide a flexible approach by taking the parent variables of


a node as input and producing distributional parameters, η, for that variable:

η = NeuralNet(Pa(x))
pθ (x∣Pa(x)) = pθ (x∣η)
​ ​

Learning in fully observed models with Neural nets


Dataset
Dataset D typically consists of N ≥ 1datapoints:

D = {x(1) , x(2) , … , x(N ) } ≡ {x(i) }N


i=1 ≡ x
(1:N )

Each datapoint is an independent sample from a consistent underlying


distribution (the dataset is composed of distinct, independent measurements
from a stable system) → the observations D = {x(i) }N
i=1 are independently

and identically distributed

The probability of the dataset given the model parameters, θ, can be
expressed as a product of individual datapoint probabilities. The log-
probability assigned to the data by the model is thus: log pθ (D) = ​

∑x∈D log pθ (x)


​ ​

Maximum likelihood and Minibatch SGD

Review: An Introduction to VAE 4


Maximizing log-likelihood equates to minimizing the Kullback-Leibler (KL)
divergence between the data distribution and the model's distribution →
seeks parameters θthat maximize the sum log-probability of the dataset D (
log pθ (D) = ∑x∈D log pθ (x) )
​ ​ ​

Stochastic Gradient Descent (SGD) use minibatches M ⊂ Dof size NM → ​

create an unbiased estimator (≃) of the log-probability:

1 1 1
log pθ (D) ≃
​ log pθ (M) =
​ ∑ log pθ (x)
​ ​ ​ ​ ​

ND ​ NM NM ​ ​

x∈M

The stochastic gradient, represented as:

1 1 1
∇θ log pθ (D) ≃
​ ​ ∇θ log pθ (x) =
​ ∑ ∇θ log pθ (x)
​ ​ ​ ​ ​ ​ ​

ND ​ NM NM ​ ​

x∈M

⇒ optimizes the objective function by iteratively adjusting the model parameters


according to the direction of the stochastic gradient

Bayesian inference
Improve through:

Maximum a posteriori (MAP)

Inference of a full approximate posterior distribution over the parameters

Learning and inference in deep latent variable models

Review: An Introduction to VAE 5

You might also like