Representation
CS236, Stanford University
Lecture 2
CS236, Stanford University Deep Generative Models Lecture 2 1 / 29
Overview
What is a generative model
Representing probability distributions
Curse of dimensionality
Crash course on graphical models (Bayesian networks)
Generative vs discriminative models
Neural models
CS236, Stanford University Deep Generative Models Lecture 2 2 / 29
Learning a generative model
We are given a training set of examples, e.g., images of dogs
We want to learn a probability distribution p(x) over images x such that
Generation: If we sample xnew ∼ p(x), xnew should look like a dog
(sampling)
Density estimation: p(x) should be high if x looks like a dog, and low
otherwise (anomaly detection)
Unsupervised representation learning: We should be able to learn
what these images have in common, e.g., ears, tail, etc. (features)
First question: how to represent p(x)
CS236, Stanford University Deep Generative Models Lecture 2 3 / 29
Basic discrete distributions
Bernoulli distribution: (biased) coin flip
D = {Heads, Tails}
Specify P(X = Heads) = p. Then P(X = Tails) = 1 − p.
Write: X ∼ Ber (p)
Sampling: flip a (biased) coin
Categorical distribution: (biased) m-sided dice
D = {1, · · · , m} P
Specify P(Y = i) = pi , such that pi = 1
Write: Y ∼ Cat(p1 , · · · , pm )
Sampling: roll a (biased) die
CS236, Stanford University Deep Generative Models Lecture 2 4 / 29
Example of joint distribution
Modeling a single pixel’s color. Three discrete random variables:
Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G ) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}
Sampling from the joint distribution (r , g , b) ∼ p(R, G , B) randomly
generates a color for the pixel. How many parameters do we need to
specify the joint distribution p(R = r , G = g , B = b)?
256 ∗ 256 ∗ 256 − 1
CS236, Stanford University Deep Generative Models Lecture 2 5 / 29
Example of joint distribution
Suppose X1 , . . . , Xn are binary (Bernoulli) random variables, i.e.,
Val(Xi ) = {0, 1} = {Black, White}.
How many possible images (states)?
n
|2 × 2 ×
{z· · · × 2} = 2
n times
Sampling from p(x1 , . . . , xn ) generates an image
How many parameters to specify the joint distribution p(x1 , . . . , xn )
over n binary pixels?
2n − 1
CS236, Stanford University Deep Generative Models Lecture 2 6 / 29
Structure through independence
If X1 , . . . , Xn are independent, then
p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )
How many possible states? 2n
How many parameters to specify the joint distribution p(x1 , . . . , xn )?
How many to specify the marginal distribution p(x1 )? 1
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
Independence assumption is too strong. Model not likely to be useful
For example, each pixel chosen independently when we sample from it.
CS236, Stanford University Deep Generative Models Lecture 2 7 / 29
Two important rules
1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.
p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 ∩ . . . ∩ Sn−1 )
2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.
p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )
p(S1 | S2 ) = =
p(S2 ) p(S2 )
CS236, Stanford University Deep Generative Models Lecture 2 8 / 29
Structure through conditional independence
Using Chain Rule
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )
How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1
p(x1 ) requires 1 parameter
p(x2 | x1 = 0) requires 1 parameter, p(x2 | x1 = 1) requires 1 parameter
Total 2 parameters.
···
2n − 1 is still exponential, chain rule does not buy us anything.
Now suppose Xi+1 ⊥ X1 , . . . , Xi−1 | Xi , then
p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | 1 , x2 ) · · · p(xn |
x , ·
x1 · ·,xn−1 )
= p(x1 )p(x2 | x1 )p(x3 | x2 ) · · · p(xn | xn−1 )
How many parameters? 2n − 1. Exponential reduction!
CS236, Stanford University Deep Generative Models Lecture 2 9 / 29
Bayes Network: General Idea
Use conditional parameterization (instead of joint parameterization)
For each random variable Xi , specify p(xi |xAi ) for set XAi of random
variables
Then get joint parametrization as
Y
p(x1 , . . . , xn ) = p(xi |xAi )
i
Need to guarantee it is a legal probability distribution. It has to
correspond to a chain rule factorization, with factors simplified due to
assumed conditional independencies
CS236, Stanford University Deep Generative Models Lecture 2 10 / 29
Bayesian networks
A Bayesian network is specified by a directed acyclic graph (DAG)
G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Graph G = (V , E ) is called the structure of the Bayesian Network
Defines a joint distribution:
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
Claim: p(x1 , . . . xn ) is a valid probability distribution because of
ordering implied by DAG
Economical representation: exponential in |Pa(i)|, not |V |
CS236, Stanford University Deep Generative Models Lecture 2 11 / 29
Example
DAG stands for Directed Acyclic Graph
CS236, Stanford University Deep Generative Models Lecture 2 12 / 29
Example
Consider the following Bayesian network:
What is its joint distribution?
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
CS236, Stanford University Deep Generative Models Lecture 2 13 / 29
Bayesian network structure implies conditional
independencies!
The joint distribution corresponding to the above BN factors as
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
However, by the chain rule, any distribution can be written as
p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, s)
Thus, we are assuming the following additional independencies:
D ⊥ I, S ⊥ {D, G } | I , L ⊥ {I , D, S} | G .
CS236, Stanford University Deep Generative Models Lecture 2 14 / 29
Summary
Bayesian networks given by (G , P) where P is specified as a set of
local conditional probability distributions associated with G ’s nodes
Efficient representation using a graph-based data structure
Computing the probability of any assignment is obtained by
multiplying CPDs
Can sample from the joint by sampling from the CPDs according to
the DAG ordering
Can identify some conditional independence properties by looking at
graph properties
In this class, graphical models will be simple (e.g., only 2 or 3 random
vectors)
Next: generative vs. discriminative; functional parameterizations
CS236, Stanford University Deep Generative Models Lecture 2 15 / 29
Naive Bayes for single label prediction
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Words are conditionally independent given Y :
Then
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1
CS236, Stanford University Deep Generative Models Lecture 2 16 / 29
Example: naive Bayes for classification
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1
Estimate parameters from training data. Predict with Bayes rule:
p(Y = 1) ni=1 p(xi | Y = 1)
Q
p(Y = 1 | x1 , . . . xn ) = P Qn
y ={0,1} p(Y = y ) i=1 p(xi | Y = y )
Are the independence assumptions made here reasonable?
Philosophy: Nearly all probabilistic models are “wrong”, but many are
nonetheless useful
CS236, Stanford University Deep Generative Models Lecture 2 17 / 29
Discriminative versus generative models
Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X).
Corresponding Bayesian networks:
However, suppose all we need for prediction is p(Y | X)
In the left model, we need to specify/learn both p(Y ) and p(X | Y ),
then compute p(Y | X) via Bayes rule
In the right model, it suffices to estimate just the conditional
distribution p(Y | X)
We never need to model/learn/use p(X)!
Called a discriminative model because it is only useful for
discriminating Y ’s label when given X
CS236, Stanford University Deep Generative Models Lecture 2 18 / 29
Discriminative versus generative models
Since X is a random vector, chain rules will give
p(Y , X) = p(Y )p(X1 | Y )p(X2 | Y , X1 ) · · · p(Xn | Y , X1 , · · · , Xn−1 )
p(Y , X) = p(X1 )p(X2 | X1 )p(X3 | X1 , X2 ) · · · p(Y | X1 , · · · , Xn−1 , Xn )
We must make the following choices:
1 In the generative model, p(Y ) is simple, but how do we parameterize
p(Xi | Xpa(i) , Y )?
2 In the discriminative model, how do we parameterize p(Y | X)? Here
we assume we don’t care about modeling p(X) because X is always
given to us in a classification problem
CS236, Stanford University Deep Generative Models Lecture 2 19 / 29
Naive Bayes
1 For the generative model, assume that Xi ⊥ X−i | Y (naive Bayes)
CS236, Stanford University Deep Generative Models Lecture 2 20 / 29
Logistic regression
1 For the discriminative model, assume that
p(Y = 1 | x; α) = f (x, α)
2 Not represented as a table anymore. It is a parameterized function of
x (regression)
Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact
representation)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function:
1
1 + e−z
CS236, Stanford University Deep Generative Models Lecture 2 21 / 29
Logistic regression
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .Then,
P
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is called the
logistic function
1 Decision boundary p(Y = 1 | x; α) > 0.5 is linear in x
2 Equal probability contours are straight lines
3 Probability rate of change has very specific form (third plot)
CS236, Stanford University Deep Generative Models Lecture 2 22 / 29
Discriminative models are powerful
Logistic model does not assume Xi ⊥ X−i | Y , unlike naive Bayes
This can make a big difference in many applications
For example, in spam classification, let X1 = 1[“bank” in e-mail] and
X2 = 1[“account” in e-mail]
Regardless of whether spam, these always appear together, i.e. X1 = X2
Learning in naive Bayes results in p(X1 | Y ) = p(X2 | Y ). Thus, naive Bayes
double counts the evidence
Learning with logistic regression sets α1 = 0 or α2 = 0, in effect ignoring it
CS236, Stanford University Deep Generative Models Lecture 2 23 / 29
Generative models are still very useful
Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding
Bayesian networks:
1 Using a conditional model is only possible when X is always observed
When some Xi variables are unobserved, the generative model allows us
to compute p(Y | Xevidence ) by marginalizing over the unseen variables
CS236, Stanford University Deep Generative Models Lecture 2 24 / 29
Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
2 Linear dependence:
Pn
let z(α, x) = α0 + i=1 αi xi .
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = σ(α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
CS236, Stanford University Deep Generative Models Lecture 2 25 / 29
Neural Models
1 In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)
Linear dependence: let z(α, x) = α0 + ni=1 αi xi .
P
2
p(Y = 1 | x; α) = f (z(α, x)), where f (z) = 1/(1 + e −z ) is the
logistic function
Dependence might be too simple
3 Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear
transformation of the inputs (features).
pNeural (Y = 1 | x; α, A, b) = f (α0 + hi=1 αi hi )
P
More flexible
More parameters: A, b, α
Can repeat multiple times to get a neural network
CS236, Stanford University Deep Generative Models Lecture 2 26 / 29
Bayesian networks vs neural models
Using Chain Rule
p(x1 , x2 , x3 , x4 ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 )
Fully General
Bayes Net
p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )p(x3 | 1 , x2 )p(x4 | x1 ,
x , x
x2 3)
Assumes conditional independencies
Neural Models
p(x1 , x2 , x3 , x4 ) ≈ p(x1 )p(x2 | x1 )pNeural (x3 | x1 , x2 )pNeural (x4 | x1 , x2 , x3 )
Assume specific functional form for the conditionals. A sufficiently
deep neural net can approximate any function.
CS236, Stanford University Deep Generative Models Lecture 2 27 / 29
Continuous variables
If X is a continuous random variable, we can usually represent it using
its probability density function pX : R → R+ . However, we cannot
represent this function as a table anymore. Typically consider
parameterized densities:
2 2
Gaussian: X ∼ N (µ, σ) if pX (x) = σ√12π e −(x−µ) /2σ
1
Uniform: X ∼ U(a, b) if pX (x) = b−a 1[a ≤ x ≤ b]
Etc.
If X is a continuous random vector, we can usually represent it using
its joint probability density function:
Gaussian: if pX (x) = √ 1
exp − 21 (x − µ)T Σ−1 (x − µ)
(2π)n |Σ|
Chain rule, Bayes rule, etc all still apply. For example,
pX ,Y ,Z (x, y , z) = pX (x)pY |X (y | x)pZ |{X ,Y } (z | x, y )
CS236, Stanford University Deep Generative Models Lecture 2 28 / 29
Continuous variables
This means we can still use Bayesian networks with continuous (and
discrete) variables. Examples:
Mixture of 2 Gaussians: Bayes net Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ Bernoulli(p)
X | (Z = 0) ∼ N (µ0 , σ0 ) , X | (Z = 1) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1
Bayes net Z → X with factorization pZ ,X (z, x) = pZ (z)pX |Z (x | z)
Z ∼ U(a, b)
X | (Z = z) ∼ N (z, σ)
The parameters are a, b, σ
Variational autoencoder: Bayes net Z → X with factorization
pZ ,X (z, x) = pZ (z)pX |Z (x | z) and
Z ∼ N (0, 1)
X | (Z = z) ∼ N (µθ (z), e σϕ (z) ) where µθ : R → R and σϕ are neural
networks with parameters (weights) θ, ϕ respectively
Note: Even if µθ , σϕ are very deep (flexible), functional form is still
Gaussian
CS236, Stanford University Deep Generative Models Lecture 2 29 / 29