0% found this document useful (0 votes)

20 views41 pages

Lecture 5 Autoregressive Models

Uploaded by

chinmayee.chimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views41 pages

Lecture 5 Autoregressive Models

Uploaded by

chinmayee.chimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Autoregressive Models

Hao Dong

Peking University

1
Autoregressive Models
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet…. (Next Lecture)

2
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

3
Definition of Autoregressive Models
The term autoregressive originates from the literature on time-series models
where observations from the previous time-steps are used to predict the value at
the current time step.
Put simply, an autoregressive model is merely a feed-forward model which predicts
future values from past values:
𝑦! = 𝑐 + 𝜙"𝑦!#" + 𝜙$𝑦!#$ + … + 𝜙% 𝑦!#% + ɛ! , ɛ! ~N(0, 𝜎 $)

𝑦& could be:

The specific stock price of day 𝑖…
The amplitude of a simple pendulum at period 𝑖…
Or any variable that depends on its preceding values!
Definition of Autoregressive Models
Autoregressive Models have a strong ability in data representation.
𝑦! = 𝑐 + 𝜙"𝑦!#" + 𝜙$𝑦!#$ + … + 𝜙% 𝑦!#% + ɛ! , ɛ! ~N(0, 𝜎 $)

Two examples of data from autoregressive models with a few different parameters.
Left: AR(1) with yt=18−0.8yt−1 + εt. Right: AR(2) with yt=8+1.3yt−1−0.7yt−2 + εt.
Definition of Autoregressive Models
Autoregressive Models have a strong ability in data representation.

• Regression

• Generation

• Prediction
Recap: Statistical Generative Models

+ Model family, loss function,

optimization algorithm, etc.

Learning Prior Knowledge

dataset 𝒟
𝑝𝑑𝑎𝑡𝑎(𝑥)

The probability
density value is
very high

𝑥1 𝑥2 𝑥3 𝑥4 𝑥

𝑥3 is a 64x64x3 high dimensional vector

Sampling from p(x) generates new images
representing a woman with blonde hair.7
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

8
Recap: Challenge of Generative Models
n Compactness
Suppose x1,x2,x3 are binary variables. P(𝑥# , 𝑥$ , 𝑥% ) can be specified with (23 - 1) =7 parameters

What about a 28×28 black/white digit image?

2$&×$& = 2(&) ≈ 10$%* parameters!
But with only 10 peaks of 0, 1, 2, … 9
𝑂(2𝑛)

Main challenge: distributions over high dimensional objects is actually very sparse!!
Too many possibilities! Main idea: write as a product of simpler terms
Recap: Challenge of Generative Models
n Solution #1: Factorization
Definition of conditional probability:
𝑃(𝑥!, 𝑥") = 𝑃(𝑥!) 𝑃(𝑥"|𝑥!)
+
Product rule:
𝑃 𝑥# , 𝑥$ , … , 𝑥+ = - 𝑝. (𝑥, |𝑥/, )
,-#
Divide and conquer ! We can solve the joint distribution 𝑃(𝒙) by
solving simpler conditional distributions 𝑝# (𝑥$ |𝑥%$ ) one by one
Can you tell the exact
likelihood of the next pixel
Still complex!! (noted as a red point)
conditioned on the given
It’s hard to exactly modeling every conditional distribution pixels?
Recap: Challenge of Generative Models
Solution #2a: use simple functions to form the conditionals Sigmoid function
𝑃(𝑥! |𝑥" , 𝑥# , 𝑥$ ) ≈ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑊" 𝑥" + 𝑊# 𝑥# + 𝑊$ 𝑥$ ) can be used to
binarize the output
◦ Only requires storing 3 parameters
◦ Relationship between 𝑥$ and (𝑥% , 𝑥# , 𝑥& ) could be too simple

Solution #2b: use more complex functional form Neural network sigmoid function

𝑍" = 𝑓"" (𝑥" , 𝑥# , 𝑥$ ), 𝑍# = 𝑓"# 𝑥" , 𝑥# , 𝑥$ , 𝑍$ = 𝑓"$ 𝑥" , 𝑥# , 𝑥$

x1
𝑌" = 𝑓#" (𝑍" , 𝑍# , 𝑍$ ), 𝑌# = 𝑓## (𝑍" , 𝑍# , 𝑍$ ), 𝑌$ = 𝑓#$ 𝑍" , 𝑍# , 𝑍$ …
x2
x4
𝑃(𝑥! |𝑥" , 𝑥# , 𝑥$ ) ≈ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑊" 𝑌" + 𝑊# 𝑌# + 𝑊$ 𝑌$ … )
x3
◦ More flexible
◦ More parameters
◦ More powerful on fitting data Neural network

finally, it’s possible to model the data distributions!

11
IJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

12
Definition of Autoregressive Models
However, by defining 𝑥3& , the output of step 𝑖 , as a random variable that follows
the conditional distribution based on previous inputs 𝑥", 𝑥$ … 𝑥&#", we get the
probability model, which can present the joint distribution of 𝑝' 𝑥", 𝑥$, … 𝑥(
• Key idea: Decompose the joint distribution as a product of tractable conditionals
𝑥*$ = 𝑝# 𝑥$ 𝑥!, 𝑥", … , 𝑥$:!
+ +

𝑝. 𝒙 = - 𝑝. (𝑥, |𝑥# , 𝑥$ , … , 𝑥,0# ) = - 𝑝. (𝑥, |𝑥/, )

,-# ,-#

• Graph model: Directed, fully-observed Bayesian network

13
Definition of Autoregressive Models

Obligatory RNN diagram. Source: Chris Olah. WaveNet animation. Source: Google DeepMind.

Relationship with RNN:

Like an RNN, an autoregressive model’s output ℎ; ,at time 𝑡 depends on not just 𝑥; , but also
𝑥! , 𝑥" , … , 𝑥$:! from previous time steps.
However, unlike an RNN, the previous 𝑥! , 𝑥" , … , 𝑥$:! are not provided via some hidden
state: they are given just as an input to the model.
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

15
Learning and Inference of Autoregressive Models
• Learning maximizes the model log-likelihood over the dataset !

min 𝑑>? 𝑝@A;A , 𝑝# = min EB~D1232 log pEFGF x − log p#= x ∝ max 𝐸H~I4565 log 𝑝# (𝑥)
#∈= #∈= #∈=
M

max log 𝑝# 𝐷 = = log 𝑝# 𝑥 = = = log 𝑝# (𝑥$ |𝑥%N )

#
H∈J H∈K $L!

Tractable:
The distribution is simple enough to be modeled explicitly.

Tractable conditionals make conditional distribution learning meaningful,

and thus allow for exact likelihood evaluation.

16
Learning and Inference of Autoregressive Models
• Inference samples each variable of one data from estimated conditional
distributions step by step, until the whole data is generated.
Ancestral sampling:
A process of producing samples from a probabilistic model.
First sample variables which has no conditional constraints using their prior
distribution. 𝑥"~𝑝' (𝑥")
Then sample child variables using conditional distribution based on their
parents and repeat so on. 𝑥$~𝑝' 𝑥$ 𝑥")……
The attribute of Autoregressive Models that directly model and output distributions
allows for ancestral sampling.

17
Learning and Inference of Autoregressive Models

Differences between Autoregressive models (AR), VAE and GAN:

GAN model doesn’t define any distribution, it adapts discriminator to learn
the data distribution implicitly.
VAE model believes the data distribution is too complex to model directly,
thus it tries to learn the distribution by defining an intermediate distribution
and learning the map between the defined simple distribution to the
complex data distribution.
AR model on the one hand assumes that the data distribution can be
learned directly (tractable), then it define its outputs as conditional
distributions to solve the generation problem by directly modeling each
conditional distribution.
18
Learning and Inference of Autoregressive Models
Conclusion:
1. Using complex networks, each step Autoregressive Models output an approximated
complex conditional distribution 𝑥*$ = 𝑝# 𝑥$ 𝑥! , 𝑥" , … , 𝑥$:!

2. Taking in the previous inputs 𝑥! , 𝑥" , … , 𝑥$:! and the next input 𝑥$ by sampling
previous estimated conditional distribution 𝑥*$ , Autoregressive Model is able to generate
all conditional distributions iteratively
𝑥! ~𝑃# 𝑥! , 𝑥" ~𝑃# 𝑥" 𝑥! , 𝑥O ~𝑃# 𝑥O 𝑥! , 𝑥" , … 𝑥M ~𝑃# (𝑥M |𝑥! , … , 𝑥M:! )

3. Product rule makes sure the generated data that made up of sampled result 𝑥$ from
each step follows the data distribution.
M

𝑥! , 𝑥" , … , 𝑥M ~ ? 𝑝# (𝑥$ |𝑥%$ )

$L! 19
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

20
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

21
Fully Visible Sigmoid Belief Network (FVSBN)
l the fully visible sigmoid belief network without any hidden units is denoted FVSBN.

The conditional variables xi |x1 ,..., xi−1 in FVSBN are Bernoulli with parameters.
Some conditionals are too complex. So FVSBN assume:

𝑥3, = 𝑝 𝑥, = 1 𝑥1, 𝑥2, … 𝑥,0# = 𝑓, 𝑥# , 𝑥$ , … , 𝑥,0# ; 𝛼 ,

, , (,)
𝑥! 𝑥" 𝑥O 𝑥P = 𝜎(𝛼7 + 𝛼# 𝑥# + ⋯ + 𝛼,0# 𝑥,0# )
• σ denotes the sigmoid function
FVSBN
(() (() (()
𝛼 ( = {𝛼) , 𝛼" , … , 𝛼(," } denotes the parameters

• The conditional for variable xi requires i parameters, and hence

the total number of parameters in the model is given by
∑.(-" 𝑖 = 𝑂 𝑛 # ≪ 𝑂(2. )

Gan Z , Henao R , Carlson D , et al. Learning Deep Sigmoid Belief Networks with Data Augmentation[C]// Artificial Intelligence and Statistics (AISTATS). 2015.
FVSBN Example

• Suppose we have a dataset D of handwritten digits (binarized MNIST)

• Each image has n = 28×28784 = pixels. Each pixel can either be black (0) or
white (1).
• We want to learn a probability distribution 𝑝 𝑥 = 𝑝(𝑥" , … , 𝑥#$% )over 𝑥 ∈
0,1 #$% such that when 𝑥~𝑝(𝑥), 𝑥 looks like a digit.
• Idea: deﬁne a FVSBN model , then pick a good one based on training data D.
(more on that later)
FVSBN Example
• We can pick an ordering, i.e., order variables (pixels) from top-left (𝑥") to
bottom-right (𝑥)*+).
• Use product rule factorization without loss of generality:
𝑝(𝑥",··· , 𝑥)*+) = 𝑝 𝑥" 𝑝 𝑥$ 𝑥")𝑝(𝑥, | 𝑥", 𝑥$)
··· 𝑝 𝑥)*+ 𝑥",··· , 𝑥)*,)
• FVSBN model assume: (less parameters)
𝑥3& = 𝑝 𝑥" = 1 𝑥1, 𝑥2, … 𝑥&#" = 𝑓& 𝑥", 𝑥$, … , 𝑥&#"; 𝛼 &
& & (&)
= 𝜎(𝛼- + 𝛼" 𝑥" + ⋯ + 𝛼&#"𝑥&#")
• Note: This is a modeling assumption. We are using a logistic regression to
predict next pixel distribution based on the previous ones. Called
autoregressive.
FVSBN Example
𝑥
@! 𝑥
@" 𝑥
@O 𝑥
@P

ai
𝑥! 𝑥" 𝑥O 𝑥O
binary hidden variables
• How to evaluate 𝑝(𝑥!,··· , 𝑥"#$) i.e. density estimation? Multiply all the conditionals (factors)
In the above example:
𝑝 𝑥# = 0, 𝑥$ = 1, 𝑥% = 1, 𝑥) = 0
= 𝑝 𝑥# = 0 𝑝 𝑥$ = 1 𝑥# = 0 𝑝 𝑥% = 1 𝑥# = 0, 𝑥$ = 1 𝑝 𝑥) = 0 𝑥# = 0, 𝑥$ = 1, 𝑥% = 1
= 1−𝑥
;# ×;
𝑥$ × 𝑥
;% ×(1 − 𝑥
;) )
• How to sample from 𝑝(𝑥!,··· , 𝑥"#$) ?
1. Sample 𝑥!~𝑝(𝑥!) (𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑐ℎ𝑜𝑖𝑐𝑒([1,0], 𝑝 = [8
𝑥!, 1 − 𝑥
8!]))
2. Sample 𝑥%~𝑝(𝑥%|𝑥! = 𝑥!)
3. Sample 𝑥&~𝑝(𝑥&|𝑥! = 𝑥!, 𝑥% = 𝑥%)
···
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

26
NADE: Neural Autoregressive Density Estimation

Improve FVSBN: use one hidden layer neural network

instead of logistic regression

𝒉𝒊 = 𝜎 𝐴$ 𝒙%𝒊 + 𝒄𝒊

ai 𝒙*𝒊 = 𝑝 𝑥$ 𝑥! , 𝑥" … 𝑥$:! ; 𝐴$ , 𝒄𝒊 , 𝜶𝒊 , 𝑏$ )=𝜎(𝜶𝒊 𝒉𝒊 + 𝑏$ )

parameters xi = vi

Tie weights are shared to reduce the

number of parameters and speed up
computation
(see blue dots in the figure)
Uria B, Côté M A, Gregor K, et al. Neural autoregressive distribution estimation[J]. The Journal of Machine Learning Research, 2016, 17(1): 7184-7220.
NADE: Neural Autoregressive Density Estimation

𝒉𝒊 = 𝜎 𝐴$ 𝒙%𝒊 + 𝒄𝒊 𝒙%𝒊 ∈ 𝑅 $:! , denotes the vector made of

preceding 𝑥 s
𝒙*𝒊 = 𝑝 𝑥$ 𝑥! , 𝑥" … 𝑥$:! ; 𝐴$ , 𝒄𝒊 , 𝜶𝒊 , 𝑏$ )
𝒉𝒊 ∈ 𝑅 @ , denotes the hidden layer
= 𝑓$ 𝑥! , 𝑥" , … , 𝑥$:! = 𝜎(𝜶𝒊 𝒉𝒊 + 𝑏$ ) activations of the MLP

𝜃𝑖 = {𝐴$ ∈ 𝑅 @× $:! , 𝒄𝒊 ∈ 𝑅𝑑, 𝜶𝒊 ∈ 𝑅𝑑, 𝑏𝑖 ∈ 𝑅}

are the set of parameters .

The total number of parameters in this model is dominated by the matrices {𝑨𝟏 , 𝑨𝟐 , … , 𝑨𝒏 }
given by O(n2d).
Sharing parameters: Tie weights are shared to reduce the number of parameters and
speed up computation. —> O(nd).
Generate samples

FVSBN

NADE

Learned Features

Performance on the MNIST dataset. (Left) Training data. (Middle) Averaged synthesized
samples. (Right) Learned features at the bottom layer.
Generate other distributions
• How to model non-binary discrete random variables Vi ∈{1, …K }? E.g., pixel intensities
varying from 0 to 255?

• One solution: Let 𝐯*𝐢 parameterize a categorical distribution

𝒉𝒊 = 𝜎 𝐴$ 𝒗%𝒊 + 𝒄𝒊

𝐯*𝒊 = 𝑝 𝑣$ 𝑣!, … , 𝑣$:! = (𝑝$!, 𝑝$" … , 𝑝$_ )

= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑈$ 𝒉𝒊 + 𝒃𝒊

• Softmax generalizes the sigmoid/logistic function σ(·) and transforms

a vector of K numbers into a vector of K probabilities (non-negative,
sum to 1).

" /
exp 𝑎" exp 𝑎 /
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒂 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑎 , … , 𝑎 =( ,…, )
∑( exp 𝑎 ( ∑( exp 𝑎 (
Generate other distributions
• How to model continuous random variables Vi ∈R? E.g., speech signals ?

• One solution: Let 𝐯*𝒊 parameterize a continuous distribution

E.g., uniform mixture of K Gaussians
𝒉𝒊 = 𝜎 𝐴$ 𝒗%𝒊 + 𝒄𝒊

𝐯*𝒊 = 𝑓 𝒉𝒊 = 𝜇$!, … , 𝜇$> , 𝜎$!, … 𝜎$>

>
1 f f
𝑝 𝑣$ 𝑣!, … , 𝑣$:! = = (𝒩(𝑣$ ; 𝜇$ , 𝜎$ ))
𝐾
fL!
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

32
Autoregressive model vs. Autoencoders
• FVSBN and NADE look similar to an Autoencoder.
• An encoder e(·), E.g., 𝑒(𝑥) = 𝜎(𝑊 " 𝑊 !𝑥 + 𝑏! + 𝑏")
• A decoder such that 𝑑 𝑒 𝑥 ≈𝑥
Binary:
min = =(−𝑥$ log 𝑥*$ − 1 − 𝑥$ log(1 − 𝑥*$ ))
g : ,g ; ,h: ,h; ,i,j
H∈K $
Continuous:
min = = 𝑥$ − 𝑥*$ "
g : ,g ,h ,h; ,i,j
; :
H∈K $
• Encoder: feature learning
• A vanilla autoencoder is not a generative model: it does not deﬁne a distribution
over 𝑥 we can sample from to generate new data points.
Autoregressive model vs. Autoencoders
• FVSBN and NADE look similar to an autoencoder.
• Can we get a generative model from an Autoencoder?

A dependency order constraint is

required for Autoencoder to make it a
Bayesian Network.
VS
Autoregressive model vs. Autoencoders
• To get an autoregressive model from an Autoencoder,
• we need to make sure it corresponds to a valid Bayesian Network,
so we need an ordering. If the ordering is 1,2,3, then:
@! cannot depend on any input x.
• 𝑥
@" can only depend on 𝑥! .
• 𝑥
@O can only depend on 𝑥! , 𝑥" .
• 𝑥
• Bonus: we can use a single neural network (with n outputs) to
produce all the parameters. In contrast, NADE requires n passes.
Much more eﬃcient on modern hardware.
MADE: Masked Autoencoder for Distribution Estimation
Use Masks to constraint dependency paths!
Each output unit is an estimated distribution, it only depends on the inputs with
orderings that before its chosen ordering

With the order 𝑥",𝑥O,𝑥!:

1. 𝑝(𝑥") doesn’t depends
on any input
2. 𝑝(𝑥O|𝑥") depends on
input 𝑥"
3. 𝑝(𝑥!|𝑥", 𝑥O) depends
on input 𝑥" , 𝑥O

Masked Autoencoder

Mathieu M. Masked Autoencoder for Distribution Estimation[J]. 2015.

Generate samples

MADE

Performance on the MNIST dataset.

(Left) : Samples from a 2 hidden layer MADE
(Right): Nearest neighbor in binarized MNIST
Autoregressive Models in NLP
Natural language generation (NLG) is one of the important research fields of Artificial
Intelligence, including text-to-text generation, meaning-to-text generation and image-to-
test generation etc.
However, every time generating a word, it’s always helpful to consider text that already
generated! Thus Autoregressive Model is widely adopted in NLP.

Examples of powerful GPT-2 model

generating “First Law Of Robotics”
Appendix A —Taxonomy of Generative Models
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet…. (Next Lecture)

40
Thanks

cs236 Lecture3
No ratings yet
cs236 Lecture3
36 pages
Week 12 Foundations of Generative AIv2 2
No ratings yet
Week 12 Foundations of Generative AIv2 2
74 pages
Autoregressive Models
No ratings yet
Autoregressive Models
17 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
Gen AI Unit 2
100% (1)
Gen AI Unit 2
65 pages
Autoregressive Generative Models Guide
No ratings yet
Autoregressive Generative Models Guide
57 pages
Deep Learning U5
No ratings yet
Deep Learning U5
5 pages
7 Deep Learning Generative Models
No ratings yet
7 Deep Learning Generative Models
21 pages
Latent Variable Model - Notes
No ratings yet
Latent Variable Model - Notes
11 pages
CSE545 sp22 (10) Time-Series and Longitudinal Analysis 4-15
No ratings yet
CSE545 sp22 (10) Time-Series and Longitudinal Analysis 4-15
37 pages
Hda RMML
No ratings yet
Hda RMML
131 pages
L11 - UCLxDeepMind DL2020
No ratings yet
L11 - UCLxDeepMind DL2020
68 pages
Advance Deep Learning - BIT L4
No ratings yet
Advance Deep Learning - BIT L4
100 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
Intro to Variational Autoencoders
No ratings yet
Intro to Variational Autoencoders
5 pages
Deep Gen Models Tutorial
No ratings yet
Deep Gen Models Tutorial
96 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Autoregressive Diffusion Models Explained
No ratings yet
Autoregressive Diffusion Models Explained
23 pages
02 ML Fundatmentals 2
No ratings yet
02 ML Fundatmentals 2
81 pages
Intro to Variational Autoencoders
No ratings yet
Intro to Variational Autoencoders
89 pages
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
No ratings yet
Masked Autoregressive Flow For Density Estimation: George Papamakarios Theo Pavlakou Iain Murray
17 pages
Introduction to Variational Autoencoders
No ratings yet
Introduction to Variational Autoencoders
89 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Lect-Gen Ai-2
No ratings yet
Lect-Gen Ai-2
22 pages
L2 Autoregressive Models (SP24)
No ratings yet
L2 Autoregressive Models (SP24)
192 pages
Quadrant Data Efficient Machine Learning Screen
No ratings yet
Quadrant Data Efficient Machine Learning Screen
6 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
w05 LectureSlices MA4550
No ratings yet
w05 LectureSlices MA4550
31 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
DLAI4 Networks Gans
No ratings yet
DLAI4 Networks Gans
7 pages
Lec1 Intro
No ratings yet
Lec1 Intro
51 pages
Ijcai Ecai Tutorial
No ratings yet
Ijcai Ecai Tutorial
115 pages
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
No ratings yet
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
23 pages
Lec 12
No ratings yet
Lec 12
15 pages
Macro Finance
No ratings yet
Macro Finance
119 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
Lect-Gen Ai-2
No ratings yet
Lect-Gen Ai-2
22 pages
PixelTransformer - Sample Conditioned Signal Generation
No ratings yet
PixelTransformer - Sample Conditioned Signal Generation
10 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
78 pages
Deep Learning Models
No ratings yet
Deep Learning Models
18 pages
Generative Learning Algorithims 1233
No ratings yet
Generative Learning Algorithims 1233
33 pages
Autoregressive Image Generation Without Vector Quantization: Tianhong Li Yonglong Tian He Li Mingyang Deng Kaiming He
No ratings yet
Autoregressive Image Generation Without Vector Quantization: Tianhong Li Yonglong Tian He Li Mingyang Deng Kaiming He
16 pages
Autoregressive Models in Unsupervised Learning
No ratings yet
Autoregressive Models in Unsupervised Learning
113 pages
Neural Processes
No ratings yet
Neural Processes
11 pages
9 Gans
No ratings yet
9 Gans
9 pages
Deep Learning Frameworks Overview
No ratings yet
Deep Learning Frameworks Overview
50 pages
Unit I
No ratings yet
Unit I
28 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
DLbook
No ratings yet
DLbook
165 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
Recurrent Interpolants For Probabilistic Time Series Prediction
No ratings yet
Recurrent Interpolants For Probabilistic Time Series Prediction
14 pages
Diffusion Models in Imaging Tutorial
No ratings yet
Diffusion Models in Imaging Tutorial
90 pages
Genai Unit1
No ratings yet
Genai Unit1
41 pages
Unit-3 Hash & MAC (Autosaved)
No ratings yet
Unit-3 Hash & MAC (Autosaved)
29 pages
Cloud Computing Consolidated QA
No ratings yet
Cloud Computing Consolidated QA
3 pages
Cloud Computing Complete All QA
No ratings yet
Cloud Computing Complete All QA
43 pages
Cloud Computing QA Sample
No ratings yet
Cloud Computing QA Sample
1 page
Cloud Computing QA
No ratings yet
Cloud Computing QA
3 pages
Regularization 3 Part
No ratings yet
Regularization 3 Part
22 pages
Grade 8 English Lesson Plan
No ratings yet
Grade 8 English Lesson Plan
10 pages
Free Wondershare Filmora 9 Activation Keys
No ratings yet
Free Wondershare Filmora 9 Activation Keys
3 pages
Shravan Kumar G: Ui/Ux Designer & Full Stack Developer
No ratings yet
Shravan Kumar G: Ui/Ux Designer & Full Stack Developer
2 pages
Condemn Register for Lab Equipment
No ratings yet
Condemn Register for Lab Equipment
4 pages
Essay On Economy of Pakistan
100% (2)
Essay On Economy of Pakistan
3 pages
DIALux Setup Log
No ratings yet
DIALux Setup Log
156 pages
Excel Solver for Profit Maximization
No ratings yet
Excel Solver for Profit Maximization
9 pages
IBM Spectrum Protect - Level 2 Quiz
No ratings yet
IBM Spectrum Protect - Level 2 Quiz
15 pages
Left Brained vs. Right Brained
No ratings yet
Left Brained vs. Right Brained
6 pages
Asesmen Bahasa Inggris SMA Waskito 2024
No ratings yet
Asesmen Bahasa Inggris SMA Waskito 2024
3 pages
Updated Benefits of Semantic Analysis
No ratings yet
Updated Benefits of Semantic Analysis
10 pages
Daniel Fast Daily Scriptures 2010
100% (1)
Daniel Fast Daily Scriptures 2010
3 pages
Either... or Neither... Nor
No ratings yet
Either... or Neither... Nor
4 pages
Course Descriptions
No ratings yet
Course Descriptions
11 pages
A New Integration Technique For Flowmeters Technique For Flowmeters
No ratings yet
A New Integration Technique For Flowmeters Technique For Flowmeters
9 pages
Enterprise User Guide V8i
No ratings yet
Enterprise User Guide V8i
33 pages
Mastering Storytelling in PR
No ratings yet
Mastering Storytelling in PR
10 pages
To Disclose or Not To Disclose Exploring The Risk of Being Transparent About GenAI Use in Second Language Writing
No ratings yet
To Disclose or Not To Disclose Exploring The Risk of Being Transparent About GenAI Use in Second Language Writing
15 pages
Litomani
No ratings yet
Litomani
27 pages
Foot Prints Without Feet English Class 10: Lesson
0% (1)
Foot Prints Without Feet English Class 10: Lesson
32 pages
Sony LED TV KLV-24R422A Description
No ratings yet
Sony LED TV KLV-24R422A Description
6 pages
Aecs Lab Manual Final - 2019-20
No ratings yet
Aecs Lab Manual Final - 2019-20
101 pages
Aneeb Ahmed Khan - Resume
No ratings yet
Aneeb Ahmed Khan - Resume
1 page
House and Home in Anita Desai's Clear Light of Day.-1
No ratings yet
House and Home in Anita Desai's Clear Light of Day.-1
18 pages
Algebraic Identities
No ratings yet
Algebraic Identities
5 pages
Event Sme 9k.
No ratings yet
Event Sme 9k.
1,050 pages
Curriculum Vitae Guide for Academics
No ratings yet
Curriculum Vitae Guide for Academics
4 pages
Section E Trial
No ratings yet
Section E Trial
2 pages
DuMux Installation and Tutorial Guide
No ratings yet
DuMux Installation and Tutorial Guide
87 pages

Lecture 5 Autoregressive Models

Uploaded by

Lecture 5 Autoregressive Models

Uploaded by

Autoregressive Models

𝑦& could be:

+ Model family, loss function,

Learning Prior Knowledge

𝑥3 is a 64x64x3 high dimensional vector

What about a 28×28 black/white digit image?

𝑍" = 𝑓"" (𝑥" , 𝑥# , 𝑥$ ), 𝑍# = 𝑓"# 𝑥" , 𝑥# , 𝑥$ , 𝑍$ = 𝑓"$ 𝑥" , 𝑥# , 𝑥$

finally, it’s possible to model the data distributions!

𝑝. 𝒙 = - 𝑝. (𝑥, |𝑥# , 𝑥$ , … , 𝑥,0# ) = - 𝑝. (𝑥, |𝑥/, )

• Graph model: Directed, fully-observed Bayesian network

Relationship with RNN:

max log 𝑝# 𝐷 = = log 𝑝# 𝑥 = = = log 𝑝# (𝑥$ |𝑥%N )

Tractable conditionals make conditional distribution learning meaningful,

Differences between Autoregressive models (AR), VAE and GAN:

𝑥! , 𝑥" , … , 𝑥M ~ ? 𝑝# (𝑥$ |𝑥%$ )

𝑥3, = 𝑝 𝑥, = 1 𝑥1, 𝑥2, … 𝑥,0# = 𝑓, 𝑥# , 𝑥$ , … , 𝑥,0# ; 𝛼 ,

• The conditional for variable xi requires i parameters, and hence

• Suppose we have a dataset D of handwritten digits (binarized MNIST)

Improve FVSBN: use one hidden layer neural network

ai 𝒙*𝒊 = 𝑝 𝑥$ 𝑥! , 𝑥" … 𝑥$:! ; 𝐴$ , 𝒄𝒊 , 𝜶𝒊 , 𝑏$ )=𝜎(𝜶𝒊 𝒉𝒊 + 𝑏$ )

Tie weights are shared to reduce the

𝒉𝒊 = 𝜎 𝐴$ 𝒙%𝒊 + 𝒄𝒊 𝒙%𝒊 ∈ 𝑅 $:! , denotes the vector made of

𝜃𝑖 = {𝐴$ ∈ 𝑅 @× $:! , 𝒄𝒊 ∈ 𝑅𝑑, 𝜶𝒊 ∈ 𝑅𝑑, 𝑏𝑖 ∈ 𝑅}

• One solution: Let 𝐯*𝐢 parameterize a categorical distribution

𝐯*𝒊 = 𝑝 𝑣$ 𝑣!, … , 𝑣$:! = (𝑝$!, 𝑝$" … , 𝑝$_ )

• Softmax generalizes the sigmoid/logistic function σ(·) and transforms

• One solution: Let 𝐯*𝒊 parameterize a continuous distribution

𝐯*𝒊 = 𝑓 𝒉𝒊 = 𝜇$!, … , 𝜇$> , 𝜎$!, … 𝜎$>

A dependency order constraint is

With the order 𝑥",𝑥O,𝑥!:

Mathieu M. Masked Autoencoder for Distribution Estimation[J]. 2015.

Performance on the MNIST dataset.

Examples of powerful GPT-2 model

You might also like