0% found this document useful (0 votes)
20 views41 pages

Lecture 5 Autoregressive Models

Uploaded by

chinmayee.chimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views41 pages

Lecture 5 Autoregressive Models

Uploaded by

chinmayee.chimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Autoregressive Models

Hao Dong

Peking University

1
Autoregressive Models
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet…. (Next Lecture)

2
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

3
Definition of Autoregressive Models
The term autoregressive originates from the literature on time-series models
where observations from the previous time-steps are used to predict the value at
the current time step.
Put simply, an autoregressive model is merely a feed-forward model which predicts
future values from past values:
𝑦! = 𝑐 + 𝜙"𝑦!#" + 𝜙$𝑦!#$ + … + 𝜙% 𝑦!#% + ɛ! , ɛ! ~N(0, 𝜎 $)

𝑦& could be:


The specific stock price of day 𝑖…
The amplitude of a simple pendulum at period 𝑖…
Or any variable that depends on its preceding values!
Definition of Autoregressive Models
Autoregressive Models have a strong ability in data representation.
𝑦! = 𝑐 + 𝜙"𝑦!#" + 𝜙$𝑦!#$ + … + 𝜙% 𝑦!#% + ɛ! , ɛ! ~N(0, 𝜎 $)

Two examples of data from autoregressive models with a few different parameters.
Left: AR(1) with yt=18−0.8yt−1 + εt. Right: AR(2) with yt=8+1.3yt−1−0.7yt−2 + εt.
Definition of Autoregressive Models
Autoregressive Models have a strong ability in data representation.

• Regression

• Generation

• Prediction
Recap: Statistical Generative Models

+ Model family, loss function,


optimization algorithm, etc.

Learning Prior Knowledge


dataset 𝒟
𝑝𝑑𝑎𝑡𝑎(𝑥)

The probability
density value is
very high

𝑥1 𝑥2 𝑥3 𝑥4 𝑥

𝑥3 is a 64x64x3 high dimensional vector


Sampling from p(x) generates new images
representing a woman with blonde hair.7
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

8
Recap: Challenge of Generative Models
n Compactness
Suppose x1,x2,x3 are binary variables. P(𝑥# , 𝑥$ , 𝑥% ) can be specified with (23 - 1) =7 parameters

What about a 28×28 black/white digit image?


2$&×$& = 2(&) ≈ 10$%* parameters!
But with only 10 peaks of 0, 1, 2, … 9
𝑂(2𝑛)

Main challenge: distributions over high dimensional objects is actually very sparse!!
Too many possibilities! Main idea: write as a product of simpler terms
Recap: Challenge of Generative Models
n Solution #1: Factorization
Definition of conditional probability:
𝑃(𝑥!, 𝑥") = 𝑃(𝑥!) 𝑃(𝑥"|𝑥!)
+
Product rule:
𝑃 𝑥# , 𝑥$ , … , 𝑥+ = - 𝑝. (𝑥, |𝑥/, )
,-#
Divide and conquer ! We can solve the joint distribution 𝑃(𝒙) by
solving simpler conditional distributions 𝑝# (𝑥$ |𝑥%$ ) one by one
Can you tell the exact
likelihood of the next pixel
Still complex!! (noted as a red point)
conditioned on the given
It’s hard to exactly modeling every conditional distribution pixels?
Recap: Challenge of Generative Models
Solution #2a: use simple functions to form the conditionals Sigmoid function
𝑃(𝑥! |𝑥" , 𝑥# , 𝑥$ ) ≈ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑊" 𝑥" + 𝑊# 𝑥# + 𝑊$ 𝑥$ ) can be used to
binarize the output
◦ Only requires storing 3 parameters
◦ Relationship between 𝑥$ and (𝑥% , 𝑥# , 𝑥& ) could be too simple

Solution #2b: use more complex functional form Neural network sigmoid function

𝑍" = 𝑓"" (𝑥" , 𝑥# , 𝑥$ ), 𝑍# = 𝑓"# 𝑥" , 𝑥# , 𝑥$ , 𝑍$ = 𝑓"$ 𝑥" , 𝑥# , 𝑥$


x1
𝑌" = 𝑓#" (𝑍" , 𝑍# , 𝑍$ ), 𝑌# = 𝑓## (𝑍" , 𝑍# , 𝑍$ ), 𝑌$ = 𝑓#$ 𝑍" , 𝑍# , 𝑍$ …
x2
x4
𝑃(𝑥! |𝑥" , 𝑥# , 𝑥$ ) ≈ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑊" 𝑌" + 𝑊# 𝑌# + 𝑊$ 𝑌$ … )
x3
◦ More flexible
◦ More parameters
◦ More powerful on fitting data Neural network

finally, it’s possible to model the data distributions!


11
IJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

12
Definition of Autoregressive Models
However, by defining 𝑥3& , the output of step 𝑖 , as a random variable that follows
the conditional distribution based on previous inputs 𝑥", 𝑥$ … 𝑥&#", we get the
probability model, which can present the joint distribution of 𝑝' 𝑥", 𝑥$, … 𝑥(
• Key idea: Decompose the joint distribution as a product of tractable conditionals
𝑥*$ = 𝑝# 𝑥$ 𝑥!, 𝑥", … , 𝑥$:!
+ +

𝑝. 𝒙 = - 𝑝. (𝑥, |𝑥# , 𝑥$ , … , 𝑥,0# ) = - 𝑝. (𝑥, |𝑥/, )


,-# ,-#

• Graph model: Directed, fully-observed Bayesian network

13
Definition of Autoregressive Models

Obligatory RNN diagram. Source: Chris Olah. WaveNet animation. Source: Google DeepMind.

Relationship with RNN:


Like an RNN, an autoregressive model’s output ℎ; ,at time 𝑡 depends on not just 𝑥; , but also
𝑥! , 𝑥" , … , 𝑥$:! from previous time steps.
However, unlike an RNN, the previous 𝑥! , 𝑥" , … , 𝑥$:! are not provided via some hidden
state: they are given just as an input to the model.
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

15
Learning and Inference of Autoregressive Models
• Learning maximizes the model log-likelihood over the dataset !

min 𝑑>? 𝑝@A;A , 𝑝# = min EB~D1232 log pEFGF x − log p#= x ∝ max 𝐸H~I4565 log 𝑝# (𝑥)
#∈= #∈= #∈=
M

max log 𝑝# 𝐷 = = log 𝑝# 𝑥 = = = log 𝑝# (𝑥$ |𝑥%N )


#
H∈J H∈K $L!

Tractable:
The distribution is simple enough to be modeled explicitly.

Tractable conditionals make conditional distribution learning meaningful,


and thus allow for exact likelihood evaluation.

16
Learning and Inference of Autoregressive Models
• Inference samples each variable of one data from estimated conditional
distributions step by step, until the whole data is generated.
Ancestral sampling:
A process of producing samples from a probabilistic model.
First sample variables which has no conditional constraints using their prior
distribution. 𝑥"~𝑝' (𝑥")
Then sample child variables using conditional distribution based on their
parents and repeat so on. 𝑥$~𝑝' 𝑥$ 𝑥")……
The attribute of Autoregressive Models that directly model and output distributions
allows for ancestral sampling.

17
Learning and Inference of Autoregressive Models

Differences between Autoregressive models (AR), VAE and GAN:


GAN model doesn’t define any distribution, it adapts discriminator to learn
the data distribution implicitly.
VAE model believes the data distribution is too complex to model directly,
thus it tries to learn the distribution by defining an intermediate distribution
and learning the map between the defined simple distribution to the
complex data distribution.
AR model on the one hand assumes that the data distribution can be
learned directly (tractable), then it define its outputs as conditional
distributions to solve the generation problem by directly modeling each
conditional distribution.
18
Learning and Inference of Autoregressive Models
Conclusion:
1. Using complex networks, each step Autoregressive Models output an approximated
complex conditional distribution 𝑥*$ = 𝑝# 𝑥$ 𝑥! , 𝑥" , … , 𝑥$:!

2. Taking in the previous inputs 𝑥! , 𝑥" , … , 𝑥$:! and the next input 𝑥$ by sampling
previous estimated conditional distribution 𝑥*$ , Autoregressive Model is able to generate
all conditional distributions iteratively
𝑥! ~𝑃# 𝑥! , 𝑥" ~𝑃# 𝑥" 𝑥! , 𝑥O ~𝑃# 𝑥O 𝑥! , 𝑥" , … 𝑥M ~𝑃# (𝑥M |𝑥! , … , 𝑥M:! )

3. Product rule makes sure the generated data that made up of sampled result 𝑥$ from
each step follows the data distribution.
M

𝑥! , 𝑥" , … , 𝑥M ~ ? 𝑝# (𝑥$ |𝑥%$ )


$L! 19
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

20
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

21
Fully Visible Sigmoid Belief Network (FVSBN)
l the fully visible sigmoid belief network without any hidden units is denoted FVSBN.

The conditional variables xi |x1 ,..., xi−1 in FVSBN are Bernoulli with parameters.
Some conditionals are too complex. So FVSBN assume:

𝑥3, = 𝑝 𝑥, = 1 𝑥1, 𝑥2, … 𝑥,0# = 𝑓, 𝑥# , 𝑥$ , … , 𝑥,0# ; 𝛼 ,


, , (,)
𝑥! 𝑥" 𝑥O 𝑥P = 𝜎(𝛼7 + 𝛼# 𝑥# + ⋯ + 𝛼,0# 𝑥,0# )
• σ denotes the sigmoid function
FVSBN
(() (() (()
𝛼 ( = {𝛼) , 𝛼" , … , 𝛼(," } denotes the parameters

• The conditional for variable xi requires i parameters, and hence


the total number of parameters in the model is given by
∑.(-" 𝑖 = 𝑂 𝑛 # ≪ 𝑂(2. )

Gan Z , Henao R , Carlson D , et al. Learning Deep Sigmoid Belief Networks with Data Augmentation[C]// Artificial Intelligence and Statistics (AISTATS). 2015.
FVSBN Example

• Suppose we have a dataset D of handwritten digits (binarized MNIST)

• Each image has n = 28×28784 = pixels. Each pixel can either be black (0) or
white (1).
• We want to learn a probability distribution 𝑝 𝑥 = 𝑝(𝑥" , … , 𝑥#$% )over 𝑥 ∈
0,1 #$% such that when 𝑥~𝑝(𝑥), 𝑥 looks like a digit.
• Idea: define a FVSBN model , then pick a good one based on training data D.
(more on that later)
FVSBN Example
• We can pick an ordering, i.e., order variables (pixels) from top-left (𝑥") to
bottom-right (𝑥)*+).
• Use product rule factorization without loss of generality:
𝑝(𝑥",··· , 𝑥)*+) = 𝑝 𝑥" 𝑝 𝑥$ 𝑥")𝑝(𝑥, | 𝑥", 𝑥$)
··· 𝑝 𝑥)*+ 𝑥",··· , 𝑥)*,)
• FVSBN model assume: (less parameters)
𝑥3& = 𝑝 𝑥" = 1 𝑥1, 𝑥2, … 𝑥&#" = 𝑓& 𝑥", 𝑥$, … , 𝑥&#"; 𝛼 &
& & (&)
= 𝜎(𝛼- + 𝛼" 𝑥" + ⋯ + 𝛼&#"𝑥&#")
• Note: This is a modeling assumption. We are using a logistic regression to
predict next pixel distribution based on the previous ones. Called
autoregressive.
FVSBN Example
𝑥
@! 𝑥
@" 𝑥
@O 𝑥
@P

ai
𝑥! 𝑥" 𝑥O 𝑥O
binary hidden variables
• How to evaluate 𝑝(𝑥!,··· , 𝑥"#$) i.e. density estimation? Multiply all the conditionals (factors)
In the above example:
𝑝 𝑥# = 0, 𝑥$ = 1, 𝑥% = 1, 𝑥) = 0
= 𝑝 𝑥# = 0 𝑝 𝑥$ = 1 𝑥# = 0 𝑝 𝑥% = 1 𝑥# = 0, 𝑥$ = 1 𝑝 𝑥) = 0 𝑥# = 0, 𝑥$ = 1, 𝑥% = 1
= 1−𝑥
;# ×;
𝑥$ × 𝑥
;% ×(1 − 𝑥
;) )
• How to sample from 𝑝(𝑥!,··· , 𝑥"#$) ?
1. Sample 𝑥!~𝑝(𝑥!) (𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑐ℎ𝑜𝑖𝑐𝑒([1,0], 𝑝 = [8
𝑥!, 1 − 𝑥
8!]))
2. Sample 𝑥%~𝑝(𝑥%|𝑥! = 𝑥!)
3. Sample 𝑥&~𝑝(𝑥&|𝑥! = 𝑥!, 𝑥% = 𝑥%)
···
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

26
NADE: Neural Autoregressive Density Estimation

Improve FVSBN: use one hidden layer neural network


instead of logistic regression

𝒉𝒊 = 𝜎 𝐴$ 𝒙%𝒊 + 𝒄𝒊

ai 𝒙*𝒊 = 𝑝 𝑥$ 𝑥! , 𝑥" … 𝑥$:! ; 𝐴$ , 𝒄𝒊 , 𝜶𝒊 , 𝑏$ )=𝜎(𝜶𝒊 𝒉𝒊 + 𝑏$ )

parameters xi = vi

Tie weights are shared to reduce the


number of parameters and speed up
computation
(see blue dots in the figure)
Uria B, Côté M A, Gregor K, et al. Neural autoregressive distribution estimation[J]. The Journal of Machine Learning Research, 2016, 17(1): 7184-7220.
NADE: Neural Autoregressive Density Estimation

𝒉𝒊 = 𝜎 𝐴$ 𝒙%𝒊 + 𝒄𝒊 𝒙%𝒊 ∈ 𝑅 $:! , denotes the vector made of


preceding 𝑥 s
𝒙*𝒊 = 𝑝 𝑥$ 𝑥! , 𝑥" … 𝑥$:! ; 𝐴$ , 𝒄𝒊 , 𝜶𝒊 , 𝑏$ )
𝒉𝒊 ∈ 𝑅 @ , denotes the hidden layer
= 𝑓$ 𝑥! , 𝑥" , … , 𝑥$:! = 𝜎(𝜶𝒊 𝒉𝒊 + 𝑏$ ) activations of the MLP

𝜃𝑖 = {𝐴$ ∈ 𝑅 @× $:! , 𝒄𝒊 ∈ 𝑅𝑑, 𝜶𝒊 ∈ 𝑅𝑑, 𝑏𝑖 ∈ 𝑅}


are the set of parameters .

The total number of parameters in this model is dominated by the matrices {𝑨𝟏 , 𝑨𝟐 , … , 𝑨𝒏 }
given by O(n2d).
Sharing parameters: Tie weights are shared to reduce the number of parameters and
speed up computation. —> O(nd).
Generate samples

FVSBN

NADE

Learned Features

Performance on the MNIST dataset. (Left) Training data. (Middle) Averaged synthesized
samples. (Right) Learned features at the bottom layer.
Generate other distributions
• How to model non-binary discrete random variables Vi ∈{1, …K }? E.g., pixel intensities
varying from 0 to 255?

• One solution: Let 𝐯*𝐢 parameterize a categorical distribution


𝒉𝒊 = 𝜎 𝐴$ 𝒗%𝒊 + 𝒄𝒊

𝐯*𝒊 = 𝑝 𝑣$ 𝑣!, … , 𝑣$:! = (𝑝$!, 𝑝$" … , 𝑝$_ )

= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑈$ 𝒉𝒊 + 𝒃𝒊

• Softmax generalizes the sigmoid/logistic function σ(·) and transforms


a vector of K numbers into a vector of K probabilities (non-negative,
sum to 1).

" /
exp 𝑎" exp 𝑎 /
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒂 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑎 , … , 𝑎 =( ,…, )
∑( exp 𝑎 ( ∑( exp 𝑎 (
Generate other distributions
• How to model continuous random variables Vi ∈R? E.g., speech signals ?

• One solution: Let 𝐯*𝒊 parameterize a continuous distribution


E.g., uniform mixture of K Gaussians
𝒉𝒊 = 𝜎 𝐴$ 𝒗%𝒊 + 𝒄𝒊

𝐯*𝒊 = 𝑓 𝒉𝒊 = 𝜇$!, … , 𝜇$> , 𝜎$!, … 𝜎$>


>
1 f f
𝑝 𝑣$ 𝑣!, … , 𝑣$:! = = (𝒩(𝑣$ ; 𝜇$ , 𝜎$ ))
𝐾
fL!
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet….

32
Autoregressive model vs. Autoencoders
• FVSBN and NADE look similar to an Autoencoder.
• An encoder e(·), E.g., 𝑒(𝑥) = 𝜎(𝑊 " 𝑊 !𝑥 + 𝑏! + 𝑏")
• A decoder such that 𝑑 𝑒 𝑥 ≈𝑥
Binary:
min = =(−𝑥$ log 𝑥*$ − 1 − 𝑥$ log(1 − 𝑥*$ ))
g : ,g ; ,h: ,h; ,i,j
H∈K $
Continuous:
min = = 𝑥$ − 𝑥*$ "
g : ,g ,h ,h; ,i,j
; :
H∈K $
• Encoder: feature learning
• A vanilla autoencoder is not a generative model: it does not define a distribution
over 𝑥 we can sample from to generate new data points.
Autoregressive model vs. Autoencoders
• FVSBN and NADE look similar to an autoencoder.
• Can we get a generative model from an Autoencoder?

A dependency order constraint is


required for Autoencoder to make it a
Bayesian Network.
VS
Autoregressive model vs. Autoencoders
• To get an autoregressive model from an Autoencoder,
• we need to make sure it corresponds to a valid Bayesian Network,
so we need an ordering. If the ordering is 1,2,3, then:
@! cannot depend on any input x.
• 𝑥
@" can only depend on 𝑥! .
• 𝑥
@O can only depend on 𝑥! , 𝑥" .
• 𝑥
• Bonus: we can use a single neural network (with n outputs) to
produce all the parameters. In contrast, NADE requires n passes.
Much more efficient on modern hardware.
MADE: Masked Autoencoder for Distribution Estimation
Use Masks to constraint dependency paths!
Each output unit is an estimated distribution, it only depends on the inputs with
orderings that before its chosen ordering

With the order 𝑥",𝑥O,𝑥!:


1. 𝑝(𝑥") doesn’t depends
on any input
2. 𝑝(𝑥O|𝑥") depends on
input 𝑥"
3. 𝑝(𝑥!|𝑥", 𝑥O) depends
on input 𝑥" , 𝑥O

Masked Autoencoder

Mathieu M. Masked Autoencoder for Distribution Estimation[J]. 2015.


Generate samples

MADE

Performance on the MNIST dataset.


(Left) : Samples from a 2 hidden layer MADE
(Right): Nearest neighbor in binarized MNIST
Autoregressive Models in NLP
Natural language generation (NLG) is one of the important research fields of Artificial
Intelligence, including text-to-text generation, meaning-to-text generation and image-to-
test generation etc.
However, every time generating a word, it’s always helpful to consider text that already
generated! Thus Autoregressive Model is widely adopted in NLP.

Examples of powerful GPT-2 model


generating “First Law Of Robotics”
Appendix A —Taxonomy of Generative Models
• Definition of Autoregressive Models (Ⅰ)
• Challenge of Generative Models
• Definition of Autoregressive Models (Ⅱ)
• Learning and Inference of Autoregressive Models
• Examples of Autoregressive Models
• Fully Visible Sigmoid Belief Network (FVSBN)
• Neural Autoregressive Density Estimation (NADE)
• Masked Autoencoder for Distribution Estimation (MADE)
• PixelRNN, PixelCNN, WaveNet…. (Next Lecture)

40
Thanks

41

You might also like