0% found this document useful (0 votes)

8 views143 pages

Week 9-10 RNN

This document provides an overview of Recurrent Neural Networks (RNNs), their advantages, and challenges such as exploding and vanishing gradients. It discusses various RNN variants like LSTMs and GRUs, and highlights their applications in sequence modeling tasks such as machine translation and sentiment analysis. The document also emphasizes the importance of sequential data processing and the need for memory mechanisms in neural networks.

Uploaded by

i222222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views143 pages

Week 9-10 RNN

Uploaded by

i222222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 143

Week 9

Recurrent
Neural Networks
Lecture
Outline
Motivation: Sequence Modeling

Understanding Recurrent Neural Networks (RNNs)

Challenges in vanilla RNNs: Exploding and Vanishing gradients. Why? Remedies?

RNN variants:
• Long Short Term Memory (LSTM) networks, Gated recurrent units (GRUs)
• Bi-directional Sequence Learning
• Recursive Neural Networks (RecNNs): TreeRNNs and TreeLSTMs
• Deep, Multi-tasking and Generative RNNs (overview)

Attention Mechanism: Attentive RNNs

RNNs in Practice + Applications

Introduction to Explainability/Interpretability of RNNs

2
Convolutional vs Recurrent Neural Networks

CNN/FF-Nets
• all the outputs are self dependent
Feed-forward nets don’t remember historic
input data at test time unlike recurrent
networks.

RNN
• perform well when the input data is interdependent in
a sequential pattern
• correlation between previous input to the next input
• introduce bias based on your previous output
Introducti
on
Vanilla RNN (Recurrent Neural Network) is a type of neural network that is
used for processing sequential data.

It is the simplest type of RNN, where the hidden state at the current time
step is determined by the input at the current time step and the hidden
state from the previous time step
Recurrent neural networks
• Dates back to (Rumelhart et al., 1986)
• A family of neural networks for handling sequential data, which
involves variable length inputs or outputs

• Especially, for natural language processing (NLP)

• Each data point: A sequence of vectors 𝑥 ( 𝑡 ) , for 1
≤𝑡 ≤𝜏

lengths 𝜏
• Batch data: many sequences with different
Sequential • Label: can be a scalar, a vector, or even a sequence

data • Example
• Sentiment analysis
• Machine translation
Example: machine translation
More complicated sequential data

Data point: two Label: different type of Example: image

dimensional sequences sequences like text captioning
like images sentences
Image
captioning
•Figure from the paper
“DenseCap: Fully
Convolutional
Localization Networks
for Dense Captioning”,
by Justin Johnson,
Andrej Karpathy, Li Fei-
Fei
Computational
graphs
A typical dynamic
system

𝑠 ( 𝑡 +1) = 𝑓 ( 𝑠 𝑡

; 𝜃) Figure from Deep Learning,

Goodfellow, Bengio and Courville
A system driven
by external data

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡

, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Compact view

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡

, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Compact square: one step time delay

view

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡
Key: the same 𝑓 and 𝜃
, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
for all time steps Goodfellow, Bengio and Courville
Recurrent neural networks
Label

Loss

Output

State

Input

Figure from Deep Learning, by Goodfellow, Bengio and Courville

Recurrent neural networks

Math formula:

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity
and good for generalization in learning
• Explicitly use the prior knowledge that the sequential data
can be processed by in the same way at different time
step (e.g., NLP)
Advantage

• Hidden state: a lossy summary of the past

• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a

Turing machine can be computed by such a recurrent network of
a finite size (see, e.g., Siegelmann and Sontag (1995))
Training RNN

• Principle: unfold the computational graph, and use

backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN

• Principle: unfold the computational graph, and use

backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes,

then compute the gradients of the parameters
Recurrent neural networks

Math formula:

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝐿( 𝑡 ) :
(total loss is sum of
those at different
time steps)

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑜 ( 𝑡 ) :

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑠 ( 𝜏 ) :

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑠 ( 𝑡 ) :

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at parameter 𝑉:

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Motivation: Need for Sequential Modeling

Why do we need Sequential

Modeling?

27
Motivation: Need for Sequential Modeling

•Examples of Sequence Input Data Output

data
This is RNN
•Speech Recognition Hallo, ich bin
Hello, I am Pankaj. Pankaj.
Machine Translation हैलो, म
� पंकज 5ं।
_
Recurrent neural _
based model network
Language Modeling
language
Pankaj lives in Pankaj lives in
Named Entity Recognition Munich Munich
person

Sentiment Classification location

There is nothing to like in this movie.
Video Activity Analysis

Punching

28
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different

examples
Example:
Sentence1: Pankaj lives in Munich
Sentence2: Pankaj Gupta lives in Munich
DE

29
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different

examples
Example:
Sentence1: Pankaj lives in Munich Additional

Sentence2: Pankaj Gupta lives in Munich word ‘PAD’ i.e.,

DE padding

Pankaj … perso Pankaj … perso

n n
lives other Gupta perso
n
in other lives other

Munic locatio in other

h n
PAD other Munic locatio
… h … n
PAD other German locatio
y n
FF-net / CNN FF-net / CNN
*FF-net: Feed-forward network
30
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different

examples
Example:
person other
Sentence1: Pankaj lives in Munich other location

Sentence2: Pankaj Gupta lives in Munich Model

DE s
variabl
Pankaj … perso Pankaj Pankaj lives in e
… perso Munich
n n
lives other Gupta perso length
in other
n person person other sequen
other location location
lives other
ces
Munic locatio in other
h n
PAD other Munic locatio
… h … n
PAD other German locatio
y n Pankaj Gupta lives in Munich Germany
FF-net / CNN FF-net / CNN
Sequential model: RNN
*FF-net: Feed-forward network
7
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time

steps
Example:
Sam e uni-
Sentence1: Market falls into bear territory  Trading/Marketing
gram
Sentence2: Bear falls into market territory  UNK statistics

32
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time

steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing No
sequential
Sentence2: Bear falls into market territory  UNK or temporal
modeling,
i.e., order-
… … less
falls falls

bear bear Treat s t he

marke marke UNK
two
Tradin
t g t sentences t
into
into he sam e
territor … territor …
y y
sentence1 sentence2
FF-net / CNN FF-net / CNN

33
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time

steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing Languag
Trading e
Sentence2: Bear falls into market territory  UNK concept
s,

… … Word
falls falls orderin
market falls into bear territory
bear bear g,
UNK
marke Tradin marke UNK Synt act ic
t g t
int int &
o o semanti
territor … territor …
y y c
sentence1 sentence2 informat i
bear falls into
FF-net / CNN FF-net / CNN market territory on
Sequential model: RNN
34
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time

steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing Languag
Trading e
Sentence2: Bear falls into market territory  UNK concept
s,

…
Word
falls Dlls irectio n of
fa
…
market falls into bear
orderin
territory
bear infboearr mation g,
UNK
marke Tradin mark fatter
low UNK Synt act ic
t g
int me s! &
o semanti
territor … territor
t …
y y c
sentence1 into
sentence2 informat i
bear falls into market territory
FF-net / CNN FF-net / CNN on
Sequential model: RNN
11
Motivation: Need for Sequential
Modeling

Machine Translation: Different Input and Output sizes, incurring sequential

patterns

pankaj Decoder
lebt in münchen मुिनच रह ह
पंकज म
� ता ै
Decoder

encodes input encodes input

text text
Pankaj lives in Munich Pankaj lives in Munich

Encoder Encoder

36
Motivation: Need for Sequential
Modeling

Convolutional vs Recurrent Neural Networks

RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output

CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.

37
Motivation: Need for Sequential Modeling

Memory-less Models • Memory Networks

Autoregressive models: •-possess a dynamic hidden state that can
Predict the next input in a sequence store long term information, e.g., RNNs.
from a fixed number of previous inputs
using “delay taps”.
Wt-2
Wt-1 •Recurrent Neural Networks:
inputt-2 inputt-1 inputt
•RNNs are very powerful, because they
combine the following properties-
Feed-forward neural networks:
•Distributed hidden state: can efficiently store
Generalize autoregressive models by a lot of information about the past.
using non-linear hidden layers.
Wt-2 •Non-linear dynamics: can update their
hidden state in complicated ways
Wt-1
•Temporal and accumulative: can build
inputt-2 inputt-1 inputt semantics, e.g., word-by-word in sequence
over time
38
Notation
s

• ℎ 𝑡 : Hidden Unit
• 𝗑 𝑡 : Input
• 𝑜 𝑡 : Output
• 𝑾𝑾ℎℎ : Shared Weight Parameter
• 𝑾𝑾ℎ𝑜 : Parameter weight between hidden layer
and output
• 𝜃𝜃: parameter in general
• 𝑔𝜃𝜃 : non linear function
• 𝐿𝑡 :Loss between the RNN outputs and the true
output
• 𝐸𝑡 : cross entropy loss

39
Long Term and Short
Dependencies

Short Term Dependencies

 need recent information to perform the present task.

For example in a language model, predict the next word based on the
previous ones. “the clouds are in the sky”
“the clouds are in the ?”  ‘sky’
 Easier to predict ‘sky’ given the context, i.e., short term
dependency
Long Term Dependencies

 Consider longer word sequence “I grew up in France…........…………………… I speak fluent

French.”
 Recent information suggests that the next word is probably the name of a language, but
if we want to narrow down which language, we need the context of France, from further
back.

40
Foundation of Recurrent Neural
Networks
Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information
to persist

punching

41
Foundation of Recurrent Neural
Networks
Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information to persist

Feed Forward NNets can not take time dependencies into

account. Sequential data needs a Feedback Mechanism.
o o0
ot-1
ot oT
Unfold
x0 o0 feedback mechanism in
… Whh Whh Whh
or internal state A time
… …
loop
xt ot … …
Whh
…
… … xt-1 xt xT
x x0
FF-net / CNN tim
Recurrent Neural Network (RNN)
e

42
Foundation of Recurrent Neural
Networks
person othe othe locatio
output labels
r r n
softmax-layer . .1 person
.2 .1 .1 .1 .8 .1 .7
8
.1 .7 .2
output layer .8 .1
.2 .1 .1 .1 .8 .1 .7 locatio
.1
.7 .2
Who n
. . . .
5 Whh 3 Whh 5 Whh 6 other
hidden layer . . . .
2 3 4 7 Recurrent Neural Network
. -. . .
7 1 9 5
Wxh
1 0 0 0
0 1 0 0
input layer
0 0 1 0
0 0 0 1
input sequence Pankaj lives in Munich
tim
43
e
(Vanilla) Recurrent Neural
Network
Process a sequence of vectors x by applying a recurrence at every
time step:
o o0 ot oT
ot-1
Unfold Who
feedback mechanism
in
or internal state loop Whh Whh Whh
A time

ℎ𝑡 =
h0 ht
Whh … …

𝑔𝜃𝜃(ℎ𝑡 −1 , 𝑥𝑡 )
Wxh
x x0 xt-1 xt xT
time
Input vector at time
new step, t Vanilla Recurrent Neural Network (RNN)
hidden old hidden
state at time some function
step, state
parameters Whh Wxh at time step, t-
twith
1
Remark: The same function g and same set of parameters W are used at
every time step

44
(Vanilla) Recurrent Neural
Network
Process a sequence of vectors x by applying a recurrence at every
time step:
o o0 ot oT
ot-1
feedback mechanism Unfold Who
or internal state in
loop A time Whh Whh

ℎ𝑡 = 𝑔𝜃 ℎ𝑡 −1 ,
Whh

𝑥𝑡
Whh … …
Wxh

ℎ𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡−1 +
x x0 xt-1 xt xT

𝑊𝑊𝑥𝑥ℎ𝑥𝑡)
time

𝑜𝑡 = 𝑠𝑜𝑠𝑠𝑡𝑚𝑎𝑥(𝑊𝑊ℎ𝑜ℎ𝑡)
Vanilla Recurrent Neural Network (RNN)

Remark: RNN‘s can be seen as selective summarization of input sequence in a

fixed-size state/hidden vector via a recursive update.

45
Recurrent Neural Network: Probabilistic
Interpretation
RNN as a generative model x0 x1 xt <eos>
xt+1
 induces a set of procedures to
model Whh Whh Whh Whh
the conditional distribution of xt+1 given
x<=t
for all t = 1, … … …
…,T
<bos> xT
x0 xt-1 xt
time
 Think of the output as the probability Generative Recurrent Neural
distribution of the Network (RNN)

xt given the previous ones in the sequence

 Training: Computing probability of the sequence and Maximum
likelihood training

Details:
https://www.cs.cmu.edu/~epxing/Class/10708-17/project-reports/project10.
pdf
46
RNN: Computational
Graphs

Sequence of output

𝑜 𝑜2 𝑜3
1

𝑔𝜃 𝑔𝜃 𝑔𝜃
𝜃
Initial 𝜃 𝜃
State,
A0
𝑥1 𝑥2 𝑥3
Next state

𝜃𝜃
Sequence of Inputs

47
RNN: Different Computational
Graphs
𝑜𝟏 𝑜𝟐 𝑜𝟑 𝑜𝟏
𝑜𝟏 𝟏
𝟏 𝟐 𝟑
𝟏

𝗑𝟏 𝗑𝟒
𝗑𝟐𝟐
𝗑𝟏𝟏
𝗑𝟏 𝟏
𝗑𝟑𝟑 to One 𝟒
𝑜𝟏 𝑜𝟐 𝑜𝟑 𝑜𝟒
Many
𝑜𝟏 𝑜𝟐
One to one One to Many
𝟏
𝟏 𝟐 𝟑 𝟒
𝟏 𝟐

𝗑𝟏 𝗑𝟒
𝗑𝟐𝟐
𝗑𝟏 𝗑𝟐 𝟏
Many to Many 𝟒

𝗑𝟑𝟑
𝟏 𝟐
48
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)

unrolled model

49
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)

unrolled model

 Think of the recurrent net as a

layered, feed-forward net with shared
weights and
then train the feed-forward net in
time domain

Lecture from the course Neural Networks for Machine

Learning

50
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)

unrolled model

 Think of the recurrent net as a

layered, feed-forward net with shared
weights and
then train the feed-forward net in
time domain

Training algorithm in time domain:

 The forward pass builds up a stack of the activities
of all the units at each time step
 The backward pass peels activities off the stack to
compute
Lecture from the the
courseerror derivatives
Neural Networks at each time step.
for Machine

Learning
After the backward pass we add together the
51 derivatives at all the different times for each weight.
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

 Think of the recurrent net as a layered,

feed-forward net wFioh
t rwshaare
r dd twherioghutsgahndentire sequence  compute
loss
then train the e
fedBf-aocw
r kawd
r anredt itnhtrm
i oeugdohmeann
i tire sequence 
compute gradient
Training algorithm in time domain:
 The forward pass builds up a stack of the activities
of all the units at each time step
 The backward pass peels activities off the stack to
compute the error derivatives at each time step.
 After the backward pass we add together the
derivatives at all the different times for each weight.
52
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT