0% found this document useful (0 votes)
8 views143 pages

Week 9-10 RNN

This document provides an overview of Recurrent Neural Networks (RNNs), their advantages, and challenges such as exploding and vanishing gradients. It discusses various RNN variants like LSTMs and GRUs, and highlights their applications in sequence modeling tasks such as machine translation and sentiment analysis. The document also emphasizes the importance of sequential data processing and the need for memory mechanisms in neural networks.

Uploaded by

i222222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views143 pages

Week 9-10 RNN

This document provides an overview of Recurrent Neural Networks (RNNs), their advantages, and challenges such as exploding and vanishing gradients. It discusses various RNN variants like LSTMs and GRUs, and highlights their applications in sequence modeling tasks such as machine translation and sentiment analysis. The document also emphasizes the importance of sequential data processing and the need for memory mechanisms in neural networks.

Uploaded by

i222222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 143

Week 9

Recurrent
Neural Networks
Lecture
Outline
Motivation: Sequence Modeling

Understanding Recurrent Neural Networks (RNNs)

Challenges in vanilla RNNs: Exploding and Vanishing gradients. Why? Remedies?

RNN variants:
• Long Short Term Memory (LSTM) networks, Gated recurrent units (GRUs)
• Bi-directional Sequence Learning
• Recursive Neural Networks (RecNNs): TreeRNNs and TreeLSTMs
• Deep, Multi-tasking and Generative RNNs (overview)

Attention Mechanism: Attentive RNNs

RNNs in Practice + Applications

Introduction to Explainability/Interpretability of RNNs

2
Convolutional vs Recurrent Neural Networks

CNN/FF-Nets
• all the outputs are self dependent
Feed-forward nets don’t remember historic
input data at test time unlike recurrent
networks.

RNN
• perform well when the input data is interdependent in
a sequential pattern
• correlation between previous input to the next input
• introduce bias based on your previous output
Introducti
on
Vanilla RNN (Recurrent Neural Network) is a type of neural network that is
used for processing sequential data.

It is the simplest type of RNN, where the hidden state at the current time
step is determined by the input at the current time step and the hidden
state from the previous time step
Recurrent neural networks
• Dates back to (Rumelhart et al., 1986)
• A family of neural networks for handling sequential data, which
involves variable length inputs or outputs

• Especially, for natural language processing (NLP)


• Each data point: A sequence of vectors 𝑥 ( 𝑡 ) , for 1
≤𝑡 ≤𝜏

lengths 𝜏
• Batch data: many sequences with different
Sequential • Label: can be a scalar, a vector, or even a sequence

data • Example
• Sentiment analysis
• Machine translation
Example: machine translation
More complicated sequential data

Data point: two Label: different type of Example: image


dimensional sequences sequences like text captioning
like images sentences
Image
captioning
•Figure from the paper
“DenseCap: Fully
Convolutional
Localization Networks
for Dense Captioning”,
by Justin Johnson,
Andrej Karpathy, Li Fei-
Fei
Computational
graphs
A typical dynamic
system

𝑠 ( 𝑡 +1) = 𝑓 ( 𝑠 𝑡

; 𝜃) Figure from Deep Learning,


Goodfellow, Bengio and Courville
A system driven
by external data

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡

, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Compact view

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡

, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Compact square: one step time delay

view

𝑠 ( 𝑡 +1) = 𝑓(𝑠 𝑡
Key: the same 𝑓 and 𝜃
, 𝑥 ( 𝑡 +1) ; 𝜃 )
Figure from Deep Learning,
for all time steps Goodfellow, Bengio and Courville
Recurrent neural networks
Label

Loss

Output

State

Input

Figure from Deep Learning, by Goodfellow, Bengio and Courville


Recurrent neural networks

Math formula:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Advantage

• Hidden state: a lossy summary of the past


• Shared functions and parameters: greatly reduce the capacity
and good for generalization in learning
• Explicitly use the prior knowledge that the sequential data
can be processed by in the same way at different time
step (e.g., NLP)
Advantage

• Hidden state: a lossy summary of the past


• Shared functions and parameters: greatly reduce the capacity and
good for generalization in learning
• Explicitly use the prior knowledge that the sequential data can be
processed by in the same way at different time step (e.g., NLP)

• Yet still powerful (actually universal): any function computable by a


Turing machine can be computed by such a recurrent network of
a finite size (see, e.g., Siegelmann and Sontag (1995))
Training RNN

• Principle: unfold the computational graph, and use


backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques
Training RNN

• Principle: unfold the computational graph, and use


backpropagation
• Called back-propagation through time (BPTT) algorithm
• Can then apply any general-purpose gradient-based techniques

• Conceptually: first compute the gradients of the internal nodes,


then compute the gradients of the parameters
Recurrent neural networks

Math formula:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝐿( 𝑡 ) :
(total loss is sum of
those at different
time steps)

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑜 ( 𝑡 ) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑠 ( 𝜏 ) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at 𝑠 ( 𝑡 ) :

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Recurrent neural networks

Gradient at parameter 𝑉:

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Motivation: Need for Sequential Modeling

Why do we need Sequential


Modeling?

27
Motivation: Need for Sequential Modeling

•Examples of Sequence Input Data Output


data
This is RNN
•Speech Recognition Hallo, ich bin
Hello, I am Pankaj. Pankaj.
Machine Translation हैलो, म
� पंकज 5ं।
_
Recurrent neural _
based model network
Language Modeling
language
Pankaj lives in Pankaj lives in
Named Entity Recognition Munich Munich
person

Sentiment Classification location


There is nothing to like in this movie.
Video Activity Analysis

Punching

28
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different


examples
Example:
Sentence1: Pankaj lives in Munich
Sentence2: Pankaj Gupta lives in Munich
DE

29
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different


examples
Example:
Sentence1: Pankaj lives in Munich Additional

Sentence2: Pankaj Gupta lives in Munich word ‘PAD’ i.e.,


DE padding

Pankaj … perso Pankaj … perso


n n
lives other Gupta perso
n
in other lives other

Munic locatio in other


h n
PAD other Munic locatio
… h … n
PAD other German locatio
y n
FF-net / CNN FF-net / CNN
*FF-net: Feed-forward network
30
Motivation: Need for Sequential
Modeling

Inputs, Outputs can be different lengths in different


examples
Example:
person other
Sentence1: Pankaj lives in Munich other location

Sentence2: Pankaj Gupta lives in Munich Model


DE s
variabl
Pankaj … perso Pankaj Pankaj lives in e
… perso Munich
n n
lives other Gupta perso length
in other
n person person other sequen
other location location
lives other
ces
Munic locatio in other
h n
PAD other Munic locatio
… h … n
PAD other German locatio
y n Pankaj Gupta lives in Munich Germany
FF-net / CNN FF-net / CNN
Sequential model: RNN
*FF-net: Feed-forward network
7
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time


steps
Example:
Sam e uni-
Sentence1: Market falls into bear territory  Trading/Marketing
gram
Sentence2: Bear falls into market territory  UNK statistics

32
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time


steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing No
sequential
Sentence2: Bear falls into market territory  UNK or temporal
modeling,
i.e., order-
… … less
falls falls

bear bear Treat s t he


marke marke UNK
two
Tradin
t g t sentences t
into
into he sam e
territor … territor …
y y
sentence1 sentence2
FF-net / CNN FF-net / CNN

33
Motivation: Need for Sequential
Modeling

Share Features learned across different positions or time


steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing Languag
Trading e
Sentence2: Bear falls into market territory  UNK concept
s,

… … Word
falls falls orderin
market falls into bear territory
bear bear g,
UNK
marke Tradin marke UNK Synt act ic
t g t
int int &
o o semanti
territor … territor …
y y c
sentence1 sentence2 informat i
bear falls into
FF-net / CNN FF-net / CNN market territory on
Sequential model: RNN
34
Motivation: Need for Sequential Modeling

Share Features learned across different positions or time


steps
Example:
Sentence1: Market falls into bear territory  Trading/Marketing Languag
Trading e
Sentence2: Bear falls into market territory  UNK concept
s,


Word
falls Dlls irectio n of
fa

market falls into bear
orderin
territory
bear infboearr mation g,
UNK
marke Tradin mark fatter
low UNK Synt act ic
t g
int me s! &
o semanti
territor … territor
t …
y y c
sentence1 into
sentence2 informat i
bear falls into market territory
FF-net / CNN FF-net / CNN on
Sequential model: RNN
11
Motivation: Need for Sequential
Modeling

Machine Translation: Different Input and Output sizes, incurring sequential


patterns

pankaj Decoder
lebt in münchen मुिनच रह ह
पंकज म
� ता ै
Decoder

encodes input encodes input


text text
Pankaj lives in Munich Pankaj lives in Munich

Encoder Encoder

36
Motivation: Need for Sequential
Modeling

Convolutional vs Recurrent Neural Networks

RNN
- perform well when the input data is interdependent in a sequential pattern
- correlation between previous input to the next input
- introduce bias based on your previous output

CNN/FF-Nets
- all the outputs are self dependent
- Feed-forward nets don’t remember historic input data at test time unlike recurrent networks.

37
Motivation: Need for Sequential Modeling

Memory-less Models • Memory Networks


Autoregressive models: •-possess a dynamic hidden state that can
Predict the next input in a sequence store long term information, e.g., RNNs.
from a fixed number of previous inputs
using “delay taps”.
Wt-2
Wt-1 •Recurrent Neural Networks:
inputt-2 inputt-1 inputt
•RNNs are very powerful, because they
combine the following properties-
Feed-forward neural networks:
•Distributed hidden state: can efficiently store
Generalize autoregressive models by a lot of information about the past.
using non-linear hidden layers.
Wt-2 •Non-linear dynamics: can update their
hidden state in complicated ways
Wt-1
•Temporal and accumulative: can build
inputt-2 inputt-1 inputt semantics, e.g., word-by-word in sequence
over time
38
Notation
s

• ℎ 𝑡 : Hidden Unit
• 𝗑 𝑡 : Input
• 𝑜 𝑡 : Output
• 𝑾𝑾ℎℎ : Shared Weight Parameter
• 𝑾𝑾ℎ𝑜 : Parameter weight between hidden layer
and output
• 𝜃𝜃: parameter in general
• 𝑔𝜃𝜃 : non linear function
• 𝐿𝑡 :Loss between the RNN outputs and the true
output
• 𝐸𝑡 : cross entropy loss

39
Long Term and Short
Dependencies

Short Term Dependencies

 need recent information to perform the present task.


For example in a language model, predict the next word based on the
previous ones. “the clouds are in the sky”
“the clouds are in the ?”  ‘sky’
 Easier to predict ‘sky’ given the context, i.e., short term
dependency
Long Term Dependencies

 Consider longer word sequence “I grew up in France…........…………………… I speak fluent


French.”
 Recent information suggests that the next word is probably the name of a language, but
if we want to narrow down which language, we need the context of France, from further
back.

40
Foundation of Recurrent Neural
Networks
Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information
to persist

punching

41
Foundation of Recurrent Neural
Networks
Goal
 model long term dependencies
 connect previous information to the present task
 model sequence of events with loops, allowing information to persist

Feed Forward NNets can not take time dependencies into


account. Sequential data needs a Feedback Mechanism.
o o0
ot-1
ot oT
Unfold
x0 o0 feedback mechanism in
… Whh Whh Whh
or internal state A time
… …
loop
xt ot … …
Whh

… … xt-1 xt xT
x x0
FF-net / CNN tim
Recurrent Neural Network (RNN)
e

42
Foundation of Recurrent Neural
Networks
person othe othe locatio
output labels
r r n
softmax-layer . .1 person
.2 .1 .1 .1 .8 .1 .7
8
.1 .7 .2
output layer .8 .1
.2 .1 .1 .1 .8 .1 .7 locatio
.1
.7 .2
Who n
. . . .
5 Whh 3 Whh 5 Whh 6 other
hidden layer . . . .
2 3 4 7 Recurrent Neural Network
. -. . .
7 1 9 5
Wxh
1 0 0 0
0 1 0 0
input layer
0 0 1 0
0 0 0 1
input sequence Pankaj lives in Munich
tim
43
e
(Vanilla) Recurrent Neural
Network
Process a sequence of vectors x by applying a recurrence at every
time step:
o o0 ot oT
ot-1
Unfold Who
feedback mechanism
in
or internal state loop Whh Whh Whh
A time

ℎ𝑡 =
h0 ht
Whh … …

𝑔𝜃𝜃(ℎ𝑡 −1 , 𝑥𝑡 )
Wxh
x x0 xt-1 xt xT
time
Input vector at time
new step, t Vanilla Recurrent Neural Network (RNN)
hidden old hidden
state at time some function
step, state
parameters Whh Wxh at time step, t-
twith
1
Remark: The same function g and same set of parameters W are used at
every time step

44
(Vanilla) Recurrent Neural
Network
Process a sequence of vectors x by applying a recurrence at every
time step:
o o0 ot oT
ot-1
feedback mechanism Unfold Who
or internal state in
loop A time Whh Whh

ℎ𝑡 = 𝑔𝜃 ℎ𝑡 −1 ,
Whh

𝑥𝑡
Whh … …
Wxh

ℎ𝑡 = tanh(𝑊𝑊ℎℎ ℎ𝑡−1 +
x x0 xt-1 xt xT

𝑊𝑊𝑥𝑥ℎ𝑥𝑡)
time

𝑜𝑡 = 𝑠𝑜𝑠𝑠𝑡𝑚𝑎𝑥(𝑊𝑊ℎ𝑜ℎ𝑡)
Vanilla Recurrent Neural Network (RNN)

Remark: RNN‘s can be seen as selective summarization of input sequence in a


fixed-size state/hidden vector via a recursive update.

45
Recurrent Neural Network: Probabilistic
Interpretation
RNN as a generative model x0 x1 xt <eos>
xt+1
 induces a set of procedures to
model Whh Whh Whh Whh
the conditional distribution of xt+1 given
x<=t
for all t = 1, … … …
…,T
<bos> xT
x0 xt-1 xt
time
 Think of the output as the probability Generative Recurrent Neural
distribution of the Network (RNN)

xt given the previous ones in the sequence


 Training: Computing probability of the sequence and Maximum
likelihood training

Details:
https://www.cs.cmu.edu/~epxing/Class/10708-17/project-reports/project10.
pdf
46
RNN: Computational
Graphs

Sequence of output

𝑜 𝑜2 𝑜3
1

𝑔𝜃 𝑔𝜃 𝑔𝜃
𝜃
Initial 𝜃 𝜃
State,
A0
𝑥1 𝑥2 𝑥3
Next state

𝜃𝜃
Sequence of Inputs

47
RNN: Different Computational
Graphs
𝑜𝟏 𝑜𝟐 𝑜𝟑 𝑜𝟏
𝑜𝟏 𝟏
𝟏 𝟐 𝟑
𝟏

𝗑𝟏 𝗑𝟒
𝗑𝟐𝟐
𝗑𝟏𝟏
𝗑𝟏 𝟏
𝗑𝟑𝟑 to One 𝟒
𝑜𝟏 𝑜𝟐 𝑜𝟑 𝑜𝟒
Many
𝑜𝟏 𝑜𝟐
One to one One to Many
𝟏
𝟏 𝟐 𝟑 𝟒
𝟏 𝟐

𝗑𝟏 𝗑𝟒
𝗑𝟐𝟐
𝗑𝟏 𝗑𝟐 𝟏
Many to Many 𝟒

𝗑𝟑𝟑
𝟏 𝟐
48
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)


unrolled model

49
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)


unrolled model

 Think of the recurrent net as a


layered, feed-forward net with shared
weights and
then train the feed-forward net in
time domain

Lecture from the course Neural Networks for Machine


Learning

50
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer)


unrolled model

 Think of the recurrent net as a


layered, feed-forward net with shared
weights and
then train the feed-forward net in
time domain

Training algorithm in time domain:


 The forward pass builds up a stack of the activities
of all the units at each time step
 The backward pass peels activities off the stack to
compute
Lecture from the the
courseerror derivatives
Neural Networks at each time step.
for Machine

Learning
After the backward pass we add together the
51 derivatives at all the different times for each weight.
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

 Compute gradients via backpropagation on the (multi-layer) unrolled model

 Think of the recurrent net as a layered,


feed-forward net wFioh
t rwshaare
r dd twherioghutsgahndentire sequence  compute
loss
then train the e
fedBf-aocw
r kawd
r anredt itnhtrm
i oeugdohmeann
i tire sequence 
compute gradient
Training algorithm in time domain:
 The forward pass builds up a stack of the activities
of all the units at each time step
 The backward pass peels activities off the stack to
compute the error derivatives at each time step.
 After the backward pass we add together the
derivatives at all the different times for each weight.
52
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
1 2 𝑜𝟑
𝑜𝟏𝟏 𝑜𝟐𝟐 𝟑

ℎ𝟏 ℎ𝟐 ℎ𝟑
𝟏 𝟐 𝟑
𝗑𝟏 𝗑𝟐𝟐
𝟏 𝗑𝟑𝟑
Direction of Forward pass

53
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
1 2 𝑜𝟑
𝑜𝟏𝟏 𝑜𝟐𝟐 𝟑

ℎ𝟏 ℎ𝟐 ℎ𝟑
𝟏 𝟐 𝟑
𝗑𝟏 𝗑𝟐𝟐
𝟏 𝗑𝟑𝟑
Direction of Forward pass

54
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
1 2 𝑜𝟑
𝑜𝟏𝟏 𝑜𝟐𝟐 𝟑

ℎ𝟏 ℎ𝟐 ℎ𝟑
𝟏 𝟐 𝟑
𝗑𝟏 𝗑𝟐𝟐
𝟏 𝗑𝟑𝟑
Direction of Forward pass

55
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
1 2 𝑜𝟑
𝑜𝟏𝟏 𝑜𝟐𝟐 𝟑

ℎ𝟏 ℎ𝟐 ℎ𝟑
𝟏 𝟐 𝟑
𝗑𝟏 𝗑𝟐𝟐
𝟏 𝗑𝟑𝟑
Direction of Forward pass

56
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
1 2 𝑜𝟑
𝑜𝟏𝟏 𝑜𝟐𝟐 𝟑

ℎ𝟏 ℎ𝟐 ℎ𝟑
𝟏 𝟐 𝟑
𝗑𝟏 𝗑𝟐𝟐
𝟏 𝗑𝟑𝟑
Direction of Forward pass

57
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸 𝐸 𝐸3
𝑜1𝟏 𝜕𝜕 2 𝟐 𝜕𝜕
𝑜 𝑜𝟑 𝜕𝜕
𝟏 𝐸
𝜕𝜕ℎ
1 𝟐 𝐸
𝜕𝜕ℎ
2 𝟑
𝐸
𝜕𝜕ℎ
3

1 2 3

ℎ𝟏 𝜕𝜕ℎ ℎ 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕ℎ 𝟐
𝟐
𝟏 2
𝜕𝜕ℎ 𝟑
3
𝗑𝟏 1 𝗑𝟐𝟐 2
𝟏 𝗑𝟑𝟑
Direction of Backward pass (via partial derivatives)
--- gradient flow ---

58
Backpropogation through time (BPTT) in
RNN
 Training recurrent networks via BPTT

The output at time t=T is dependent on the inputs from t=T


to t=1
𝐸1 𝐸3
𝐸2𝜕𝜕 𝑜𝟐 𝜕𝜕 𝑜𝟑 𝜕𝜕
 Let us take our loss/error function to be cross

𝑜𝟏𝟏 𝐸
𝜕𝜕ℎ 𝐸
𝜕𝜕ℎ 𝐸
𝜕𝜕ℎ
entropy:
1 2 3
𝐸𝑡 𝑜𝑡 ′, 𝑜𝑡 = −𝑜𝑡 ′ log 𝑜𝑡
𝟐 𝟑

1 2 3

𝐸 𝑜𝑡 ′, = � 𝐸𝑡 (𝑜𝑡
𝑜𝑡 ′, 𝑜𝑡 ) ℎ𝟏 𝜕𝜕ℎ ℎ 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕ℎ 𝟐
𝟐
𝑡 𝟏 2
𝜕𝜕ℎ 𝟑
3
𝐸 𝑜 𝑡 = − � 𝑜𝑡 ′ 𝗑𝟏 1 𝗑𝟐𝟐 2
′, 𝑜
𝑡 log 𝑜𝑡 𝟏 of Backward 𝗑
Direction 𝟑𝟑
pass (via partial derivatives)
𝑡
Where 𝑜𝑡′ are the truth
--- gradient flow ---

values
59
Backpropogation through time (BPTT) in
RNN

The output at time t=3 is dependent on the inputs from t=3


to t=1

Writing gradients in a𝜕𝜕𝐸


𝜕𝜕𝐸𝑡
sum-of-products form
𝜕 = �𝜕
�1 �2 �3
𝜕 𝜕 𝑜�𝟏 𝜕𝜕 𝑜 𝟐 𝜕𝜕 𝑜𝟑 𝜕𝜕
𝐸
� �
𝐸 𝐸
1≤𝑡≤3
𝜕𝜕𝐸3 θ3 𝜕𝜕𝑜3
𝜕𝜕𝐸 𝜕𝜕𝐸3θ𝜕𝜕𝑜3 𝜕𝜕ℎ
1 1 𝜕𝜕ℎ 𝜕𝜕ℎ
3
= =
𝟏 𝟐 2 𝟑
Who Who
3
Who
𝜕𝜕𝑧3 2
𝑤ℎ𝑒𝑟𝑒,𝜕𝜕𝖶 𝑧3 = 𝜕𝜕𝑜i.e., 𝑜3 with
3 𝜕𝜕𝖶ℎ𝑜 𝜕𝜕𝑜3 𝜕𝜕𝑧3
𝜕𝜕ℎ ℎ
Whh

𝑊𝑊 𝑜ℎ 3
Whh

ℎ𝟏 𝜕𝜕ℎ ℎ𝟑
ℎ𝑜
𝜕𝜕𝖶ℎ
= 𝑜3′(𝑜3 − 1) × 𝜕𝜕ℎ 𝟐
softmax
ℎ𝑜 𝟐

𝜕𝜕
𝜕𝜕𝐸ℎ3
2
𝜕
𝑊 𝑜 (ℎ3)
𝟏
3 2 𝟑
𝗑𝟏 𝗑𝟐𝟐 𝜕
𝑤ℎ𝑒𝑟 ×= 𝑜𝑢𝑡𝑒𝑟
1

𝑊 𝟏 of Backward 𝗑 ℎ partial derivatives)


𝑒, 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝟑𝟑
𝜕𝜕𝐸 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛𝑙𝑦 𝑜𝑛 𝑜3 , 𝑜
Direction pass (via


𝑎𝑛𝑑 ℎ
--- gradient flow ---
𝜕𝜕𝖶
3 ℎ 3 3
𝑜

60
Backpropogation through time (BPTT) in
RNN

𝜕𝜕
𝐸3ℎ = 𝑜3′(𝑜3 − 1) ×
offl in
𝜕𝜕
𝑊 𝑜 (ℎ3)
How ? e

𝐸𝑊 = −𝑜3 ′ 𝑜3 = 𝑎𝑛𝑑 𝑧3 =
Proof
3
𝑜3 ′, 𝑜3 log 𝑜3 𝑠𝑜𝑠𝑠𝑡𝑚𝑎𝑥(𝑧3), 𝑊𝑊ℎ𝑜ℎ3
𝜕𝜕 𝜕𝜕log(𝑜 1
= 𝑜3 = 𝑒
𝑧 3 and , Ω
𝑖𝑖𝑒
𝑧𝑖 log o3 = z3 − log
𝐸
𝜕 33 3) 𝜕
Ω
Ω
−𝑜3′
𝜕3 𝜕 =
𝜕𝜕Ω
𝑖


𝑧 = � 𝑒𝑧𝑖𝑖𝑖𝑖𝛿𝛿
= 1 −𝑧 𝜕𝜕 = 𝑒 𝑧 𝑘
𝜕𝜕log(𝑜3) 1
𝑖 3
𝜕𝜕Ω
𝑧3
𝜕𝜕log(𝑜
𝜕𝜕𝑧 3 3)
𝜕𝜕𝑧 =1 3
𝜕𝜕𝑜
= 𝑜3(1 −
Ω 𝜕𝜕𝑧 3
𝑜3)
𝜕𝜕𝐸
3

= −𝑜 ′(1 − 𝑜 ) 𝑜 −
𝜕𝜕3 − 𝑜 ′
𝜕𝜕𝐸3 𝜕𝜕𝐸3 𝜕𝜕𝑜3
3

= 𝜕𝜕𝑧 = 𝑜′ 𝑜 − 1 ×
𝐸
𝜕𝜕𝑧 𝜕𝜕𝑧 3 3
=𝑜 1
3
=
3 3 3
𝜕𝜕𝑧3 (ℎ )
3 3 3
𝜕𝜕𝑧3
3 3

𝜕𝜕𝑊𝑊ℎ𝑜
𝜕𝜕𝑊𝑊 ℎ𝑜
1. http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vani

𝜕𝜕𝑜 𝜕𝜕𝑧 𝜕𝜕𝑊𝑊


shing-gradients/

3 3 ℎ𝑜
2. https://stats.stackexchange.com/questions/235528/backpropagation-with-softmax-cross-entropy

61
Backpropogation through time (BPTT) in
RNN

The output at time t=3 is dependent on the inputs from t=3

𝐸 𝐸 𝐸3
to t=1
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝜕𝜕𝐸 𝜕𝜕𝐸
𝑜1𝟏 𝜕𝜕 2 𝜕𝜕 𝑜𝟑 𝜕𝜕
form
= 𝐸 𝐸
𝜕𝜕𝐸𝑡 𝜕𝜕W 𝜕𝜕ℎ 𝜕𝜕W 𝜕𝜕ℎ 𝐸
𝜕𝜕ℎ 𝜕𝜕ℎ
3 3
𝜕 𝜕𝜕ℎ
𝟏 1 1 𝑜 2 3

𝟐𝟐 𝟑
ℎℎ 3
𝜕
1≤𝑡≤ 3
= 3 ℎℎ
𝜕𝜕θ
W W
3
W
2
Since ℎ3 dependsθ on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1,
ho ho ho

therefore 𝜕𝜕ℎ ℎ
Whh W

ℎ𝟏 𝜕𝜕ℎ ℎ𝟑
hh

3
𝜕 3 𝜕𝜕𝐸 3 𝜕 𝜕𝜕 ℎ3 𝜕𝜕
= 𝜕𝜕ℎ 𝟐
𝟐
3
𝜕 𝜕 = 2
𝜕
𝜕𝜕Wℎ 𝜕𝜕ℎ3
𝟏
𝑘 3 2 𝟑
𝜕
� e.g.
𝐸 𝑘=
𝜕𝜕ℎ
𝜕𝜕ℎ ℎ
𝜕𝜕W ℎ
�𝟏 1 𝗑𝟐𝟐
ℎ partial derivatives)
,
1
𝜕𝜕 Direction of Backward 𝗑
3 𝟏
𝜕𝜕𝐸 𝜕𝜕ℎ
ℎ 𝑘 ℎℎ

= � 𝜕𝜕ℎ2
𝟑𝟑

𝑡 𝑡
𝐸𝑡ℎ
pass (via
𝜕𝜕 𝜕𝜕ℎ𝑡
In general,
𝜕𝜕h1
--- gradient flow ---
Wℎ
1≤𝑘≤
𝜕𝜕ℎ
𝜕𝜕ℎ𝑘𝑘
𝑡
𝜕𝜕ℎ2 𝜕𝜕ℎ1
𝜕𝜕Wℎ ℎ

62
Backpropogation through time (BPTT) in
RNN

The output at time t=3 is dependent on the inputs from t=3

𝐸 𝐸 𝐸3
to t=1
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝜕𝜕𝐸 𝜕𝜕𝐸
𝑜1𝟏 𝜕𝜕 2 𝜕𝜕 𝑜𝟑 𝜕𝜕
form
= 𝐸 𝐸
𝜕𝜕𝐸𝑡 𝜕𝜕W 𝜕𝜕ℎ 𝜕𝜕W 𝜕𝜕ℎ 𝑜𝟐𝟐 𝐸𝜕𝜕ℎ 𝜕𝜕ℎ
3 3
𝜕 𝜕𝜕ℎ
𝟏 1 1 2 3

𝟑
ℎℎ 3
𝜕
1≤𝑡≤ 3
= 3 ℎℎ
𝜕𝜕θ
W W
3
W
2
Since ℎ3 dependsθ on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1,
ho ho ho

therefore
Whh Whh

𝜕𝜕𝐸
3
𝜕𝜕𝐸
ℎ𝟏 ℎ𝟐 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕 ℎ3 𝜕𝜕
𝜕𝜕ℎ3
= 𝜕𝜕ℎ 3 3 𝑘 = 𝜕
𝜕𝜕Wℎ 𝜕𝜕ℎ3
𝟏 𝟐 3 2 𝟑
𝜕
e.g.
𝗑𝟐𝟐

𝑘=

𝜕𝜕ℎ 𝜕𝜕W
�𝟏
ℎ partial derivatives)
,
1
𝜕𝜕 of Backward 𝗑
3
𝜕𝜕𝐸 𝜕𝜕ℎ
ℎ 𝑘 ℎℎ 𝟏

= � 𝜕𝜕ℎ2
𝟑𝟑

𝑡 𝑡
𝐸𝑡 ℎ
Direction pass (via
𝜕𝜕W
In general,
𝜕𝜕h1
--- gradient flow ---
1≤𝑘
𝜕𝜕ℎ
𝜕𝜕ℎ𝑡
ℎ ≤𝑡 𝑘
𝜕𝜕ℎ
𝑇 2 𝜕𝜕ℎ1
= 𝖦 W 𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 = 𝖦 𝜕𝜕ℎ𝑖 𝜕𝜕ℎ𝑡
Transport error in time from step t back to step k
ℎ 𝑖𝑖−
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− )] 𝑡 ≥𝑖𝑖 ℎ 1
h𝑘 >𝑘 𝜕𝜕ℎ>𝑘𝑘 𝝏𝝏
1
𝜕𝜕Wℎ ℎ
Jacobian matrix ℎ𝑡 The Jacobian matrix is the matrix of all first-order partial derivatives of
𝝏𝝏
a vector-valued function. It generalizes the concept of a derivative for
63 functions with multiple inputs and outputs, showing how the output
Backpropogation through time (BPTT) in
RNN

The output at time t=3 is dependent on the inputs from t=3


to t=1
Writing gradients in a sum-of-products 𝐸 𝐸 𝐸3
form𝜕𝜕𝐸
𝜕𝜕𝐸 𝜕𝜕𝐸 1 𝜕𝜕 𝜕𝜕 𝑜 𝜕𝜕
𝜕𝜕𝐸𝑡
2
=
𝜕𝜕W 𝜕𝜕ℎ 𝜕𝜕W 𝑜𝟏𝟏 𝐸 𝐸2 𝐸
𝜕𝜕ℎ
3 3
𝜕 = �𝜕 𝟐𝟐 𝜕𝜕ℎ
𝟑

ℎℎ 𝜕𝜕ℎ 3
𝜕𝜕ℎ
1 1 𝑜 3

𝜕 𝜕
3 W W 𝟑

ℎℎ 3
ℎ3 dependsθ on
Who
1≤𝑡≤3 2
ho

Since θ
ho

ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1,


Whh Whh

𝜕𝜕𝐸
3
𝜕𝜕𝐸 therefore
ℎ𝟏 ℎ𝟐 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕 ℎ 𝜕
𝜕
𝜕𝜕ℎ3
𝜕𝜕ℎ 3 3 𝑘 = 𝜕𝜕ℎ 𝟑
3
𝜕𝜕Wℎ = 𝜕𝜕ℎ3
e.g. 𝟏 𝟐 3
𝗑𝟏 𝗑𝟐𝟐

𝑘=

𝜕𝜕ℎ 𝜕𝜕W 2
,
1
𝜕𝜕
3
𝜕𝜕𝐸𝑡 𝜕𝜕ℎ𝑡 𝟏 of Backward 𝗑
ℎ 𝑘 ℎℎ

= � 𝜕𝜕ℎ 𝟑𝟑

𝐸𝑡 ℎ
2
𝜕𝜕W
In Weight matrix Direction pass (via partial derivatives)

1≤𝑘 𝜕𝜕h1
--- gradient flow ---
𝜕𝜕ℎ
general,

𝜕𝜕ℎ𝑡
ℎ ≤𝑡 𝑘
𝜕𝜕ℎ
𝑇 2 𝜕𝜕ℎ1
= 𝖦 W 𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 = 𝖦 𝜕𝜕ℎ𝑖 𝜕𝜕ℎ𝑡
Derivative of activation function
ℎ 𝑖𝑖−
h𝑘 𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖 )] 𝑡 ≥𝑖𝑖 ℎ 1
𝜕𝜕ℎ 𝝏𝝏
Transport error in time from step t back to step k
>𝑘 −1 >𝑘
𝑘
𝜕𝜕Wℎ ℎ
Jacobian matrix ℎ𝑡

64
𝝏𝝏
Backpropogation through time (BPTT) in
RNN

The output at time t=3 is dependent on the inputs from t=3

𝐸 𝐸
to t=1
Writing gradients in a sum-of-products 𝐸3
𝜕𝜕𝐸 𝜕𝜕𝐸 𝜕𝜕𝐸 𝑜1𝟏 𝜕𝜕 2 𝜕𝜕 𝑜𝟑 𝜕𝜕
form
𝜕𝜕𝐸𝑡 = 𝐸 Who𝑜𝟐𝟐 𝐸𝜕𝜕ℎ 𝐸𝜕𝜕ℎ
3 3
𝜕 𝜕𝜕W 𝜕𝜕ℎ
𝜕𝜕ℎ 𝜕𝜕ℎ
1 1 2 3
� 𝟏
𝟑

𝜕
1≤𝑡≤ ℎℎ 33
= 3 𝜕𝜕Wℎ ℎ
𝜕𝜕θ
W W
3
Since ℎ3 dependsθ on ℎ2 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends on ℎ1,
2
ho ho

therefore 𝜕𝜕ℎ ℎ
Whh Whh

𝜕𝜕𝐸
3
𝜕𝜕𝐸
ℎ𝟏 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕 ℎ3 𝜕𝜕 𝜕 1 𝟐
𝟐

𝜕𝜕ℎ
𝜕𝜕 ℎ
3
= 𝜕𝜕ℎ
𝜕𝜕ℎ
3 3 𝑘 = 𝟏 2
𝜕
𝜕
3 2 𝟑
𝜕
3 ℎ
e.g.
𝗑𝟏 𝗑𝟐𝟐
Wℎ 𝜕𝜕ℎ ℎ
𝑘=


𝑘 ℎ
𝟏 of Backward 𝗑 ℎ partial derivatives)
,
1
𝜕𝜕𝜕𝜕W
3
𝜕𝜕𝐸3 𝜕𝜕ℎ𝑡
𝟑𝟑
𝜕𝜕ℎ2
𝐸𝑡ma =tri�x 𝜕𝜕ℎ
Direction pass (via
Repeated 𝜕𝜕W ℎ
𝜕𝜕h1
multiplications leads to vanishing and exp-l-
𝑡
1≤𝑘
𝜕𝜕ℎ
l dwie𝜕𝜕ℎ 𝜕𝜕ℎ𝑇 2 𝜕𝜕ℎ1
ogdriandℎgeingtrfao≤𝑡
𝜕𝜕ℎ𝑡
𝑘𝑘
=
𝜕𝜕Wℎ ℎ 𝖦 W ℎℎ 𝑑𝑑𝑑𝑎𝑔[𝑔′
-n- ts
𝜕𝜕 = 𝖦
𝜕𝜕ℎ 𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− (ℎ𝑡𝑖𝑖−1
≥𝑖𝑖)]
h 𝑖 𝝏𝝏
Transport error in time from step t back to step k
𝑘 >𝑘 1 >𝑘

𝝏
𝝏ℎ
Jacobian matrix
𝑡
𝑘
41
BPTT: Gradient
Flow
𝜕𝜕𝐸3 3
= �
𝜕𝜕𝐸3 𝜕𝜕ℎ3
𝜕𝜕ℎ𝑘 𝐸 𝐸 𝐸3
𝜕𝜕Wℎ ℎ 𝜕𝜕ℎ3
𝜕𝜕ℎ
1 2 𝑜𝟑
𝜕𝜕ℎ 𝜕𝜕W
= 𝜕𝜕𝐸3𝑘𝜕𝜕𝑜 3 𝜕𝜕ℎ3 𝜕𝜕ℎ2
𝑘=1
ℎℎ
+𝜕𝜕𝐸3 𝜕𝜕𝑜3
𝜕𝜕ℎ
+𝜕𝜕𝐸3 𝜕𝜕𝑜3 𝜕𝜕ℎ 𝑜𝟏𝟏 𝑜𝟐𝟐
𝜕𝜕ℎ
𝟑

𝜕𝜕𝑜 3𝜕𝜕ℎ𝜕𝜕ℎ 𝜕𝜕𝑜


3 3 3 𝜕𝜕ℎ 𝜕𝜕𝑜 𝜕𝜕ℎ 𝜕𝜕ℎ
3
3 2 1 𝜕𝜕ℎ 2 ℎ 𝜕𝜕ℎ
33 3 3 ℎ
𝜕𝜕ℎ 𝜕𝜕𝖶 2 𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕𝖶
𝜕𝜕ℎ
ℎℎ ℎ 3 ℎ
𝜕𝜕ℎ1 2
𝜕𝜕𝖶 2

𝑾𝑾 𝑾𝑾
ℎ𝟏 ℎℎ
ℎ𝟐 ℎℎ ℎ𝟑
𝟏 𝟐 𝟑

𝗑𝟏 𝗑𝟐 𝗑𝟑
𝟏 𝟐 𝟑

66
Backpropogation through time (BPTT) in
RNN

Code snippet for forward-propagation is shown below (Before going for offl ine
BPTT code)

https://cs224d.stanford.edu/lectures/CS224d-
Lecture8.pdf

67
Backpropogation through time (BPTT) in
RNN
Code snippet for backpropagation w.r.t. time is
𝝏𝝏𝐸𝑡 𝑡
= �
shown below
𝝏𝝏𝐸𝑡 𝝏𝝏ℎ𝑡
𝝏𝝏𝑾𝑾 ℎ ℎ 𝝏𝝏ℎ 𝑡
𝝏𝝏ℎ
𝑘 =𝟏
offl ine
𝝏𝝏ℎ 𝑘 𝝏𝝏𝑾𝑾
𝑘
𝟏 ℎℎ

𝜕𝜕ℎ 𝜕𝜕ℎ 𝜕𝜕ℎ


𝜕𝜕ℎ𝑡−1 𝜕𝜕ℎ𝑡−2 𝜕𝜕𝐸𝑡 𝜕𝜕𝐸𝑡
= ⋯ + 𝜕𝜕𝑜𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡−1 𝜕𝜕ℎ𝑡−2 + 𝜕𝜕𝑜𝑡 𝜕𝜕ℎ𝑡𝑡 𝜕𝜕ℎ𝑡−1 + 𝜕𝜕𝑜𝑡 𝜕𝜕ℎ𝑡𝑡
𝜕𝜕ℎ𝑡−1
𝜕𝜕𝐸𝑡 𝜕𝜕𝑜
𝜕𝜕𝖶𝑡 𝜕𝜕
𝜕𝜕𝑜
𝜕𝜕𝖶
𝑡 𝜕𝜕𝑜𝜕𝜕𝖶
𝑡
ℎℎ ℎℎ ℎℎ
= −(𝑜𝑡 −𝑜 ′)(ℎ
𝐸𝑡 ℎ
𝜕𝜕W
𝑡
) 𝑡
𝜕𝑜 𝑡 𝜕𝜕
= −(𝑜 −𝑜 𝑎𝑛
𝜕 ℎ𝑡 ℎ= (1 − ℎ𝑡 )(ℎ
2
𝜕𝜕 𝑡 𝑡
𝜕𝜕W
𝐸 ′)W ℎ𝑜 𝑑 ) 𝑡 −1
ℎ𝜕𝜕𝐸
= −(𝑜𝑡−𝑜𝑡′)Wℎℎ𝑜(1 − ℎ2)(ℎ
𝑡

𝜕𝜕ℎ𝑡𝑡
𝜕𝜕ℎ )
𝑡 𝑡−1 𝐀
𝜕𝜕Wℎ ℎ
𝐼 𝑛 𝑰𝑰𝑡𝑰𝑰𝑎𝑙

𝑑 𝑒 𝑙 𝑡 𝑎 _𝑡

𝜕𝜕𝐸𝑡 𝜕𝜕𝑜𝑡 𝜕𝜕ℎ W 1 − ℎ2 (W )(1 − ℎ2 )(ℎ


𝑡− = − 𝑜 �
𝜕𝜕ℎ
𝜕𝜕𝑜𝑡𝑡 𝜕𝜕ℎ𝑡 1
𝑡 ℎ 𝑡 𝑡−
−𝑜 ′ 𝑡 𝑜 ℎℎ
𝑡−1 ) 2
𝜕𝜕ℎ𝑡 −1 𝜕𝜕Wℎ ℎ

𝑑𝑒𝑙𝑡𝑎
𝜕
_𝑡
𝜕𝜕𝜕 ℎ = A + B + ⋯ (till the end of
dependency)
68 W𝐸ℎ
Challenges in Training an RNN: Vanishing
Gradients

Short Term Dependencies

 need recent information to perform the present task.


For example in a language model, predict the next word based on the previous ones.
“the clouds are in the ?”  ‘sky’
 Easier to predict ‘sky’ given the context, i.e., short term dependency  (vanilla) RNN Good so
far.

Long Term Dependencies

 Consider longer word sequence “I grew up in France…........…………………… I speak fluent


French.”
 Recent information suggests that the next word is probably the name of a language, but
if we want to narrow down which language, we need the context of France, from further
back.
69
 As the gap increases  practically difficult for RNN to learn from the past information
Challenges in Training an RNN: Vanishing
Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while
BPTT:
𝜕𝜕𝐸5 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2
= + +
𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
+ …
𝜕−1.70e−1
𝜕𝜃𝜃 𝜕𝜕ℎ 5 𝜕𝜕ℎ4 𝜕𝜕ℎ
4.94e− 3 𝜕𝜕ℎ2B=
2.29e− 𝜕𝜕ℎ1−1.13e−0
𝜕𝜕𝜃𝜃 2.61e− 1.50e−0
𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
8 8
0 10 10
𝜕𝜕ℎ 2 𝜕𝜕𝜃𝜃 5.56e− 𝜕𝜕ℎ5 𝜕𝜕ℎ 4 𝜕𝜕ℎ3 1.51e−0
5.70e− 𝜕𝜕𝜃𝜃
09
2.55e− C= 6
A= −1.70e−0 8.70e−0 9.40e−0
8
𝐵 = 8
−1.11e−0
0 = 10 10
6 6
𝐴
−1.73e−1
−1.33e−0 9.11e− 1.83e−0
1.53e−07
09
4.40e− 2.08e− 𝐶
07 =
−2.51e− 7.30e−0 8.98e−0
1.00e−09 8 8
0 10 10
−1.81e−1 6 6
 𝜕𝜕𝜕
𝜕𝐸
09
5 7.32e−0
2.18e−05 7.85e−0 1.05e−0
is dominated by short-term dependencies(e.g.,
𝜃𝜃 but
7 6 5
C),
 Gradient vanishes in long-term dependencies i.e. 𝜕𝜕𝐸5
𝜕𝜕
is updated much less due to A
as compared
to updated by 𝜃𝜃
C
70
Challenges in Training an RNN: Vanishing
Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while
BPTT:
𝜕𝜕𝐸5 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2
= + +
𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
+ …
𝜕−1.70e−1
𝜕𝜃𝜃 𝜕𝜕ℎ 5 𝜕𝜕ℎ4 𝜕𝜕ℎ
4.94e− 3 𝜕𝜕ℎ2B=
2.29e− 𝜕𝜕ℎ1−1.13e−0
𝜕𝜕𝜃𝜃 2.61e− 1.50e−0
𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
8 8
0 10 10
𝜕𝜕ℎ 2 𝜕𝜕𝜃𝜃 5.56e− 𝜕𝜕ℎ5 𝜕𝜕ℎ 4 𝜕𝜕ℎ3 1.51e−0
5.70e− 𝜕𝜕𝜃𝜃
09
2.55e− C= 6
A= −1.70e−0 8.70e−0 9.40e−0
8
𝐵 = 8
−1.11e−0
0 = 10 10
6 6
𝐴
−1.73e−1
−1.33e−0 9.11e− 1.83e−0
1.53e−07
09
4.40e− 2.08e− 𝐶
07 =
−2.51e− 7.30e−0 8.98e−0
1.00e−09 8 8
0 10 10
−1.81e−1 6 6
 𝜕𝜕𝜕
𝜕𝐸
09
5 7.32e−0
2.18e−05 7.85e−0 1.05e−0
is dominated by short-term dependencies(e.g.,
𝜃𝜃 but
7 6 5
C), Long Term Components goes exponentially fast to norm 0
𝜕𝜕
 Gradient vanisnhoescionrlroenlga-tteo i rmn dbeep tewnedeennctieesmi.peo.
to updated by 𝜃𝜃
rC𝜕𝜕a𝐸5yl isduipsdtaatnetdemvuecnhs
t less due to A as compared
71
Challenges in Training an RNN: Exploding
Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while
BPTT:
𝜕𝜕𝐸5 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2
= + +
𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
+ …
𝜕−1.70e
𝜕𝜃𝜃 +𝜕𝜕ℎ4.94e
5 𝜕𝜕ℎ+4 𝜕𝜕ℎ 3 𝜕𝜕ℎ2 =
2.29e−10 𝜕𝜕ℎ1−1.13e
𝜕𝜕𝜃𝜃 + 2.61e + 1.50e +
𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
08 −1.70e + 8.70e + 9.40e +
10 10 2.55e−1
𝜕𝜕ℎ 2 𝜕𝜕+
𝜃𝜃 5.56e + 𝜕𝜕ℎ5+𝜕𝜕ℎ5.70e
4 𝜕𝜕ℎ+3 𝜕𝜕𝜃𝜃 + C = 06
1.51e
09 08
0
A=

𝐵08=
−1.11e
10 = 10 2.08e−1 −2.51e + 7.30e + 8.98e +
B 06 06
𝐴
−1.73e
−1.33e +
1.53e+107 9.11e + 1.83e +
09 08
−1.81e +
1.00e+109 4.40e + 0 𝐶 =
08 7.32e + 7.85e + 1.05e +
10 10
07 06 06
09 08
2.18e+105
07 06 05

 𝜕𝜕𝜕
𝜕𝐸 5
, gradient explodes, i.e., NaN due to very large
𝜃𝜃
numbers

72
Challenges in Training an RNN: Exploding
Gradients

Assume an RNN of 5 time steps: Long Term dependencies


Let‘s look at the Jacobian matrix while
BPTT:
𝜕𝜕𝐸5 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3 𝜕𝜕ℎ2
= + +
𝜕𝜕𝐸5 𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
+ …
𝜕−1.70e
𝜕𝜃𝜃 +𝜕𝜕ℎ4.94e
5 𝜕𝜕ℎ+4 𝜕𝜕ℎ 3 𝜕𝜕ℎ2 =
2.29e−10 𝜕𝜕ℎ1−1.13e
𝜕𝜕𝜃𝜃 + 2.61e + 1.50e +
𝜕𝜕ℎ5 𝜕𝜕ℎ4 𝜕𝜕ℎ3
06 −1.70e + 8.70e + 9.40e +
10 10 2.55e−1
𝜕𝜕ℎ 2 𝜕𝜕+
𝜃𝜃 5.56e + 𝜕𝜕ℎ5+𝜕𝜕ℎ5.70e
4 𝜕𝜕ℎ+3 𝜕𝜕𝜃𝜃 + C = 04
1.51e
06 06
0
A=

𝐵06=
−1.11e
10 = 10 2.08e−1 −2.51e + 7.30e + 8.98e +
B 04 04
𝐴
−1.73e
−1.33e +
1.53e+97 9.11e + 1.83e +
06 06
−1.81e +
1.00e+109 4.40e + 0 𝐶 =
06 7.32e + 7.85e + 1.05e +
10 10
04 04 04
06 06
2.18e+85
04 04 04
Large increase in the norm of the gradient during training 
 𝜕𝜕𝐸 5

𝜕𝜕
, gradiednuteextpoloedxepsl,oi.sei.o, NnaoNf dol uengtotveerrmy a
l crogme
n𝜃𝜃puomnbeenrsst

73
Vanishing Gradient in Long-term
Dependencies

Often, the length of sequences are long….e.g., documents,


speech, etc.
𝐸1 𝐸3 𝐸5
𝜕𝜕 𝜕𝜕𝐸
𝑜𝐸𝟏 2𝜕𝜕 𝑜𝟐 𝜕𝜕 𝑜𝟑 0
𝐸 𝐸 𝐸
𝜕𝜕ℎ 𝑜𝟑𝟑 5𝜕𝜕ℎ
𝟏 𝜕𝜕ℎ
1 𝟐 𝜕𝜕ℎ
2 𝟑 3
0 5

1 3 0
2

ℎ𝟏 𝜕𝜕ℎ ℎ 𝜕𝜕ℎ ℎ𝟑 …
ℎ𝟓𝟓𝟓
𝜕𝜕ℎ 𝟐
𝟐
𝟏 2
𝜕𝜕ℎ 𝟑
3 𝟓
𝗑𝟏𝟏 1 𝗑𝟐𝟐 2
𝗑
𝟓𝟓𝟓𝟓 𝗑𝟑𝟑
In practice as the length of the sequence increases, the probability of training being
successful
decrease drastically.
Why

74
Vanishing Gradient in Long-term
Dependencies

Why

Let us look at the recurrent part of our


RNN equation:

ℎ𝑡 = ℎ𝑡 −1 ,
tanh
𝑔𝖶
Expansion
𝑥𝑡 ℎ𝑡 = Wℎℎf(ℎt−1) + some other
ℎ𝑡 = tanh(Wℎℎ ℎ𝑡−1 +
W𝑥ℎ𝑥𝑡 ) terms
ℎ𝑡 = Wℎℎℎ0 + some other
𝑜𝑡 =
terms
𝑠𝑜𝑠𝑠𝑡𝑚𝑎𝑥(Wℎ𝑜ℎ𝑡)

75
Vanishing Gradient in Long-term
Dependencies

𝐸 𝐸 𝐸3
1 𝜕𝜕 2 𝟐 𝜕𝜕 𝑜𝟑 𝜕𝜕
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝑜
= � 𝑡 𝜕𝜕𝐸3 𝜕𝜕𝐸3 Who𝑜𝟏𝟏 𝐸 𝐸
form
𝜕𝜕𝐸
𝜕 = 𝜕 11 Who𝟐 𝐸 𝜕 22 𝜕 33
𝜕𝜕Wℎ ℎ 𝜕𝜕ℎ3 𝜕 𝜕 W 𝜕
Who𝟑
1≤𝑡≤ 𝜕𝜕
𝜕 𝜕𝜕Wℎ ℎ𝜕𝜕ℎ
ℎ3 dependsθ on
Since θ
3 3
ℎ ℎ ℎ
W

𝜕𝜕ℎ ℎ
hh hh

ℎ2 3 𝐚𝐚𝐚𝐚𝐚𝐚 ℎ2 depends 𝜕𝜕ℎ3 on ℎ1,


ℎ𝟏 𝜕𝜕ℎ ℎ𝟑
𝜕𝜕𝐸3 𝜕𝜕𝐸3 𝜕𝜕ℎ 𝜕𝜕ℎ 𝟐
𝟐

therefore 𝜕𝜕ℎ3 𝜕𝜕ℎ2 2


𝜕
𝜕𝜕W = � 𝜕𝜕ℎ 𝜕𝜕h
3 𝟏
3 2 𝟑
ℎℎ 3 𝑘 = 𝗑𝟏 𝗑𝟐𝟐 𝜕
𝜕𝜕ℎ
1
𝜕𝜕ℎ 𝜕𝜕W
𝑘 = 1
𝜕𝜕ℎ2 𝜕𝜕ℎ1
e.g.,
ℎ partial derivatives)
ℎℎ
𝟏 of Backward 𝗑
𝑘
0
𝜕𝜕 𝜕𝜕𝐸 𝜕𝜕ℎ
𝟑𝟑

= �
Direction pass (via
𝑡 𝑡
𝐸𝑡 ℎ
𝜕𝜕W
In --- gradient flow ---
1≤𝑘
𝜕𝜕ℎ𝑘
general,
ℎ𝑡 = 𝑾 𝑾ℎ ℎ 𝐟𝐟(ℎ𝐭−𝟏𝟏 )+
𝜕𝜕 ℎ ≤𝑡 𝐭

= 𝖦 𝜕𝜕ℎ 𝜕𝜕ℎ= 𝖦 W 𝑑𝑑𝑑𝑎𝑔[𝑔′ 𝐬


𝐬𝐬𝐬𝐬𝐬𝐬𝐬 𝐭𝐭𝐬𝐬𝐭𝐭𝐬𝐬𝐬 𝐬
ℎ𝑡
𝑇
𝜕𝜕 𝜕𝜕ℎ 𝑖𝑖−
𝑖 ℎℎ
(ℎ𝑡𝑖𝑖−1)]
𝑡
𝜕𝜕ℎ
≥𝑖𝑖
h𝑘 1 𝝏𝝏ℎ
Transport error in time from step t back to step k
𝑡≥𝑖𝑖>𝑘 >𝑘𝑘
𝜕𝜕Wℎ ℎ
Jacobian matrix
𝑡

𝝏𝝏ℎ
This term is the product of Jacobian 𝑘 matrix .

76
Vanishing Gradient in Long-term
Dependencies

𝐸 𝐸 𝐸3
𝜕𝜕 𝜕𝜕 𝑜𝟑 𝜕𝜕
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝜕𝜕𝐸3 𝑜
𝜕𝜕
1 2𝟐
= Who𝑜𝟏𝟏 𝐸 𝐸
form
𝜕𝜕𝐸 𝜕𝜕ℎ3 𝜕 11 𝐸
𝜕 22 𝜕 33
𝜕 𝑡
𝐸3ℎ = 𝜕𝜕ℎ
𝜕𝜕 𝜕
Who𝟐
𝜕 𝜕 W
� Who𝟑
1≤𝑡≤ 𝜕𝜕 3 ℎ
𝜕 Wℎ ℎ ℎ ℎ

θ
3
θ 𝜕𝜕𝐸3 𝜕𝜕ℎ𝑡 𝜕𝜕W
Whh
𝜕𝜕 𝜕𝜕ℎ ℎ
hh

ℎ𝟏
𝐸𝑡ℎ
𝜕𝜕 𝜕𝜕ℎ
𝜕𝜕ℎ
𝟐
𝜕𝜕ℎ
3
𝜕𝜕ℎ �
𝟑
𝜕𝜕ℎ
𝑡𝑘 𝑘 �
W ℎ = 𝑡 � 𝜕𝜕ℎ
1≤𝑘≤ 2

𝟏
ℎℎ
𝗑𝟏 𝗑𝟐𝟐
𝜕𝜕ℎ𝑡 𝜕𝜕W
𝑖 = 𝖦 Wℎℎ𝑇𝑑𝑑𝑑𝑎𝑔[𝑔′
1
𝜕𝜕ℎ
� �
2
𝜕𝜕 = 𝖦 𝟏 of Backward 𝗑
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− (ℎ𝑡𝑖𝑖−1)]
𝟑𝟑
≥𝑖𝑖
h𝑘
Direction pass (via partial derivatives)
𝑖 𝝏𝝏ℎ
>𝑘 1 >𝑘 Jacobian matrix --- gradient flow ---
𝑡

𝝏𝝏ℎ
𝑘

77
Vanishing Gradient in Long-term
Dependencies

𝐸 𝐸 𝐸3
𝜕𝜕 𝜕𝜕 𝑜𝟑 𝜕𝜕
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝜕𝜕𝐸3 𝑜
𝜕𝜕
1 2𝟐
= Who𝑜𝟏𝟏 𝐸 𝐸
form
𝜕𝜕𝐸 𝜕𝜕ℎ3 𝜕 11 𝐸
𝜕 22 𝜕 33
𝜕 𝑡
𝐸3ℎ = 𝜕𝜕ℎ
𝜕𝜕 𝜕
Who𝟐
𝜕 𝜕 W
� Who𝟑
1≤𝑡≤ 𝜕𝜕 3 ℎ
𝜕 Wℎ ℎ ℎ ℎ

θ
3
θ 𝜕𝜕𝐸3 𝜕𝜕ℎ𝑡 𝜕𝜕W
Whh
𝜕𝜕 𝜕𝜕ℎ ℎ
hh

ℎ𝟏
𝐸𝑡ℎ
𝜕𝜕 𝜕𝜕ℎ
𝜕𝜕ℎ
𝟐
𝜕𝜕ℎ
3
𝜕𝜕ℎ �
𝟑
𝜕𝜕ℎ
𝑡𝑘 𝑘 �
W ℎ = 𝑡 � 𝜕𝜕ℎ
1≤𝑘≤ 2

𝟏
ℎℎ
𝗑𝟏 𝗑𝟐𝟐
𝜕𝜕ℎ𝑡 𝜕𝜕W
𝑖 = 𝖦 Wℎℎ𝑇𝑑𝑑𝑑𝑎𝑔[𝑔′
1
𝜕𝜕ℎ
� �
2
𝜕𝜕 = 𝖦 𝟏 of Backward 𝗑
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− (ℎ𝑡𝑖𝑖−1)]
𝟑𝟑
≥𝑖𝑖
h𝑘
Direction pass (via partial derivatives)
𝑖
>𝑘 1 >𝑘 --- gradient flow ---

Repeated matrix multiplications leads to vanishing gradients !!!

78
Mechanics behind Vanishing and Exploding
Gradients

𝐸 𝐸 𝐸3
𝜕𝜕ℎ𝑡 = 𝖦 W 𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 𝑜𝟑 𝜕𝜕
𝑇
𝜕𝜕 = 𝖦 2 𝟐 𝜕𝜕
ℎ 𝑖𝑖−
𝑜
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− )] 𝑡 ≥𝑖𝑖
1
𝜕𝜕ℎ
ℎ 1
h𝑘 𝑖
>𝑘 >𝑘 𝑜𝟏𝟏 𝐸
𝜕𝜕ℎ
1 1 𝐸
𝜕𝜕ℎ
2 𝐸
𝜕𝜕ℎ
3
1 Who 𝟐
Who 𝟑
3
Who
2
Consider identity activation
𝜕𝜕ℎ ℎ
Whh Whh

If recurrent matrix Wℎℎ is a ℎ𝟏 𝜕𝜕ℎ ℎ𝟑


function

𝜕𝜕ℎ 𝟐
𝟐

𝜕𝜕ℎ 𝟑
𝑾ℎ = ∗ 𝜵𝜵
𝟏 2
diagonalizable: 3
𝗑𝟏 𝗑𝟐𝟐
∗ 𝑸𝑸
1
�ℎ
2
𝟏 of Backward 𝗑

𝑸𝑸−𝟏𝟏 of eigenvectors ℎ
Direction 𝟑𝟑
pass (via partial derivatives)

of W
matrix
�composed

--- gradient flow ---
diagonal matrix with eigenvalues placed on the
diagonals

n computing powers of Wℎℎ :


𝑾 𝑾ℎ ℎ 𝟏 ∗ 𝜵𝜵
−𝟏
Using power iteration
n method,

Bengio et al, "On the= 𝑸 ∗ 𝑸


𝑸 of training
difficulty
𝑸
recurrent neural
networks." (2012)
79
Mechanics behind Vanishing and Exploding
Gradients

𝐸 𝐸 𝐸3
𝜕𝜕ℎ𝑡 = 𝖦 W 𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 𝑜𝟑 𝜕𝜕
𝑇
𝜕𝜕 = 𝖦 2 𝟐 𝜕𝜕
ℎ 𝑖𝑖−
𝑜
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− )] 𝑡 ≥𝑖𝑖
1
𝜕𝜕ℎ
ℎ 1
h𝑘 𝑖
>𝑘 >𝑘 𝑜𝟏𝟏 𝐸
𝜕𝜕ℎ
1 1 𝐸
𝜕𝜕ℎ
2 𝐸
𝜕𝜕ℎ
3
1 Who 𝟐
Who 𝟑
3
Who
2
Consider identity activation
𝜕𝜕ℎ ℎ
Whh Whh

ℎ𝟏 𝜕𝜕ℎ ℎ𝟑
function

Wℎℎ : 𝜕𝜕ℎ 𝟐
computing powers of 𝟐

𝜕𝜕ℎ 𝟑
𝑾ℎ = ∗ 𝜵𝜵
𝟏 2
3
𝗑𝟏
n

𝑾ℎ
𝗑𝟐𝟐
∗ 𝑸 𝑸 gradients
1
n
2
𝟏 of Backward 𝗑

𝑸𝑸−𝟏𝟏
Vanishing Direction 𝟑𝟑
pass (via partial derivatives)

𝜵
𝜵 𝜵𝜵
- 0 10 - 0
=
0.618 0.0081
=
1.61 0 122.9
0 8 9
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural
networks." (2012)
80
Mechanics behind Vanishing and Exploding
Gradients

𝜕𝜕ℎ𝑡 = 𝖦 W 𝑇
𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 = 𝖦 ℎ 𝑖𝑖−
𝜕𝜕ℎ 𝜕𝜕ℎ𝑖𝑖− )]
𝑡 ≥𝑖𝑖>𝑘 ℎ 1
h𝑘 𝑖
𝑡 ≥𝑖𝑖>𝑘
1 Need for tight conditions
Consider identity activation on eigen values
function

n of Wℎℎ :
during training to prevent
𝑾ℎ = ∗ 𝜵𝜵n
computing powers

𝑾ℎ ∗ 𝑸 𝑸 gradients
gradients to vanish or explode

𝑸𝑸−𝟏𝟏
Vanishing

𝜵
𝜵 𝜵𝜵
- 0 10 - 0
=
0.618 0.0081
=
1.61 0 122.9
0 8 9
Eigen values on the diagonal
Exploding gradients
Bengio et al, "On the difficulty of training recurrent neural
networks." (2012)
81
Mechanics behind Vanishing and Exploding
Gradients

𝐸 𝐸 𝐸3
𝜕𝜕 𝜕𝜕 𝑜𝟑 𝜕𝜕
Writing gradients in a sum-of-products
𝜕𝜕𝐸 𝜕𝜕 𝜕𝜕𝐸3 1 2
= Who𝑜𝟏𝟏 𝐸 𝐸
form
𝜕𝜕𝐸
𝜕
𝜕𝜕ℎ3 𝜕 11 Who𝑜𝟐𝟐 𝐸
𝜕 22 𝜕 33
𝑡
𝐸3ℎ = 𝜕𝜕ℎ
𝜕𝜕 𝜕 𝜕 W 𝜕
� Who𝟑
1≤𝑡≤ 𝜕𝜕 3 ℎ
𝜕 Wℎ ℎ ℎ ℎ

θ
3
θ 𝜕𝜕𝐸3 𝜕𝜕ℎ𝑡 𝜕𝜕W
Whh
𝜕𝜕 𝜕𝜕ℎ ℎ
hh

ℎ𝟏
𝐸𝑡ℎ
𝜕𝜕 𝜕𝜕ℎ
𝜕𝜕ℎ
𝟐
𝜕𝜕ℎ
3
𝜕𝜕ℎ �
𝟑
𝜕𝜕ℎ
𝑡𝑘 𝑘 �
W ℎ = 𝑡 � 𝜕𝜕ℎ
1≤𝑘≤ 2

𝟏
ℎℎ
𝗑𝟏 𝗑𝟐𝟐
𝜕𝜕ℎ𝑡 𝜕𝜕W
𝑖 = 𝖦 Wℎℎ𝑇𝑑𝑑𝑑𝑎𝑔[𝑔′
1
𝜕𝜕ℎ
� �
2
𝜕𝜕 = 𝖦 𝟏 of Backward 𝗑
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− (ℎ𝑡𝑖𝑖−1)]
𝟑𝟑
≥𝑖𝑖
h𝑘
Direction pass (via partial derivatives)
𝑖
>𝑘 1 >𝑘 --- gradient flow ---

𝜕𝜕ℎ𝑡
𝜕𝜕ℎ
Find Sufficient condition for when gradients vanish  compute an upper bound for
𝜕𝜕ℎ𝑖
ǁ ≤ ℎ � 𝑑𝑑𝑑𝑎𝑔
term 𝑘

𝑔 ′ 𝑖𝑖−
W𝜕𝜕ℎ𝑖𝑖− ℎ

ǁ
ℎ 1
 find out an upper bound for the norm of the jacobian!
1

82
Mechanics behind Vanishing and Exploding
Gradients
Lets find an upper bound for the term: 𝑾 𝑔 ′ ℎ𝑰𝑰−
𝑑𝑰𝑰𝑎 𝑔
𝑀 2 = 𝜆 𝑚𝑎𝑥 (𝑀 ∗ 𝑀) = 𝛾𝛾𝑾𝑚𝑎𝑥(M)
𝟏𝟏
Propert y of mat rix
• Proof: norm
𝑇
𝑀 2‖of a
complex matrix 𝑀 is defined as 𝑚𝑎𝑥 𝑀𝑥2 : 𝑥 =
where the spectral norm
offl ine
1 The norm of a matrix is equal to the largest singular value of the matrix and is
related to the largest Eigen value (spectral radius)

Put 𝐵 = 𝑀 ∗ 𝑀 which is a Hermitian matrix. As a linear transformation of Euclidean vector


space 𝐸 is Hermite iff there exists an orthonormal basis of 𝐸 consisting of all the eigenvectors
of 𝐵
Let 𝜆1, 𝜆2 , 𝜆3 … 𝜆𝑛 be the eigenvalues of 𝐵 and 𝑒 1 , 𝑒2 … … . 𝑒 𝑛
of 𝐸
be an orthonormal basis

Let 𝑥 = 𝑎1𝑒1 + … … 𝑎 𝑛 𝑒 𝑛
𝑛 𝑎𝑖𝑖𝑒𝑖𝑖
𝑥 =
(linear combination of
∑𝑖𝑖=1 𝑛 𝑎𝑖2
𝑎𝑖𝑖𝑒𝑖𝑖 1/2 = ∑𝑖𝑖=1
𝑖𝑖=1
∑𝑛
eigen vectors) The specttal norm of x:
83
Mechanics behind Vanishing and Exploding
Gradients

𝑛 𝑛
Using characteristic equation to find a matrix's
𝑛
𝐵𝑥 =� =�
eigenvalues,

=𝐵 𝑎𝑖𝑖 𝑎𝑖𝑖𝐵 𝑒 𝜆𝑖𝑖𝑎𝑖𝑖𝑒𝑖𝑖



𝑖𝑖=
1 𝑖𝑖
𝑒𝑖𝑖 𝑖𝑖=1 offl ine
𝑖𝑖=1
Therefore,
𝑛 𝑛 𝑛

𝑀𝑥 = = 𝑥, 𝑀∗ 𝑀𝑥 = = � 𝑎𝑖𝑖𝑒𝑖𝑖 � 𝜆𝑖𝑖 = � 𝑎𝑖𝑖𝜆𝑖𝑖𝑎𝑖𝑖 max 𝜆𝑗𝑗 × ( 𝑥


𝑀𝑥, 𝑀𝑥
(1≤𝑗𝑗≤
𝑥, 𝐵𝑥 𝑎𝑖𝑖𝑒𝑖𝑖 ≤ 𝑛)

𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1 )


Thus
= 𝑀𝑥 𝑥 = 1 , 𝑡ℎ𝑒𝑛𝑀 ≤ 𝜆
𝑚𝑎𝑥 : max
, If equation
1≤𝑗𝑗≤𝑛 𝑗𝑗 (1)

84
Mechanics behind Vanishing and Exploding
Gradients
Conside
r,
𝑥0 = 𝑒𝑗𝑗0 ⇒ 𝑥 = 1, 𝑠𝑜 = 𝑒𝑗𝑗0 , 𝐵 = 𝑒𝑗𝑗0 , =
𝑡ℎ𝑎𝑡 𝑀 ≥ 𝑥, 𝐵𝑥 𝑒𝑗𝑗0 𝜆𝑗𝑗0
… equation
𝜆𝑗𝑗0 𝑒𝑗𝑗0
where, 0𝑗𝑗
(2)
is the largest
offl ine
eigen value.
= max 𝜆𝑗 𝑤ℎ𝑒𝑟𝑒, 𝜆𝑗 𝑑
𝑑𝑠 𝑡ℎ𝑒 𝑒𝑑𝑑𝑔𝑒𝑛
𝑣𝑎𝑙𝑢𝑒 𝑜𝑠𝑠 𝐵 = 𝑀 𝑀∗
Combining (1) and (2) give
1≤𝑗𝑗≤𝑛
𝑀
us
2 = 𝜆 𝑚𝑎𝑥 (𝑀 ∗ 𝑀) = 𝛾𝛾𝑚𝑎𝑥(M) ….
𝑀
Conclusion :
equation (3)

Remarks:
The spectral norm of a matrix is equal to the largest singular value of
the matrix and is related to the largest Eigen value (spectral radius)
 If the matrix is square symmetric, the singular value = spectral
Radius

85
Mechanics behind Vanishing and Exploding
Gradients
Let’s use these
𝜕𝜕ℎ𝑡 = 𝖦 𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
properties:
𝑇
𝜕𝜕 = 𝖦 ℎ 𝑖𝑖−
𝜕𝜕ℎ 𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− W𝑡 ≥𝑖𝑖 ℎ )] 1
h𝑘 𝑖
>𝑘 1 >𝑘

𝜕𝜕
ℎ𝑖 ǁ ≤ 𝑑𝑑
𝑑 𝑔 ′ ℎ𝑖𝑖−

ℎ �
ǁ 𝜕𝜕ℎ𝑖𝑖− W ℎ 𝑎𝑔 1
1

86
Mechanics behind Vanishing and Exploding Gradients

𝜕𝜕ℎ𝑡
Let’s use these properties:

= 𝖦 = 𝖦 Wℎℎ �𝑑𝑑𝑑𝑎𝑔[𝑔′ function or tanh) 𝑔 ′


Gradient of the nonlinear
𝜕
𝜕𝜕ℎ ℎ𝑖𝑖−

(ℎ𝑖𝑖−1)] constant, .i.e., 𝑑𝑑𝑑𝑎𝑔 𝑔𝑖𝑖−1


′ ℎ ≤
𝑘𝑖
(sigmoid is bounded
𝜕 𝑖𝑖−1
𝑡 ≥𝑖𝑖
𝜕𝜕ℎ>𝑘 1
𝛾𝛾
by
h 𝑡≥𝑖𝑖>𝑘 𝑔𝑔
𝜕𝜕 �
𝑑𝑑
𝑑𝑎𝑔 ′
ǁ ℎ𝑖𝑖−
an upper bound for the norm
Wℎℎ� ℎ𝑖𝑖−
𝜕𝜕 𝑖 ≤
ǁ of the gradient of activation

1 1
ℎ 𝑔

𝛾𝛾𝑔 = ¼ for
constant
sigmoid

𝛾𝛾𝑔 = 1 for tanh


87
Mechanics behind Vanishing and Exploding Gradients

𝜕𝜕ℎ𝑡
Let’s use these properties:

= 𝖦 = 𝖦 Wℎℎ �𝑑𝑑𝑑𝑎𝑔[𝑔′ function or tanh) 𝑔 ′


Gradient of the nonlinear
𝜕
𝜕𝜕ℎ ℎ𝑖𝑖−

(ℎ𝑖𝑖−1)] constant, .i.e., 𝑑𝑑𝑑𝑎𝑔 𝑔𝑖𝑖−1


′ ℎ ≤
𝑘𝑖
(sigmoid is bounded
𝜕 𝑖𝑖−1
𝑡 ≥𝑖𝑖
𝜕𝜕ℎ>𝑘 1
𝛾𝛾
by
h 𝑡≥𝑖𝑖>𝑘 𝑔𝑔
𝜕𝜕 �
𝑑𝑑𝑑𝑎𝑔 ′
ǁ ℎ𝑖𝑖−
an upper bound for the norm
Wℎℎ� ℎ𝑖𝑖−
𝜕𝜕 ≤
ǁ
𝑖
of the gradient of activation

1 1

Largest Singular
≤ 𝛾𝛾 𝑔
value of 𝛾𝛾𝑔 𝖶
𝑾𝑾
𝜸𝜸 𝑾𝑾 𝜸𝜸𝑔 = an upper bound for the norm of

𝛾𝛾𝑔
jacobian!
= ¼ for
sigmoid constant

𝛾𝛾𝑔 = 1 for tanh


88
Mechanics behind Vanishing and Exploding
Gradients
Let’s use these
𝜕𝜕ℎ𝑡
= 𝖦 = 𝖦 Wℎℎ
�𝑑𝑑𝑑𝑎𝑔[𝑔′ function or tanh) 𝑔 ′
properties: Gradient of the nonlinear
𝜕
𝜕𝜕ℎ ℎ𝑖𝑖−

(ℎ𝑖𝑖−1)] constant, .i.e., 𝑑𝑑𝑑𝑎𝑔 𝑔𝑖𝑖−1


′ ℎ ≤
𝑘𝑖
(sigmoid is bounded
𝜕 𝑖𝑖−1
𝑡 ≥𝑖𝑖
𝜕𝜕ℎ>𝑘 1
𝛾𝛾
by
h 𝑡≥𝑖𝑖>𝑘 𝑔𝑔
𝜕𝜕 �
𝑑𝑑𝑑𝑎𝑔 ′
ǁ ℎ𝑖𝑖−
an upper bound for the norm
Wℎℎ� ℎ𝑖𝑖−
𝜕𝜕 ≤
ǁ
𝑖
of the gradient of activation

1 1

Largest Singular
≤ 𝛾𝛾 𝑔
value of 𝛾𝛾𝑔 𝖶
𝑾𝑾
𝜸𝜸 𝑾𝑾 𝜸𝜸𝑔 = an upper bound for the norm of

𝜕𝜕ℎ3
jacobian!

𝑡−
𝛾𝛾𝖶 𝑘
≤ 𝜕𝜕ℎ
𝛾𝛾𝑔
ǁ ǁ

89
Mechanics behind Vanishing and Exploding
Gradients
Let’s use these
𝜕𝜕ℎ𝑡
= 𝖦 = 𝖦 Wℎℎ
�𝑑𝑑𝑑𝑎𝑔[𝑔′
properties: Sufficient Condition for Vanishing Gradient
𝜕
𝜕𝜕ℎ As 𝛾𝛾𝖶 𝛾𝑔 < 1 and (t-k)∞ then long term

(ℎ𝑖𝑖−1)]
𝑘𝑖
𝜕 𝑖𝑖−1
𝑡 ≥𝑖𝑖
𝜕𝜕ℎ>𝑘
h
𝜕𝜕
𝑡≥𝑖𝑖>𝑘
contributions go to 0 exponentially fast with

ℎ𝑖 ǁ ≤ �
𝑑𝑑𝑑𝑎𝑔
t-k (power iteration method).
ℎ � 𝑖𝑖−
ǁ 𝜕𝜕ℎ𝑖𝑖− W 𝑔′ ℎ
Therefore,
ℎ 1
𝛾𝛾𝖶 <
sufficient condition for vanishing gradient to
1
≤ 𝛾𝛾 1/𝛾𝛾
occur:
𝛾𝛾𝑔𝑔
Largest Singular

𝛾𝛾𝑔 𝖶
𝖶
< 4 for tanh, 𝛾𝛾𝖶 <
i.e. for sigmoid,
value of
1
i.e.,
𝑾𝑾
𝜸𝜸 𝑾𝑾 𝜸𝜸𝑔 = an upper bound for the norm of

𝜕𝜕ℎ3
jacobian!

𝑡−
𝛾𝛾𝖶 𝑘
≤ 𝜕𝜕ℎ
𝛾𝛾𝑔
ǁ ǁ

90
Mechanics behind Vanishing and Exploding
Gradients
Let’s use these
𝜕𝜕ℎ𝑡
= 𝖦 = 𝖦 Wℎℎ
�𝑑𝑑𝑑𝑎𝑔[𝑔′
properties: Necessary Condition for Exploding Gradient
𝜕
𝜕𝜕ℎ As 𝛾𝛾𝖶 𝛾𝑔 > 1 and (t-k)∞ then gradient

(ℎ𝑖𝑖−1)]
𝑘𝑖
𝜕 𝑖𝑖−1
𝑡 ≥𝑖𝑖
𝜕𝜕ℎ>𝑘
h
𝜕𝜕
𝑡≥𝑖𝑖>𝑘
explodes!!! Therefore,

ℎ𝑖 ǁ ≤ �
𝑑𝑑𝑑𝑎𝑔 𝛾𝛾𝖶>
Necessary condition for exploding gradient
ℎ � 𝑖𝑖− 1/𝛾𝛾
ǁ 𝜕𝜕ℎ𝑖𝑖− W � 𝛾𝛾
to occur:
ℎ 𝑔′ ℎ 1
𝑔𝑔
�>
i.e. for sigmoid,
i.e., for tanh, 𝛾𝛾 𝖶>
1
≤ 𝛾𝛾 4
1
Largest Singular
value of 𝛾𝛾𝑔 𝖶
𝑾𝑾
𝜸𝜸 𝑾𝑾 𝜸𝜸𝑔 = an upper bound for the norm of

𝜕𝜕ℎ3
jacobian!

𝑡−
𝛾𝛾𝖶 𝑘
≤ 𝜕𝜕ℎ
𝛾𝛾𝑔
ǁ ǁ

91
Vanishing Gradient in Long-term Dependencies

What have we concluded with the upper bound of derivative from recurrent step?

𝜕𝜕ℎ𝑡 = 𝖦 W 𝑇
𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 = 𝖦 𝜕 3
ℎ 𝑖𝑖−
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− )] 𝑡 ≥𝑖𝑖 𝑡−
𝜕𝜕ℎ 𝛾𝛾𝖶 𝑘
ℎ 1
h ǁ𝜕 ǁ
𝜕𝜕ℎ
𝑖
>𝑘 >𝑘
≤ℎ
𝑘 1
𝛾𝛾𝑔
𝜕𝜕
ℎ𝑖 ǁ ≤ 𝑑𝑑
𝑑 ≤
𝑔 ′ ℎ𝑖𝑖− 𝖶𝛾
𝑘

ℎ �
ǁ 𝜕𝜕ℎ𝑖𝑖− W ℎ 𝑎𝑔 1 𝛾
𝛾 𝑔𝑔 �
1
If we multiply the same term 𝛾𝛾𝖶 𝛾𝛾𝑔 < 1 again and again, the overall number

becomes very small(i.e almost equal to zero)

HOW ?
Repeated matrix multiplications leads to vanishing and exploding
gradients
92
Vanishing Gradient in Long-term
Dependencies

𝜕𝜕𝐸3 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2


=
�1 �2 �3
𝑜𝟏
+ +
𝑜𝟐 �𝑜𝟑
𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
� �
𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ2
𝟏 𝟐 𝟑

𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝖶

= ≪≪ 1 + ≪1 + <1
𝑾
𝑾




ℎ � ℎ
The gradients no longer depend on the past inputs… 𝟏
𝟏
𝟐
𝟐 𝟑𝟑
Gradient due to Gradient due to
Total
since, the near past inputs
longdominate
term the gradient
short!!!
term 𝗑𝟏 𝗑𝟐 𝗑𝟑
Gradient dependencies 𝟏
dependencies
𝟐 𝟑

Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
93
Vanishing Gradient in Long-term
Dependencies

𝜕𝜕𝐸3 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2


=
�1 �2 �3
𝑜𝟏
+ +
𝑜𝟐 �𝑜𝟑
𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3
� �
𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ2
𝟏 𝟐 𝟑

𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝖶

= ≪≪matrix
1 + ≪1 + leads
< 1 to vanishing and exploding gradients
The gradients no longer depend on the past inputs…
Repeated multiplications
𝑾
𝑾

since, the near past inputs dominate the gradient !!!



ℎ � ℎ
𝟏
𝟏
𝟐
𝟐 𝟑𝟑
Gradient due to Gradient due to
Total long term short term 𝗑𝟏 𝗑𝟐 𝗑𝟑
Gradient dependencies 𝟏
dependencies
𝟐 𝟑

Remark: The gradients due to short term dependencies (just previous dependencies) dominates the
gradients due to long-term dependencies.
This means network will tend to focus on short term dependencies which is often not desired
Problem of Vanishing Gradient
94
Exploding Gradient in Long-term Dependencies

What have we concluded with the upper bound of derivative from recurrent step?

𝜕𝜕ℎ𝑡 = 𝖦 W 𝑇
𝑑𝑑𝑑𝑎𝑔[𝑔′(ℎ
𝜕𝜕 = 𝖦 𝜕 3
ℎ 𝑖𝑖−
𝑡 ≥𝑖𝑖 𝜕𝜕ℎ𝑖𝑖− )] 𝑡 ≥𝑖𝑖 𝑡−
𝜕𝜕ℎ 𝛾𝛾𝖶 𝑘
ℎ 1
h ǁ𝜕 ǁ
𝜕𝜕ℎ
𝑖
>𝑘 >𝑘
≤ℎ
𝑘 1
𝛾𝛾𝑔
𝜕𝜕
ℎ𝑖 ǁ ≤ 𝑑𝑑
𝑑 ≤
𝑔 ′ ℎ𝑖𝑖− 𝖶𝛾
𝑘

ℎ �
ǁ 𝜕𝜕ℎ𝑖𝑖− W ℎ 𝑎𝑔 1 𝛾
𝛾 𝑔𝑔 �
1
If we multiply the same term 𝜸𝜸𝑾𝑾 𝜸𝜸𝑔 > 1 again and again, the overall

number explodes and hence the gradient explodes

HOW ?
Repeated matrix multiplications leads to vanishing and exploding
gradients
95
Vanishing Gradient in Long-term
Dependencies

𝜕𝜕𝐸3 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ2


= +
𝜕𝜕𝐸3 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝐸 𝐸 𝐸3
+
1 2 𝑜𝟑
𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ2 𝜕𝜕ℎ1 𝜕𝜕𝖶 𝜕𝜕ℎ3
𝑜𝟏𝟏 𝑜𝟐𝟐
𝜕𝜕ℎ2 𝜕𝜕𝖶 𝜕𝜕ℎ3 𝜕𝜕ℎ3 𝜕𝜕𝖶
𝟑

≫of≫ 1 + ≫≫ 1 +
≫≫ 1
Problem
= Exploding Gradient
𝑾
𝑾

= Very large number, i.e., NaN �


ℎ � ℎ
𝟏
𝟏
𝟐
𝟐 𝟑𝟑

𝗑𝟏 𝗑𝟐 𝗑𝟑
𝟏 𝟐 𝟑

96
Vanishing vs Exploding
Gradients

𝜕𝜕ℎ3 𝑡−
𝛾𝛾𝖶 𝑘
≤ 𝜕𝜕ℎ
𝛾𝛾𝑔
ǁ ǁ For tanh or linear activation

𝛾𝛾𝖶 𝛾𝑔 greater than 𝛾𝛾𝖶 𝛾𝑔 less than 1


1
Gradient Vanishes !!!
Gradient Expodes !!!

Remark: This problem of exploding/vanishing gradient occurs because the same


number is multiplied in the gradient repeatedly.

97
Dealing With Exploding
Gradients

98
Dealing with Exploding Gradients: Gradient
Clipping

Scaling down the gradients


 rescale norm of the gradients whenever it goes over a
threshold

 Proposed clipping is simple and computationally


efficient,
 introduce an additional hyper-parameter, namely the
threshold
99
Pascanu et al., 2013. On the difficulty of training recurrent neural networks.
Dealing With Vanishing
Gradients

100
Dealing with Vanishing
Gradient

• As discussed, the gradient vanishes due to the recurrent part of the RNN
equations.

+ some other
ℎ𝑡 = Wℎℎ
terms
ht−1

• What if Largest Eigen value of the parameter matrix becomes 1, but in this case, memory just grows.

• We need to be able to decide when to put information in the memory

101
Activation
functions
(revisit)
Long Short Term Memory (LSTM): Gating
Mechanism

Gates :
 way to optionally let information through.
Forge
 composed out of a sigmoid neural net layer t
Gate
and a pointwise multiplication operation.
 remove or add information to the cell state
„Cloud
s“
Current
Cell Outp
 3 gates in Inpu
ut
t state
LSTM Gat Gate
e

Input from Output to


 gates to protect and control the cell rest of the rest of the
state. LSTM LSTM
http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
103
Long Short Term Memory (LSTM): Gating
Mechanism

Remember the word „ clouds“ over


time….

Forge Forge Forge


Forge
t t t
t
Gate: Gate: Gate:
Gate:
0 0 0
0

„cloud „cloud „cloud


s“ s“ s“

Input Outp Outp


Input Outpu Input
ut ut
Gate: t
Gate: Gate:
1 Gate: Gate: Gate:
0 1
0 0 0
„clouds“
„clouds“
Lecture from the course Neural Networks for Machine Learning by
Greff Hinton
104
Long Short Term Memory
(LSTM)

Motivation:
 Create a self loop path from where gradient can flow
 self loop corresponds to an eigenvalue of Jacobian to be slightly less than 1

𝑛𝑒𝑤 𝑠𝑡𝑎𝑡𝑒
Self loop

𝑛𝑒𝑤 𝑠𝑡𝑎𝑡𝑒 = 𝑜𝑙𝑑 𝑠𝑡𝑎𝑡𝑒


×
+ 𝑢𝑝𝑑𝑎𝑡𝑒
𝑜𝑙𝑑 𝑜𝑙𝑑
+

𝑢𝑝𝑑𝑎 𝑠𝑡𝑎𝑡𝑒 𝑠𝑡𝑎𝑡𝑒 𝜕𝜕𝑛𝑒𝑤 𝑠𝑡𝑎𝑡𝑒 ~


𝜕𝜕𝑜𝑙𝑑 =
𝑡𝑒
𝑠𝑡𝑎𝑡𝑒
𝐼𝑑𝑒𝑛𝑡𝑑𝑑𝑡𝑦

LONG SHORT-TERM MEMORY, Sepp Hochreiter and Jürgen


Schmidhuber
82
Long Short Term Memory (LSTM): Step by
Step

Key Ingredients
Cell state - transport the information through
the units
Gates – optionally allow information passage

http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
106
Long Short Term Memory (LSTM): Step by
Step

Cell: Transports information through the units (key idea)


 the horizontal line running through the top
LSTM removes or adds information to the cell state
using gates.

107
Long Short Term Memory (LSTM): Step by
Step
Forget Gate:
 decides what information to throw away or remember from the previous
cell state
 decision maker: sigmoid layer (forget gate
layer) The output of the sigmoid lies
between 0 to 1,

𝑓𝑡 = 𝑠𝑰𝑰𝑔𝑚𝑜𝑰𝑰𝑑(𝜽𝜽𝗑𝑓𝗑𝑡 +
 0 being forget, 1 being keep.

𝜽𝜽ℎ𝑓ℎ𝑡−𝟏𝟏 + 𝑏 𝑓 )
 looks at ht−1 and xt, and outputs a number between 0
and 1 for each number in the cell state Ct−1

108
Long Short Term Memory (LSTM): Step by
Step
Input Gate: Selectively updates the cell state based on the new input.
A multiplicative input gate unit to protect the memory contents stored in j from perturbation by
irrelevant inputs

The next step is to decide what new


information we’re going to store in the cell
state. This has two parts:
1.A sigmoid layer called the “input gate
layer” decides which values we’ll
update.
2.A tanh layer creates a vector of new
candidate values,

𝑰𝑰𝑡 = 𝑠𝑰𝑰𝑔𝑚𝑜𝑰𝑰𝑑(𝜽𝜽𝗑𝑰𝑰𝗑𝑡 +
𝜽𝜽ℎ𝑰𝑰ℎ𝑡−𝟏𝟏 + 𝑏𝑰𝑰)
, that could be added to the state.
In the next step, we’ll combine these two
to
create an update to the state.
109
Long Short Term Memory (LSTM): Step by
Step

Cell Update
- update the old cell state, Ct−1, into the new cell
state Ct
-multiply the old state by ft, forgetting
the things we decided to forget earlier
-add it ∗ to get the new candidate values,
scaled by how much we decided to update
each state value.

110
Long Short Term Memory (LSTM): Step by
Step
Output Gate: Output is the filtered version of the cell state
- Decides the part of the cell we want as our output in the form of new hidden state
- multiplicative output gate to protect other units from perturbation by currently irrelevant
memory contents
-a sigmoid layer decides what parts of the cell state goes to output. Apply tanh to the cell state
and multiply it by the output of the sigmoid gate  only output the parts decided

𝑜 𝑡 = 𝑠𝑰𝑰𝑔𝑚𝑜𝑰𝑰𝑑 𝜽𝜽𝗑𝑜𝗑𝑡 +

𝜽𝜽ℎ𝑜ℎ𝑡−𝟏𝟏 + 𝑏 𝑜 ℎ 𝑡 = 𝑜 𝑡 ∗

𝑡𝑎𝑛ℎ(𝐶 𝑡 )

111
Dealing with Vanishing Gradients in LSTM

As seen, the gradient vanishes due to the recurrent part of the RNN

+ some other
equations
ℎ𝑡 = Wℎℎ
terms
hHow
t−1 LSTM tackled vanishing gradient?
Answer: forget gate

 The forget gate parameters takes care of the vanishing gradient problem
 Activation function becomes identity and therefore, the problem of vanishing gradient
is addressed.
 The derivative of the identity function is, conveniently, always one. So if f = 1,
information from the previous cell state can pass through this step unchanged

112
LSTM code
snippet

Code snippet for LSTM


unit: Offl ine

Parameter Dimension
113
LSTM code
snippet

Code snippet for LSTM unit: LSTM equations forward pass and shape
of gates Offl ine

114
Gated Recurrent Unit
(GRU)
• GRU like LSTMs, attempts to solve the Vanishing gradient problem
in RNN
Gates:

Update
Gate These 2 vectors decide
what information
should be passed to
output
Reset Gate

• Units with short-term dependencies will have active


reset gates r
• Units with long term dependencies have active update
gates z
115
Gated Recurrent Unit
(GRU)

Update Gate:
-to determine how much of the past
information (from previous time steps)
needs to be passed along to the future.
-to learn to copy information from the past such
that gradient is not vanished.

Here, 𝑥𝑡 is the input and ℎ𝑡−1 holds the


information from the previous gate.

𝑧𝑡 = 𝑠𝑑𝑑𝑔𝑚𝑜𝑑𝑑𝑑(W𝑧𝑥𝑡 + 𝑈 𝑧 ℎ 𝑡−1 )

116
Gated Recurrent Unit
(GRU)

Reset Gate
- model how much of information to forget by
the unit

Here, 𝑥𝑡 is the input and ℎ𝑡−1 holds the


information from the previous gate.

𝑟𝑡 = 𝑠𝑑𝑑𝑔𝑚𝑜𝑑𝑑𝑑(W(𝑟)𝑥𝑡 + 𝑈(𝑟) ℎ 𝑡−1 )

Memory Content:
ℎ′𝑡 = 𝑡𝑎𝑛ℎ(W𝑥𝑡 + 𝑟_𝑡 ⊙ 𝑈ℎ𝑡−1)

ℎ𝑡 Memory
+ (1 − 𝑧𝑡 ) step
= 𝑧𝑡 ⊙ ℎ at current time
⊙𝑡
ℎ′
Final
𝑡−1
117
Dealing with Vanishing Gradient s in Gated Recurrent Unit

(GRU)
We had a product of
Jacobian: Offl ine
𝑡
𝜕𝜕ℎ𝑡 𝜕𝜕
𝖦 ℎ𝑗𝑗
𝜕𝜕ℎ ≤
𝑗𝑗=𝑘+ 𝜕𝜕ℎ𝑗𝑗−
=
𝑘 1 1 𝛼𝑡−𝑗𝑗−1
Where, alpha depends upon weight matrix and derivative of the
activation function Now,

= 𝑧𝑗 + 1
𝜕𝜕ℎ𝑗𝑗
𝑗 𝑗 𝑗
𝜕𝜕ℎ𝑗𝑗−
− 𝑧𝑗
𝘍
𝜕𝜕ℎ 𝑗
1 𝜕𝜕ℎ𝑗𝑗−
𝜕𝜕ℎ
= 1 𝑠𝑠𝑜𝑟𝑗 𝑧
1

𝑗 𝘍𝑗
𝑗 =
And

1
𝜕𝜕ℎ𝑗𝑗−
,
1

118
Code snippet of GRU
unit

Code snippet of GRU


unit: Offl ine

119
Comparing LSTM and
GRU
LSTM over GRU
One feature of the LSTM has: controlled exposure of the memory content, not in GRU.
In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network
is controlled by the output gate. On the other hand the GRU exposes its full content without any control.

 GRU performs comparably to LSTM


GRU LSTM unit

Chung et al, 2014. Empirical Evaluation of Gated Recurrent Neural Networks on


Sequence Modeling

120
Bi-directional •Bidirectional Recurrent Neural Networks (BRNN)
RNNs - connects two hidden layers of opposite directions to the same
output

- output layer can get information from past (backwards) and future
(forward) states simultaneously

- learn representations from future time steps to better understand


the context and eliminate ambiguity
Example sentences:
Sentence1: “He said, Teddy bears are on sale”
Sentnce2: “He said, Teddy Roosevelt was a great sequence of
President”. Output

Forward state

when we are looking at the word “Teddy” and the previous


Backward
two words “He said”, we might not be able to understand if state

the sentence refers to the President or Teddy bears. sequence of


Input
Therefore, to resolve this ambiguity, we need to look ahead.
https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-
73927ec9df15

121
Bi-directional
RNNs
Bidirectional Recurrent Neural Networks (BRNN)

Gupta 2015. (Master Thesis). Deep Learning Methods for the Extraction of Relations in Natural Language Text
Gupta and Schütze. 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to Pattern
Transformation
Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification

100
Recursive Neural Networks (RecNNs): TreeRNN or
TreeLSTM

 applying the same set of weights recursively over a


structured input, by traversing a given structure in
topological order,
e.g., parse tree
 Use principle of compositionality

 Recursive Neural Nets can jointly learn compositional


vector representations and parse
trees RecNN

 The meaning (vector) of a sentence is


determined by
(1)the meanings of its words and
(2)the rules that combine them.
http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recu
rsive.pdf
RNN
101
Recursive Neural Networks (RecNNs): TreeRNN or
TreeLSTM

http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recu
rsive.pdf
102
Recursive Neural Networks (RecNNs): TreeRNN or
TreeLSTM

Applications
 represent the meaning of longer
phrases
 Map phrases into a vector space
 Sentence parsing
 Scene parsing

103
Recursive Neural Networks (RecNNs): TreeRNN or
TreeLSTM
Application: Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation
extraction

Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence
Boundaries.
104
Recursive Neural Networks (RecNNs): TreeRNN or TreeLSTM

Relation Extraction Within and Cross Sentence Boundaries, i.e., document-level relation extraction

Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence
Boundaries.
105
Deep and Multi-tasking
RNNs

Deep RNN Multi-task RNN


architecture architecture

Marek Rei . 2017. Semi-supervised Multitask Learning for Sequence


Labeling

106
RNN in Practice: Training
Tips

Weight Initialization Methods


 Identity weight initialization with ReLU activation

Activation Function: ReLU


i.e., ReLU(x) = max{0,x}
And it’s gradient = 0 for x < 0 and 1
for x > 0 Therefore,

107
RNN in Practice: Training
Tips

Weight Initialization Methods (in Vanilla RNNs)


 Random Whh initialization of RNN  no constraint on
eigenvalues

vanishing or exploding gradients in the initial epoch

 Careful initialization of Whh with suitable


eigenvalues
 Whh initialized to Identity matrix
What else?
 Activation function: ReLU
Batch Normalization: faster
 allows the RNN to learn in the initial epochs convergence

 can generalize well for further iterations  Dropout: better generalization

Geoffrey et al, “A Simple Way to Initialize Recurrent Networks of Rectified


Linear Units”
108
Attention Mechanism: Attentive
RNNs

Translation often requires arbitrary input length and output length


 Encode-decoder can be applied to N-to-M sequence, but is one hidden state really
enough?

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-
13c578ba9129
Attention Mechanism: Attentive
RNNs

Attention to improve the performance of the Encoder-Decoder RNN on machine


translation.
 allows to focus on local or global features
 is a vector, often the outputs of dense layer using softmax function
 generates a context vector into the gap between encoder and decoder

Context vector
 takes all cells’ outputs as input
 compute the probability distribution of source
language words for each word in decoder (e.g.,
‘Je’)
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-
13c578ba9129

110
Attention Mechanism: Attentive
RNNs
How does it Work?
Idea: Compute Context vector for every output/target word, t (during
decoding)
For each target word, t
1. generate scores between each encoder state hs and the target state ht
2. apply softmax to normalize scores  attention weights
(the probability distribution conditioned on the target state)

3. compute context vector for the target word, t


using attention weights
4. compute attention vector for the target word,
t

https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-
13c578ba9129
111
Explainability/Interpretability of
RNNs

Visualization
Visualize output predictions: LISA
 Visualize neuron activations: Sensitivity Analysis

Further Details:
- Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example
to Pattern Transformation”. https://arxiv.org/abs/1808.01591
- Andrej Karpathy, Blog on “Unreasonable Effectiveness of Recurrent Neural Networks”
- Hendrick et al, “Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks”

112
Explainability/Interpretability of
RNNs
 Visualize output predictions: LISA

Checkout our POSTER about LISA paper (EMNLP2018 conference)

https://www.researchgate.net/publication/328956863_LISA_Explaining_RNN_Judg
ments_via_Layer-
wIse_Semantic_Accumulation_and_Example_to_Pattern_Transformation_Analyzi
ng_and_Interpreting_RNNs_for_NLP

Full paper:

Gupta et al, 2018. “LISA: Explaining Recurrent Neural Network Judgments via
Layer- wIse Semantic Accumulation and Example to Pattern Transformation”.
https://arxiv.org/abs/1808.01591

113
Explainability/Interpretability of
RNNs
Visualize neuron activations via Heat maps, i.e. Sensitivity Analysis
Figure below shows the plot of the sensitivity score .Each row corresponds to saliency
score for the correspondent word representation with each grid representing each
dimension.

All three models assign high sensitivity to “hate” and dampen the influence of other tokens. LSTM offers
a clearer focus on “hate” than the standard recurrent model, but the bi-directional LSTM shows the
clearest focus, attaching almost zero emphasis on words other than “hate”. This is presumably due to
the gates structures in LSTMs and Bi-LSTMs that controls information flow, making these architectures
better at filtering out less relevant information.

LSTM and RNN capture


Jiwei LI et al, “Visualizing and Understanding Neural Models
in NLP”
short-term depdendency
114
Explainability/Interpretability of
RNNs

Visualize neuron activations via Heat maps, i.e. Sensitivity Analysis

LSTM captures long-term depdendency, (vanilla) RNN not.

Jiwei LI et al, “Visualizing and Understanding Neural Models


in NLP”

115
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-
RSM

RSM RSM RSM RSM


(1)
bh h(1)
(2)
bh h(2) … (T-1)
bh h(T-1) (T)
bh h(T)

Wuh (1) Wvh (2) Wvh Wvh Wuh (T) Wvh


Wuh Wuh (T-1)
bv bv bv bv
V(1) V(2) V(T-1) V(T) Observable
Softmax Visibles
Wuv W vu Wuv Wvu Wuv Wvu Wuv Wvu

RNN u(0) u(1) u(2) u(T-1) u(T)


Wuu Wuu Wuu Wuu

Neural Network Neural Network Neural Network Topic-


Language Models Language Language
Word Represent Models Super Models Word
words
at ion Linear vised Embedding over time
Model Linear Word for topic
Rule Set Model Embeddings
1996 1997
Rule Set Word2014
Represent ‘Word
Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over at ion Vector’
Time
116
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-
RSM

RSM RSM RSM RSM


Latent Topics
(1)
bh h(1)
(2)
bh h(2) … (T-1)
bh h(T-1) (T)
bh h(T)

Wuh (1) Wvh (2) Wvh Wvh Wuh (T) Wvh


Wuh Wuh (T-1)
bv bv bv bv
V(1) V(2) V(T-1) V(T) Observable
Softmax Visibles
Wuv W vu Wuv Wvu Wuv Wvu Wuv Wvu

RNN u(0) u(1) u(2) u(T-1) u(T)


Wuu Wuu Wuu Wuu

Cost in RNN-RSM, the negative log-likelihood

Training via BPTT

Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over


Time
117
RNNs in Topic Trend Extraction (Dynamic Topic Evolution): RNN-RSM

Topic Trend Extraction or Topic Evolution in NLP research


over time

Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over


Time
118
Key
Takeaways
 RNNs model sequential data
 Long term dependencies are a major problem in RNNs
Solution:
 careful weight initialization
 LSTM/GRUs
 Gradients Explodes
Solution:  Gradient norm clipping
 Regularization (Batch normalization and Dropout) and
attention help
 Interesting direction to visualize and interpret RNN learning

119
References, Resources and Further
Reading
 RNN lecture (Ian Goodfellow): https://www.youtube.com/watch?v=ZVN14xYm7JA
 Andrew Ng lecture on RNN:
https://www.coursera.org/lecture/nlp-sequence-models/why-sequence-models-0h7gT
 Recurrent Highway Networks (RHN)
 LSTMs for Language Models (Lecture 07)
 Bengio et al,. "On the difficulty of training recurrent neural networks." (2012)
 Geoffrey et al, “Improving Perfomance of Recurrent Neural Network with ReLU nonlinearity”
 Geoffrey et al, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”
 Cooijmans, Tim, et al. "Recurrent batch normalization."(2016).
 Dropout : A Probabilistic Theory of Deep Learning, Ankit B. Patel, Tan Nguyen, Richard G. Baraniuk.
 Barth (2016) : “Semenuita et al. 2016. “Recurrent dropout without memory loss”
 Andrej Karpathy, Blog on “Unreasonable Effectiveness of Recurrent Neural Networks”
 Ilya Sutskever, et al. 2014. “Sequence to Sequence Learning with Neural Networks”
 Bahdanau et al. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate”
 Hierarchical Attention Networks for Document Classification, 2016.
 Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, 2016
 Good Resource: http://slazebni.cs.illinois.edu/spring17/lec20_rnn.pdf
120
References, Resources and Further
Reading
 Lecture from the course Neural Networks for Machine Learning by Greff Hinton
 Lecture by Richard Socher: https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
 Understanding LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
 Recursive NN: http://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB6-NLP-recursive.pdf
 Attention: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
 Gupta, 2015. Master Thesis on “Deep Learning Methods for the Extraction of Relations in Natural Language Text”
 Gupta et al., 2016. Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction.
 Vu et al., 2016. Combining recurrent and convolutional neural networks for relation classification.
 Vu et al., 2016. Bi-directional recurrent neural network with ranking loss for spoken language understanding.
 Gupta et al. 2018. Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
 Gupta et al., 2018. LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and
Example to Pattern Transformation.
 Gupta et al., 2018. Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts.
 Gupta et al., 2019. Neural Relation Extraction Within and Across Sentence Boundaries
 Talk/slides: https://vimeo.com/277669869

121

You might also like