M ODULE 4
R ECURRENT N EURAL
N ETWORK [RNN]
AIML - C6 - D EEP L EARNING
Seetha Parameswaran
Asst Prof, BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
In feedforward and convolutional neural networks
The size of the input is always fixed.
Each input to the network is independent of the previous or future inputs.
The computations, outputs and decisions for two successive inputs / images
are completely independent of each other.
Example: Auto-completion.
This is not true in many applications.
The size of the input is not e e p eow
always fixed.
Successive inputs may not be
independent of each other.
Each network (blue - orange -
green structure) is performing
the same task – input :
character output : character.
d e e p
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Sequence Learning Problems
To model a sequence we need
Process an input or sequence of inputs.
The inputs may have be dependent.
We may have to maintain the sequence order.
Each input corresponds to one time step.
Keep track of long term dependencies.
Produce an output or sequence of outputs.
Supervised Learning.
Share parameters across the sequences.
Sequence Model
Courtesy: Andrew Ng
Part of Speech Tagging
Task is predicting the part of speech tag
(noun, adverb, adjective, verb) of each word
in a sentence.
noun verb article adj noun
When we see an adjective we are almost
sure, the next word should be a noun.
The current output depends on the current
input as well as the previous input.
The size of the input is not fixed. Sentences
have any number of words.
An output is produced at end of each time
step.
Each network is performing the same task – Apple is a red fruit
input : word, output : tag.
Sentiment Analysis
Task is predicting the sentiment of a
(-)
whole sentence.
Input is the entire sequence of
inputs.
An output is not produced at end of
each time step.
Each network is performing the
same task – input : word, output :
polarity +/−.
The movie was boring and long
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Recurrent Neural Network (RNN)
Accounts for variable number of inputs.
Accounts for dependencies between inputs.
Accounts for variable number of outputs.
Ensures that the same function executed at each time step.
The features learned across the inputs at different time step has to be shared.
RNN I
The function learned at each time step.
ŷ1 ŷ2 ŷ3
t = time step
xt = input at time step t
st = σ(Uxt + b) V V V
yt = g (Vst + c )
s1 s2 s3
Since the same function has to be executed
at each time step we should share the same U U U
network i.e., same parameters at each time
step.
x1 x2 x3
RNN II
The parameter sharing ensures that ŷ1 ŷ2 ŷ3 ŷn
I the network becomes invariant to the length
of the input.
I the number of time steps doesn’t matter. V V V V
Create multiple copies of the network and W W W ... W
s1 sn
execute them at each timestep.
I i.e. create a loop effect. U U U U
I i.e. add recurrent connection in the network.
x1 x2 x3 xn
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Types of RNN
Courtesy: Andrej Karpathy
Types of RNN
Courtesy: Andrej Karpathy
Types of RNN and Applications
One to one – Generic neural network, Image classification
One to many – Music generation, Image Captioning
Many to one – Movie review or Sentiment Analysis
Many to many – Machine translation
Synced Many to many – Video classification
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Forward Propagation in RNN
st is the state of the
ŷ1 ŷ2 ŷ3 ŷTy network at time step t.
s0 = 0
V V V V st = σ(Uxt + Wst −1 + b)
s0 s1 s2 sTx −1 sTx ŷt = g (Vst + c )
...
W W W W W or
U U U U
ŷt = f (xt , st −1 , W , U , V , b, c )
The parameters
x1 x2 x3 xTx W , U , V , b, c are shared
across time steps.
Back Propagation in RNN
ŷ1 ŷ2 ŷ3 ŷTy
Loss function
V V V V Ty
Y
s0 s1 s2 sTx −1 sTx Lt (ŷt , yt ) = P (ŷt | ŷt −1 , . . . , ŷ1 )
... t =1
W W W W W
Overall Loss
U U U U Ty
X
L(ŷ , y ) = Lt (ŷt , yt )
x1 x2 x3 xTx t =1
Back Propagation in RNN
L
L1 L2 L3 L Ty
ŷ1 ŷ2 ŷ3 ŷTy
V V V V
s0 W W W ... W
s1 s2 s3 sTx
U U U U
x1 x2 x3 xTx
Back-propagation through time.
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Issue of Maintaining States
The old information gets morphed by the current input at each new time step.
After t steps the information stored at time step t − k (for some k < t) gets
completely morphed so much that it would be impossible to extract the original
information stored at time step t − k .
It is very hard to assign the responsibility of the error caused at time step t to
the events that occurred at time step t − k .
Basically depends on the size of memory that is available.
Strategy to Maintain States
Selectively write on the states.
Selectively read the already written content.
Selectively forget (erase) some content.
Sentiment Analysis
RNN reads the document from left to right
and after every word updates the state.
By the time we reach the end of the
document the information obtained from the
first few words is completely lost.
Ideally we want to
I forget the information added by stop words
(a, the, etc.).
I selectively read the information added by
previous sentiment bearing words
(awesome, amazing, etc.)
I selectively write new information from the
current word to the state.
Mitesh M. Khapra
Selective Write
Recall that in RNNs we use st −1 to compute st .
st = σ(Wst −1 + Uxt + b)
Mitesh M. Khapra
Selective Write
Introduce a vector ot −1 which decides what fraction of each element of st −1
should be passed to the next state.
Each element of ot −1 gets multiplied with the corresponding element of st −1 .
Each element of ot −1 is restricted to be between 0 and 1.
The RNN has to learn ot −1 along with the other parameters (W , U , V ).
Mitesh M. Khapra
Selective Write
Compute ot −1 and ht −1 as
ot −1 = σ(Wo ht −2 + Uo xt −1 + bo )
ht −1 = ot −1 σ(st −1 )
The parameters (Wo , Uo , bo ) are learned
along with the existing parameters
(W , U , V ).
The sigmoid function ensures that the
values are between 0 and 1.
ot is called the output gate as it decides
how much to pass (write) to the next time
step.
Mitesh M. Khapra
Compute State
ht −1 and xt are used to compute the new state at the next time step.
s̃t = σ(Wht −1 + Uxt + b)
Mitesh M. Khapra
Selective Read
s̃t captures all the information from the previous state ht −1 and the current
input xt .
To do selective read, introduce another gate called the input gate.
it = σ(Wi ht −1 + Ui xt + bi )
Selectively Read = it s̃t
Mitesh M. Khapra
Selective Read
s̃t captures all the information from the previous state ht −1 and the current
input xt .
To do selective read, introduce another gate called the input gate.
st = st −1 + it s̃t
Mitesh M. Khapra
Selective Forget
To do selective forget, introduce another gate called the forget gate.
ft = σ(Wf ht −1 + Uf xt + bf )
st = ft st −1 + it s̃t
Mitesh M. Khapra
Full LSTM
3 gates
3 states
ot = σ(Wo ht −1 + Uo xt + bo )
s̃t = σ(Wht −1 + Uxt + b)
it = σ(Wi ht −1 + Ui xt + bi )
st = ft st −1 + it s̃t
ft = σ(Wf ht −1 + Uf xt + bf )
ht = ot σ st
st = ft st −1 + it s̃t
Mitesh M. Khapra
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Long Short Term Memory Unit (LSTM)
Another representation
3 gates are used – Update gate Γu , Forget gate Γf and Output gate Γo .
c̃ <t > = tanh(Wc [a<t −1> , x <t > ] + bc )
Γu = σ(Wu [a<t −1> , x <t > + bu ])
Γf = σ(Wf [a<t −1> , x <t > + bf ])
Γo = σ(Wo [a<t −1> , x <t > + bo ])
c <t > = Γu ∗ c̃ <t > + Γf ∗ c <t −1>
a<t > = Γo ∗ tanh(c <t > )
LSTM
c t −1 + ct
tanh
ft
a <t >
ut c̃ t ot
softmax
<t −1> Forget Update tanh Output
a
y <t >
x <t >
LSTM
c t −1 + ct c t −1 + ct
tanh tanh
ft ft
a <t > a<t >
ut c̃ t ot ut c̃ t ot
softmax softmax
a<t −1> Forget Update tanh Output a<t −1> Forget Update tanh Output
<t >
y y <t >
x <t > x <t >
In This Segment
1 Sequence Learning
2 Recurrent Neural Network (RNN)
3 Types of RNN
4 Learning in RNN
5 Issues in RNN
6 Long Short Term Memory Unit (LSTM)
7 Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU)
Introduce a memory cell c <t > = a<t > .
Candidate for replacing c <t > is given as c̃ <t > .
The decision whether to update c <t > with c̃ <t > is given by the update gate Γu .
Γu takes the value of 0 or 1.
c̃ <t > = tanh(Wc [Γr ∗ c <t −1> , x <t > ] + bc )
Γu = σ(Wu [c <t −1> , x <t > + bu ])
Γr = σ(Wr [c <t −1> , x <t > + br ])
c <t > = Γu ∗ c̃ <t > + (1 − Γu ) ∗ c <t −1>
a<t > = c <t >
Gated Recurrent Unit (GRU)
y <t >
c <t >
c <t −1>
c̃ <t > Γu
Tanh σ
x <t >
Gated Recurrent Unit (GRU)
y <t > y <t > y <t >
c <t > c <t > c <t >
c <t −1> c <t −1> c <t −1>
c̃ <t > Γu c̃ <t > Γu c̃ <t > Γu
Tanh σ Tanh σ Tanh σ
x <t > x <t > x <t >
References
1 Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
https://www.deeplearningbook.org/
2 Deep Learning with Python by Francois Chollet.
https://livebook.manning.com/book/deep-learning-with-python/