0% found this document useful (0 votes)
3 views145 pages

Recurrent Neural Network

The document discusses Recurrent Neural Networks (RNNs) and their applications in Natural Language Processing (NLP), highlighting the challenges of ambiguity and co-reference resolution. It explains the architecture of RNNs, including their ability to handle sequences of data and the introduction of variants like Long Short Term Memory (LSTM) and Gated Recurrent Units (GRUs) to address issues like vanishing gradients. The document also covers the importance of RNNs in tasks such as sentiment analysis, machine translation, and speech recognition.

Uploaded by

somyaranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views145 pages

Recurrent Neural Network

The document discusses Recurrent Neural Networks (RNNs) and their applications in Natural Language Processing (NLP), highlighting the challenges of ambiguity and co-reference resolution. It explains the architecture of RNNs, including their ability to handle sequences of data and the introduction of variants like Long Short Term Memory (LSTM) and Gated Recurrent Units (GRUs) to address issues like vanishing gradients. The document also covers the importance of RNNs in tasks such as sentiment analysis, machine translation, and speech recognition.

Uploaded by

somyaranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NLP: Recurrent Neural

Networks

Slides references: © MIT 6.S191: Introduction to Deep Learning


[Link]
Andrew NG, Deep Learning Specializations.
Introduction to NLP

Motivation

Methodology

Lecture content: Exploding and vanishing gradient problems

Recurrent Variants – Long Short Term Memory (LSTM), Gated Recurrent Units
Neural Network (GRUs)
Popular RNN Models

Applications of RNN
What is NLP?

Fundamental goal:
• Deep understanding of broad language Not just string processing or keyword matching!
End systems that we want to build:
• Simple: spelling correction, text categorization…
• Complex: speech recognition, machine translation, information extraction, dialog
interfaces, question answering…
• Unknown: human-level comprehension
Business Intelligence on the Internet Platform

Opinion Mining

Reputation Management

Sentiment Analysis

Areas being Machine translation

investigated Text summarization

Information retrieval

Question answering

Chat bot …

NLP is thought to play a key role


Ambiguity.

NLP faces
Co-reference resolution
3 major (anaphora is a kind of it).
challenges

Ellipsis.
Ambiguity
Chair
Co-reference Resolution
Sequence of commands to the robot:

Place the pen on the table. Then paint it.

What does it refer to?


Sequence of Move the table
command to the to the corner.
Also, the chair.
Robot:

Ellipsis
Second command needs
completing by using the first
part of the previous command.
Three Views of
NLP and the 1.
2.
Classical View
Statistical/Machine
Learning View
Associated 3. Neural Network view

Challenges
Motivation
➢ Let us try to classify following text
as positive or negative
❑ I like this phone – Positive
Sentiment ❑ This phone is good – Positive
Analysis ❑ This phone is not okay – Negative
❑ I do not like this phone because battery
is not charging properly - Negative
Sentiment Analysis
➢ Let us try to classify following text as positive or negative
❑ I like this phone – Positive
❑ This phone is good – Positive
❑ This phone is not okay – Negative
❑ I do not like this phone because battery is not charging properly - Negative

Feed forward networks accept a fixed-sized vector as input !


Using Bag-of-Words
➢ Represent the text using Bag-of-Words
❑I like this phone
❑This phone is good
❑This phone is not okay
❑I do not like this phone because battery is not good
battery because charging do good I is like not okay phone properly this
Doc 1 0 0 0 0 0 1 0 1 0 0 1 0 1
Doc 2 0 0 0 0 1 0 1 0 0 0 1 0 1
Doc 3 0 0 0 0 0 0 1 0 1 1 1 0 1
Doc 4 0 0 0 0 0 0 0 0 0 0 0 0 0
Applying ANN
battery because charging do good I is like not okay phone properly this
Doc 1 0 0 0 0 0 1 0 1 0 1 0 0 1
Doc 2 0 0 0 0 1 0 1 0 0 0 1 0 1
Doc 3 0 0 0 0 0 0 1 0 1 1 1 0 1
Doc 4 0 0 0 0 0 0 0 0 0 0 0 0 0

Positive/Negative
Drawback of BoW
➢ Lets try to represent the following text using Bag-of-Words
❑This phone is no good - Negative
❑No this phone is good - Positive

good is no phone this


Doc 1 1 1 1 1 1
Doc 2 1 1 1 1 1

➢Feed Forward Neural Networks with Bag-of-words (BoW) model does


not consider position of words in input !
Drawbacks of NN for Sequence Analysis
Feed forward networks accept a fixed-sized vector as input and
produce a fixed-sized vector as output

So, Feed forward networks cannot process sequential data containing


variable length of data

Feed forward networks does not consider sequence in the data


Standard Neural Network Does not works out to give a good application for sequence models
Other situations where sequence
matters

Stock price today Tomorrow’s


will be more or temperature will
less similar to be close to todays’
yesterday’s price temperature
Sequence Application Variation
Audio Signal to Sequence - Speech Recognition
Nothing to Sequence or Single Parameter to Sequence - Music Generation
Sequence to Single Output - Sentiment Classification
Sequence to Sequence - Machine Translation
Video Frame Sequence to Output - Activity Recognition
Sub-Sequence from a Sequence - Finding Specific Protein from a DNA
Sequence
Outlining Specific parts of a sequence - Name Entity Recognition

9/12/2024 47
Solution for Sequence Analysis - RNN
Recurrent Neural Networks allow us to operate over sequences of vectors

Recurrent, because previous output is also used with current input

RNN also viewed as having a “memory”

Unlike a traditional deep network, RNN shares same parameters across all steps

Greatly reduces the total number of parameters we need to learn

RNN is not a feed forward neural network as cycle is formed in hidden units
Notation Understanding
X: Rama Conquered Ravana to install the virtue of dharma
x<1> x<2> x<3> ……………… x<t> ……………….. x<9>
Tx = 9 (Length of training sequence: 9)
xi<t> : tth word of ith training sequence

Y: 1 0 1 0 …… 0 0 0 0
y<1> y<2> y<3> ……………… y<t> ………………..
y<9>
Ty = 9 (Length of output sequence: 9)
yi<t> : tth word of ith output sequence
9/12/2024 50
Representing words and one-hot encoding
X: Rama Conquered Ravana to install the virtue of dharma
A 1 Rama Ravana
: 0 0
: 0 0
Conquered 329 0 0
: 0 0
: 0 0
Install 4521 : :
: : :
: : :
Rama 7689 1 -> 7689 :
: : 1 -> 7900
Ravana 7900 : :
: 0 0
ZZZ 10000 0 0
9/12/2024 51
NN vs RNN
Previous output

input output input output


History of RNN
Recurrent Neural Network were introduced in the late 80’s

Hochreiter discovers the ‘vanishing gradients’ problem in 1991

Long Short Term Memory published in 1997

LSTM a recurrent network to overcome these problems

Recent Variant GRUs published in 2014


Recurrent Neural Networks
RNN are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.


What happens at every time step
o(𝑡) Output

hidden ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑉
𝑠(𝑡)

Hidden Delay

𝑠(𝑡 − 1) ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑊 𝑠(𝑡 − 1)


Inputs * 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑈

𝑥(𝑡) Input
Notations
❖ x : Input
❖ o : Output
❖ s : state of the hidden unit
❖ U, V and W : Weights to be learned
❖ U : weights used for hidden state computation (from input)
❖ V : weights used for output computation
❖ W : weights used for hidden state computation (from previous
hidden state)
Inside RNN

An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). This diagram shows the
activations in the forward pass when the RNN is fed the characters "hell" as input. The output layer contains confidences the RNN
assigns for the next character (vocabulary is "h,e,l,o"); We want the green numbers to be high and red numbers to be low.
Important Notes
4 output units

These are not single units


There are single layers

3 HL units

4 Input units
Unrolled RNN with parameters

The recurrent network can be converted into a feed forward network by unfolding
over time
Input to RNN
❖ xt is the input at time step t. For example, x1 could be a one-hot vector corresponding to the
second word of a sentence.
Input to RNN
❖ In text classification, input xt can be a one-hot vector corresponding to the
word of a sentence at iteration t
❖ In speech recognition, input xt can be audio features at time t
❖ In stock prediction, input xt can be numerical values of high, low, etc.
❖ In weather prediction, input xt can be wind speed, low and high
temperatures, etc
❖ In video classification, input xt can be a single video frame or its features
State of Hidden Unit
❖ st is the hidden state at time step t. st is calculated based on the previous hidden state and
the input at the current step: st=f(Uxt + Wst-1).
❖ It’s the “memory” of the network. The function f() usually is a nonlinearity such as tanh,
sigmoid or ReLU
State of previous Hidden Unit
❖ st-1, which is required to calculate the first hidden state, is typically initialized to all zeroes
State of Hidden Unit
❖ State of the hidden unit is considered as “Memory” which is important in the
RNNs
❖ They are the actual memory helpful in passing useful information until last
element of the input is processed
❖ For each iteration, some states of units will be forgotten where some states of
units will be updated based on input
Output at a particular time step
❖ The output at step t is ot = f(Vst)
❖ For example, if we wanted to predict the next word in a sentence it would be a vector of
probabilities across our vocabulary. f() can be sigmoid or softmax() function.
Weights in RNN
❖ U, V and W are weights to be learned while training the network
RNN Forward Pass
❖ Step 1: input x will be given at time t
❖ Step 2: Current hidden state s at time t will be computed using
𝑠 𝑡 = 𝑓ℎ 𝑈𝑥(𝑡) + 𝑊𝑠(𝑡 − 1)
❖ Step 3: Current output o at time t will be computed using
𝑜 𝑡 = 𝑓𝑜 𝑉𝑠 𝑡

❖ Note: Output will not necessarily be generated for every t. i.e. it


depends upon application. Speech recognition RNN will output
words instantly at every iteration. Opinion classification RNN will
output label only at the end of sentence.
RNN Forward Pass with Example
Forward Pass with Example
➢ The inputs are one hot encoded. Our entire vocabulary is {h,e,l,o} and hence we can easily
one hot encode the inputs.
1 0 0 0
0 1 0 0
0 0 1 1
0 0 0 0
h e l l

➢ Now the input neuron would transform the input to the hidden state using the weight U. We
have randomly initialized the weights as a 3*4 matrix –
U
0.287027 0.84606 0.572392 0.486813
0.902874 0.871522 0.691079 0.18998
0.537524 0.09224 0.558159 0.491528
Step 1
➢ Now for the letter “h”, for the the hidden state we would need UXt. By matrix
multiplication, we get it as

U 1
0 0.287027
0.287027 0.84606 0.572392 0.486813
0 0.902874
0.902874 0.871522 0.691079 0.18998
0 0.537524
0.537524 0.09224 0.558159 0.491528 h
Step 2
➢ Now moving to the recurrent neuron, we have W as the weight which is a 1*1 matrix as
0.427043 and the bias which is also a 1*1 matrix as 0.56700
➢ For the letter “h”, the previous state is [0,0,0] since there is no letter prior to it.
➢ So to calculate -> (W*st-1+bias)

0
0.567001
W bias 0
0.567001
0.427043 0.567001
0
0.567001
st-1
Step 3
➢ Now we can get the current state as
st = tanh(Wst-1 + Uxt)
➢ Since for h, there is no previous hidden state we apply the tanh function to this output and
get the current state st

0.287027 0.567001 0.854028 0.693168


0.902874 0.567001 tanh 1.469875 0.899554
0.537524 0.567001 1.104525 0.802118
Step 4
➢ Now we go on to the next state. “e” is now supplied to the network. The processed output of st, now
becomes st-1, while the one hot encoded e, is xt. Let’s now calculate the current state st.
ht = tanh(Wst-1 + Uxt)
➢ Wst-1 +bias will be
0.693168 0.863013
0.427043 0.899554 0.567001 0.951149
0.802118 0.90954
➢ Uxt will be

U 0
1 0.84606
0.287027 0.84606 0.572392 0.486813
0 0.871522
0.902874 0.871522 0.691079 0.18998
0 0.09224
0.537524 0.09224 0.558159 0.491528 e
Step 5
➢ Now calculating st for the letter “e”,

0.863013 0.84606 0.93653372


st = tanh 0.951149 0.871522 0.94910403
0.90954 0.09224 0.76234056

➢ Now this would become st-1 for the next state and the recurrent neuron would use this along
with the new character to predict the next one.
Step 6
➢ At each state, the recurrent neural network would produce the output as well. Let’s calculate
yt for the letter e.
Yt = Vst

V yt
0.37168 0.974829459 0.830034886 st
1.90607732
0.39141 0.282585823 0.659835709 0.93653372
0.94910403 1.13779113
0.64985 0.09821557 0.334287084
0.76234056 0.95666016
0.91266 0.32581642 0.144630018
1.27422602
Step 7
➢ The probability for a particular letter from the vocabulary can be calculated by applying the
softmax function. so we shall have softmax(yt)
yt
0.419748
1.90607732 0.194682 Letter h got
Classwise probabilities for next letter = softmax 1.13779113 0.162429 high probability
0.95666016 0.223141
1.27422602

➢ If we convert these probabilities to understand the prediction, we see that the model says
that the letter after “e” should be h, since the highest probability is for the letter “h”. Does
this mean we have done something wrong? No, so here we have hardly trained the network.
We have just shown it two letters. So it pretty much hasn’t learnt anything yet.
➢ Now the next BIG question that faces us is how does Back propagation work in case of a
Recurrent Neural Network. How are the weights updated while there is a feedback loop?
A recurrent neural network (RNN)
Back Propagation Through Time
➢ BPTT learning algorithm is an extension of standard backpropagation that
performs gradients descent on an unfolded network.
➢ The gradient descent weight updates have contributions from each time
step.
➢ The errors have to be back-propagated through time as well as through the
network
RNN using Keras
</>

[Link](cell, return_sequences=False, return_state=False,


go_backwards=False, stateful=False, unroll=False)
What's wrong with Naïve RNN?
➢ When dealing with a time series, it tends to forget old information. When
there is a distant relationship of unknown length, we wish to have a
“memory” to it.
➢ Limitations of Backprop Through Time
➢ Vanishing Gradients
➢ Exploding Gradients
Vanishing Gradients
Exploding Gradients
➢ In the same way, gradients may be exploding for each time step if
gradient computed at each step is increasing
➢ One solution is to clip the gradient to a standard value
➢ i.e. Gradients larger than certain value can be changed into the
maximum gradient value
The Problem of Long-Term Dependencies
➢ If we are trying to predict the last word in “the clouds are in the ___,” we
don’t need any further context – it’s pretty obvious the next word is going to
be sky.
➢ In such cases, where the gap between the relevant information and the
place that it’s needed is small, RNNs can learn to use the past information.
The Problem of Long-Term Dependencies
➢ Consider trying to predict the last word in the text “I grew up in France… I
speak fluent French.” Recent information suggests that the next word is
probably the name of a language, but if we want to narrow down which
language, we need the context of France, from further back. It’s entirely
possible for the gap between the relevant information and the point where it
is needed to become very large.
➢ Unfortunately, as that gap grows, RNNs become unable to learn to connect
the information.
Moving from RNN to LSTM
➢ All recurrent neural networks have the form of a chain of repeating modules of neural
network.
➢ In standard RNNs, this repeating module will have a very simple structure, such as a single
tanh layer.

The repeating module in a standard RNN contains a single layer.


Long Short Term Memory (LSTM)
Long Short Term Memory (LSTM)
➢ LSTMs also have this chain like structure, but the repeating module has a different structure
➢ Instead of having a single neural network layer, there are four, interacting in a special way for
controlling information flow.

The repeating module in an LSTM contains four interacting layers.


Cell State
➢ The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
➢ The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information to just flow along it unchanged.

➢ The LSTM does have the ability to remove or add information to the cell state, carefully regulated
by structures called gates.
Long Short Term Memory (LSTM)
➢ Gates are a way to optionally let information through. They are composed out of a sigmoid neural net
layer and a pointwise multiplication operation.

➢ The sigmoid layer outputs numbers between zero and one, describing how much of each component
should be let through. A value of zero means “let nothing through,” while a value of one means “let
everything through!”
➢ An LSTM has three of these gates, to protect and control the cell state.
Step-by-Step LSTM Walk Through
➢ The first step in our LSTM is to decide what information we’re going to throw away from the cell state.
➢ This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and xt, and outputs a
number between 0 and 1 for each number in the cell state Ct−1.
➢ 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
Step-by-Step LSTM Walk Through
➢ Next, a tanh layer creates a vector of new candidate values, 𝐶ሚ t, that could be added to the state. In the next
step, we’ll combine these two to create an update to the state.
Step-by-Step LSTM Walk Through
➢ It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps
already decided what to do, we just need to actually do it.
➢ We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add
it∗ 𝐶ሚ t. This is the new candidate values, scaled by how much we decided to update each state
value.
Step-by-Step LSTM Walk Through
➢ Finally, we need to decide what we’re going to output. This output will be based on our cell
state, but will be a filtered version.
➢ First, we run a sigmoid layer which decides what parts of the cell state we’re going to output.
Then, we put the cell state through tanh (to push the values to be between −1 and 1) and
multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
LSTM using Keras
</>

[Link](units, activation='tanh',
recurrent_activation='hard_sigmoid', use_bias=True, dropout=0.0,
recurrent_dropout=0.0)
Advantages of LSTM
➢ Non-decaying error backpropagation.
➢ For long time lag problems, LSTM can handle noise and continuous values.
➢ No parameter fine tuning.
➢ Memory for long time periods
LSTM Conclusions
➢ RNNs - self connected networks
➢ Vanishing gradients and long memory problems
➢ LSTM - solves the vanishing gradient and the long memory
limitation problem
➢ LSTM can learn sequences with more than 1000 time steps.
Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs)
➢ A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU,
introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update
gate.”
➢ It also merges the cell state and hidden state, and makes some other changes. The resulting
model is simpler than standard LSTM models, and has been growing increasingly popular.
GRU using Keras
</>

[Link](units, activation='tanh',
recurrent_activation='hard_sigmoid', use_bias=True, dropout=0.0,
recurrent_dropout=0.0)
LSTM vs GRU
➢ A GRU has two gates, an LSTM has three gates. What does this tell you?
➢ In GRUs
➢ No internal memory (ct) different from the exposed hidden state.
➢ No output gate as in LSTMs.
➢ The input and forget gates of LSTMs are coupled by an update gate in
GRUs, and the reset gate (GRUs) is applied directly to the previous hidden
state.
➢ GRUs: No nonlinearity when computing the output.
Bidirectional RNNs: motivation
Task: Sentiment Classification
We can regard this hidden state as a
positive representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.

Sentence encoding These contextual


representations only
contain information
about the left context
(e.g. “the movie
was”).

What about right


context?

In this example,
“exciting” is in the
right context and this
modifies the meaning
the movie was terribly exciting ! of “terribly” (from
negative to positive)
137
This contextual representation of “terribly”
Bidirectional RNNs has both left and right context!

Concatenated
hidden states

Backward RNN

Forward RNN

the movie was terribly exciting !


138
Bidirectional RNNs

• Note: bidirectional RNNs are only applicable if you have access to the
entire input sequence.
• They are not applicable to Language Modeling, because in LM you
only have left context available.

• If you do have entire input sequence (e.g. any kind of encoding),


bidirectionality is powerful (you should use it by default).

• For example, BERT (Bidirectional Encoder Representations from


Transformers) is a powerful pretrained contextual representation system
built on bidirectionality.

139
RNN Variants

Vanilla mode of processing without RNN, from fixed-sized input to fixed-


sized output (e.g. image classification)

[Link]
RNN Variants

Sequence output (e.g. image captioning takes an image and outputs a


sentence of words)
RNN Variants

Sequence input (e.g. sentiment analysis where a given sentence is classified


as expressing positive or negative sentiment)
RNN Variants

Sequence input and sequence output (e.g. Machine Translation: an RNN


reads a sentence in English and then outputs a sentence in French)
RNN Variants

Synced sequence input and output (e.g. video classification where we wish
to label each frame of the video).
Applications of RNN
RNN Applications
Robot control
Time series prediction
Speech recognition
Rhythm learning
Music composition
Grammar learning
Handwriting recognition
Human action recognition
Protein Homology Detection Wherever you have
Predicting subcellular localization of proteins Sequential Data !
Prediction tasks in the area of business process management
Prediction in medical care pathways
Sentiment Classification
Neural machine translation
Sequence to sequence chat model
Baidu’s speech recognition using RNN
Music Transcription
Image and Video Processing
Natural Language Generation

Shakespeare Wikipedia
Lab
[Link]

[Link]

[Link]
Thank You
For more information, please visit the following links:

gauravsingal789@[Link]
[Link]
[Link]

12 September 2024 155


Keras PyTorch TensorFlow
API Level High Low High and Low
Simple, concise, Complex, less
Architecture Not easy to use
readable readable

Comparison Datasets Smaller datasets


Large datasets, high
performance
Large datasets, high
performance
Keras vs PyTorch vs TensorFlow
Simple network, so
Good debugging Difficult to conduct
Debugging debugging is not
capabilities debugging
often needed

Does It Have
Yes Yes Yes
Trained Models?

Second most
Popularity Most popular Third most popular
popular
Slow, low Fast, high- Fast, high-
Speed
performance performance performance

Written In Python Lua C++, CUDA, Python

You might also like