0% found this document useful (0 votes)
24 views60 pages

Week 6

The document discusses Long Short Term Memory (LSTM) networks, which are designed to address the vanishing gradient problem in standard RNNs by maintaining both long-term and short-term memory through a series of gates. It explains the architecture of LSTM, including the roles of input, forget, and output gates, as well as the concept of Bidirectional LSTM (BiLSTM) and Gated Recurrent Units (GRUs) as alternatives with fewer parameters. The document emphasizes the importance of these architectures in processing sequential data effectively.

Uploaded by

anamtoc9anam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views60 pages

Week 6

The document discusses Long Short Term Memory (LSTM) networks, which are designed to address the vanishing gradient problem in standard RNNs by maintaining both long-term and short-term memory through a series of gates. It explains the architecture of LSTM, including the roles of input, forget, and output gates, as well as the concept of Bidirectional LSTM (BiLSTM) and Gated Recurrent Units (GRUs) as alternatives with fewer parameters. The document emphasizes the importance of these architectures in processing sequential data effectively.

Uploaded by

anamtoc9anam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Deep Learning

Dr. Irfan Yousuf


Institute of Data Science, UET, Lahore
(Week 6; February 23, 2025)
Outline
• Long Short Term Memory
Dealing with Vanishing Gradients
Long Short Term Memory (LSTM)
• A standard RNN that has both “long-term memory” and
“short-term memory”.

• LSTM networks combat the RNN's vanishing gradients or


long-term dependence issue.

• The weights and biases of connections in the network change


once per training episode, analogous to how physiological
changes in synaptic strengths store long-term memories;
activation patterns in the network change once per time step,
analogous to how an instantaneous change in electrical firing
patterns in the brain stores short-term memories.
Long Short Term Memory (LSTM)
• For example, if an RNN is asked to predict the following
word in this phrase, "have a pleasant _______," it will readily
anticipate "day.“

• I am going to buy a table that is large in size, it’ll cost more,


which means I have to ______ down my budget for the chair.

• If there is no valuable data from other inputs (previous words


of the sentence), LSTM will forget that data and produce the
result “Cut down the budget.”
RNN vs. LSTM

• Basically, we are feeding in a sequence of inputs. The hope is


that the states of the “cell” contains information from all of
the inputs that have been fed in up to that point, i.e., all of the
Xs that have been fed in have a say in the state of A. Think of
it like A is supposed to listen to, more or less, every one of
the Xs.
RNN vs. LSTM
RNN vs. LSTM

• X0: “Hey A! *Important info*”


• A: “Okay. Got it.”
• X1: “Hey A! *Irrelevant info*”
• A: “Okay. Got it.”
• X2: “Hey A! *Important info*”
• A: Okay. Got it.”
RNN vs. LSTM

• So, in all likelihood, it mostly just remembers what the later


Xs said, i.e., the things said by the Xs at the start of the
sequence have little to no influence on what A remembers at
the end.
RNN vs. LSTM

• X0: “Hey A! *Important info*”


• A: “Okay. Got it.”
• X1: “Hey A! *Irrelevant info*”
• A: “Okay. Forget it.”
• X2: “Hey A! *Important info*”
• A: Okay. Got it.”
LSTM Architecture
LSTM Architecture
LSTM Architecture
• Long Short-Term Memory (LSTM) is a recurrent neural
network architecture designed by Sepp Hochreiter and
Jürgen Schmidhuber in 1997.

• The structure of an LSTM network consists of a series of


LSTM cells, each of which has a set of gates (input, output,
and forget gates) that control the flow of information into
and out of the cell.

• The gates are used to selectively forget or retain


information from the previous time steps, allowing the
LSTM to maintain long-term dependencies in the input
data.
LSTM Architecture
• In LSTM neural network, hidden layers are the layers
between the input and output layers where computations
are performed.

• Each hidden layer contains units called neurons or


memory cells.

• These units process input data and store information over


time, using gates to control the flow of information.

• The number of hidden layers and units determines the


ability of the model to learn complex temporal patterns
and dependencies in sequential data.
Input and Output
• An LSTM unit receives three vectors (three lists of
numbers) as input.
• Two vectors come from the LSTM itself and were
generated by the LSTM at the previous instant (t − 1).
• These are the cell state (C) and the hidden state (H).

• The third vector comes from outside. This is the vector X


(called input vector) submitted to the LSTM at instant t.
Input and Output
• Given the three input vectors (C, H, X), the LSTM regulates,
through the gates, the internal flow of information and
transforms the values of the cell state and hidden state
vectors.

• Information flow control is done so that the cell state acts


as a long-term memory, while the hidden state acts as a
short-term memory.
Input and Output
• In practice, the LSTM unit uses recent past information
(the short-term memory, H) and new information coming
from the outside (the input vector, X) to update the long-
term memory (cell state, C).

• Finally, it uses the long-term memory (the cell state, C) to


update the short-term memory (the hidden state, H).

• The hidden state determined in instant t is also the output


of the LSTM unit in instant t. It is what the LSTM provides
to the outside for the performance of a specific task.
Gates
• The three gates (forget gate, input gate and output gate)
are information selectors. Their task is to create selector
vectors. A selector vector is a vector with values between
zero and one.
• A selector vector is created to be multiplied, element by
element, by another vector of the same size.
• All three gates are neural networks that use the sigmoid
function as the activation function in the output layer.
• All three gates use the input vector (X) and the hidden
state vector coming from the previous instant (t−1)
concatenated together in a single vector. This vector is the
input of all three gates.
Forget Gate
• The first activity of the LSTM unit is executed by the forget
gate. The forget gate decides (based on X_[t] and H_[t−1]
vectors) what information to remove from the cell state
vector coming from time t−1. The outcome of this decision
is a selector vector.
Input Gate and Candidate Memory
• After removing some of the information from the cell state
received in input (C_[t−1]), we can insert a new one. This
activity is carried out by two neural networks: the
candidate memory and the input gate. The two neural
networks are independent of each other.
Input Gate and Candidate Memory
• The candidate memory is responsible for the generation of
a candidate vector: a vector of information that is
candidate to be added to the cell state.
• The input gate is responsible for the generation of a
selector vector which will be multiplied element by
element with the candidate vector.
Input Gate and Candidate Memory
• The result of the multiplication between the candidate
vector and the selector vector is added to the cell state
vector. This adds new information to the cell state.
Output Gate
• Output generation also works with a multiplication
between a selector vector and a candidate vector.
• We get a hidden state with values between -1 and 1. This
makes it possible to control the stability of the network
over time.
Mathematics Behind LSTM
LSTM Architecture

The key to LSTMs is the cell state, the horizontal line running
through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight


down the entire chain, with only some minor linear
interactions. It’s very easy for information to just flow along it
unchanged.
LSTM Architecture: Forget Gate
LSTM Architecture: Input Gate
LSTM Architecture: Update Cell State
LSTM Architecture: Output Gate
RNN vs. LSTM
Mathematics of LSTM
Mathematics of LSTM
Mathematics of LSTM
Mathematics of LSTM
Mathematics of LSTM
Backpropagation in LSTM
Backpropagation in LSTM
Backpropagation in LSTM
Backpropagation in LSTM
RNN vs. LSTM
RNN vs. LSTM
Bi-directional LSTM (BiLSTM)
• A Bidirectional Long Short-Term Memory (BiLSTM) is an
extension of the traditional LSTM (Long Short-Term
Memory) architecture.
• In a regular LSTM, the network processes sequences in
one direction (usually from left to right).
• In contrast, a BiLSTM processes the sequence in two
directions:
• Forward Direction : Left to right (like a regular LSTM).
• Backward Direction: Right to left.
Bi-directional LSTM (BiLSTM)
• By doing this, BiLSTMs are able to capture context from
both past and future for each time step in the sequence,
which can improve performance on tasks that require
understanding the entire context of a sequence, such as in
text processing, machine translation, and speech
recognition.

• In simpler terms, while a regular LSTM only has


information from the previous words, a BiLSTM has access
to both previous and future words in the sequence, which
can help the model understand the full context more
effectively.
Working of BiLSTM
• Two LSTMs: A BiLSTM consists of two LSTM layers:
• One LSTM processes the sequence in the forward
direction (left to right).
• One LSTM processes the sequence in the backward
direction (right to left).

• Hidden States: The outputs from both LSTMs at each time


step are combined, typically by concatenation or
summation, to form the final hidden state at that time
step.
Working of BiLSTM
• Two Contexts:
• The forward LSTM captures information from earlier parts
of the sequence.
• The backward LSTM captures information from the later
parts of the sequence.
Working of BiLSTM
Gated Recurrent Unit
Gated Recurrent Unit
• Gated recurrent units (GRUs) are a gating mechanism in
recurrent neural networks, introduced in 2014.

• The GRU is like a long short-term memory (LSTM) with a


gating mechanism to input or forget certain features but
lacks a context vector or output gate, resulting in fewer
parameters than LSTM.

• GRU's performance on certain tasks of polyphonic music


modeling, speech signal modeling and natural language
processing was found to be similar to that of LSTM.
GRU Architecture

In GRU, the memory cell state is replaced with a “candidate


activation vector,” which is updated using two gates: the reset
gate and update gate.
GRU Architecture

Update Gate: Determines how much of the previous hidden state should
be kept and how much of the new candidate memory should be added to
the hidden state.
Reset Gate: Controls how much of the previous hidden state should be
forgotten when calculating the candidate hidden state.
GRU Architecture
GRU: Reset Gate and Update Gate

The first thing we need to introduce are the reset gate and the
update gate. A reset gate would allow us to control how much
of the previous state we might still want to remember.
Likewise, an update gate would allow us to control how
much of the new state is just a copy of the old state.
GRU: Reset Gate and Update Gate
GRU Architecture
GRU: Candidate Hidden State
GRU Architecture
GRU: Hidden State
GRU
RNN vs. LSTM vs. GRU
Summary
• Long Short Term Memory

You might also like