How Recurrent Neural Networks
and Long Short-Term Memory
Work – By Example
2017-2021
Based on the notes from Brandon Brohrer
1
Explanation using examples
We will attempt to explain the functionality of
• RNNs
• LSTMs
By using few examples
2
RNN – Guess what we have for Dinner tonight?
• Every night for dinner, we have either:
₋ Pizza, or
₋ Sushi, or
₋ Waffles
• And repeat again
3
Guess the dinner tonight?
Voting Process Prediction
Outputs: (?)
3 choices
• pizza,
• sushi,
• waffles
Inputs: (?)
what can affect what
we have for dinner, for
example,
• day of the week,
• month,
• late meeting
4
Pizza, Sushi, Waffles, & repeat - Re-examine the data
Let’s simplify our assumptions
Assume that the choice of
dinner does not depend on the
day of the week, month, or late
meetings
Let’s assume that the data
follows a simple pattern of
• Pizza,
• sushi,
• waffles and
• repeat
Therefore, we just need to
know what we had last night 5
What happens if we do not know what we had last night?
• e.g., I was not home last night,
I cannot remember,
…
• Then, it will be helpful to have:
• A prediction of what we might have had yesterday night
6
What do we need to know to make a prediction re dinner night?
• Generally we need
to know:
• A prediction of
what we might
have had last
night
or
• Information
about the dinner
last night
7
Side note - Vectors
Neural networks can
understand vectors
better
Native language of
NNs is vectors
8
Side note - Vectors as statements
ONE HOT ENCODING
The list (vector) includes
all possibilities for the
days of the week
All of them are ZERO
Except the one that is
true that is Tuesday is
ONE
“It is Tuesday”
9
Side note - One Hot Vector for our example
A vector: a list of values
We have 3 choices for
dinner
-Pizza,
-Sushi,
-Waffles
“we have Sushi”
The one hot vector
representing this
statement is:
0
1
0 10
Input/Output vector
- Input: Two vectors
1. A vector for
prediction of dinner
for yesterday
2. A vector for actual
dinner yesterday
- Output: One vector
1. A vector for dinner
prediction for today
11
Recurrent Neural Networks
12
RNN - Create a feedback from output to the input
We can now connect
the output to the input
to create the predicted
vector with a delay
Pt-1
Dotted line signifies
the delay
Pt
If the output vector
denotes (t) then the
it
feedback line denotes
(t-1)
13
Dinner example - Unwrapped recurrent network
Now we can go as
far back as we want
Let’s say we have
the dinner
information for 2
weeks ago for
example
14
Example: A network to write a children’s book
The collection and/or
dictionary of the words
that we have to write
this book is rather small:
₋ Doug
₋ Jane
₋ Spot
₋ saw
₋ .
Objective: to put these
words together in right
order to write a book
15
RNN to write a book
- 3 vectors
Pt ❷
1. A vector of the
words that we
have now (it) ❸ Pt-1
2. A vector of the
prediction of the
❶
words (Pt) it
3. A vector of the
words that may The new information (it) indicates what is the current
word, e.g., if it is Doug then the vector is [0 1 0 0 0 0]’
come next (Pt-1)
16
Trained RNN – new information vector (it)
Let’s try to work out
this RNN
After the training is
done when the new
information is
₋ Jane,
₋ Doug or
₋ Spot
we expect that the
trained RNN would
point to
₋ saw or
₋ . 17
Working out our RNN – prediction vector (Pt-1)
if the predicted
word is
- Jane, or
- Doug, or
- Spot
Similarly we expect
that the trained net
would point to
- Saw, or
- .
18
Working out our RNN
if the present word is
- saw, or
- .
The trained net would
point to
- Jane or
- Doug
- Spot
As a name should
appear after saw or .
19
A representation for our RNN
The input is a collection
(concatenation) of the new
information and the
predicated values
This is denoted by
The activation function used
here is tanh denoted by
Making the output behave
well
20
Side note – how does tanh work
Tanh is the squashing function
Regardless of the input
everything will always be
between -1 & +1 (very
important)
For the input values between
-1 & +1 the output value is
very close or equal to the
original input
For the values greater than +1
the output value is +1
For the values less than -1
the output value is -1 21
Why RNN may not work ?
Doug saw Doug.
(after saw we expect a
name that name could be
Doug)
Jane saw Spot saw …
(after saw we expect a
name and after a name we
can expect saw …)
Spot. Doug. Jane.
(after a name we can
expect .)
22
What may not work so far?
Problem:
We have short term
memory
We only look back one
time step & do not use
the information from
further back
23
RNN
A simple architecture of
a RNN Feedback
delay
Your input is a
combination of:
- the new information
&
- what you predicted in
the last step (time
wise)
24
How do we fix this?
We need to modify
the existing
architecture
One solution is to add
a memory capabilities
How do we add a
memory component ?
25
Introduction of the memory component
Adding memory
component
to enable the network
to remember what
happens many steps memory
ago (from further
back)
26
Side note - Element-by-Element Addition/Plus Junction
27
Side note - Element-by-Element Multiplication/Times Junction
28
Gating
We can use time junction to
control what percentage of
the an input (a signal) goes
through, i.e., gating
In this example, the 1st
element of the signal goes
through completely
whereas the 3rd element is
completely masked
29
Side note - Sigmoid Function
30
Memory Component: forget & keep
Memory
component: Prediction
from last
round
• To forget some
of the previous
prediction and
• to keep the rest
31
How does the forget gate work?
1. A combination of the previous
prediction & new information
goes thru net1 & a prediction
is made accordingly
a copy of the
2. A copy of the prediction will prediction from the
be given to the forget gate last round will be
net2: what to forget passed to the
a combination of the
forget gate
previous prediction &
Note: new information
net2 is different from net1 & its ❷
task is to learn what to forget &
when to forget net1: what to predict
A part of this will be forgotten &
❶
the remaining will be added to
the prediction
32
Add a selection layer – net3
We do not necessarily
need to send the entire
prediction to the
input/output
net3: what to select
To select with part of
the prediction goes
back to the
input/output
33
How does the selection gate work?
In the previous layer
(forget/keep) we combined our
memory with our prediction
1. We need to have a filter to
select which part of
combined memory +
prediction to go out
2. We also need to add a new
tanh after the elementwise
add to make sure everything
is still bet -1 & +1 (addition
might have caused an
increase beyond -1/+1)
34
Where does learning happen so far?
• net1: to learn to PREDICT
• net2: to learn what to FORGET/KEEP
• net3: to learn what to SELECT
35
Add an ignore/attention layer – net4
To ignore some of
the possible
predictions
net4: what to ignore
36
How does the ignore layer work?
Some of the possible
predictions that are not
immediately relevant to
be ignored
Not to unnecessarily
complicate the predictions
(by having too many of
them) in the memory as
going forward
37
Where does learning happen?
• net1: to learn to predict
• net2: to learn what to forget/keep
• net3: to learn what to select
• net4: to learn what to ignore
38
LSTM Structure
③② ① ④
39
Side note
• A multiplicative input gate unit learns to protect the constant
error flow within the memory cell from perturbation by
irrelevant inputs
• Likewise, a multiplicative output gate unit learns to protect
other units from perturbation by currently irrelevant
memory contents stored in the memory cell
40
Running a simple example
Assume this LSTM is
already trained
net1, net2, net3 ,net4 are
known
41
Information going through
① So far we have …
“Jane saw Spot.”
and the new word is “Doug”
② We also know from
previous prediction that the
next word can be “Doug,
②
Jane, Spot”
③ We pass this info through
③
net 1, 2, 3, 4 to
1. Predict ①
2. Ignore
3. Forget
4. Select
42
net1 - Prediction Step
④ The new word is “Doug”, net1 should predict that the next word is “saw”
Also, net1 should know that since the new word is “Doug” it should not see the word
“Doug” again very soon
net1 to make 2 predictions:
1. A positive prediction for
“saw”
2. A negative prediction for
“Doug” (do not expect to
see “Doug” in the near
future) ④
43
net2 - Ignore Step
This example is simple,
we do not need to focus on
ignoring anything
This prediction of
₋ “saw”
₋ “not Doug”
⑤
is passed forward
44
net3 - Forget Step
For the sake of
simplicity, assume,
there is no memory at
the moment
⑥
Therefore,
• “saw”
• “not Doug” ⑤
going forward
④
45
net4 - Selection Step
The selection mechanism
(net4) has learned that when
the most recent word was a
name then the next is either saw
• “saw” or saw
• “.” saw Doug
⑦ Doug
net4 blocks any other words
from coming out so
₋ “not Doug” gets blocked
₋ “saw” goes out
as the prediction for the next
time step
46
Next Prediction Process
So we take a step forward in
time now the word “saw” is
our most recent word and
our most recent prediction
They get passed forward to
all of these neural networks
(net 1, 2, 3, 4) and we get
a new set of predictions
47
net1 - Prediction Step
Because the word “saw” just
occurred we now predict that
the words
• “Doug”,
• “Jane”, or
• “Spot”
might come next
we will pass over ignoring
and attention in this example
again & we will take those
predictions forward
48
net3 - Forget Step
Now the other thing that we
need to consider is our
previous set of possibilities
Remember that we already
had the words
• saw
• not Doug
that we maintain internally
from previous step
They get passed to a
forgetting gate
49
net3 - Forget Step
At the forgetting gate we know:
The last word that occurred was
the word “saw” then the
network can forget it but the
network should keep any ① ⑤
④
predictions about names
For net3:
③
• forgets “saw” ⑥
• keeps “not Doug”
& now at we have: ①
②
• a positive vote for “Doug”
• a positive vote for “not Doug”
( or a negative vote for
After this point the network has only “Jane”
“Doug”)
they cancel each other & “spots” Those get passed forward
50
net4 - Selection Step
The selection gate knows that
• the word “saw” just
occurred and
• a name should happen
next
• so it passes through these
predictions for names and
for the next time step then
we get predictions of
• “Jane”
• “spot”
51
Some mistakes may not happen
This network can avoid:
• Doug saw Doug.
• Jane saw Spot saw …
• Spot. Doug. Jane.
That is because LSTM can look back two, three, many time steps and
use that information to make good predictions about what's going to
happen next.
Note: vanilla recurrent neural networks they can actually look back
several time steps as well but not very many.
52
LSTM Applications
• Translation of text from one language to another language
Even though translation is not a word to word process, it's a phrase to phrase or
even in some cases a sentence to sentence process, LSTMS are able to
represent those grammar structures that are specific to each language and
what it looks like is that they find the higher-level idea and translate it from one
mode of expression to another, just using the bits and pieces that we just
walked through.
53
LSTM Applications
• Translation of speech to text
Speech is just some signals that vary in time. It takes them and uses that then to
predict what text -what word- is being spoken and it can use the history -the
recent history of words- to make a better guess for what's going to come next.
54
LSTM Applications
• LSTMS are a great fit for any information that is embedded in time –
like audio, video
• An agent taking in information from a set of sensors and then based
on that information, making a decision and carrying out an action.
• It’s inherently sequential and actions taken now can influence what is
sensed and what should be done many times steps down the line.
55
Some interesting applications
56