Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them
for educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode.
Transformers
vs RNNs
deeplearning.ai
Outline
● Issues with RNNs
● Comparison with Transformers
Neural Machine Translation
Comment allez- vous
How are you
No parallel computing!
Seq2Seq Architectures
...
Loss of information
T sequential steps
...
...
...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est
⊕ c Decoder
h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism
LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model
https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention
Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention
Scaled dot-product
attention multiple times in
parallel
Linear transformations of
the input queries, keys and
values
The Encoder
Provides contextual
representation of each item
in the input sequence
Self-Attention
Every item in the input
attends to every other item
in the sequence
The Decoder
Every position from the
Encoder-Decoder
decoder attents to the
Attention
outputs from the encoder
Masked Self-Attention
Every position attends to
previous positions
RNNs vs Transformer: Positional Encoding
POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1
ENCODING
EMBEDDINGS
INPUT Je suis content
The Transformer
Decoder
Encoder
Easy to parallelize!
Summary
● In RNNs parallel computing is difficult to implement
● For long sequences in RNNs there is loss of information
● In RNNs there is the problem of vanishing gradient
● Transformers help with all of the above
Transformer
Applications
deeplearning.ai
Outline
● Transformers applications in NLP
● Some Transformers
● Introduction to T5
Transformer NLP applications
Translation
Text
summarization
Chat-bots
te Auto-Complete
Other NLP tasks
Named entity Sentiment Analysis
recognition (NER) Market Intelligence
Question Text Classification
answering (Q&A) Character Recognition
Spell Checking
State of the Art Transformers
Radford, A., et al. (2018) GPT-2: Generative Pre-training for
Open AI Transformer
Devlin, J., et al. (2018) BERT:Bidirectional Encoder
Google AI Language Representations from Transformers
Colin, R., et al. (2019) T5: Text-to-text transfer transformer
Google
T5: Text-To-Text Transfer Transformer
Translate English into French: “I am happy” “Je suis content”
Translation
Unacceptable
Cola sentence: “He bought fruits and.” Classification
*Cola stands for “Corpus of Linguistic Acceptability”
T5 Acceptable
Cola sentence: “He bought fruits and
vegetables.” Q&A
Question: Which volcano in Tanzania is the Answer: Mount
highest mountain in Africa? Kilimanjaro
T5: Text-To-Text Transfer Transformer
Stsb sentence1: “Cats and dogs are
mammals.” Sentence2: “There are four
known forces in nature – gravity, 0.0
electromagnetic, weak and strong.” Regression
Stsb sentence1: “Cats and dogs are
mammals.” Sentence2:“Cats, dogs, and T5 2.6
cows are domesticated.”
Summarization
Summarize: “State authorities “Six people
dispatched emergency crews Tuesday to hospitalized
survey the damage after an onslaught of after a storm in
severe weather in mississippi…” Attala county”
T5: Demo
Summary
● Transformers are suitable for a wide range of NLP applications
● Some transformers include GPT, BERT and T5
● T5 is a powerful multi-task transformer
Scaled Dot-Product
Attention
deeplearning.ai
Outline
● Revisit scaled dot product attention
● Mathematics behind Attention
Scaled dot-product attention
Improves
Weights add up to 1 performance
Queries Values Weighted sum of values V
Keys
Just two matrix multiplications
(Vaswani et al., 2017)
and a Softmax!
Queries, Keys and Values Size of the
Je suis heureux embedding
Embedding Stack
Je suis heureux Q
I am happy
Embedding Stack
I am happy K
Same
Generally the
number of
same
Stack rows
V
Attention Math
Context vectors
for each query
Number of
queries
Size of the
value vector
Weight assigned to the third key
for the second query
Summary
● Scaled Dot-product Attention is essential for Transformer
● The input to Attention are queries, keys, and values
● GPUs and TPUs
Masked
Self-Attention
deeplearning.ai
Outline
● Ways of Attention
● Overview of masked Self-Attention
Encoder-Decoder Attention
Queries from one sentence, keys and values from another
it’s time for tea
c’est
l’heure
Weight matrix
du
thé
Self-Attention
Queries, keys and values come from the same sentence
it’s time for tea
it’s
time
Weight matrix
for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea
it’s
time
Weight matrix
for
tea
Masked self-attention math
Minus infinity
0
0 0
0 0 0
0 0
Weights assigned to future
0
positions are equal to 0
Summary
● There are three main ways of Attention: Encoder/Decoder,
self-attention and masked self-attention.
● In self-attention, queries and keys come from the same sentence
● In masked self-attention queries cannot attend to the future
Multi-head
Attention
deeplearning.ai
Outline
● Intuition Multi-Head Attention
● Math of Multi-Head Attention
Multi-Head Attention - Overview
Queries Keys Values
c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for
Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview
Linear
Concatenation
Learnable parameters
Scaled Dot-Product heads
Attention
Linear Linear Linear heads
Queries Keys Values
Multi-Head Attention
Head 1
Attention
Context vectors
for each query
Head 2 Concat
Attention
Usual choice of dimensions
: Embedding size
Summary
● Multi-Headed models attend to information from different
representations
● Parallel computations
● Similar computational cost to single-head attention
Transformer
decoder
deeplearning.ai
Outline
● Overview of Transformer decoder
● Implementation (decoder and feed-forward block)
Transformer decoder Overview
Output Probabilities
SoftMax ● input: sentence or paragraph
○ we predict the next word
Linear
● sentence gets embedded, add positional encoding
Add & Norm ○ (vectors representing )
Feed
Forward ● multi-head attention looks at previous words
Add & Norm ● feed-forward layer with ReLU
Multi-Head
○ that’s where most parameters are!
Attention
● residual connection with layer normalization
Positional
Encoding ● repeat N times
Input
Embedding ● dense layer and softmax for output
Inputs
Transformer decoder
Output Probabilities
SoftMax
Explanation
Linear
Add & Norm
Feed
Forward Decoder Block
Add & Norm
Multi-Head
Attention Positional Encoding
Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Output Probabilities
Add & Norm Decoder
SoftMax
Block
Linear
Feed Feed Feed
Add & Norm Forward Forward Forward
Feed
Forward
Add & Norm
Multi-Head LayerNormAdd
( & Norm
+ )
Attention
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)
Add & Norm
Multi-Head
Attention
Positional Self Attention
Encoding
Input
Embedding
Inputs
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)
Add & Norm
Multi-Head
Attention
Positional Self Attention
Encoding
Input
Embedding
Inputs
Summary
● Transformer decoder mainly consists of three layers
● Decoder and feed-forward blocks are the core of this model code
● It also includes a module to calculate the cross-entropy loss
Transformer
summarizer
deeplearning.ai
Outline
● Overview of Transformer summarizer
● Technical details for data processing
● Inference with a Language Model
Transformer for summarization
Output Probabilities
SoftMax
Linear Input Output:
Add & Norm Summary
Feed
Forward
Add & Norm
Multi-Head
Attention
Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
Output Probabilities
SoftMax Model Input:
Linear
ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Add & Norm
Feed
Forward
Tokenized version:
Add & Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]
Positional
Encoding Loss weights: 0s until the first <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add & Norm
Feed
Forward
Add & Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>
Inference:
● Provide: [Article] <EOS>
● Generate summary word-by-word
○ until the final <EOS>
● Pick the next word by random sampling
○ each time you get a different summary!
Summary
● For summarization, a weighted loss function is optimized
● Transformer Decoder summarizes predicting the next word using
● The transformer uses tokenized versions of the input