0% found this document useful (0 votes)

75 views58 pages

Deeplearning - Ai Deeplearning - Ai

These slides are distributed under a Creative Commons license which allows for educational use and distribution as long as DeepLearning.ai is cited as the source. The license details are available at a specified URL and prohibit commercial use or distribution of the slides without permission. Transformers help address issues with RNNs like parallel computing limitations, loss of information for long sequences, and vanishing gradients.

Uploaded by

9f8z4k2cxs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views58 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

9f8z4k2cxs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them
for educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode.
Transformers
vs RNNs
deeplearning.ai
Outline

● Issues with RNNs

● Comparison with Transformers

Neural Machine Translation

Comment allez- vous

How are you

No parallel computing!
Seq2Seq Architectures

...

Loss of information
T sequential steps
...

...

...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est

⊕ c Decoder

h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism

LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model

https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention

Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention

Scaled dot-product
attention multiple times in
parallel

Linear transformations of
the input queries, keys and
values
The Encoder

Provides contextual
representation of each item
in the input sequence

Self-Attention

Every item in the input

attends to every other item
in the sequence
The Decoder

Every position from the

Encoder-Decoder
decoder attents to the
Attention
outputs from the encoder

Masked Self-Attention

Every position attends to

previous positions
RNNs vs Transformer: Positional Encoding

POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

ENCODING

EMBEDDINGS

INPUT Je suis content

The Transformer

Decoder
Encoder

Easy to parallelize!
Summary

● In RNNs parallel computing is difﬁcult to implement

● For long sequences in RNNs there is loss of information

● In RNNs there is the problem of vanishing gradient

● Transformers help with all of the above

Transformer
Applications
deeplearning.ai
Outline

● Transformers applications in NLP

● Some Transformers
● Introduction to T5
Transformer NLP applications
Translation
Text
summarization
Chat-bots
te Auto-Complete
Other NLP tasks
Named entity Sentiment Analysis
recognition (NER) Market Intelligence
Question Text Classiﬁcation
answering (Q&A) Character Recognition
Spell Checking
State of the Art Transformers

Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Open AI Transformer

Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Google AI Language Representations from Transformers

Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Google
T5: Text-To-Text Transfer Transformer

Translate English into French: “I am happy” “Je suis content”

Translation
Unacceptable
Cola sentence: “He bought fruits and.” Classiﬁcation
*Cola stands for “Corpus of Linguistic Acceptability”
T5 Acceptable
Cola sentence: “He bought fruits and
vegetables.” Q&A

Question: Which volcano in Tanzania is the Answer: Mount

highest mountain in Africa? Kilimanjaro
T5: Text-To-Text Transfer Transformer
Stsb sentence1: “Cats and dogs are
mammals.” Sentence2: “There are four
known forces in nature – gravity, 0.0
electromagnetic, weak and strong.” Regression

Stsb sentence1: “Cats and dogs are

mammals.” Sentence2:“Cats, dogs, and T5 2.6
cows are domesticated.”
Summarization

Summarize: “State authorities “Six people

dispatched emergency crews Tuesday to hospitalized
survey the damage after an onslaught of after a storm in
severe weather in mississippi…” Attala county”
T5: Demo
Summary

● Transformers are suitable for a wide range of NLP applications

● Some transformers include GPT, BERT and T5

● T5 is a powerful multi-task transformer

Scaled Dot-Product
Attention
deeplearning.ai
Outline

● Revisit scaled dot product attention

● Mathematics behind Attention

Scaled dot-product attention
Improves
Weights add up to 1 performance

Queries Values Weighted sum of values V

Keys
Just two matrix multiplications
(Vaswani et al., 2017)
and a Softmax!
Queries, Keys and Values Size of the
Je suis heureux embedding

Embedding Stack
Je suis heureux Q

I am happy
Embedding Stack
I am happy K

Same
Generally the
number of
same
Stack rows
V
Attention Math

Context vectors
for each query

Number of
queries

Size of the
value vector

Weight assigned to the third key

for the second query
Summary

● Scaled Dot-product Attention is essential for Transformer

● The input to Attention are queries, keys, and values

● GPUs and TPUs

Masked
Self-Attention
deeplearning.ai
Outline

● Ways of Attention

● Overview of masked Self-Attention

Encoder-Decoder Attention
Queries from one sentence, keys and values from another

it’s time for tea

c’est

l’heure
Weight matrix

thé
Self-Attention
Queries, keys and values come from the same sentence

it’s time for tea

it’s

time
Weight matrix

for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea

it’s

time
Weight matrix

for

tea
Masked self-attention math
Minus inﬁnity

0
0 0
0 0 0

0 0
Weights assigned to future
0
positions are equal to 0
Summary

● There are three main ways of Attention: Encoder/Decoder,

self-attention and masked self-attention.

● In self-attention, queries and keys come from the same sentence

● In masked self-attention queries cannot attend to the future

Multi-head
Attention
deeplearning.ai
Outline

● Intuition Multi-Head Attention

● Math of Multi-Head Attention

Multi-Head Attention - Overview
Queries Keys Values

c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for

Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview

Linear

Concatenation

Learnable parameters
Scaled Dot-Product heads
Attention

Linear Linear Linear heads

Queries Keys Values

Multi-Head Attention
Head 1

Attention
Context vectors
for each query

Head 2 Concat

Attention

Usual choice of dimensions

: Embedding size
Summary

● Multi-Headed models attend to information from different

representations
● Parallel computations

● Similar computational cost to single-head attention

Transformer
decoder
deeplearning.ai
Outline

● Overview of Transformer decoder

● Implementation (decoder and feed-forward block)

Transformer decoder Overview
Output Probabilities
SoftMax ● input: sentence or paragraph
○ we predict the next word
Linear
● sentence gets embedded, add positional encoding
Add & Norm ○ (vectors representing )
Feed
Forward ● multi-head attention looks at previous words

Add & Norm ● feed-forward layer with ReLU

Multi-Head
○ that’s where most parameters are!
Attention
● residual connection with layer normalization
Positional
Encoding ● repeat N times
Input
Embedding ● dense layer and softmax for output
Inputs
Transformer decoder
Output Probabilities
SoftMax
Explanation
Linear
Add & Norm
Feed
Forward Decoder Block

Add & Norm

Multi-Head
Attention Positional Encoding

Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Output Probabilities
Add & Norm Decoder
SoftMax
Block
Linear
Feed Feed Feed
Add & Norm Forward Forward Forward
Feed
Forward
Add & Norm
Multi-Head LayerNormAdd
( & Norm
+ )
Attention
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)

Add & Norm

Multi-Head
Attention

Positional Self Attention

Encoding
Input
Embedding
Inputs
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)

Add & Norm

Multi-Head
Attention

Positional Self Attention

Encoding
Input
Embedding
Inputs
Summary

● Transformer decoder mainly consists of three layers

● Decoder and feed-forward blocks are the core of this model code

● It also includes a module to calculate the cross-entropy loss

Transformer
summarizer
deeplearning.ai
Outline

● Overview of Transformer summarizer

● Technical details for data processing

● Inference with a Language Model

Transformer for summarization
Output Probabilities
SoftMax
Linear Input Output:
Add & Norm Summary
Feed
Forward
Add & Norm
Multi-Head
Attention

Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
Output Probabilities
SoftMax Model Input:
Linear
ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Add & Norm
Feed
Forward
Tokenized version:
Add & Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]

Positional
Encoding Loss weights: 0s until the ﬁrst <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add & Norm
Feed
Forward
Add & Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>

Inference:
● Provide: [Article] <EOS>

● Generate summary word-by-word

○ until the ﬁnal <EOS>

● Pick the next word by random sampling

○ each time you get a different summary!
Summary

● For summarization, a weighted loss function is optimized

● Transformer Decoder summarizes predicting the next word using

● The transformer uses tokenized versions of the input

Amazon Fine Food Reviews Dataset Overview
No ratings yet
Amazon Fine Food Reviews Dataset Overview
1 page
Transformers For Natural Language Processing and Computer Vision
No ratings yet
Transformers For Natural Language Processing and Computer Vision
150 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Deep Learning CNN
No ratings yet
Deep Learning CNN
204 pages
Building GPT-2 from Scratch in PyTorch
No ratings yet
Building GPT-2 from Scratch in PyTorch
13 pages
Neovarsity DSML Brochure
No ratings yet
Neovarsity DSML Brochure
7 pages
Data Science & Analytics Beginners
No ratings yet
Data Science & Analytics Beginners
6 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Opencv4 With Python
No ratings yet
Opencv4 With Python
156 pages
Apache Spark Basics with PySpark
No ratings yet
Apache Spark Basics with PySpark
33 pages
Big Data and Spark Developer Course
No ratings yet
Big Data and Spark Developer Course
5 pages
Data Science & ML Full Stack Guide
No ratings yet
Data Science & ML Full Stack Guide
9 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
Deep Learning CNN Training Guide
No ratings yet
Deep Learning CNN Training Guide
20 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
42 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
No ratings yet
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
110 pages
Hugging Face Transformers: A Step-By-Step Guide
No ratings yet
Hugging Face Transformers: A Step-By-Step Guide
12 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
LLM Interview Questions PDF
No ratings yet
LLM Interview Questions PDF
12 pages
DL CNN
No ratings yet
DL CNN
129 pages
LoRA vs QLoRA: Fine-Tuning Techniques
No ratings yet
LoRA vs QLoRA: Fine-Tuning Techniques
5 pages
Codebasics Data Science Bootcamp Brochure
No ratings yet
Codebasics Data Science Bootcamp Brochure
32 pages
Cracking the ISRO Scientist Exam Guide
No ratings yet
Cracking the ISRO Scientist Exam Guide
4 pages
Python AI ML Complete Roadmap With Skills
No ratings yet
Python AI ML Complete Roadmap With Skills
3 pages
Attention Mechanism
No ratings yet
Attention Mechanism
11 pages
Introduction to Natural Language Processing
100% (1)
Introduction to Natural Language Processing
12 pages
Week 1 Deep Learning Quiz Insights
No ratings yet
Week 1 Deep Learning Quiz Insights
2 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
AI Concepts for Tech Enthusiasts
No ratings yet
AI Concepts for Tech Enthusiasts
1 page
Deep Learning With Tensorflow
No ratings yet
Deep Learning With Tensorflow
15 pages
AI & ML: Concepts and Comparisons
No ratings yet
AI & ML: Concepts and Comparisons
179 pages
Data Science Upskilling for Tech Pros
No ratings yet
Data Science Upskilling for Tech Pros
40 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
161 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Evaluate RAG - Phoenix
No ratings yet
Evaluate RAG - Phoenix
25 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
Deep Learning Quiz: Week 1 & 2
No ratings yet
Deep Learning Quiz: Week 1 & 2
5 pages
Clustering Iris Data With Weka
No ratings yet
Clustering Iris Data With Weka
6 pages
Mobile Net
No ratings yet
Mobile Net
9 pages
Bias-Variance Tradeoff in ML Interviews
No ratings yet
Bias-Variance Tradeoff in ML Interviews
46 pages
Top 45 Machine Learning Interview Questions in 2025
100% (1)
Top 45 Machine Learning Interview Questions in 2025
37 pages
Deep Learning Course Overview
100% (1)
Deep Learning Course Overview
122 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
TensorFlow 2 Deep Learning Lab Guide
No ratings yet
TensorFlow 2 Deep Learning Lab Guide
33 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Commonly Asked Power Bi Interview Question
No ratings yet
Commonly Asked Power Bi Interview Question
7 pages
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
No ratings yet
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
9 pages
DS Interview Question Ineuron
100% (1)
DS Interview Question Ineuron
208 pages
W Purch Cost F
100% (1)
W Purch Cost F
25 pages
Introduction to NLP in AI
No ratings yet
Introduction to NLP in AI
43 pages
Transformers Vs RNNS: Deeplearning - Ai
No ratings yet
Transformers Vs RNNS: Deeplearning - Ai
57 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Getting Started With The Model Architecture of The Transformer
No ratings yet
Getting Started With The Model Architecture of The Transformer
103 pages
Transformer
No ratings yet
Transformer
5 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Fault Detection & Control Advances
No ratings yet
Fault Detection & Control Advances
6 pages
Software Modeling Essentials
No ratings yet
Software Modeling Essentials
10 pages
6-Example of Project Book
No ratings yet
6-Example of Project Book
9 pages
CW UFMFW7-15-3 - Assignment 2021
No ratings yet
CW UFMFW7-15-3 - Assignment 2021
4 pages
ISO 26262 Safety Case for FLEDS
No ratings yet
ISO 26262 Safety Case for FLEDS
124 pages
Innovatix Corp's Strategic Expansion Plan
No ratings yet
Innovatix Corp's Strategic Expansion Plan
5 pages
Presentation On Robotics Terminology
No ratings yet
Presentation On Robotics Terminology
14 pages
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
No ratings yet
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
5 pages
Advantages of Rule-Based Information Extraction
No ratings yet
Advantages of Rule-Based Information Extraction
4 pages
Control Engineering - Unit 10 - Module-8 (Week-8)
No ratings yet
Control Engineering - Unit 10 - Module-8 (Week-8)
3 pages
Introduction to Artificial Intelligence
No ratings yet
Introduction to Artificial Intelligence
16 pages
Fuzzy Neural Network - Scholarpedia
No ratings yet
Fuzzy Neural Network - Scholarpedia
7 pages
System Design and Acquisition Overview
No ratings yet
System Design and Acquisition Overview
18 pages
TQM Model Exam
No ratings yet
TQM Model Exam
5 pages
ISTQB Sample Paper - 500 Questions
100% (1)
ISTQB Sample Paper - 500 Questions
66 pages
Pretest - MC Eng 101 (Teaching English in Elementary Grades - Language Arts)
No ratings yet
Pretest - MC Eng 101 (Teaching English in Elementary Grades - Language Arts)
2 pages
Cruz, Mark Angelo U - Logistics Management
No ratings yet
Cruz, Mark Angelo U - Logistics Management
2 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
Maintenance and Reliability Mock Test - Vskills Practice Tests PDF
No ratings yet
Maintenance and Reliability Mock Test - Vskills Practice Tests PDF
26 pages
Chapter 4. Everything Can Be Automated
No ratings yet
Chapter 4. Everything Can Be Automated
20 pages
Technical Questions With Answers - Data Management
No ratings yet
Technical Questions With Answers - Data Management
12 pages
Testing Methodology
No ratings yet
Testing Methodology
6 pages
Software Testing for Developers
No ratings yet
Software Testing for Developers
27 pages
Communication Theory
50% (2)
Communication Theory
7 pages
Final Project
No ratings yet
Final Project
27 pages
Emerging Trends in Software
No ratings yet
Emerging Trends in Software
13 pages
Chapter 3 - Architecture Session II 2
No ratings yet
Chapter 3 - Architecture Session II 2
14 pages
System Modeling with UML
No ratings yet
System Modeling with UML
8 pages
Process Control & Instrumentation Guide
No ratings yet
Process Control & Instrumentation Guide
18 pages
Cybernetic Governance
No ratings yet
Cybernetic Governance
3 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright Notice

These slides are distributed under the Creative Commons License.

● Issues with RNNs

● Comparison with Transformers

Comment allez- vous

How are you

Every item in the input

Every position from the

Every position attends to

POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

INPUT Je suis content

● In RNNs parallel computing is difﬁcult to implement

● For long sequences in RNNs there is loss of information

● In RNNs there is the problem of vanishing gradient

● Transformers help with all of the above

● Transformers applications in NLP

Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Translate English into French: “I am happy” “Je suis content”

Question: Which volcano in Tanzania is the Answer: Mount

Stsb sentence1: “Cats and dogs are

Summarize: “State authorities “Six people

● Transformers are suitable for a wide range of NLP applications

● Some transformers include GPT, BERT and T5

● T5 is a powerful multi-task transformer

● Revisit scaled dot product attention

● Mathematics behind Attention

Queries Values Weighted sum of values V

Weight assigned to the third key

● Scaled Dot-product Attention is essential for Transformer

● The input to Attention are queries, keys, and values

● GPUs and TPUs

● Overview of masked Self-Attention

it’s time for tea

it’s time for tea

● There are three main ways of Attention: Encoder/Decoder,

● In self-attention, queries and keys come from the same sentence

● In masked self-attention queries cannot attend to the future

● Intuition Multi-Head Attention

● Math of Multi-Head Attention

Linear Linear Linear heads

Queries Keys Values

Usual choice of dimensions

● Multi-Headed models attend to information from different

● Similar computational cost to single-head attention

● Overview of Transformer decoder

● Implementation (decoder and feed-forward block)

Add & Norm ● feed-forward layer with ReLU

Add & Norm

Add & Norm

Positional Self Attention

Add & Norm

Positional Self Attention

● Transformer decoder mainly consists of three layers

● It also includes a module to calculate the cross-entropy loss

● Overview of Transformer summarizer

● Technical details for data processing

● Inference with a Language Model

● Generate summary word-by-word

● Pick the next word by random sampling

● For summarization, a weighted loss function is optimized

● Transformer Decoder summarizes predicting the next word using

● The transformer uses tokenized versions of the input

You might also like