Script 2

Uploaded by

magicsarap30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views2 pages

Script 2

Uploaded by

magicsarap30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

Slide 20

so for the second part of my presentation, I want to shared what ive learned on
some topics, I will be discussing about GPT.

slde 21:
(read the slides) this is the earlier version of chatgpt haha, its predecessor.

slide 22:
lets briefly discuss the architecture of GPT, these are the key components of the
architecture:
Attention Mechanism: The attention mechanism is the bread and butter of the GPT. It
allows the model to focus on different parts of the input sequence, understanding
context and relationships between words, even if they are far apart in the tex
Tokenization means breaking down the words into smaller chunks like subwords or per
character. GPT2 uses a Byte-pair enconding, specifically called tiktoken.
embeddings, which represent the 'meaning' of the tokens in a high-dimensional
space. every token has 768-size vector for gpt2 small model.
Transformer bloc stack self-attention mechanisms and feed-forward networks,
allowing the model to capture both local and long-range dependencies between
tokens. It process the embedded tokens and apply the attention mechanism, along
with other operations, to build up a rich representation of the text.
Finally, a linear head is applied to produce the output, typically a sequence of
tokens that form the model's response or prediction.

slide 22:
this is the overall architecture of the GPT2 model, as you can see there is a lot
of operations being done here, and I think this transfomer block is repeated
multiple times for a single pass

SLIDE 23:
GPT-2 is trained on the OpenWebText dataset, which is a very large corpus of text
scraped from the web. This dataset is massive, containing billions of words, and is
used to train the model to generate human-like text. However, with such a large
corpus, combined with the complex operations performed within the model, training
GPT-2 becomes very expensive in terms of both computational resources and time.
Given this, it's crucial that we focus on optimizations to make training more
efficient and feasible

Slide 24:
I will now dicsuss some of the optimizations i employed when i trained a dummy
dataset. The optimizations here are actually the default settings mentioned in the
original GPT-2 paper.
the Optimizer Hyperparameters used is AdamW. for the weight tying, the weights of
the Word Token Embeddings (WTE) are shared with the Language Modeling (LM) Head,
these are the layers in red square here. Since, most of the parameters are in the
MLP layers, and not in the attentation mechanism, it significantly reduces number
of params. And due to how our GPUs/CPU calculates and do operation, it is much more
efficient if the numbers are multiples of 2. thus changing the vocab size from
50257 to 50304. Applying these optimization on my experiment and setup, i got
around 1 sec of speed per step. this will be the baseline.

slide 25:
by default numbers are in FP32, floating-point 32), which is the standard precision
for many deep learning models. However, TF32 offers an optimization by reducing the
precision slightly, allowing for faster computation without significant loss in
accuracy.
With TF32, I was able to achieve a speed of around 360 ms per step. from 1000 ms to
360.

By using torch.autocast(device_type=device, dtype=torch.bfloat16), we allow the

model to automatically choose the right precision based on the operation. This
means that BF16 is applied to operations like the forward pass and loss
computation, while TF32 is still used for matrix multiplications and other
performance-critical tasks.

slide 27:
Next, I used Torch.Compile to optimize the model further. Torch.Compile converts
the model into an optimized intermediate representation, which enhances the
computational efficiency. It achieves this by fusing operations and eliminating
redundant calculations, resulting in faster execution and improved overall
performance.

By applying model = torch.compile(model), I was able to reduce the training time to

around 130 ms per step, which is a significant improvement compared to the previous
methods.

slide 29:
Each GPU has its own copy of the model, and they synchronize their gradients during
the backward pass to ensure the model updates are consistent across all GPUs. I was
not able to do this due to the laptop getting suddnely broken. but this will
significantly make the training much more faster, with more GPUs the faster.

Beginner's Guide to GPT-2 Training
No ratings yet
Beginner's Guide to GPT-2 Training
2 pages
Deep Learning Evolution at Google
No ratings yet
Deep Learning Evolution at Google
69 pages
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
No ratings yet
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
52 pages
GPT 2 - Learninhg 4
0% (2)
GPT 2 - Learninhg 4
2 pages
Transformer
No ratings yet
Transformer
5 pages
HuggingFace GPT2
No ratings yet
HuggingFace GPT2
43 pages
ICML'22 Big Model Tutorial (Public v2)
No ratings yet
ICML'22 Big Model Tutorial (Public v2)
160 pages
GPT 2 - Learninhg 5
No ratings yet
GPT 2 - Learninhg 5
2 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Large-Scale Deep Learning with TensorFlow
No ratings yet
Large-Scale Deep Learning with TensorFlow
119 pages
PyTorch NLP Tutorial Documentation
No ratings yet
PyTorch NLP Tutorial Documentation
35 pages
Causal Self-Attention in PyTorch
No ratings yet
Causal Self-Attention in PyTorch
10 pages
04 Pytorch Custom Datasets - Ipynb
No ratings yet
04 Pytorch Custom Datasets - Ipynb
742 pages
Megatron LM
No ratings yet
Megatron LM
15 pages
Project Description 1
No ratings yet
Project Description 1
3 pages
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
No ratings yet
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
13 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Comprehensive Analysis of GPT Architecture, Evolut
No ratings yet
Comprehensive Analysis of GPT Architecture, Evolut
6 pages
01 Pytorch Workflow - Ipynb
No ratings yet
01 Pytorch Workflow - Ipynb
73 pages
Deep Atlas MLI Syllabus
No ratings yet
Deep Atlas MLI Syllabus
1 page
LLM Training Update
100% (1)
LLM Training Update
31 pages
Astro AI
No ratings yet
Astro AI
20 pages
Py Torch
No ratings yet
Py Torch
786 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Building GPT-2 from Scratch in PyTorch
No ratings yet
Building GPT-2 from Scratch in PyTorch
13 pages
Atelier 3
No ratings yet
Atelier 3
2 pages
GPT-2 Model Architecture Overview
No ratings yet
GPT-2 Model Architecture Overview
2 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
02 Pytorch Classification - Ipynb
No ratings yet
02 Pytorch Classification - Ipynb
348 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
00 Pytorch Fundamentals - Ipynb
No ratings yet
00 Pytorch Fundamentals - Ipynb
75 pages
Astro AI
No ratings yet
Astro AI
20 pages
DL4Final (1) Tanvir
No ratings yet
DL4Final (1) Tanvir
10 pages
Zy 174360787988339
No ratings yet
Zy 174360787988339
8 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
Transformers Inference Optimization Guide
No ratings yet
Transformers Inference Optimization Guide
29 pages
Artificial Intelligence - Assignment 3
No ratings yet
Artificial Intelligence - Assignment 3
11 pages
Bigdata Neural Networks
No ratings yet
Bigdata Neural Networks
144 pages
Full Text
No ratings yet
Full Text
25 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
Deep Learning Lab Course 2017 Overview
No ratings yet
Deep Learning Lab Course 2017 Overview
49 pages
Build GPT Model from Scratch in PyTorch
No ratings yet
Build GPT Model from Scratch in PyTorch
27 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Stas Bekman - Machine Learning Engineering
100% (1)
Stas Bekman - Machine Learning Engineering
217 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
No ratings yet
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
12 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
DL Lab Manual
No ratings yet
DL Lab Manual
67 pages
DLC Unit 1
No ratings yet
DLC Unit 1
7 pages
Data Movement Is All You Need - A Case Study On Optimizing Transformers
No ratings yet
Data Movement Is All You Need - A Case Study On Optimizing Transformers
22 pages
RLDL128
No ratings yet
RLDL128
73 pages
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
Machine Learning in Wireless Networks
No ratings yet
Machine Learning in Wireless Networks
4 pages
ICAISE24 Programm 3
No ratings yet
ICAISE24 Programm 3
18 pages
高速铁路道岔故障知识图谱研究
No ratings yet
高速铁路道岔故障知识图谱研究
62 pages
Jour 2
No ratings yet
Jour 2
11 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
57 pages
Leveraging AI Tools For Building Manager Productivity
No ratings yet
Leveraging AI Tools For Building Manager Productivity
36 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
21 pages
Big Data Architectures in Insurance
No ratings yet
Big Data Architectures in Insurance
147 pages
Automated Classification of Bacterial Images Extracted From Digital Microscope Via Bag of Words Model
No ratings yet
Automated Classification of Bacterial Images Extracted From Digital Microscope Via Bag of Words Model
4 pages
Final Report
No ratings yet
Final Report
46 pages
AI-Enhanced Robotic Process Automation A Review of Intelligent Automation Innovations
No ratings yet
AI-Enhanced Robotic Process Automation A Review of Intelligent Automation Innovations
25 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Unit Guide - 2 - Layers of Computing Systems - Y8 - v1.1
No ratings yet
Unit Guide - 2 - Layers of Computing Systems - Y8 - v1.1
8 pages
Ensemble Classifier For Driver Fatigue Detection Based On A Single EEG Channel
No ratings yet
Ensemble Classifier For Driver Fatigue Detection Based On A Single EEG Channel
8 pages
Core Dimensions of Responsible AI
No ratings yet
Core Dimensions of Responsible AI
15 pages
Machine-Learning Space Applications On SmallSat Platforms With Te
No ratings yet
Machine-Learning Space Applications On SmallSat Platforms With Te
8 pages
M6 - Aliyah Muthi Lathifah - 10521097 - 4PA21
No ratings yet
M6 - Aliyah Muthi Lathifah - 10521097 - 4PA21
4 pages
Global Health Care Sector Outlook 2024
No ratings yet
Global Health Care Sector Outlook 2024
49 pages
Person Identification by Keystroke Dynamics Using Pairwise User Coupling
No ratings yet
Person Identification by Keystroke Dynamics Using Pairwise User Coupling
11 pages
Clustering Algorithm (Dbscan) : Vishal Bharti Computer Science Dept. GC, Cuny
No ratings yet
Clustering Algorithm (Dbscan) : Vishal Bharti Computer Science Dept. GC, Cuny
27 pages
Class Note For Machine Learning at University
No ratings yet
Class Note For Machine Learning at University
58 pages
Machine Learning in 6G Wireless Communications
No ratings yet
Machine Learning in 6G Wireless Communications
9 pages
BTech MNC Brochure
No ratings yet
BTech MNC Brochure
10 pages
AIDeep
No ratings yet
AIDeep
10 pages
Classifying Plastic Waste Using Deep Convolutional Neural Networks For Efficient Plastic Waste Management
No ratings yet
Classifying Plastic Waste Using Deep Convolutional Neural Networks For Efficient Plastic Waste Management
10 pages
Decision Tree
No ratings yet
Decision Tree
9 pages
6-CSC 405 Sem1 2020-2021 - Intro To Machine Learning
No ratings yet
6-CSC 405 Sem1 2020-2021 - Intro To Machine Learning
39 pages
Integrating Machine Learning For Accurate Prediction of Early Diabetes - A Novel Approach
No ratings yet
Integrating Machine Learning For Accurate Prediction of Early Diabetes - A Novel Approach
24 pages
USJ Master AI
No ratings yet
USJ Master AI
6 pages