0% found this document useful (0 votes)

43 views2 pages

Beginner's Guide to GPT-2 Training

Uploaded by

sid_hyd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views2 pages

Beginner's Guide to GPT-2 Training

Uploaded by

sid_hyd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

gpt.

md 2024-07-27

One Step at a Time 📚

This document provides a beginner explanation for understanding and training GPT-2. I started by
implementing a transformer decoder. You can visit mini-autograd and mini-models for my older work, and
now I am slowly graduating to setting up, training, and using a Generative Pre-trained Transformer (GPT-2)
model, starting with defining the model architecture, implementing the training loop, and generating text
sequences. My learning rate is also 3e-4 because I need a steady caffeine drip! ✨
Manpreet's GitHub repository ☕

Table of Contents
. Introduction
. Model Architecture
. Training Loop
. Text Generation
. Loving the Floats
. Acknowledgements
1. Introduction
The aim is to thoroughly understand how to train a GPT-2 model from scratch. It leverages the PyTorch
library and includes custom implementations of critical components like the attention mechanism and
transformer blocks. Test on multiple chips like CPU, MPS and GPU is in progress.
The Transformer architecture, introduced in the paper "Attention is All You Need" by Vaswani et al., laid the
groundwork for models like GPT-2. Here are the key differences between the generic Transformer
architecture and GPT-2:
Attention Mechanism: Both use self-attention mechanisms, but GPT-2 causally applies them to
ensure tokens only attend to previous tokens in the sequence, maintaining the autoregressive
property.
1 / 11
gpt.md 2024-07-27

Layer Normalization and Activation: GPT-2 employs layer normalization before the multi-head
attention and feed-forward layers (pre-normalization), whereas the original Transformer does it after
these layers (post-normalization).
Model Depth: GPT-2 uses significantly more layers.
2. Model Architecture
The core architecture of the GPT-2 model is defined in several classes, including GPT, Block, and
CausalSelfAttention.

GPTConfig

The GPTConfig class is designed to configure the essential parameters for a GPT model. It specifies a
block_size of 1024, which determines the maximum sequence length that the model can process. The
vocab_size is set to 50,257, accommodating 50,000 Byte Pair Encoding (BPE) tokens, 256 byte tokens,
and an additional special token, ensuring comprehensive token representation. The model architecture is
further defined by 12 transformer layers (n_layer), each incorporating 12 attention heads (n_head) to
manage multiple attention mechanisms simultaneously. Additionally, the embedding dimension (n_embd) is
set to 768, dictating the size of the vectors used to represent tokens. This configuration balances
complexity and computational efficiency, making it suitable for training robust language models.

@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = 50257
n_layer: int = 12
n_head: int = 12
n_embd: int = 768

CausalSelfAttention

The CausalSelfAttention class is a crucial component in the GPT-2 model, implementing the self-
attention mechanism. Upon initialization, it checks that the embedding dimension (n_embd) is divisible by
the number of attention heads (n_head), ensuring a consistent split of dimensions across heads. The class
defines linear transformations for key, query, and value projections (c_attn), as well as an output
projection (c_proj). It registers a lower triangular matrix (bias) to enforce causality, ensuring that each
position can only attend to previous positions, thus preventing information leakage from future tokens.
In the forward method, the input tensor x is processed to extract batch size (B), sequence length (T), and
embedding dimensionality (C). The input is projected into query (q), key (k), and value (v) tensors. These
tensors are then reshaped and transposed to facilitate parallel processing across heads. The attention
mechanism computes the dot product of queries and keys, scales it, and applies a causal mask to maintain
temporal order. The softmax function normalizes these attention scores, which are then used to weight the
values. Finally, the output is recombined and projected back into the original embedding space. This
mechanism allows the model to focus on relevant parts of the input sequence, enhancing its ability to
understand context and generate coherent text.
2 / 11

Script 2
No ratings yet
Script 2
2 pages
GPT-2 Model Architecture Overview
No ratings yet
GPT-2 Model Architecture Overview
2 pages
Lec 4
No ratings yet
Lec 4
15 pages
Transformer
No ratings yet
Transformer
5 pages
cl13 gpt-2
No ratings yet
cl13 gpt-2
26 pages
cl13 GPT
No ratings yet
cl13 GPT
26 pages
GPT 2 - Learninhg 4
0% (2)
GPT 2 - Learninhg 4
2 pages
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
No ratings yet
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
52 pages
ICE516 GPT4 Architecture
No ratings yet
ICE516 GPT4 Architecture
5 pages
HuggingFace GPT2
No ratings yet
HuggingFace GPT2
43 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
GPT (Generative Pretrained Transformers)
No ratings yet
GPT (Generative Pretrained Transformers)
5 pages
Introduction To Transformers An NLP Perspective
No ratings yet
Introduction To Transformers An NLP Perspective
119 pages
Transformers Architecture
No ratings yet
Transformers Architecture
5 pages
Causal Self-Attention in PyTorch
No ratings yet
Causal Self-Attention in PyTorch
10 pages
Transformers
No ratings yet
Transformers
21 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
IISWC2022 91-Sparse Attention
No ratings yet
IISWC2022 91-Sparse Attention
67 pages
Building GPT-2 from Scratch in PyTorch
No ratings yet
Building GPT-2 from Scratch in PyTorch
13 pages
Custom GPT-2 Model Implementation
No ratings yet
Custom GPT-2 Model Implementation
2 pages
Week 12
100% (1)
Week 12
64 pages
GPT4 Architecture
No ratings yet
GPT4 Architecture
2 pages
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
No ratings yet
Differ - Blog-Heres How You Can Build and Train GPT-2 From Scratch Using PyTorch
13 pages
Lec 3
No ratings yet
Lec 3
13 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Deep Learning Evolution at Google
No ratings yet
Deep Learning Evolution at Google
69 pages
Transformers
No ratings yet
Transformers
2 pages
Artificial Intelligence - Assignment 3
No ratings yet
Artificial Intelligence - Assignment 3
11 pages
Project Description 1
No ratings yet
Project Description 1
3 pages
Insights on GPT and Transformer Models
No ratings yet
Insights on GPT and Transformer Models
51 pages
Megatron LM
No ratings yet
Megatron LM
15 pages
The Transformer Model - Revolutionizing Artificial Intelligence
No ratings yet
The Transformer Model - Revolutionizing Artificial Intelligence
6 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
LLM Book
No ratings yet
LLM Book
161 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
19 pages
Comprehensive Analysis of GPT Architecture, Evolut
No ratings yet
Comprehensive Analysis of GPT Architecture, Evolut
6 pages
N-gram vs Negative Sampling in NLP
No ratings yet
N-gram vs Negative Sampling in NLP
117 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
A Mathematical Framework For Transformer Circuits
No ratings yet
A Mathematical Framework For Transformer Circuits
53 pages
Analyzing The Structure of Attention
No ratings yet
Analyzing The Structure of Attention
14 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Energy-Efficient Transformer Processor
No ratings yet
Energy-Efficient Transformer Processor
16 pages
Good Note - Transformer
No ratings yet
Good Note - Transformer
16 pages
NLP
No ratings yet
NLP
1 page
Cluster1 Core ML NLP Techniques Summary
No ratings yet
Cluster1 Core ML NLP Techniques Summary
8 pages
Transformers
No ratings yet
Transformers
15 pages
Understanding GPT The AI Revolution in Language Processing
No ratings yet
Understanding GPT The AI Revolution in Language Processing
10 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
96 pages
Term Paper On Big Data
100% (1)
Term Paper On Big Data
6 pages
Besa Pruning Large Language Models With Blockwise Parameter-Efficient Sparsity Allocation 2402.16880v2
No ratings yet
Besa Pruning Large Language Models With Blockwise Parameter-Efficient Sparsity Allocation 2402.16880v2
15 pages
Tema 4 PRIM 2023 24
No ratings yet
Tema 4 PRIM 2023 24
13 pages
Hyper Automation
No ratings yet
Hyper Automation
8 pages
SSRN 4946728
No ratings yet
SSRN 4946728
7 pages
AI Club Coord+Project Manager App 25 26
No ratings yet
AI Club Coord+Project Manager App 25 26
8 pages
Data Final
No ratings yet
Data Final
4 pages
Predicting Kickstarter Campaign Success
No ratings yet
Predicting Kickstarter Campaign Success
5 pages
Qué Está Pasando Con El Cacao - Mayo 24
No ratings yet
Qué Está Pasando Con El Cacao - Mayo 24
14 pages
Varc - DPP 14 - GMAT Foundation 2024
No ratings yet
Varc - DPP 14 - GMAT Foundation 2024
15 pages
An Overview of Ensemble Methods For Binary Classifiers in Multi-Class Problems Experimental Study On One-Vs-One and One-Vs-All Schemes 2011
No ratings yet
An Overview of Ensemble Methods For Binary Classifiers in Multi-Class Problems Experimental Study On One-Vs-One and One-Vs-All Schemes 2011
16 pages
Big Data & ML in Real Estate Analysis
No ratings yet
Big Data & ML in Real Estate Analysis
48 pages
Class X PB1 QP
No ratings yet
Class X PB1 QP
5 pages
Tentative Course List (July - Dec 2024) - COMP DEPT.
No ratings yet
Tentative Course List (July - Dec 2024) - COMP DEPT.
24 pages
Vidya Sagar Resume
No ratings yet
Vidya Sagar Resume
1 page
A Quantum Convolutional Neural Network
No ratings yet
A Quantum Convolutional Neural Network
16 pages
Shoplifting Detection Algorithm Task
No ratings yet
Shoplifting Detection Algorithm Task
2 pages
Machine Learning Model Optimization
No ratings yet
Machine Learning Model Optimization
12 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Automated Violence Detection System
No ratings yet
Automated Violence Detection System
19 pages
iDS-2CD71C5G0-IZS Datasheet V5.5.121 20211208
No ratings yet
iDS-2CD71C5G0-IZS Datasheet V5.5.121 20211208
8 pages
Done N Dusted
No ratings yet
Done N Dusted
29 pages
ML Mod-4
No ratings yet
ML Mod-4
30 pages
Localizing BERT for NLP Tasks
No ratings yet
Localizing BERT for NLP Tasks
1 page
Machine Learning Lab Manual 2020-21
No ratings yet
Machine Learning Lab Manual 2020-21
43 pages
Business Analytics Outline
No ratings yet
Business Analytics Outline
4 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
2 pages
Real-Time Drowning Detection System
100% (1)
Real-Time Drowning Detection System
21 pages
Machine Learning in Occupational Accident Analysis A Review Using Science Mapping Approach With Citation Network Analysis
No ratings yet
Machine Learning in Occupational Accident Analysis A Review Using Science Mapping Approach With Citation Network Analysis
25 pages

Beginner's Guide to GPT-2 Training

Uploaded by

Beginner's Guide to GPT-2 Training

Uploaded by

gpt.

One Step at a Time 📚

You might also like