0% found this document useful (0 votes)
43 views2 pages

Beginner's Guide to GPT-2 Training

Uploaded by

sid_hyd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views2 pages

Beginner's Guide to GPT-2 Training

Uploaded by

sid_hyd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

gpt.

md 2024-07-27

One Step at a Time 📚


This document provides a beginner explanation for understanding and training GPT-2. I started by
implementing a transformer decoder. You can visit mini-autograd and mini-models for my older work, and
now I am slowly graduating to setting up, training, and using a Generative Pre-trained Transformer (GPT-2)
model, starting with defining the model architecture, implementing the training loop, and generating text
sequences. My learning rate is also 3e-4 because I need a steady caffeine drip! ✨
Manpreet's GitHub repository ☕

Table of Contents
. Introduction
. Model Architecture
. Training Loop
. Text Generation
. Loving the Floats
. Acknowledgements
1. Introduction
The aim is to thoroughly understand how to train a GPT-2 model from scratch. It leverages the PyTorch
library and includes custom implementations of critical components like the attention mechanism and
transformer blocks. Test on multiple chips like CPU, MPS and GPU is in progress.
The Transformer architecture, introduced in the paper "Attention is All You Need" by Vaswani et al., laid the
groundwork for models like GPT-2. Here are the key differences between the generic Transformer
architecture and GPT-2:
Attention Mechanism: Both use self-attention mechanisms, but GPT-2 causally applies them to
ensure tokens only attend to previous tokens in the sequence, maintaining the autoregressive
property.
1 / 11
gpt.md 2024-07-27

Layer Normalization and Activation: GPT-2 employs layer normalization before the multi-head
attention and feed-forward layers (pre-normalization), whereas the original Transformer does it after
these layers (post-normalization).
Model Depth: GPT-2 uses significantly more layers.
2. Model Architecture
The core architecture of the GPT-2 model is defined in several classes, including GPT, Block, and
CausalSelfAttention.

GPTConfig

The GPTConfig class is designed to configure the essential parameters for a GPT model. It specifies a
block_size of 1024, which determines the maximum sequence length that the model can process. The
vocab_size is set to 50,257, accommodating 50,000 Byte Pair Encoding (BPE) tokens, 256 byte tokens,
and an additional special token, ensuring comprehensive token representation. The model architecture is
further defined by 12 transformer layers (n_layer), each incorporating 12 attention heads (n_head) to
manage multiple attention mechanisms simultaneously. Additionally, the embedding dimension (n_embd) is
set to 768, dictating the size of the vectors used to represent tokens. This configuration balances
complexity and computational efficiency, making it suitable for training robust language models.

@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = 50257
n_layer: int = 12
n_head: int = 12
n_embd: int = 768

CausalSelfAttention

The CausalSelfAttention class is a crucial component in the GPT-2 model, implementing the self-
attention mechanism. Upon initialization, it checks that the embedding dimension (n_embd) is divisible by
the number of attention heads (n_head), ensuring a consistent split of dimensions across heads. The class
defines linear transformations for key, query, and value projections (c_attn), as well as an output
projection (c_proj). It registers a lower triangular matrix (bias) to enforce causality, ensuring that each
position can only attend to previous positions, thus preventing information leakage from future tokens.
In the forward method, the input tensor x is processed to extract batch size (B), sequence length (T), and
embedding dimensionality (C). The input is projected into query (q), key (k), and value (v) tensors. These
tensors are then reshaped and transposed to facilitate parallel processing across heads. The attention
mechanism computes the dot product of queries and keys, scales it, and applies a causal mask to maintain
temporal order. The softmax function normalizes these attention scores, which are then used to weight the
values. Finally, the output is recombined and projected back into the original embedding space. This
mechanism allows the model to focus on relevant parts of the input sequence, enhancing its ability to
understand context and generate coherent text.
2 / 11

You might also like