0% found this document useful (0 votes)
15 views2 pages

Script 2

Uploaded by

magicsarap30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views2 pages

Script 2

Uploaded by

magicsarap30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Slide 20

so for the second part of my presentation, I want to shared what ive learned on
some topics, I will be discussing about GPT.

slde 21:
(read the slides) this is the earlier version of chatgpt haha, its predecessor.

slide 22:
lets briefly discuss the architecture of GPT, these are the key components of the
architecture:
Attention Mechanism: The attention mechanism is the bread and butter of the GPT. It
allows the model to focus on different parts of the input sequence, understanding
context and relationships between words, even if they are far apart in the tex
Tokenization means breaking down the words into smaller chunks like subwords or per
character. GPT2 uses a Byte-pair enconding, specifically called tiktoken.
embeddings, which represent the 'meaning' of the tokens in a high-dimensional
space. every token has 768-size vector for gpt2 small model.
Transformer bloc stack self-attention mechanisms and feed-forward networks,
allowing the model to capture both local and long-range dependencies between
tokens. It process the embedded tokens and apply the attention mechanism, along
with other operations, to build up a rich representation of the text.
Finally, a linear head is applied to produce the output, typically a sequence of
tokens that form the model's response or prediction.

slide 22:
this is the overall architecture of the GPT2 model, as you can see there is a lot
of operations being done here, and I think this transfomer block is repeated
multiple times for a single pass

SLIDE 23:
GPT-2 is trained on the OpenWebText dataset, which is a very large corpus of text
scraped from the web. This dataset is massive, containing billions of words, and is
used to train the model to generate human-like text. However, with such a large
corpus, combined with the complex operations performed within the model, training
GPT-2 becomes very expensive in terms of both computational resources and time.
Given this, it's crucial that we focus on optimizations to make training more
efficient and feasible

Slide 24:
I will now dicsuss some of the optimizations i employed when i trained a dummy
dataset. The optimizations here are actually the default settings mentioned in the
original GPT-2 paper.
the Optimizer Hyperparameters used is AdamW. for the weight tying, the weights of
the Word Token Embeddings (WTE) are shared with the Language Modeling (LM) Head,
these are the layers in red square here. Since, most of the parameters are in the
MLP layers, and not in the attentation mechanism, it significantly reduces number
of params. And due to how our GPUs/CPU calculates and do operation, it is much more
efficient if the numbers are multiples of 2. thus changing the vocab size from
50257 to 50304. Applying these optimization on my experiment and setup, i got
around 1 sec of speed per step. this will be the baseline.

slide 25:
by default numbers are in FP32, floating-point 32), which is the standard precision
for many deep learning models. However, TF32 offers an optimization by reducing the
precision slightly, allowing for faster computation without significant loss in
accuracy.
With TF32, I was able to achieve a speed of around 360 ms per step. from 1000 ms to
360.

By using torch.autocast(device_type=device, dtype=torch.bfloat16), we allow the


model to automatically choose the right precision based on the operation. This
means that BF16 is applied to operations like the forward pass and loss
computation, while TF32 is still used for matrix multiplications and other
performance-critical tasks.

slide 27:
Next, I used Torch.Compile to optimize the model further. Torch.Compile converts
the model into an optimized intermediate representation, which enhances the
computational efficiency. It achieves this by fusing operations and eliminating
redundant calculations, resulting in faster execution and improved overall
performance.

By applying model = torch.compile(model), I was able to reduce the training time to


around 130 ms per step, which is a significant improvement compared to the previous
methods.

slide 29:
Each GPU has its own copy of the model, and they synchronize their gradients during
the backward pass to ensure the model updates are consistent across all GPUs. I was
not able to do this due to the laptop getting suddnely broken. but this will
significantly make the training much more faster, with more GPUs the faster.

You might also like