LLM Crash Course
Srijit Mukherjee
LLM Crash Course Content
Build your Foundations (Day 1)
Mathematics of Deep Learning
Problem Statement
Language Process
Build LLM from Scratch (Day 2)
GPT Architecture Experimental Setup Data Processing
LLM: Problem Statement
n y
Input (x)
Output (y)
LLM: Problem Statement
θ : unknown parameters
Network
Neural
Input (x) f(x,θ) Output (y)
LLM: Problem Statement
θ : unknown parameters
IE
Network
Neural
x = [x1, x2, x3, …., xn] f(x,θ) y = [y1, y2, y3, …., ym]
Sequence of words
sequence of characters Tokenization fent.se
tD
LLM: Problem Statement
● Given a text “x”, how to define [x1, x2, …, xn] based on x?
○ We answer this in details in Tokenization.
○ However, here are some examples.
● Assume x = “I love large language models”.
000000
○ x = [‘I’, ‘love’, ‘large’, ‘language’, ‘models’, ‘.’] (word-wise)
○ x = [‘I’, ‘l’ ,‘o’, ‘v’, ‘e’, ‘l’, ‘a’, …, ‘e’, ‘l’, ‘s’, ‘.’] (character-wise)
000C
○ There can be other ways too. What are they?
● The question is similar to a stock market prediction problem.
○ If you know the price of the last 100 days, predict price of the next 5 days?
○ Stock market prediction is more random and hard.
○ Language prediction is actually not hard - it is an easier problem.
LLM: Revising the Problem Statement
θ : unknown parameters
Network
Neural
x = [x1, x2, x3, …., xn] f(x,θ) y = [y1, y2, y3, …., ym]
LLM: Deep Learning Process
θ : unknown parameters
f ni
Network
Neural
x (tensor) y (tensor)
i 0
f(x,θ)
i
m
● Goal: Given data (x1,y1), (x2,y2), …., (xn,yn), how do we find the best
parameters θ, such that f(xi,θ) is as close as possible to yi for all i?
Questions
Important
mmmm
text How do we
We have a
large
them into Mi yi Input Output
divide
is independent 22,72
n y
Time Series are
But in Language
on each other
dependent
LLM: Tensors are the math of data.
DO
Vector -> Matrix -> Tensor An image is a tensor
● Everything is a tensor. Image is a tensor. Stock market data is a tensor.
● Images have a natural mathematical (intensity-based) setup of a tensor of
dimension (C,H,W) where C = channels, H = height, W = width.
● Important Question: How is language data a tensor?
Wengthe
connection Feed
Logical
Tang15765
A can
Rupatif
WSymbols
Hindet
values W
ASCII
Interconnet
Apple
A perations Headphone
Mango
Frequency Logical
M that is beyond symbols
ng
I want to eattpple
want to eat Mango
I
LLM: Statistical and Distributional Semantics
King an queen woman
Rd
1
1
Statistical and Distributional Semantics
● Statistical Semantics: Words has positional connection with one another.
“a word is characterized by the company it keeps”
● Distributional Semantics: Words with similar meanings should be close.
“linguistic items with similar distributions have similar meanings”
LLM: Cosine Law: Measuring the Distance between two Words
I 1
m
Word f ote
Vector
LLM: Distributional Semantics: King - Man + Woman ≈ Queen
inform
LLM: Language has long context length. Hence position is imp.
1ⁿᵈ
Word
mar Meaning Position
LLM: Attention Mechanism solves Distributional Semantics
Apple
guest 0
Em Et
Eco En
m
Meaning
value
https://mlspring.beehiiv.com/ https://www.columbia.edu/~jsl2239/transformers.html
LLM: Attention Mechanism solves Distributional Semantics
to
https://mlspring.beehiiv.com/ https://www.columbia.edu/~jsl2239/transformers.html
1347918
Guy
111
Query
Suraj r
Query e
match a
mn
key
EH
IE Yi
K2
E
g
ingth
Harvard
Ein
t
É Nxn
DDD
LLM: Position Encoding solves Statistical Semantics
meaning I
I
2
3
to
4
position
https://newsletter.theaiedge.io/
LLM: Position Encoding is important (asymmetric)!
https://newsletter.theaiedge.io/
LLM: Linearity of Positional Encoding for Relative Encoding
s
650
https://cs.brown.edu/courses/cs146/assets/files/linearity.pdf
LLM: Why Convolution and RNN/LSTM couldn’t solve it?
mn
I 1
iii
Receptive Fied
Convolution -> Takes lot of layer.
Sequential -> Takes lots of time.
●
1k
Attention solves it by parallel computation, because matrix multiplication is
parallel computation.
I 5d sum dot
NV
DEERE
LLM: Deep Learning Process
θ : unknown parameters
Network
Neural
x (tensor) f(x,θ) y (tensor)
● Goal: Given data (x1,y1), (x2,y2), …., (xn,yn), how do we find the best
parameters θ, such that f(xi,θ) is as close as possible to yi for all i?
LLM: Deep Learning Process
θ : unknown parameters
input predicted true value
Network
Neural
xi f(x,θ) f(xi,θ) yi
Lossi = Loss(f(xi,θ),yi)
we want the difference
Loss(θ) = sum over i: Lossi
to be small for all i
● Solution: Given data (x1,y1), (x2,y2), …., (xn,yn), find the parameters θ, such
that Loss(θ) = sum over i: Lossi is minimized!
LLM: Gradient Descent over θ
LLM: Changing Data into Train - Test
Satyajit Ray (Bengali pronunciation: [ˈʃotːodʒit ˈrae̯] ⓘ; 2 May 1921 – 23 April 1992)
was an Indian director, screenwriter, documentary filmmaker, author, essayist,
lyricist, magazine editor, illustrator, calligrapher, and composer. Ray is widely
considered one of the greatest and most influential film directors in the history of
cinema.[7][8][9][10][11] He is celebrated for works including The Apu Trilogy
(1955–1959),[12] The Music Room (1958), The Big City (1963) and Charulata
(1964) and the Goopy–Bagha trilogy.
Ray was born in Calcutta to author Sukumar Ray. Starting his career as a
commercial artist, Ray was drawn into independent film-making after meeting
French filmmaker Jean Renoir and viewing Vittorio De Sica's Italian neorealist film
Bicycle Thieves (1948) during a visit to London.
Ray directed 36 films, including feature films, documentaries, and shorts. Ray's first
how
Network
Neural
film, Pather Panchali (1955) won eleven international prizes, including the inaugural
Best Human Document award at the 1956 Cannes Film Festival. This film, along
with Aparajito (1956) and Apur Sansar (The World of Apu) (1959), form The Apu
Trilogy. Ray did the scripting, casting, scoring, and editing, and designed his own
credit titles and publicity material. He also authored several short stories and
What is the (x, y)? to
novels, primarily for young children and teenagers. Popular characters created by
Ray include Feluda the sleuth, Professor Shonku the scientist, Tarini Khuro the
storyteller, and Lalmohan Ganguly the novelist.
train?
Ray received many major awards in his career, including a record thirty-six Indian
National Film Awards, a Golden Lion, a Golden Bear, two Silver Bears, many
additional awards at international film festivals and ceremonies, and an Academy
Honorary Award in 1992. In 1978, he was awarded an honorary degree by Oxford
University. The Government of India honoured him with the Bharat Ratna, its
highest civilian award in 1992. On the occasion of the birth centenary of Ray, the
International Film Festival of India, in recognition of the auteur's legacy,
rechristened in 2021 its annual Lifetime Achievement award to "Satyajit Ray
Lifetime Achievement Award".
how to infer (predict)?
LLM: Changing Data into Train - Test
Satyajit Ray (Bengali pronunciation: [ˈʃotːodʒit ˈrae̯] ⓘ; 2 May 1921 – 23 April 1992)
was an Indian director, screenwriter, documentary filmmaker, author, essayist,
lyricist, magazine editor, illustrator, calligrapher, and composer. Ray is widely
considered one of the greatest and most influential film directors in the history of
cinema.[7][8][9][10][11] He is celebrated for works including The Apu Trilogy
(1955–1959),[12] The Music Room (1958), The Big City (1963) and Charulata
(1964) and the Goopy–Bagha trilogy.
Ray was born in Calcutta to author Sukumar Ray. Starting his career as a
commercial artist, Ray was drawn into independent film-making after meeting
French filmmaker Jean Renoir and viewing Vittorio De Sica's Italian neorealist film
Bicycle Thieves (1948) during a visit to London.
Ray directed 36 films, including feature films, documentaries, and shorts. Ray's first
how
Network
Neural
film, Pather Panchali (1955) won eleven international prizes, including the inaugural
Best Human Document award at the 1956 Cannes Film Festival. This film, along
with Aparajito (1956) and Apur Sansar (The World of Apu) (1959), form The Apu
Trilogy. Ray did the scripting, casting, scoring, and editing, and designed his own
credit titles and publicity material. He also authored several short stories and
What is the (x, y)? to
novels, primarily for young children and teenagers. Popular characters created by
Ray include Feluda the sleuth, Professor Shonku the scientist, Tarini Khuro the
storyteller, and Lalmohan Ganguly the novelist.
train?
Ray received many major awards in his career, including a record thirty-six Indian
National Film Awards, a Golden Lion, a Golden Bear, two Silver Bears, many
additional awards at international film festivals and ceremonies, and an Academy
Honorary Award in 1992. In 1978, he was awarded an honorary degree by Oxford
University. The Government of India honoured him with the Bharat Ratna, its
highest civilian award in 1992. On the occasion of the birth centenary of Ray, the
International Film Festival of India, in recognition of the auteur's legacy,
rechristened in 2021 its annual Lifetime Achievement award to "Satyajit Ray
Lifetime Achievement Award".
how to infer (predict)?
LLM: Changing Data into Train - Test
LLM: Changing Data into Train - Test
LLM: Changing Data into Train - Test
https://www.columbia.edu/~jsl2239/transformers.html
LLM: Changing Data into Train - Test
a
The GPT2 Model
LLM: Dataprepocessing
LLM: Experimental Setup
LLM: GPT Model
LLM: Training
LLM: Evaluation
LLM: GPT Architecture: Single Head
LLM: GPT Architecture: Multi Head
LLM: GPT Architecture: Transformer Block
LLM: GPT2 Architecture
Inspiration and References
Andrej Karpathy
Andrej Karpathy
Sebastian Raschka