0% found this document useful (0 votes)
71 views46 pages

LLM Crash Course

Uploaded by

Purnendu Maity
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views46 pages

LLM Crash Course

Uploaded by

Purnendu Maity
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

LLM Crash Course

Srijit Mukherjee
LLM Crash Course Content
Build your Foundations (Day 1)

Mathematics of Deep Learning


Problem Statement
Language Process

Build LLM from Scratch (Day 2)

GPT Architecture Experimental Setup Data Processing


LLM: Problem Statement
n y
Input (x)

Output (y)
LLM: Problem Statement

θ : unknown parameters

Network
Neural
Input (x) f(x,θ) Output (y)
LLM: Problem Statement

θ : unknown parameters

IE

Network
Neural
x = [x1, x2, x3, …., xn] f(x,θ) y = [y1, y2, y3, …., ym]

Sequence of words

sequence of characters Tokenization fent.se


tD
LLM: Problem Statement

● Given a text “x”, how to define [x1, x2, …, xn] based on x?


○ We answer this in details in Tokenization.
○ However, here are some examples.

● Assume x = “I love large language models”.


000000
○ x = [‘I’, ‘love’, ‘large’, ‘language’, ‘models’, ‘.’] (word-wise)
○ x = [‘I’, ‘l’ ,‘o’, ‘v’, ‘e’, ‘l’, ‘a’, …, ‘e’, ‘l’, ‘s’, ‘.’] (character-wise)
000C
○ There can be other ways too. What are they?

● The question is similar to a stock market prediction problem.


○ If you know the price of the last 100 days, predict price of the next 5 days?
○ Stock market prediction is more random and hard.
○ Language prediction is actually not hard - it is an easier problem.
LLM: Revising the Problem Statement

θ : unknown parameters

Network
Neural
x = [x1, x2, x3, …., xn] f(x,θ) y = [y1, y2, y3, …., ym]
LLM: Deep Learning Process

θ : unknown parameters

f ni

Network
Neural
x (tensor) y (tensor)
i 0
f(x,θ)
i

m
● Goal: Given data (x1,y1), (x2,y2), …., (xn,yn), how do we find the best
parameters θ, such that f(xi,θ) is as close as possible to yi for all i?
Questions
Important
mmmm

text How do we
We have a
large
them into Mi yi Input Output
divide

is independent 22,72
n y
Time Series are
But in Language
on each other
dependent
LLM: Tensors are the math of data.
DO

Vector -> Matrix -> Tensor An image is a tensor

● Everything is a tensor. Image is a tensor. Stock market data is a tensor.


● Images have a natural mathematical (intensity-based) setup of a tensor of
dimension (C,H,W) where C = channels, H = height, W = width.
● Important Question: How is language data a tensor?
Wengthe
connection Feed
Logical
Tang15765

A can
Rupatif
WSymbols
Hindet
values W
ASCII
Interconnet
Apple
A perations Headphone

Mango

Frequency Logical

M that is beyond symbols


ng
I want to eattpple
want to eat Mango
I
LLM: Statistical and Distributional Semantics
King an queen woman
Rd
1
1

Statistical and Distributional Semantics

● Statistical Semantics: Words has positional connection with one another.


“a word is characterized by the company it keeps”
● Distributional Semantics: Words with similar meanings should be close.
“linguistic items with similar distributions have similar meanings”
LLM: Cosine Law: Measuring the Distance between two Words

I 1
m

Word f ote
Vector
LLM: Distributional Semantics: King - Man + Woman ≈ Queen

inform
LLM: Language has long context length. Hence position is imp.

1ⁿᵈ
Word

mar Meaning Position


LLM: Attention Mechanism solves Distributional Semantics
Apple
guest 0

Em Et
Eco En
m
Meaning
value

https://mlspring.beehiiv.com/ https://www.columbia.edu/~jsl2239/transformers.html
LLM: Attention Mechanism solves Distributional Semantics

to

https://mlspring.beehiiv.com/ https://www.columbia.edu/~jsl2239/transformers.html
1347918
Guy
111
Query

Suraj r

Query e
match a
mn
key

EH
IE Yi

K2
E

g
ingth

Harvard

Ein

t
É Nxn
DDD
LLM: Position Encoding solves Statistical Semantics

meaning I
I
2
3
to
4

position

https://newsletter.theaiedge.io/
LLM: Position Encoding is important (asymmetric)!

https://newsletter.theaiedge.io/
LLM: Linearity of Positional Encoding for Relative Encoding
s

650

https://cs.brown.edu/courses/cs146/assets/files/linearity.pdf
LLM: Why Convolution and RNN/LSTM couldn’t solve it?

mn

I 1
iii
Receptive Fied
Convolution -> Takes lot of layer.
Sequential -> Takes lots of time.


1k
Attention solves it by parallel computation, because matrix multiplication is
parallel computation.
I 5d sum dot

NV
DEERE
LLM: Deep Learning Process

θ : unknown parameters

Network
Neural
x (tensor) f(x,θ) y (tensor)

● Goal: Given data (x1,y1), (x2,y2), …., (xn,yn), how do we find the best
parameters θ, such that f(xi,θ) is as close as possible to yi for all i?
LLM: Deep Learning Process
θ : unknown parameters

input predicted true value

Network
Neural
xi f(x,θ) f(xi,θ) yi

Lossi = Loss(f(xi,θ),yi)

we want the difference


Loss(θ) = sum over i: Lossi
to be small for all i

● Solution: Given data (x1,y1), (x2,y2), …., (xn,yn), find the parameters θ, such
that Loss(θ) = sum over i: Lossi is minimized!
LLM: Gradient Descent over θ
LLM: Changing Data into Train - Test

Satyajit Ray (Bengali pronunciation: [ˈʃotːodʒit ˈrae̯] ⓘ; 2 May 1921 – 23 April 1992)
was an Indian director, screenwriter, documentary filmmaker, author, essayist,
lyricist, magazine editor, illustrator, calligrapher, and composer. Ray is widely
considered one of the greatest and most influential film directors in the history of
cinema.[7][8][9][10][11] He is celebrated for works including The Apu Trilogy
(1955–1959),[12] The Music Room (1958), The Big City (1963) and Charulata
(1964) and the Goopy–Bagha trilogy.

Ray was born in Calcutta to author Sukumar Ray. Starting his career as a
commercial artist, Ray was drawn into independent film-making after meeting
French filmmaker Jean Renoir and viewing Vittorio De Sica's Italian neorealist film
Bicycle Thieves (1948) during a visit to London.

Ray directed 36 films, including feature films, documentaries, and shorts. Ray's first
how

Network
Neural
film, Pather Panchali (1955) won eleven international prizes, including the inaugural
Best Human Document award at the 1956 Cannes Film Festival. This film, along
with Aparajito (1956) and Apur Sansar (The World of Apu) (1959), form The Apu
Trilogy. Ray did the scripting, casting, scoring, and editing, and designed his own
credit titles and publicity material. He also authored several short stories and
What is the (x, y)? to
novels, primarily for young children and teenagers. Popular characters created by
Ray include Feluda the sleuth, Professor Shonku the scientist, Tarini Khuro the
storyteller, and Lalmohan Ganguly the novelist.
train?
Ray received many major awards in his career, including a record thirty-six Indian
National Film Awards, a Golden Lion, a Golden Bear, two Silver Bears, many
additional awards at international film festivals and ceremonies, and an Academy
Honorary Award in 1992. In 1978, he was awarded an honorary degree by Oxford
University. The Government of India honoured him with the Bharat Ratna, its
highest civilian award in 1992. On the occasion of the birth centenary of Ray, the
International Film Festival of India, in recognition of the auteur's legacy,
rechristened in 2021 its annual Lifetime Achievement award to "Satyajit Ray
Lifetime Achievement Award".

how to infer (predict)?


LLM: Changing Data into Train - Test

Satyajit Ray (Bengali pronunciation: [ˈʃotːodʒit ˈrae̯] ⓘ; 2 May 1921 – 23 April 1992)
was an Indian director, screenwriter, documentary filmmaker, author, essayist,
lyricist, magazine editor, illustrator, calligrapher, and composer. Ray is widely
considered one of the greatest and most influential film directors in the history of
cinema.[7][8][9][10][11] He is celebrated for works including The Apu Trilogy
(1955–1959),[12] The Music Room (1958), The Big City (1963) and Charulata
(1964) and the Goopy–Bagha trilogy.

Ray was born in Calcutta to author Sukumar Ray. Starting his career as a
commercial artist, Ray was drawn into independent film-making after meeting
French filmmaker Jean Renoir and viewing Vittorio De Sica's Italian neorealist film
Bicycle Thieves (1948) during a visit to London.

Ray directed 36 films, including feature films, documentaries, and shorts. Ray's first
how

Network
Neural
film, Pather Panchali (1955) won eleven international prizes, including the inaugural
Best Human Document award at the 1956 Cannes Film Festival. This film, along
with Aparajito (1956) and Apur Sansar (The World of Apu) (1959), form The Apu
Trilogy. Ray did the scripting, casting, scoring, and editing, and designed his own
credit titles and publicity material. He also authored several short stories and
What is the (x, y)? to
novels, primarily for young children and teenagers. Popular characters created by
Ray include Feluda the sleuth, Professor Shonku the scientist, Tarini Khuro the
storyteller, and Lalmohan Ganguly the novelist.
train?
Ray received many major awards in his career, including a record thirty-six Indian
National Film Awards, a Golden Lion, a Golden Bear, two Silver Bears, many
additional awards at international film festivals and ceremonies, and an Academy
Honorary Award in 1992. In 1978, he was awarded an honorary degree by Oxford
University. The Government of India honoured him with the Bharat Ratna, its
highest civilian award in 1992. On the occasion of the birth centenary of Ray, the
International Film Festival of India, in recognition of the auteur's legacy,
rechristened in 2021 its annual Lifetime Achievement award to "Satyajit Ray
Lifetime Achievement Award".

how to infer (predict)?


LLM: Changing Data into Train - Test
LLM: Changing Data into Train - Test
LLM: Changing Data into Train - Test

https://www.columbia.edu/~jsl2239/transformers.html
LLM: Changing Data into Train - Test

a
The GPT2 Model
LLM: Dataprepocessing
LLM: Experimental Setup
LLM: GPT Model
LLM: Training
LLM: Evaluation
LLM: GPT Architecture: Single Head
LLM: GPT Architecture: Multi Head
LLM: GPT Architecture: Transformer Block
LLM: GPT2 Architecture
Inspiration and References
Andrej Karpathy
Andrej Karpathy

Sebastian Raschka

You might also like