0% found this document useful (0 votes)
46 views33 pages

Transformer LectureNotes

The document discusses the architecture and functioning of transformers, particularly focusing on self-attention mechanisms and their applications in tasks like movie recommendation systems. It emphasizes the importance of learning feature representations automatically instead of manual annotation and highlights the role of positional embeddings in maintaining the order of words in sequences. Additionally, it touches on the normalization and scaling of dot products in multi-head attention to improve performance in deep learning models.

Uploaded by

khushalbishnoi44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views33 pages

Transformer LectureNotes

The document discusses the architecture and functioning of transformers, particularly focusing on self-attention mechanisms and their applications in tasks like movie recommendation systems. It emphasizes the importance of learning feature representations automatically instead of manual annotation and highlights the role of positional embeddings in maintaining the order of words in sequences. Additionally, it touches on the normalization and scaling of dot products in multi-head attention to improve performance in deep learning models.

Uploaded by

khushalbishnoi44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

www.peterbloem.

nl blog
Transformer whos
transformes
o

Self attention Seg 2 Segotperatramidias

iii
77,3 7 Y Yz Ys Ye
K dimensional vector

M I k dins in Ml Hz Ms Mu Mk

not i

few Colours

similarly we
t
mI
iE

in
yik t
w.si 7 foii4Imwuis uktEioswiieI
O
Just like matrix
and
there is a
matrix

Any is nothing but the weighted averaged


vector over all the inputvectors
summits
M M Mt

Y JO
o

One Can ESaywi that


L
Nils can be seen as

the basis and the space spawn them


by
will hold the cy.is M O 10

mm onto

0017
Ng
This
weight matrix Hq i Howrelated

i
A int t

wiin.am
Its value
range from fu
can a

hence
in order to normalize it apply HIM

wiis em.tt lw
w
EnIa MfY

y Y Yu

I
To
on on
T nT
q
ne m as my

L
Only operation in transformers that propagates

information between the vectors


Other than self
on individual
inputs
attention all operations are applied

qoks
Problem Movie recommender system

movie
go.in afwoFmanofi.d user itoIHikes
l
ftpM3 X
Hikes
My Mz Mt Miz MD

ImJ Tojo th I
Mz Mum af.tflomoY
t
qwqf.fr
t

Mia t.TN Me Tj
U3zU33
UH
Mk Mkz MRS
3
Kl
I 2 Mingo score 2 Min Uja
1 15 1 ili 4 1 Theres seems
likes the
user
movie score gotBosHed
Basically each movie can be seen as a feature vector
as well as each uses a e il l
ofSamfim
dotproduct fine similarity is being Computed
But Such feahise annotation need to

be done manually as
for million of
movies and atleast entraction of
features
Impractical

Instead manual annotation we


of
will let it to be learned automatically fix up
the hyperparameters and learn these parameters
6mFvie e user feature need to belearned
in
taking it towards supervised learning
classyannotation
the
Fet a
small dataset with
can be seen as
and learn the features that

learning the best representation


ygqg'mofoY
Gwg qfhY ftp.T.pg

amMI am
m
BY the cat walks on the street

I 1111
ez.glVthefatuwaeusVonVtheostreetmif
vector

oomtt
His fed t
Ythe Flat Ywalks Yon Ythe Ystaat

We need to learn vector


word
embedding
for the words
Similar
requires
words

emwgu.fi own
euwefgtgeJEY.amhdTauseaamgTy

H
But relatedness completely
Edfayuendsupontheoptingation

Mgt belatedness
appugherweedusngdot
Hp is weighted sums over
the whole infatsequenee withweights
determined by these
Dot product How much these two vectors LM dad
on are related
to do tea 2 seed operations
Currently we are trying
No parameters learnedfor self attention
are determined
Only word encoding
and are used to get the
till now
relatedness dot product
using
its input as a set
Self attention sees
it is
NOT even though
Sequence
sequenced

m n nn fff all belongsto


the input set 1

This means that order


does not matter and

TgYmgmTt
mmT

Rin

Min turfed
permutation
equivariant
Now additional weights
introducing THO
to be learnt for Performance modern transformers tricks
and values Trick ol
Query Keys
network
One Can see that in the above transformer
vector
any input is used in ways

Y Y Yz Yu
D
FfQe
ey oiiioieiiiEisEioiiioi.p
alnd

IF IF IF IF
24 Nz Nz Ny

weighted Sum of Lm m B M
but weights Wi upon
for each Cni dependsto
how much it is related dot product 20
One can see that each day 2b is used
in different
ways
in
self attention that
used as a key value ofcoed

BOT Wrists
TY
will be
so as
check the relatedness
used to
to get the weight
that
Wy B used to Convert into a value value
Can used to produce the weighted

wand www.mpkgfygj

Mff TjQ hmm


Fountainhead.IE
with Gg That will decide
its

weights
Query's Ni
µ
with Compared for relatedness using dotproduct
all other Logo's
in order to establish
the weights its
for own
Cni crib
they In order to participate in the generation of
ni is used as a and
all ffj.fi again key
to get its weighting
Compared with all i
contribution

valid finallyEli is also Converted into avaluevectors that can


be multiplied withthe weight to computeeach of weight
Y Y Y y for Confutala

IF He
1 qe.ua
s

I am query
gi
Igzf
T I come T for relatedness
usingfalqm
F IF A IF
M ma se
m with the key of

J
G
all the input's
DABaaakdaokdwwr

b Now iefwplit.it
once
g
all the related
ness
offy with Fey Fey Fey Emery Contribution

all other i's for y


got computed in www.zwizw 7Y
HI't G M olpt's
µg Query

i
i

where y WyW4zW43W Yg
my
We Ya b Query
I I I contribution
P
Wa HE 43 M Hz Hz Hy Ot
Wu ofMg Yu
In
a Horizontally Row Query of fee

Vertically Colli key of Ego

1 Whim
Y Wi M twizktwiz.kz

Yz Wz M t WzzMtW23BtWzyM

Yz Wz M tWzzNtW33BtW34M

twyzktwyzB
a

Yg Wy M

wyyn xkeyl.me Query ns


oabeth
uen zz
keyG t.xo

Contribution
Value Cng
in H
of
generation of y

A general
write
t Unis
Wig i
Contribution
of Ks coe
HiF i in the generation

Yi
hh
o9

all kxk
These
Hq Wu Wu are
matrix

and can be Seen as a linear transformation of


and k din ni infant vector from Rk to some
These transformation
Other K dem transformed vector
Need to be learned in order to transform any infant
it adapt to play its different
Hi so that can

roles appropriately for SA modules


These
IWaWJ
Wu Can be used in
many different
ways to realize the attention Onesuch way is

Query ni W
qui

8i
I

Qtri
Uni v ni
c

It wi Query Gif keyLn

w
Soft max i
it Wig

Wego value he
y
The attention layer
self got parameterized using
Wa Wa Wv
BO scaling the Dot product tick
l 02

wig
Query Nif key se

q
Tr
Average value
of the dot product keeps on
growing
with the has
embeddings dimensions finally
got applied by Softmase this growth
g
entponentiated and be effected adversarily
got may
Hence we need to normalize it Since our

input vector C
IRK say all are
I
Norm
CITE
growthisred
a ER a

Hence normalizing it
byCrn
Multi Head attention

tq Case I
q
Teachergivehomeworkto
student
Student give
homework to
Teacher

HonemauhhfnhtadMJ signeeuinpsutdfso.lt offhand


nations igsnou.es

Won Wu
Mo
yg bombejses.name
r.gnafsw
Mls imgur.m.ahy.gggot

Same word mean differently to


different neighbourhood.Content
One Self attention head information got
Summed up Averaged
But more and more
adding
headsfiaddPressmidontmenhaekimmeanighrainer
than find
Waikiki
ii
kk attention
WE wEwjYM
n

Hr wrq Wii wie


NY 1442 µY3
MEE Concert
TEY 3rd
AL
T
1 TEXTS A
I 2913

iiiiiii.im p
T
Dilthey
T T
M Nz Hz
Narrow and wide self attention

ii

I 013
ff O Py
5
Ots
F 0Pa
013
infantonond o ko

ai

e each
head
apply whole
vector
the din
to 1256
size M 256
oridk 3
256
Terms f
co
eet.edu
nits

An
architecture that can process a set
of units sequence
of tokens pixels where
Units can
only intercut via
selfattention
Repeated
self attention

TransformerBloc mDP
I l
Iif
in kEE
t.am
ima pyg
Dmi
ResidualConnections
Eff Feedforwardlayer

Singlelayer Mlp
Fff
for eachvector
Normalization and Residual learning are standard
tricks to train deep networks

Designing Classification Transformer

lateenzed
IMDb Sentiment classification dataset
in wordy

www.quemgtmwouie
reviews It
Binary
Classification

Transformer based classification Etuefue


ilpsegw.org Position

Di t ITI t

t Ii F

Bad 1J ing

4 I
WordEmbedding Cafeamwedas
jdgemgJfshFhtme.n
W word embedding as Classified
or seen as
love positional encoding
using some Complenmabburgfu
Requirement of positional embeddings

The position in sentence S


of any
word Lugo a

sentence can be
Plays a
very crucial role Any
seen as a Complex

ftp.zffsiunanm.h.name
hrnn.ua input

www.msn.iszfiafo.i
embedded
ii.numwmti
explicitly got 9Ei
f

s.eofxi.ae am

Massively parallel fast


But in this process it looses the
positional information ofanyword

t.EE Ti5Tn m
gTnentenfdefsiEontaTt
s wtf 17 invariant encodings
w
3 E
WZ Wz W T
Some basic options are

Decimal encoding for each t

t O I 2 3 10,000
some bigNo

f
Unbounded
Normalize it 61W o 1
Bounded

Do not generalize But each Mp is different size


of

jmmoHrbnthJ ME'Fineresteft
Ifta FF
sentences across all sentences
testing
be longer
may Baticcriterialsforcencoding
Dtfungistent

Position
of a word in any sentence must
have to be unique unique timeStef encodings
Distance b w two timesteps at must have
any
to be consistent across all length Sequence
to be bounded and generalize well
Encoding need
for longer sequence

It has to be deterministic and fixed for any given i p


should not be a random fu
Hence
for each timestef t in any ilpsentence s

we require a function IN Rd
t
pos.djmdimefon.at

Tied Rd
t
fft
function
PIE
u
ueosmongf.taqj.ng.mu
inthelement
time t
Sinlwtk
when i _2K
i e i even

29 1 2 d
ranges 3,4 Cos wet
from
tf when I _2kt I
e i odd
K ranges
k 1 K 2 k 3 k d z

E KmimHttiti7mYwi
f One encodes dimensions
Sinews t
q

i using singe each

Therefore a total of 2
sin want frequencies are used to encode d dimensions

cos Cwdht
Q1 than one sinusoid function
why do
you require
more

only u t Anyonesinusoid can


12 give us uniquevalues
after that oalues keep on repeating
ray Instead of a single sinusoid if one
Can use Sinusoid maybe at samefreq
to
then there combination
may lead
produce unique tiniestef
encodingfor each
But it will be e D encoding
W gag
f 2
t d w also f Is
2A

Hence Sin t Cover upto T


of Cees wit
Timeperiod
w o repeating
sinusoids will
Adding more
increase the time period w o repeating the sequence

As Suggested in paper 1
Wk 0,0003214A

14 0 Wo L
fo To 2A 0ne

K l W 1
ffor9 2

doooo Hd ft 2 ooj4d Tl 2A toooo

d
1 WE 1d I Tz 2K 10000
µo ga yd

K 2 9 10000
I W3 o qd fz z og6µ
13
d z
finally k

d
K 2 Wd E o
I tdyz gaqooog dk2nlHood
r

f To Ti Tz Tz Tdk
zatfoooojagtafooooYDF.nu
d
2 oooo 2 10 ooo

Systematically increasing the time period of the


sinusoid
from La 2nd 9000 faffemaint
d
It is a GP with o Co ooo

The d dim encoding is going to berepeated after


bCM To T Tz Tdf Td 2 2940000

t.CM ofa GP is its final term


E3 I 2 3 I 2 3 1 23 F 3 1 23 123 123 123 123 I
U I 2341 2341234 I 234
t 2 I 2 I 2 I 2 I 2 I

after after

Q2 why dividing 2A gallo ooo into 4 2

when even 1cm of 29 2h 10 ooo a 10,000


Instead of taking a few why to take dots frequencies
Before that let us see how to use thesepositionalencodings

for any word


say Cwa in
any sentence Ew Wn

Y wt Y wt t TE
un www ur

word position
final embedding embedding
embedding
fedtothe NH
Jn
order to make this summation
can
Theyconcatenated possible dimensions of word
alsobe addition embeddings and position
intuitive
instead of embeddings are set as equal
ismore few reasons supporting additions
that µ
It requires less memory
With man timescale 19000 do o o1 wordhas
the same
encoding as first word
Assuming any seq of length koufimhcsm.org
with word embedding
of 512 dimwillmost of
the position embedding dimensions be
used dominated by world embeddingCWE

Et PE got trained from scratch in transformer


This enable not to encode important
information in lowerdimensionsof WE
for smaller sequence this isalmostlike
concatenation happening
In order to learn the attention 61W 2
given embeddings
primary say Transformer passesthem into
ill
I I seggydary

Query DQ KeyTransformation
matrix
Transformation
matrix
I I b w
Qu Ky similarity
Them is considered
as attention Confuted

using dot product


a
Kyl see b y
How much attention steatffrofenformahr
matrix b
Should we pay to word G It transforms secondaryiYp
given word y into a space where it can be
directly compared with
Now
adding pe's to d

Q atte yttf l.gg

Quite ufonsffmarily
Ktttt
usingdotproduct
Q ate K y tf at Qe Ky 1kt
Qu Ky Quik ft e
Ky Kf

x'CQ k y n Qk f e Q k y t e CQ K f
Assuming k as a single transform TO
fateY Klutts N'T Y 1 N'T f te'T yte'Tf
ATLEAST
Afaik Affelity
Attention a 4 t

word given word y Attention word

Hence learning with


ordtoyftioss
bfnwm
OE enehioiqufwo.gg

WE PE need to look forthese 4 tasks Attention to the plositioneo

Simultaneously d Efsfdt.io
QAffnqttwnordyO

Concatenation
may give
uN orthogonality1separation
guffaws between w but
omeyathighdi as both are high dimension
M
orthogonal
Can apporimate orthogonality
space is naturally being
appooninally w o concatualu
achieved Cost of more
Parametes

Q2CAnsw Since we are


going
to

add both WE PE we need to make


their dimensions same
Assuming any seq of length koufimhcs.org
with word embedding
of 512 dim most of
the position embedding dimensions will be
used dominated world embeddingCWE
by
0
Zo

OO
I O
S 120
t t t
O 00
Lotto
zoo 400 500

Xanis embedding
dimension

3 How does it address relative positions


to be consistent
i e PE for t R t t.at need

o
for some
fined K PE pod CPE post k
need to be
linearly related
As we all know that for each frequay Wu we have
Sin wat and Cees wut components

Let us assume a linear transformation M EREZ


independent of as follows M f est

IY t II 1
m
East
This ensures that the PE at for frequency
is
linearly related to that of Lt tat and is
independent of only depends on t

Let us compute such an LM

II Him tt 9 Iii D
m
Ykwt wast Sin at Sin wist casuist

fg.m
ygsyingefqw.gg coscw a.e
w.gyg.m w.e w

a U Cos Kot V Sin Wic Dt


Uz Sin Wis Dt Vs Cos Wis Dt
Mls EE Cos wkDtJSin wkDt

Spongiform Sinlwkat Costwist

Similarly we can have Most for alek's


k

Hence Such a transformation can eventually

relate
E a linearly
ensuring model learns to attend relative positions
0 to 20 30 UO 5 60
Showing the
10 0 dot productof
PE's for all
20 O
time steps
30 O Distance b w

neighbouring time
40 0
steps are symmetric
50 Decaylinealy
V with time
60

Diagonals are
zero and off diagonals are Symmetric
Transformer Architecture uses self attention
It has units that can process
tokens Sequences Set but
of pixels
units can interacts via Selfattention

Transformer block one


MVPindependently
applied vector
each
ILP fMLP to
Xi
cryer Layer
spelt
too Nor
2h3

my
Im MLP
MLP
41
Residual Residual
Connection Connection 42
Lf
43
Yy
0 13
How to generate Text using Transformers

entire input
IP defends upon
so nent character prediction is just
Trivial step choosing
Hence mash the future data

allows to attend Previous date


Only the
by

You might also like