Transformer LectureNotes
Transformer LectureNotes
nl blog
Transformer whos
transformes
o
iii
77,3 7 Y Yz Ys Ye
K dimensional vector
M I k dins in Ml Hz Ms Mu Mk
not i
few Colours
similarly we
t
mI
iE
in
yik t
w.si 7 foii4Imwuis uktEioswiieI
O
Just like matrix
and
there is a
matrix
Y JO
o
mm onto
0017
Ng
This
weight matrix Hq i Howrelated
i
A int t
wiin.am
Its value
range from fu
can a
hence
in order to normalize it apply HIM
wiis em.tt lw
w
EnIa MfY
y Y Yu
I
To
on on
T nT
q
ne m as my
L
Only operation in transformers that propagates
qoks
Problem Movie recommender system
movie
go.in afwoFmanofi.d user itoIHikes
l
ftpM3 X
Hikes
My Mz Mt Miz MD
ImJ Tojo th I
Mz Mum af.tflomoY
t
qwqf.fr
t
Mia t.TN Me Tj
U3zU33
UH
Mk Mkz MRS
3
Kl
I 2 Mingo score 2 Min Uja
1 15 1 ili 4 1 Theres seems
likes the
user
movie score gotBosHed
Basically each movie can be seen as a feature vector
as well as each uses a e il l
ofSamfim
dotproduct fine similarity is being Computed
But Such feahise annotation need to
be done manually as
for million of
movies and atleast entraction of
features
Impractical
amMI am
m
BY the cat walks on the street
I 1111
ez.glVthefatuwaeusVonVtheostreetmif
vector
oomtt
His fed t
Ythe Flat Ywalks Yon Ythe Ystaat
emwgu.fi own
euwefgtgeJEY.amhdTauseaamgTy
H
But relatedness completely
Edfayuendsupontheoptingation
Mgt belatedness
appugherweedusngdot
Hp is weighted sums over
the whole infatsequenee withweights
determined by these
Dot product How much these two vectors LM dad
on are related
to do tea 2 seed operations
Currently we are trying
No parameters learnedfor self attention
are determined
Only word encoding
and are used to get the
till now
relatedness dot product
using
its input as a set
Self attention sees
it is
NOT even though
Sequence
sequenced
TgYmgmTt
mmT
Rin
Min turfed
permutation
equivariant
Now additional weights
introducing THO
to be learnt for Performance modern transformers tricks
and values Trick ol
Query Keys
network
One Can see that in the above transformer
vector
any input is used in ways
Y Y Yz Yu
D
FfQe
ey oiiioieiiiEisEioiiioi.p
alnd
IF IF IF IF
24 Nz Nz Ny
weighted Sum of Lm m B M
but weights Wi upon
for each Cni dependsto
how much it is related dot product 20
One can see that each day 2b is used
in different
ways
in
self attention that
used as a key value ofcoed
BOT Wrists
TY
will be
so as
check the relatedness
used to
to get the weight
that
Wy B used to Convert into a value value
Can used to produce the weighted
wand www.mpkgfygj
weights
Query's Ni
µ
with Compared for relatedness using dotproduct
all other Logo's
in order to establish
the weights its
for own
Cni crib
they In order to participate in the generation of
ni is used as a and
all ffj.fi again key
to get its weighting
Compared with all i
contribution
IF He
1 qe.ua
s
I am query
gi
Igzf
T I come T for relatedness
usingfalqm
F IF A IF
M ma se
m with the key of
J
G
all the input's
DABaaakdaokdwwr
b Now iefwplit.it
once
g
all the related
ness
offy with Fey Fey Fey Emery Contribution
i
i
where y WyW4zW43W Yg
my
We Ya b Query
I I I contribution
P
Wa HE 43 M Hz Hz Hy Ot
Wu ofMg Yu
In
a Horizontally Row Query of fee
1 Whim
Y Wi M twizktwiz.kz
Yz Wz M t WzzMtW23BtWzyM
Yz Wz M tWzzNtW33BtW34M
twyzktwyzB
a
Yg Wy M
Contribution
Value Cng
in H
of
generation of y
A general
write
t Unis
Wig i
Contribution
of Ks coe
HiF i in the generation
Yi
hh
o9
all kxk
These
Hq Wu Wu are
matrix
Query ni W
qui
8i
I
Qtri
Uni v ni
c
w
Soft max i
it Wig
Wego value he
y
The attention layer
self got parameterized using
Wa Wa Wv
BO scaling the Dot product tick
l 02
wig
Query Nif key se
q
Tr
Average value
of the dot product keeps on
growing
with the has
embeddings dimensions finally
got applied by Softmase this growth
g
entponentiated and be effected adversarily
got may
Hence we need to normalize it Since our
input vector C
IRK say all are
I
Norm
CITE
growthisred
a ER a
Hence normalizing it
byCrn
Multi Head attention
tq Case I
q
Teachergivehomeworkto
student
Student give
homework to
Teacher
Won Wu
Mo
yg bombejses.name
r.gnafsw
Mls imgur.m.ahy.gggot
iiiiiii.im p
T
Dilthey
T T
M Nz Hz
Narrow and wide self attention
ii
I 013
ff O Py
5
Ots
F 0Pa
013
infantonond o ko
ai
e each
head
apply whole
vector
the din
to 1256
size M 256
oridk 3
256
Terms f
co
eet.edu
nits
An
architecture that can process a set
of units sequence
of tokens pixels where
Units can
only intercut via
selfattention
Repeated
self attention
TransformerBloc mDP
I l
Iif
in kEE
t.am
ima pyg
Dmi
ResidualConnections
Eff Feedforwardlayer
Singlelayer Mlp
Fff
for eachvector
Normalization and Residual learning are standard
tricks to train deep networks
lateenzed
IMDb Sentiment classification dataset
in wordy
www.quemgtmwouie
reviews It
Binary
Classification
Di t ITI t
t Ii F
Bad 1J ing
4 I
WordEmbedding Cafeamwedas
jdgemgJfshFhtme.n
W word embedding as Classified
or seen as
love positional encoding
using some Complenmabburgfu
Requirement of positional embeddings
sentence can be
Plays a
very crucial role Any
seen as a Complex
ftp.zffsiunanm.h.name
hrnn.ua input
www.msn.iszfiafo.i
embedded
ii.numwmti
explicitly got 9Ei
f
s.eofxi.ae am
t.EE Ti5Tn m
gTnentenfdefsiEontaTt
s wtf 17 invariant encodings
w
3 E
WZ Wz W T
Some basic options are
t O I 2 3 10,000
some bigNo
f
Unbounded
Normalize it 61W o 1
Bounded
jmmoHrbnthJ ME'Fineresteft
Ifta FF
sentences across all sentences
testing
be longer
may Baticcriterialsforcencoding
Dtfungistent
Position
of a word in any sentence must
have to be unique unique timeStef encodings
Distance b w two timesteps at must have
any
to be consistent across all length Sequence
to be bounded and generalize well
Encoding need
for longer sequence
we require a function IN Rd
t
pos.djmdimefon.at
Tied Rd
t
fft
function
PIE
u
ueosmongf.taqj.ng.mu
inthelement
time t
Sinlwtk
when i _2K
i e i even
29 1 2 d
ranges 3,4 Cos wet
from
tf when I _2kt I
e i odd
K ranges
k 1 K 2 k 3 k d z
E KmimHttiti7mYwi
f One encodes dimensions
Sinews t
q
Therefore a total of 2
sin want frequencies are used to encode d dimensions
cos Cwdht
Q1 than one sinusoid function
why do
you require
more
As Suggested in paper 1
Wk 0,0003214A
14 0 Wo L
fo To 2A 0ne
K l W 1
ffor9 2
d
1 WE 1d I Tz 2K 10000
µo ga yd
K 2 9 10000
I W3 o qd fz z og6µ
13
d z
finally k
d
K 2 Wd E o
I tdyz gaqooog dk2nlHood
r
f To Ti Tz Tz Tdk
zatfoooojagtafooooYDF.nu
d
2 oooo 2 10 ooo
after after
Y wt Y wt t TE
un www ur
word position
final embedding embedding
embedding
fedtothe NH
Jn
order to make this summation
can
Theyconcatenated possible dimensions of word
alsobe addition embeddings and position
intuitive
instead of embeddings are set as equal
ismore few reasons supporting additions
that µ
It requires less memory
With man timescale 19000 do o o1 wordhas
the same
encoding as first word
Assuming any seq of length koufimhcsm.org
with word embedding
of 512 dimwillmost of
the position embedding dimensions be
used dominated by world embeddingCWE
Query DQ KeyTransformation
matrix
Transformation
matrix
I I b w
Qu Ky similarity
Them is considered
as attention Confuted
Quite ufonsffmarily
Ktttt
usingdotproduct
Q ate K y tf at Qe Ky 1kt
Qu Ky Quik ft e
Ky Kf
x'CQ k y n Qk f e Q k y t e CQ K f
Assuming k as a single transform TO
fateY Klutts N'T Y 1 N'T f te'T yte'Tf
ATLEAST
Afaik Affelity
Attention a 4 t
Simultaneously d Efsfdt.io
QAffnqttwnordyO
Concatenation
may give
uN orthogonality1separation
guffaws between w but
omeyathighdi as both are high dimension
M
orthogonal
Can apporimate orthogonality
space is naturally being
appooninally w o concatualu
achieved Cost of more
Parametes
OO
I O
S 120
t t t
O 00
Lotto
zoo 400 500
Xanis embedding
dimension
o
for some
fined K PE pod CPE post k
need to be
linearly related
As we all know that for each frequay Wu we have
Sin wat and Cees wut components
IY t II 1
m
East
This ensures that the PE at for frequency
is
linearly related to that of Lt tat and is
independent of only depends on t
II Him tt 9 Iii D
m
Ykwt wast Sin at Sin wist casuist
fg.m
ygsyingefqw.gg coscw a.e
w.gyg.m w.e w
relate
E a linearly
ensuring model learns to attend relative positions
0 to 20 30 UO 5 60
Showing the
10 0 dot productof
PE's for all
20 O
time steps
30 O Distance b w
neighbouring time
40 0
steps are symmetric
50 Decaylinealy
V with time
60
Diagonals are
zero and off diagonals are Symmetric
Transformer Architecture uses self attention
It has units that can process
tokens Sequences Set but
of pixels
units can interacts via Selfattention
my
Im MLP
MLP
41
Residual Residual
Connection Connection 42
Lf
43
Yy
0 13
How to generate Text using Transformers
entire input
IP defends upon
so nent character prediction is just
Trivial step choosing
Hence mash the future data