0% found this document useful (0 votes)

48 views17 pages

The Positional Encoding Blog

Uploaded by

sindhuvanaparthi2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views17 pages

The Positional Encoding Blog

Uploaded by

sindhuvanaparthi2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Transformer Architecture: The Positional Encoding

Let's use sinusoidal functions to inject the order of words in our model
Posted by September 20, 2019 · 17 mins read

Table of Content
What is positional encoding and Why do we need it in the first place?
Proposed method
The intuition
Other details
Relative Positioning
FAQ
Summary
References

Transformer architecture was introduced as a novel pure attention-only sequence-to-sequence

architecture by Vaswani et al. Its ability for parallelizable training and its general performance
improvement made it a popular option among NLP (and recently CV) researchers.

Thanks to the several implementations in common deep learning frameworks, it became an easy option
to experiment with for many students (including myself). Even though making it more accessible is a
great thing, but on the downside it may cause the details of the model to be ignored.

In this article, I don’t plan to explain its architecture in depth as there are currently several great
tutorials on this topic (here, here, and here), but alternatively, I want to discuss one specific part of the
transformer’s architecture - the positional encoding.

When I read this part of the paper, it raised some questions in my head, which unfortunately the author
had not provided sufficient information to answer them. So in this article, I want to try to break this
module apart and look at how it works.

NOTE: To understand the rest of this post, I highly suggest you read one those tutorials to get familiar
with the transformer architecture.
Figure 1 - The Transformer Architecture

Header Photo by Susan Yin on Unsplash

What is positional encoding and Why do we need it in the first place?

Position and order of words are the essential parts of any language. They define the grammar and thus
the actual semantics of a sentence. Recurrent Neural Networks (RNNs) inherently take the order of
word into account; They parse a sentence word by word in a sequential manner. This will integrate the
words’ order in the backbone of RNNs.

But the Transformer architecture ditched the recurrence mechanism in favor of multi-head self-
attention mechanism. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the
training time. And theoretically, it can capture longer dependencies in a sentence.

As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack,
The model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the
need for a way to incorporate the order of the words into our model.

One possible solution to give the model some sense of order is to add a piece of information to each
word about its position in the sentence. We call this “piece of information”, the positional encoding.

The first idea that might come to mind is to assign a number to each time-step within the [0, 1] range in
which 0 means the first word and 1 is the last time-step. Could you figure out what kind of issues it
would cause? One of the problems it will introduce is that you can’t figure out how many words are
present within a specific range. In other words, time-step delta doesn’t have consistent meaning across
different sentences.
Another idea is to assign a number to each time-step linearly. That is, the first word is given “1”, the
second word is given “2”, and so on. The problem with this approach is that not only the values could
get quite large, but also our model can face sentences longer than the ones in training. In addition, our
model may not see any sample with one specific length which would hurt generalization of our model.

Ideally, the following criteria should be satisfied:

It should output a unique encoding for each time-step (word’s position in a sentence)
Distance between any two time-steps should be consistent across sentences with different
lengths.
Our model should generalize to longer sentences without any efforts. Its values should be
bounded.
It must be deterministic.

Proposed method
The encoding proposed by the authors is a simple yet genius technique which satisfies all of those
criteria. First of all, it isn’t a single number. Instead, it’s a d-dimensional vector that contains
information about a specific position in a sentence. And secondly, this encoding is not integrated into
the model itself. Instead, this vector is used to equip each word with information about its position in a
sentence. In other words, we enhance the model’s input to inject the order of words.

→
Let t be the desired position in an input sentence, p ∈ R be its corresponding encoding, and d be the
t
d

encoding dimension (where d ≡ 0) Then f : N → R will be the function that produces the output
2
d

→
vector pt and it is defined as follows:

(i)
→ (i)
sin(ωk . t), if i = 2k
pt = f (t) := {
cos(ωk . t), if i = 2k + 1

where

1
ωk =
2k/d
10000

As it can be derived from the function definition, the frequencies are decreasing along the vector
dimension. Thus it forms a geometric progression from 2π to 10000 ⋅ 2π on the wavelengths.

→
You can also imagine the positional embedding pt as a vector containing pairs of sines and cosines for
each frequency (Note that d is divisble by 2):
sin(ω1 . t)
⎡ ⎤

⎢ cos(ω1 . t) ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ sin(ω . t) ⎥
2
⎢ ⎥
⎢ ⎥
⎢ cos(ω2 . t) ⎥
→ ⎢ ⎥
pt = ⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ sin(ωd/2 . t) ⎥

⎣ ⎦
cos(ωd/2 . t)
d×1

The intuition
You may wonder how this combination of sines and cosines could ever represent a position/order? It is
actually quite simple, Suppose you want to represent a number in binary format, how will that be?

0 : 0 0 0 0 8 : 1 0 0 0

1 : 0 0 0 1 9 : 1 0 0 1

2 : 0 0 1 0 10 : 1 0 1 0

3 : 0 0 1 1 11 : 1 0 1 1

4 : 0 1 0 0 12 : 1 1 0 0

5 : 0 1 0 1 13 : 1 1 0 1

6 : 0 1 1 0 14 : 1 1 1 0

7 : 0 1 1 1 15 : 1 1 1 1

You can spot the rate of change between different bits. The LSB bit is alternating on every number, the
second-lowest bit is rotating on every two numbers, and so on.

But using binary values would be a waste of space in the world of floats. So instead, we can use their
float continous counterparts - Sinusoidal functions. Indeed, they are the equivalent to alternating bits.
Moreover, By decreasing their frequencies, we can go from red bits to orange ones.

→
Figure 2 - The 128-dimensional positonal encoding for a sentence with the maximum lenght of 50. Each row represents the embedding vector pt
Other details
Earlier in this post, I mentioned that positional embeddings are used to equip the input words with
their positional information. But how is it done? In fact, the original paper added the positional
encoding on top of the actual embeddings. That is for every word w in a sentence [w , . . . w ],
t 1 n

Calculating the correspondent embedding which is fed to the model is as follows:

′
→
ψ (wt ) = ψ(wt ) + pt

To make this summation possible, we keep the positional embedding’s dimension equal to the word
embeddings’ dimension i.e. d = d
word embedding postional embedding

Relative Positioning
Another characteristic of sinusoidal positional encoding is that it allows the model to attend relative
positions effortlessly. Here is a quote from the original paper:

We chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.

But why does this statement hold? To fully understand why, please refer to this great article to read the
detailed proof. However I’ve prepared a shorter version here.

For every sine-cosine pair corresponding to frequency ω , there is a linear transformation M

k
∈ R
2×2

(independent of t) where the following equation holds:

sin(ωk . t) sin(ωk . (t + ϕ))

M. [ ] = [ ]
cos(ωk . t) cos(ωk . (t + ϕ))

Proof:

Let M be a 2 × 2 matrix, we want to find u 1, v1 , u 2 and v so that:

u1 v1 sin(ωk . t) sin(ωk . (t + ϕ))

[ ].[ ] = [ ]
u2 v2 cos(ωk . t) cos(ωk . (t + ϕ))

By applying the addition theorem, we can expand the right hand side as follows:

u1 v1 sin(ωk . t) sin(ωk . t) cos(ωk . ϕ) + cos(ωk . t) sin(ωk . ϕ)

[ ].[ ] = [ ]
u2 v2 cos(ωk . t) cos(ωk . t) cos(ωk . ϕ) − sin(ωk . t) sin(ωk . ϕ)

Which result in the following two equations:

u1 sin(ωk . t) + v1 cos(ωk . t) = cos(ωk . ϕ) sin(ωk . t) + sin(ωk . ϕ) cos(ωk . t) (1)

u2 sin(ωk . t) + v2 cos(ωk . t) = − sin(ωk . ϕ) sin(ωk . t) + cos(ωk . ϕ) cos(ωk . t) (2)

By solving above equations, we get:

u1 = cos(ωk . ϕ) v1 = sin(ωk . ϕ)

u2 = − sin(ωk . ϕ) v2 = cos(ωk . ϕ)
So the final transformation matrix M is:

cos(ωk . ϕ) sin(ωk . ϕ)
Mϕ,k = [ ]
− sin(ωk . ϕ) cos(ωk . ϕ)

As you can see, the final transformation does not depend on t. Note that one can find the matrix M
very similar to the rotation matrix.

→
Similarly, we can find M for other sine-cosine pairs, which eventually allows us to represent p t+ϕ as a
→
linear function of p for any fixed offset ϕ. This property, makes it easy for the model to learn to attend
t

by relative positions.

Another property of sinusoidal position encoding is that the distance between neighboring time-steps
are symmetrical and decays nicely with time.

Figure 3 - Dot product of position embeddings for all time-steps

FAQ
Why positional embeddings are summed with word embeddings instead of
concatenation?
I couldn’t find any theoretical reason for this question. Since summation (in contrast to concatenation)
saves the model’s parameters, it is reasonable to reform the initial question to “Does adding the
positional embeddings to words have any disadvantages?”. I would say, not necessarily!
Initially, if we pay attention to the figure 2, we will find out that only the first few dimensions of the
whole embedding are used to store the information about the positions (Note that the reported
embedding dimension is 512 despite our small toy example). And since the embeddings in the
Transfomer are trained from scratch, the parameters are probably set in a way that the semantic of
words does not get stored in the first few dimensions to avoid interfering with the positional encoding.

With the same reason, I think the final Transformer can separate the semantic of words from their
positional information. Moreover, there is no reason to consider the separability as an advantage.
Maybe the summation provides a good source of feature for the model to learn from.

For more information, I recommend you to check these links: link 1, link 2.

Doesn't the position information get vanished once it reaches the upper layers?
Fortunately, the Transformer architecture is equipped with residual connections. Therefore the
information from the input of the model (which contains positional embeddings) can efficiently
propagate to further layers where the more complex interactions are handled.

Why are both sine and cosine used?

Personally, I think, only by using both sine and cosine, we can express the sine(x+k) and cosine(x+k) as a
linear transformation of sin(x) and cos(x). It seems that you can’t do the same thing with the single sine
or cosine. If you can find a linear transformation for a single sine/cosine, please let me know in the
comments section.

Summary
Thank you for staying with me until the end of this article. I hope you’ve found this useful for answering
your question. Please feel free to provide any corrections or feedbacks, the comment section is at your
disposal.

Cited as

@article{kazemnejad2019:pencoding,
title = "Transformer Architecture: The Positional Encoding",
author = "Kazemnejad, Amirhossein",
journal = "[Link]",
year = "2019",
url = "[Link]
}

References
The Illustrated Transformer
Attention Is All You Need - The Transformer
Linear Relationships in the Transformer’s Positional Encoding
position_encoding.ipynb
Tensor2Tensor Github issue #1591
Reddit thread - Positional Encoding in Transformer
Reddit thread - Positional Encoding in Transformer model
Secure Your Child’s Future with
Planet Spark Learn More
Sponsored

Secure Your Child’s Future with Strong English Fluency

Planet Spark Learn More

Transform Your Child’s Confidence with Our Public Speaking Program

Planet Spark Book Now

Struggling with Piles Discomfort? Here's What Can Help

Hexa Health

Affordable Flats in Chattarpur

A D Infra Learn More

Well-design 2bhk in Chattarpur South Delhi

A D Infra Learn More

Neurologist: 97% of People With Neuropathy Don't Know This Crucial Thing
Nerve Relief Read More
67 Comments 
1 Login

G Join the discussion…

LOG IN WITH OR SIGN UP WITH DISQUS ?

Name

 49 Share Best Newest Oldest

J
Jiaxuan Wang
5 years ago

The information is great! I think a more intuitive explanation of positional embedding is to think about it as a clock (as cos and sin
are just concept from unit circle). Every two dimension of the positional embedding just specifies one of the clock's hand (the hour
hand, the minute hand, the second hand, for example). Then moving from one position to the next position is just rotating those
hands at different frequencies. Thus, without formal proof, it immediately tells you why a rotation matrix exist.

26 1 Reply Share ›

J
Jiaxuan Wang 5 years ago
− ⚑
🏆 Featured by [Link]

26 1 Reply Share ›

F football ekadoshi > Jiaxuan Wang − ⚑

5 years ago

Can you please explain it with help of a diagram? I am not able to visualise it.

1 0 Reply Share ›

A
Abhishek Kumar Dubey > Jiaxuan Wang − ⚑
5 years ago

what happens when the clock hand rolls over, In that case will 2 position will have same encoding?

1 0 Reply Share ›

W Wen Yu > Abhishek Kumar Dubey

− ⚑
a year ago edited

I think you misunderstood Jiaxuan Wang's explanation. Let's use K=4 as the embedding dimension for easy
explanation. What the user is saying is that -
a. with the PE in the form of [s1, s2, s3, s4], s1 and s3 are sines, while s2 and s4 are cosines (stating the obvious).
b. s1 and s2 are hour and minute hands in one clock; s3 and s4 are in another. Let's say we are in a 24 hour
timeframe, s1 and s2 are hours and minutes in the first 12 hours, while s3 and s4 are in the second 12 hours.
c. what the blog author demonstrated is that each point in time within that 24 hours can be expressed with these
two clocks combined. Distance in time between any two positions are independent of where those positions are in
actual time.

I honestly don't think we need to borrow the clock metaphor. If you look at the code implementation of PE
(search for 'Jay Alammar illustrated transformer'), latent dimension K dictates the angle rate of rotation, so it's
fixed as long as we don't change K. It has nothing to do with sequence length. Each position is then represented
as a vector in the same K dimension as the word embedding (as the author pointed out, for summation
purposes). The distance between any two positions is degree of counter-clock wise rotations from the beginning
positions to the ending position. Also in this implementation, the geometric progress is from 0 to 9999 * 2(pi),
instead of 2(pi) to 10000*2(pi). If the latent dimension K exceeds 10,000, we may have to update the angle rate
calculation. But that's rarely necessary in real life application
1 0 Reply Share ›

B
Blazej Fiderek > Abhishek Kumar Dubey − ⚑
5 years ago edited

This kind of behaviour is preserved by having large value (namely 10000) in denominator for the biggest w_k
p y g g y gg
(omega). It ensures that for each possible time-step (word position in sequence), representation will differ at
least at one position in the embedding vector. Rolling over of clock could occur if you had 10000 in the
denominator of omega, but number of time-steps, that is number of words in sequence will be higher, say 12000.

0 0 Reply Share ›

Yacine Fitta > Blazej Fiderek − ⚑

4 years ago edited

No it won't the smallest w_k (namely 1/10000) the frequency is 10000×2pi which is bigger than 12000
and even if it was bigger than 10000×2pi say 80000 sin(80000/10000) = sin(8) = sin(8-2pi) and since
8-2pi isn't rational it can't be any of the PE of previous positions

0 0 Reply Share ›

V Vlad F > Yacine Fitta − ⚑

4 years ago edited

I believe it's not about the *exact* equality of embedding vectors but being *approximately* equal. E.g.
in your example, specific value of `sin(80000/10000)`, indeed, would have not been seen before, but
`sin(17681/10000)` is equal to it when rounded to 2 decimal places.
1 0 Reply Share ›

P
Philip Voinea > Yacine Fitta
− ⚑
a year ago

@Yacine Fitta The transformer can only be so sensitive to small changes in vectors. The difference
between sin(80,000/10,000) and sin(17,168/10,000), for example, while not exactly 0, is imperceptible
to the transformer because such a tiny difference is at the scale of noise.
0 0 Reply Share ›

Amirhossein Kazemnejad Mod > Jiaxuan Wang

− ⚑
5 years ago

Yes, this is a really elegent intuition. I've featured your comment so that it can help others.

1 0 Reply Share ›

yunpengtai > Amirhossein Kazemnejad − ⚑

4 years ago

Actually, the binary bits change is equal to the clock assumption. In binary bits, the last position changes most
frequently which is equal to the second hand.
0 0 Reply Share ›

katy sei − ⚑
5 years ago

very clear explanation,!!!

Can u further clarify how do you enocde the 4 bits example into sin cos representation and save space? what is the dimension of the
positional vector in this case?
4 0 Reply Share ›

Yacine Fitta − ⚑
4 years ago edited

Why do we need the whole d-dimentional PE? Doesn't a single pair suffice. They can be linearly transformed to any other PE, and they
provide a unique value for any position i (since pi is a transandental number). It is mentioned in the end that ots because we need to
add the word embeddings to the PEs but why add and not just concatenate? Or why not just add it to the first 2 dimensions?
3 0 Reply Share ›

Matt Hill − ⚑
5 years ago

>we will find out that only the first few dimensions of the whole embedding are used to store the information about the positions

This sort of makes sense, in that the positional encoding varies little in the higher dimensions. However, it does alternate between 0 and
1, so when added to the semantic embedding, it seems like it would disrupt / distort the semantic information in half the positions
significantly? Why does this not happen?

2 0 Reply Share ›

D
disqus_0WriWs1f0L − ⚑
5 years ago

Thanks a lot, Amirhossein, for the detailed and clear explanation!

I have some concern regarding how it effective it would be when we do the positional encoding by adding the positional vector to the
word embedding vector I'm not sure if I got this correctly I understand the position encoding itself has all these great properties But
word embedding vector. I m not sure if I got this correctly. I understand the position encoding itself has all these great properties. But
when it's added with the word vector, it seems the positional information would be mixed with the word embedding information that I
am not sure whether the model could distinguish them. For example, if a word vector is [0.2, -0.1, 0.9, 0.03], with positional vector [0,
1, 0, 1], and another word vector is [-0.5, 0.7, 0.8, 0.7], with positional vector [0.7, 0.2, 0.1, 0.33]. Then their sum vector would be the
same. In this case how does the model tell the differences from their words and positions? Thanks a lot for your time!

2 0 Reply Share ›

Riccardo Di Sipio − ⚑
5 years ago

Hi,

great article! Having a background in physics, I noted that there is a certain similarity between the position encoding and the Fourier
transform. You would get different FT's for inputs that differ only in the order of the elements, e.g.:

>>> x = [Link]([101, 2054, 2003, 1996, 4495, 2017], dtype=tf.float32)

>>> y = [Link]([101, 2054, 2003, 1996, 2017, 4495], dtype=tf.float32)

>>> [Link](x)
<[Link]: shape="(4,)," dtype="complex64," numpy="array([12665.999" +6.1035156e-05j,="" -3108.4998+2.1260920e+03j,="" -3187.4995-
2.1901782e+03j,="" 532.0007+1.2207031e-04j],="" dtype="complex64)">

>>> [Link](y)
<[Link]: shape="(4,)," dtype="complex64," numpy="array([12666." -3.6621094e-04j,="" -630.4998+2.1260923e+03j,=""
-3187.4995+2.1018433e+03j,="" -4423.9995-2.4414062e-04j],="" dtype="complex64)">

And of course, the inverse FT would give you back the original sequence (modulo some rounding due to int <-> float conversion).

Are you aware of any attempt at using such encoding? Incidentally, the FT of a sequence of n real numbers is a sequence of 2 complex
numbers, i.e. 2n real numbers that can be seen as the cos and sin projections of a vector in the Complex plane.

2 0 Reply Share ›

A
Anish Tondwalkar > Riccardo Di Sipio − ⚑
5 years ago

The input is 2-d: one time dimension (position in the sentence) and one _semantic_ dimension (the word embedding)

What we're doing here is explicitly including the Fourier transform of the position (remember, the FT of $\delta(\omega - t)$
is $sin(\omega t) + i cos(\omega t)$ )

I'm not sure what it would mean to take a FT in the semantic dimension.

0 0 Reply Share ›

Amirhossein Kazemnejad Mod > Riccardo Di Sipio

− ⚑
5 years ago edited

Hi,
Thank you so much for your comment.

Wow, that's a very interesting observation. I have not actually seen someone following this path. But, I found this paper some
time ago that tries to provide a more general framework for positional encodings. I think it can be helpful
([Link]

Regarding the use of FT in positional encoding, I think there are some gray areas that need more discussion. For example, I
am not sure whether a NN can infer about positions only by seeing a single FT sequence. I think (please correct me if I'm
wrong) the order information is only available when there are multiple FT sequences in comparison to each other. Moreover,
the design of a transformer stack that uses an FT sequence is debatable. That is, the stack needs to know about the position t.
How can we extract such information from an FT sequence? (Maybe, we can feed an FT input that is calculated from a
manipulated original sequence based on t, such as removing the word at t). Additionally, what kinds of benefits it will bring in
terms of relative positioning (Maybe, some particular mathematic property).

0 0 Reply Share ›

J
Jauhar − ⚑
3 months ago

This is great. Thank you for the explanation. Just a small query however, since P_t is a vector with sin and cosine values, both
dependent on t, and since sin and cosine are periodic functions, wouldn't the vectors be the same for many different ts?

1 0 Reply Share ›

R rinku jadhav − ⚑
2 years ago

You say that the naïve idea of providing position as 1/n, 2/n, 3/n....k/n....n/n is not feasible because n (sequence length) varies over
examples. But, practically, transformers have an upper bound of how many tokens they can handle (typically 512 or 1024) so we could
just fix the value of n to be either that limit or that limit times some constant so that the largest value is not 1.
I think a potential problem (maybe) then would be that transformer would need to learn that the extra value provided is not an
extension of embedding but rather a position token.
1 0 Reply Share ›

Kiers − ⚑
5 years ago

Sorry, what is the "i"? {i = 2k; i = 2k+1} and what is "k"?

1 0 Reply Share ›

Artur Tanona − ⚑
5 years ago

Please convert "enocoding"' to "encoding"

1 0 Reply Share ›

Amirhossein Kazemnejad Mod > Artur Tanona − ⚑

5 years ago

I fixed it. Thank you very much for the notice.

1 0 Reply Share ›

Z
Zephyr − ⚑
5 years ago edited

Thanks for the blog. I could not understand the sentence "But using binary values would be a waste of space in the world of floats." Is it
because integer is 4 bytes and binary values take only ones and zeroes? If that's the case, we can change the data type to "bit" right?

1 0 Reply Share ›

Amirhossein Kazemnejad Mod > Zephyr

− ⚑
5 years ago

I'm really sorry for the late reply! Yes, my statement can be inaccurate if we go into exact details. At the intuition level, I meant
that we are not forced to use the Square wave (binary values). Moreover, word embeddings are represented using float32.
Therefore, we have to always cast the hypothetical bit-represented positional encoding to float32 before calculating the sum.
From another point of view, if we represent the positional encoding in integers, then we have to go in discrete values for
incrementing the wavelength, which is not the case for floats since a finer change in values can be detected by neural networks.
Thus, it would be an under usage of floats' potential.

I hope I could clarify that. Please feel free to drop a comment if you have further questions.

1 0 Reply Share ›

S soarer − ⚑
9 months ago

Hi, thank you for this clear exposition!

I don't understand why your first idea (assigning 0/n, 1/n, ...) doesn't work. Transformer models need to have a fixed context length
size, and shorter input has to be padded anyway, so as long as the time-step size is 1/n where n is context length size, the problem of
inconsistency across sentence isn't a real issue. This scheme clearly preserves relative distance as well. Are there other issues that you
can see? Thanks!

0 0 Reply Share ›

M
Mike Gates − ⚑
a year ago

Thank you for this great article.

It's exciting to know that there is a linear (intuitively, rotational) transformation between positions.

Here, I have a supplement that could add to the article because the author is very modest and hides some explanations.
The positional information is added to its corresponding token embedding, quite naturally. The whole network is a Neural Network,
each layer of which accepts its input linearly, multiplying input by weights. So the network can easily "decompose" and "understand" the
(linear) sum of positional vector and embedding vector, through millions of repeated inputs. If the two were concatenated, the NN
would have had to spend multiple layers to understand the positions. The linear relationship between positions, the linear addition of
positional vector into embedding vector, (inherently linear) Neural networks, and the residual connections are a good combination that
allows us to get rid of RNNs in sequence-to-sequence learning. That's my intuition.

0 0 Reply Share ›

Aryan Pandey − ⚑
a year ago

Thank you, got here from a reddit thread from 4-5yrs back.
Been trying to understand attention from over 4months, but been pushing myself from the past few weeks to finally understand this in
depth. This article cleared a lot of stuff, just like reading about Xavier initialization feels after understanding Basics of Neural Networks
!
R l Sh
0 0 Reply Share ›

WIlfo − ⚑
a year ago

Great work! do you a python code of this explanation?

0 0 Reply Share ›

D
David Ireland − ⚑
a year ago

Looks like there are some rendering errors with the equations as of 2024?

0 0 Reply Share ›

S
sirk390 − ⚑
2 years ago

Thanks for the greate article. Just one comment: You say "But using binary values would be a waste of space in the world of floats. " .
Are you sure this is correct? For me it doesn't seem to waste more space. Is the real reason not the linerarity explained later? "we
hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be
represented as a linear function of PEpos"

0 0 Reply Share ›

MichaelSB − ⚑
2 years ago

Very nice explanation, thank you! One question - why not represent positional encoding using binary vectors? For example, concatenate
16 elements to an input vector, where each of those elements is 0 or 1. Such positional encoding represents a 16 bit integer (0-65,535),
where each vector element is 0 or 1. We avoid the problem of having large numerical values in our input embedding, and the number of
positions that can be represented is bounded.

One potential problem I see is the weights corresponding to these 16 vector positions might end up to be increasing proportionally (e.g.
weight for the MSB position could be 65k times larger than the weight corresponding to LSB position), so instead of large input values
we could end up with large weight values (and require a large dynamic range for the weights, which means more precision). But the
same problem should affect the sin/cosine representation, right?

0 0 Reply Share ›

Boris Burkov − ⚑
3 years ago

Fourier transform is ubiquitous, but I have a "theory" that angle encoding in quantum machine learning could've been the source of
inspiration for positional encoding: [Link]

0 0 Reply Share ›

M Mahdi Amrollahi − ⚑
3 years ago

That was so Helpful to me, I used to have several questions about why this kind of encoding works. Thanks Amirhossein...

0 0 Reply Share ›

A
Amin Mansouri − ⚑
3 years ago

Great article! Thank you so much!

0 0 Reply Share ›

Lawrence − ⚑
4 years ago

what does "≡2 " mean?

0 0 Reply Share ›

M Magnus Pierrau > Lawrence − ⚑

4 years ago edited

It means "modulo 2". So "d ≡2 0" means d = 0 modulo 2, i.e. that d is even. :)

0 0 Reply Share ›

ypocat − ⚑
4 years ago

the last good link to the `position_encoding.ipynb` is:

[Link]
and the respective collab link:
[Link]
0 0 Reply Share ›

D
Diego Quintana − ⚑
4 years ago edited

Hey great article, thanks! How did you produce the last plot for the distances? Can you show the code used? Thank you!

0 0 Reply Share ›

S
sober reflection − ⚑
4 years ago

If we want to encoding the position information is not sequence, how can we do ? For example, we want to use sin/cos to represent the
relative position of image pixels, position embedding may not friendly to space postion, because left bottom, right bottom, left up, right
up, center has strong correlation.

0 0 Reply Share ›

Ben − ⚑
4 years ago

Thanks for your insight on why they use summation versus concatenation! Had also been wondering about that myself.

0 0 Reply Share ›

S
s.i. − ⚑
4 years ago

I dont understand how the vector p(t) starts with the first term being sine and then cosine.

Making k=1 means i=2 so what about indices i=0 and i=1?

Shouldn't things start from k=0? And if they do then they won't neatly go to k=d/2 rather k=d/2 - 1 unless... we start not with i=0 but
i=1 and take the first term to be the cosine.

I hope you get what I mean.

0 0 Reply Share ›

M
Melvin > s.i.
− ⚑
4 years ago

If you start with i=2, your last term in the p(t) vector will be with i=d+1 since you need a d-dimensional vector, and you'll get
the same vector as in the article.
Whether you start with i=0 or i=1 or whatever the value is, you get a unique encoding (p(t)) for each word (t) and that's the
whole point of positional embedding.

1 0 Reply Share ›

Leap of Faith − ⚑
5 years ago

Why could this linear property makes it easy for the model to learn to attend ?

0 0 Reply Share ›

Amirhossein Kazemnejad Mod > Leap of Faith

− ⚑
5 years ago

Please refer to my answer to Rajesh R's comment. [Link] and let me know if you have any further questions.

0 0 Reply Share ›

M
mb19029 − ⚑
5 years ago

Can you please fix the table mapping numbers 0..15 to their binary representation by replacing the second 2 with the intended 10?

0 0 Reply Share ›

Amirhossein Kazemnejad Mod > mb19029 − ⚑

5 years ago

Ops! Thank you very much for noticing that!

0 0 Reply Share ›

K
kamal fara − ⚑
5 years ago

Dear Amirhossein ,Thanks for the blog. can you elaborate on "This property, makes it easy for the model to learn to attend by relative
positions"?

0 0 Reply Share ›

A i h i K j d M d >k lf ⚑
Amirhossein Kazemnejad Mod > kamal fara − ⚑
5 years ago

This is an interesting question. Please refer to my answer to Rajesh R's comment. [Link]
0 0 Reply Share ›

J Jimmy6 − ⚑
5 years ago

I am totally lost in the Intuition...

0 0 Reply Share ›

K
kamatikos − ⚑
5 years ago

Good explanation that addresses a lot of the questions that were left out of basically all other breakdowns. Thank you.

0 0 Reply Share ›

Load more comments

Subscribe Privacy Do Not Sell My Data

Secure Your Child’s Future with Strong English Fluency

Planet Spark

Transform Your Child’s Confidence with Our Public Speaking Program

Planet Spark

Struggling with Piles Discomfort? Here's What Can Help

Hexa Health

Flat in Chattarpur South Delhi

A D Infra Learn More

Well-design 2bhk in Chattarpur South Delhi

A D Infra Learn More

Neurologist: 97% of People With Neuropathy Don't Know This Crucial Thing
Nerve Relief


 
 
 


Header Photo by Francesco Ungaro on Unsplash • Theme by Clean Blog

Transformer Neural Networks: RAHUL 121AD0036
No ratings yet
Transformer Neural Networks: RAHUL 121AD0036
43 pages
Positional Encoding Boosts RNN Performance
No ratings yet
Positional Encoding Boosts RNN Performance
7 pages
Understanding Transformers: Position Encoding
No ratings yet
Understanding Transformers: Position Encoding
33 pages
Large Vocabulary Handling in Recurrent Neural Networks Enhanced by Positional Encoding
No ratings yet
Large Vocabulary Handling in Recurrent Neural Networks Enhanced by Positional Encoding
8 pages
Enhancing Positional Embeddings in Transformers
No ratings yet
Enhancing Positional Embeddings in Transformers
7 pages
Visual Transformer
No ratings yet
Visual Transformer
18 pages
Attention Is All We Need
No ratings yet
Attention Is All We Need
5 pages
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
No ratings yet
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
12 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
1707 06519
No ratings yet
1707 06519
8 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
2023.final Project
No ratings yet
2023.final Project
2 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Transformer Language Models Without Positional Encodings Still Learn Positional Information Haviv Ram Press Izsak Levy ArXiv 2203.16634 2022
No ratings yet
Transformer Language Models Without Positional Encodings Still Learn Positional Information Haviv Ram Press Izsak Levy ArXiv 2203.16634 2022
9 pages
Tranformers Transfer Learning
No ratings yet
Tranformers Transfer Learning
58 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
No ratings yet
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
13 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
39 pages
Mathematics of LLMs Part 1
No ratings yet
Mathematics of LLMs Part 1
8 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
Solving Transformer by Hand A Step-by-Step Math Example
No ratings yet
Solving Transformer by Hand A Step-by-Step Math Example
43 pages
Word Embedding Techniques Explained
No ratings yet
Word Embedding Techniques Explained
9 pages
Neural Words Encoding Model for Text Compression
No ratings yet
Neural Words Encoding Model for Text Compression
5 pages
Transformers
No ratings yet
Transformers
23 pages
The Curious Case of Absolute Position Embeddings
No ratings yet
The Curious Case of Absolute Position Embeddings
24 pages
Wa0010.
No ratings yet
Wa0010.
7 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Roformer - Enhanced Transformer With Rotary Position Embedding
No ratings yet
Roformer - Enhanced Transformer With Rotary Position Embedding
14 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
10 RNN
No ratings yet
10 RNN
56 pages
NLP Techniques: Machine Learning & Transformers
No ratings yet
NLP Techniques: Machine Learning & Transformers
74 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Text Representation in NLP Techniques
No ratings yet
Text Representation in NLP Techniques
57 pages
L6 - UCLxDeepMind DL2020 Document of Google
No ratings yet
L6 - UCLxDeepMind DL2020 Document of Google
141 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
105 pages
NLP Slides2
No ratings yet
NLP Slides2
93 pages
Getting Started With The Model Architecture of The Transformer
No ratings yet
Getting Started With The Model Architecture of The Transformer
103 pages
Recurrent Neural Networks Cheatsheet
No ratings yet
Recurrent Neural Networks Cheatsheet
44 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Transformer
No ratings yet
Transformer
58 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Dlunit 4
No ratings yet
Dlunit 4
122 pages
Generative AI
No ratings yet
Generative AI
54 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Foundations of Text Representation, LLMs and Transformers
No ratings yet
Foundations of Text Representation, LLMs and Transformers
87 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
Module 2 Cont... Text Classification
No ratings yet
Module 2 Cont... Text Classification
14 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
Chapter II
No ratings yet
Chapter II
26 pages
Deep Learning
No ratings yet
Deep Learning
24 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
08 Embedding Et RNN v2.11
No ratings yet
08 Embedding Et RNN v2.11
69 pages
Wireless Comm Princip N Practice Theodoresrappaport
No ratings yet
Wireless Comm Princip N Practice Theodoresrappaport
640 pages
Novel Automatic Modulation Classification Using Cumulant Features For Communications Via Multipath Channels
No ratings yet
Novel Automatic Modulation Classification Using Cumulant Features For Communications Via Multipath Channels
8 pages
Linear Algebra and Group Theory
No ratings yet
Linear Algebra and Group Theory
400 pages
Book Scheduling2008
No ratings yet
Book Scheduling2008
124 pages
OracleApps88 - Oracle Alerts PDF
No ratings yet
OracleApps88 - Oracle Alerts PDF
13 pages
Psidium Guajava: Myrtaceae L
No ratings yet
Psidium Guajava: Myrtaceae L
5 pages
IEEE Paper Formatting Template Guide
No ratings yet
IEEE Paper Formatting Template Guide
4 pages
The Anglo-Saxon Period 449-1066 English
No ratings yet
The Anglo-Saxon Period 449-1066 English
10 pages
Bootstrap 3 All Classes List Cheat Sheet Reference PDF (2020) PDF
No ratings yet
Bootstrap 3 All Classes List Cheat Sheet Reference PDF (2020) PDF
21 pages
The Cambridge Companion To The Australian Novel
100% (1)
The Cambridge Companion To The Australian Novel
379 pages
Engine Diagnostics User Guide
No ratings yet
Engine Diagnostics User Guide
74 pages
TASK 2 Subscribed Databases - Information Literacy
No ratings yet
TASK 2 Subscribed Databases - Information Literacy
3 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Nolasco Rob New Streetwise Intermediate Student S Book
100% (2)
Nolasco Rob New Streetwise Intermediate Student S Book
122 pages
ISCSI Cheatsheet V1.00
No ratings yet
ISCSI Cheatsheet V1.00
1 page
Intro to Relational Databases
No ratings yet
Intro to Relational Databases
9 pages
Pronunciation Skills for English Learners
No ratings yet
Pronunciation Skills for English Learners
17 pages
Lesson Plan Quadratic Expressions
No ratings yet
Lesson Plan Quadratic Expressions
3 pages
In Defense of Hiya As A Filipino Virtue
No ratings yet
In Defense of Hiya As A Filipino Virtue
14 pages
Overcoming Sexual Temptation Strategies
No ratings yet
Overcoming Sexual Temptation Strategies
9 pages
Research Methodology Guide
No ratings yet
Research Methodology Guide
4 pages
SC7000 - SC200 User Manual
No ratings yet
SC7000 - SC200 User Manual
311 pages
KANNAD ELT 406 Overview and Training
100% (1)
KANNAD ELT 406 Overview and Training
50 pages
HPL Project - Second Speech
No ratings yet
HPL Project - Second Speech
3 pages
Y4-Math Exam Review 2
No ratings yet
Y4-Math Exam Review 2
5 pages
Input Data Sheet For E-Class Record: Region Division District School Name School Id School Year
No ratings yet
Input Data Sheet For E-Class Record: Region Division District School Name School Id School Year
22 pages
Lista de Verbos Irregulares
No ratings yet
Lista de Verbos Irregulares
33 pages
File Transfer Protocol
No ratings yet
File Transfer Protocol
11 pages
Java Virtual Machine - Wiki.
No ratings yet
Java Virtual Machine - Wiki.
6 pages
Cheyenne Memories 2nd Edition John Stands in Timber - The Latest Updated Ebook Is Now Available For Download
100% (3)
Cheyenne Memories 2nd Edition John Stands in Timber - The Latest Updated Ebook Is Now Available For Download
30 pages
Cleft Lip and Palate Overview
No ratings yet
Cleft Lip and Palate Overview
52 pages
Distributed Database Systems-Chhanda Ray
No ratings yet
Distributed Database Systems-Chhanda Ray
271 pages
"E-Commerce App Using Flutter": Minor Project Report
No ratings yet
"E-Commerce App Using Flutter": Minor Project Report
9 pages
First and Second Conditional
No ratings yet
First and Second Conditional
4 pages

The Positional Encoding Blog

Uploaded by

The Positional Encoding Blog

Uploaded by

Transformer Architecture: The Positional Encoding

Transformer architecture was introduced as a novel pure attention-only sequence-to-sequence

Header Photo by Susan Yin on Unsplash

What is positional encoding and Why do we need it in the first place?

Ideally, the following criteria should be satisfied:

Calculating the correspondent embedding which is fed to the model is as follows:

For every sine-cosine pair corresponding to frequency ω , there is a linear transformation M

(independent of t) where the following equation holds:

sin(ωk . t) sin(ωk . (t + ϕ))

Let M be a 2 × 2 matrix, we want to find u 1, v1 , u 2 and v so that:

u1 v1 sin(ωk . t) sin(ωk . (t + ϕ))

u1 v1 sin(ωk . t) sin(ωk . t) cos(ωk . ϕ) + cos(ωk . t) sin(ωk . ϕ)

Which result in the following two equations:

u1 sin(ωk . t) + v1 cos(ωk . t) = cos(ωk . ϕ) sin(ωk . t) + sin(ωk . ϕ) cos(ωk . t) (1)

u2 sin(ωk . t) + v2 cos(ωk . t) = − sin(ωk . ϕ) sin(ωk . t) + cos(ωk . ϕ) cos(ωk . t) (2)

By solving above equations, we get:

Figure 3 - Dot product of position embeddings for all time-steps

Why are both sine and cosine used?

Secure Your Child’s Future with Strong English Fluency

Transform Your Child’s Confidence with Our Public Speaking Program

Struggling with Piles Discomfort? Here's What Can Help

Affordable Flats in Chattarpur

Well-design 2bhk in Chattarpur South Delhi

G Join the discussion…

LOG IN WITH OR SIGN UP WITH DISQUS ?

 49 Share Best Newest Oldest

F football ekadoshi > Jiaxuan Wang − ⚑

W Wen Yu > Abhishek Kumar Dubey

Yacine Fitta > Blazej Fiderek − ⚑

V Vlad F > Yacine Fitta − ⚑

Amirhossein Kazemnejad Mod > Jiaxuan Wang

yunpengtai > Amirhossein Kazemnejad − ⚑

very clear explanation,!!!

Thanks a lot, Amirhossein, for the detailed and clear explanation!

>>> x = [Link]([101, 2054, 2003, 1996, 4495, 2017], dtype=tf.float32)

Amirhossein Kazemnejad Mod > Riccardo Di Sipio

Sorry, what is the "i"? {i = 2k; i = 2k+1} and what is "k"?

Please convert "enocoding"' to "encoding"

Amirhossein Kazemnejad Mod > Artur Tanona − ⚑

I fixed it. Thank you very much for the notice.

Amirhossein Kazemnejad Mod > Zephyr

Hi, thank you for this clear exposition!

Thank you for this great article.

Great work! do you a python code of this explanation?

Great article! Thank you so much!

what does "≡2 " mean?

M Magnus Pierrau > Lawrence − ⚑

the last good link to the `position_encoding.ipynb` is:

I hope you get what I mean.

Amirhossein Kazemnejad Mod > Leap of Faith

Amirhossein Kazemnejad Mod > mb19029 − ⚑

Ops! Thank you very much for noticing that!

I am totally lost in the Intuition...

Load more comments

Subscribe Privacy Do Not Sell My Data

Secure Your Child’s Future with Strong English Fluency

Transform Your Child’s Confidence with Our Public Speaking Program

Struggling with Piles Discomfort? Here's What Can Help

Flat in Chattarpur South Delhi

Well-design 2bhk in Chattarpur South Delhi

Header Photo by Francesco Ungaro on Unsplash • Theme by Clean Blog

Copyright © Amirhossein Kazemnejad 2023

You might also like