0% found this document useful (0 votes)

41 views92 pages

Lecture 6 Transformers

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views92 pages

Lecture 6 Transformers

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI6126 Advanced Computer Vision

Last update: 17 February 2022 4:45pm

Transformers for
Computer Vision
Chen-Change Loy
吕健勤

https://www.mmlab-ntu.com/
https://twitter.com/ccloy
Outline
• Attention
• Sequence-to-Sequence with RNNs and Attention
• Transformers
• Extra – Vision Transformers
Background
• Convolutional neural networks (CNN) has been dominating
• Greater scale
• More extensive connections
• More sophisticated forms of convolution

• Transformers
• Competitive alternative to CNN
• Generally found to perform best in settings with large amounts of training data
• Enable multi-modality learning
Credits
• A nice blog by Lilian Weng:
• https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
• https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html

• The illustrated Transformer by Jay Alammar

• https://jalammar.github.io/illustrated-transformer/
Attention
"Where's Waldo?"
Visual Attention
Scientists divided into two camps
• In the first model, the spotlight of attention would track across the page, checking
each detail against a mental image of Waldo's red stocking cap and striped shirt.
• In the second model, the color red and stocking-cap shapes would gradually come to
the foreground and other shapes and colors would recede.
"Where's Waldo?"

• The study by MIT showed that both processes are going on in the same chunk of the
brain and in the same neurons

• Midregion of the V4 visual cortex known to be important to attention (V4 is tuned

for object features of intermediate complexity, like simple geometric shapes)

Research explains how the brain finds Waldo, MIT News 2005
Attention

The term “visual attention” refers to a set of cognitive operations that mediate the selection of relevant
and the filtering out of irrelevant information from cluttered visual scenes.
• Reduce complexity
• Resource saving
Attention
Attention is a mechanism that a model can learn to
make predictions by selectively attending to a given set
of data.

The amount of attention is quantified by learned

weights and thus the output is usually formed as a
weighted average.

Self-attention is a type of attention mechanism where

the model makes prediction for one part of a data
sample using other parts of the observation about the
same sample.

What does “it” in this sentence refer to? Is it referring to the

street or to the animal?

When the model is processing the word “it”, self-attention

allows it to associate “it” with “animal”.
Sequence-to-Sequence
with RNNs and Attention
Input-output scenarios
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs

st-1,
Sequence-to-Sequence with RNNs

st-1
Sequence-to-Sequence with RNNs

st-1

Context vector
summarizes all the
information need for
the decoder
Sequence-to-Sequence with RNNs

st-1

Idea: use new context vector at each step of decoder!

Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
To predict how much we should attend to each hidden state of
the encoder given the current hidden state of the decoder

scalar
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Transformers
Transformers

https://transformer.huggingface.co/doc/distil-gpt2
Transformers
Notable for its use of attention to model
long-range dependencies in data

A sequence-to-sequence model

Model of choice in natural language

processing (NLP)

Step-by-step guide to self-attention with illustrations and code https://jalammar.github.io/illustrated-transformer/

Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers

Like LSTM, Transformer is an architecture for transforming one

sequence into another one with the help of two parts (Encoder
and Decoder)

But it differs from the existing sequence-to-sequence models

because it does not imply any Recurrent Networks (GRU, LSTM,
etc.).
• Layer outputs can be calculated in parallel, instead of a series
like an RNN
• Attention-based models allow modeling of dependencies
without regard to their distance in the input or output
sequences

Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers

It is entirely built on the self-attention

mechanisms without using sequence-aligned
recurrent architecture

Self-attention is a type of attention mechanism

where the model makes prediction for one part of
a data sample using other parts of the observation
about the same sample.

What does “it” in this sentence refer to? Is it referring to the

street or to the animal?

When the model is processing the word “it”, self-attention

allows it to associate “it” with “animal”.
Transformers
The encoding component is a stack of
encoders

The decoding component is a stack of

decoders of the same number
Transformers
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other
words in the input sentence as it encodes a specific word
The outputs of the self-attention layer are fed to a feed-forward neural network.
The exact same feed-forward network is independently applied to each position (each word/token).
Transformers
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence
Transformers

Decoder

Encoder
Transformers
Self-attention = Scaled dot-product attention
The output is a weighted sum of the values, where the weight
assigned to each value is determined by the dot-product of the
query with all the keys

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!

Scaled Dot-Product Attention

Self-Attention in Detail
First Step
Create three vectors from each of the encoder’s
input vectors (in this case, the embedding of
each word).
So for each word, we create a Query vector, a
Key vector, and a Value vector.
These vectors are created by multiplying the
embedding by three matrices that we trained
during the training process.

What are the “query”, “key”, and “value” vectors?

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Second Step
Calculate a score for each word of the input
sentence against a word.
The score determines how much focus to place
on other parts of the input sentence as we
encode a word at a certain position.
The score is calculated by taking the dot product
of the query vector with the key vector of the
respective word we’re scoring.

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Third Step
Divide the scores by 𝑑! , the square root of the
dimension of the key vectors
This leads to having more stable gradients (large
similarities will cause softmax to saturate and
give vanishing gradients)

Fourth Step
Softmax for normalization

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Fifth Step
Multiply each value vector by the softmax score

Sixth Step
Sum up the weighted value vectors to get the
output of the self-attention layer at this position
(for the first word)

~ ~

z1= 0.88v1+ 0.12v2

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Matrix Calculation of Self-Attention
First Step
Calculate the Query, Key, and Value matrices.
Pack our embeddings into a matrix X, and
multiplying it by the weight matrices we’ve
Every row in
trained (WQ, WK, WV) the X matrix
corresponds to a word
in the input sentence.
Second Step
Calculate the outputs of the self-attention layer.
SoftMax is row-wise
Multi-Head Self-Attention
Multi-Head Self-Attention
Rather than only computing the attention once, the multi-head mechanism
runs through the scaled dot-product attention multiple times in parallel.

The independent attention outputs are simply concatenated and linearly

transformed into the expected dimensions.

Why?
“Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions. ”

Example:
Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning.

Given “representation learning”, the first head attends to “Deep learning” while the Multi-head Attention
second head attends to the more general term “machine learning methods”
Positional Encoding
Self-attention operation is permutation
equivariant

SelfAtt(𝜋 ⋅ 𝑋) = 𝜋 ⋅ SelfAtt(𝑋)
Self-attention layer works on sets of
vectors and it doesn’t know the order of
the vectors it is processing
The positional encoding has the same
dimension as the input embedding
Adds a vector to each input embedding
to give information about the relative or
absolute position of the tokens in the
sequence
These vectors follow a specific pattern
Positional Encoding
What might this pattern look like?
Each row corresponds the a positional encoding of a vector.
The first row would be the vector we’d add to the embedding of the first word in an input sequence.
Each position is uniquely encoded and the encoding can deal with sequences longer than any sequence
seen in the training time.
Sinusoidal positional encoding - interweaves
the two signals

where 𝑝𝑜𝑠 is the position and 𝑖 is the

Sinusoidal positional encoding with 32 tokens and embedding dimension of 128. The value is dimension,
between -1 (black) and 1 (white) and the value 0 is in gray.
Example: Positional Encoding
Example:
Given the following Sinusoidal positional encoding, calculate the 𝑃𝐸(𝑝𝑜𝑠 = 1) for the first five dimensions [0, 1, 2, 3, 4].
Assume 𝑑!"#$% = 512

Solution:
Given 𝑝𝑜𝑠 = 1 and 𝑑!"#$% = 512
At dimension 0, 2𝑖 = 0 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,.) = sin(1/10000./0-* )
At dimension 1, 2𝑖 + 1 = 1 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,-) = cos(1/10000./0-* )
At dimension 2, 2𝑖 = 2 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,*) = sin(1/10000*/0-* )
At dimension 3, 2𝑖 + 1 = 3 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,2) = cos(1/10000*/0-* )
At dimension 4, 2𝑖 = 4 thus 𝑖 = 2, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,3) = sin(1/100003/0-* )
Positional Encoding
Other form of positional encoding (e.g., learnable encoding)

Xuanqing Liu et al., Learning to Encode Position for Transformer with Continuous Dynamical Model, ICML 2020
Transformer Encoder
Encoder

• A stack of 𝑁 = 6 identical layers.

• Each layer has a multi-head self-attention
layer and a simple position-wise fully connected
feed-forward network.
If we’re to visualize the vectors and
the layer-norm operation
associated with self attention, it
would look like this:

• The linear transformations are the same across

different positions, they use different
parameters from layer to layer.
• Each sub-layer adopts a residual connection and
a layer normalization.
Transformer Encoder
BN LN
Input, 𝑁

Input, 𝑁
Channel, 𝐷 Channel, 𝐷

Layer Normalization (LN) 1 Gaussian Error Linear Units (GELU) 3

• The pixels along the red arrow are normalized by the same • Can be thought of as a smoother ReLU
mean and variance, computed by aggregating the values of • Used in GPT-3, BERT, and most other Transformers
these pixels.
• BN is found unstable in Transformers 2
• Works well with RNN and now being used in Transformers

1 Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, Layer Normalization, arXiv:1607.06450
2 Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer, Rethinking Batch Normalization in Transformers, ICML 2020
3 Dan Hendrycks, Kevin Gimpel, Gaussian Error Linear Units (GELUs), arXiv:1606.08415
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:

The “Encoder-Decoder Attention”

layer works just like multiheaded
self-attention, except it creates its
Queries matrix from the layer
below it, and takes the Keys and
Values matrix from the output of
the encoder stack.
Transformer Decoder
How encoder and decoder work together

• The encoder start by processing the input sequence

• The output of the top encoder is then transformed
into a set of attention vectors K and V.
• These are to be used by each decoder in its
“encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input
sequence

After finishing the encoding phase, we begin the decoding phase. Each
step in the decoding phase outputs an element from the output sequence
(the English translation sentence in this case).
Transformer Decoder
How encoder and decoder work together

• The output of each step is fed to the bottom

decoder in the next time step
• Embed and add positional encoding to those
decoder inputs. Process the inputs
• Repeat the process until a special symbol is reached
indicating the transformer decoder has completed
its output.

Note:
In the decoder, the self-attention layer is only allowed to attend to earlier
positions in the output sequence. This is done by masking future positions
(setting them to -inf) before the softmax step in the self-attention
calculation.
Transformers
Decoder

Encoder

John Thickstun, The Transformer Model in Equations

Vision Transformers
Vision Transformer (ViT)
Vision Transformer
• Do not have decoder
• Reshape the image 𝐱 ∈ ℝ"×$×% into a sequence of
!
flattened 2D patches 𝐱 ∈ ℝ&× ' ( % , where (𝐻, 𝑊)
is the resolution of the original image, 𝐶 is the
number of channels, (𝑃, 𝑃) is the resolution of each
image patch, and 𝑁 = 𝐻𝑊/𝑃) is the resulting
number of patches
• Patch embedding - Linearly embed each of them to
𝐷 dimension with a trainable linear projection 𝐄
• Add learnable position embeddings 𝐄*+, to retain
positional information

Alexey Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Vision Transformer (ViT)
Vision Transformer
• Prepend a learnable embedding (𝐳-- = 𝐱 ./011 ) to
the sequence of embedded patches
• Feed the resulting sequence of vectors to a
standard Transformer encoder
• A classification head is attached to 𝐳2-

Transformer Encoder

Classification Head
Vision Transformer (ViT)
Transformer Encoder
• Consists of a multi-head self-attention module (MSA),
followed by a 2-layer MLP (with GELU)
• LayerNorm (LN) is applied before MSA module and MLP,
and a residual connection is applied after each module.

Transformer Encoder
Vision Transformer (ViT)

The model attends to image regions that are

The model learns to encode distance within the image - closer patches tend to
semantically relevant for classification
have more similar position embeddings.
Attention Rollout 1 - averaged attention weights
The row-column structure appears - patches in the same row/column have similar of ViT- L/16 across all heads and then recursively
embeddings.
multiplied the weight matrices of all layers
1 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL, 2020 (allows attention to be more meaningfully visualized and
interpreted for deeper layers in a transformer)
Vision Transformer (ViT)
Examine the attention distance, analogous to receptive field in CNN

Compute the average distance in image space across which

information is integrated, based on the attention weights.

The attention distance increases

Some heads attend to most of the image already with network depth
in the lowest layers, showing the capability of ViT
in integrating information globally

Other attention heads have consistently

small attention distances in the low layers
Vision Transformer (ViT)
Performance of ViT
• ViT performs significantly worse than the CNN equivalent (BiT)
when trained on ImageNet (1M images).
• However, on ImageNet-21k (14M images) performance is
comparable, and on JFT (300M images), ViT outperforms BiT.
• ViT overfits the ImageNet task due to its lack of inbuilt
knowledge about images

ViT conducts global self-attention

• Relationships between a token and all other tokens are
computed
• Quadratic complexity with respect to the number of tokens
• Not tractable for dense prediction or to represent a high-
resolution image
Swin Transformers
Swin Transformer

Swin Transformer
• Perform local self-attention thus having linear computational complexity with respect to
the number of tokens
• Shifted window between consecutive self-attention layers to allow connections
• Flexibility to model at various scales with hierarchical representation

Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021
Swin Transformer

Patch Partition Linear Embedding

• Split an input RGB image into non-overlapping Applied on the raw-valued feature to project it
patches to an arbitrary dimension, denoted as 𝐶
• Treat each patch as a “token” (4×4×3 = 48),
4 5
number of tokens ( × )
3 3
• Concatenate raw pixel RGB values as patch
feature set
Swin Transformer

Patch Merging
• Reminiscent the pooling in CNN
• Reduce the dimension, form hierarchical representation
• Concatenate the features of each group of 2 × 2
neighboring patches
• Apply a linear layer on the 4𝐶-dimensional
concatenated features to get an output dimension of 2𝐶
Swin Transformer

Swin Transformer Block

With modified self-attention computation

Repeating the same process of Stage 2 to produce a hierarchical representation

Swin Transformer
Swin Transformer conducts local self-attention
• Limiting self-attention computation to non-
overlapping local windows
• The number of patches in each window is fixed
• Supposing each window contains 𝑀 × 𝑀 patches and
there are ℎ × 𝑤 patches on an image

Conventional multi-head self attention (MSA),

Swin Transformer ViT
quadratic to patch number ℎ𝑤 (omit SoftMax)
Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶

Window-based self attention (W-MSA),

linear to patch number ℎ𝑤 (omit SoftMax)
Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶
Swin Transformer
if C = AB for
an n × m matrix A and
Supposing each window contains 𝑀 × 𝑀 patches and there are ℎ × 𝑤 patches on an image. an m × p matrix B, then C is
an n × p matrix with entries.
This algorithm
For conventional multi-head self attention (MSA): takes time Θ(nmp) (in asym
ptotic notation).

1. For each patch we need to compute the respective 𝑄, 𝐾, and 𝑉, with 𝐐 = 𝐗𝐖 3 , 𝐊 = 𝐗𝐖4 , and
𝐕 = 𝐗𝐖 5 , where 𝐗 ∈ ℝ67×% and 𝐖 ∈ ℝ%×% . Thus, the complexity is 3ℎ𝑤𝐶 )

2. Compute 𝐐𝐊 ⊤ , and we know 𝐐, 𝐊, 𝐕 ∈ ℝ67×% . Thus, the complexity is ℎ𝑤 ) 𝐶

3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Since 𝐐𝐊 ⊤ ∈ ℝ67×67 , this operation takes
ℎ𝑤 ) 𝐶

4. The final output is obtained by multiplying 𝐙 with the output matrix 𝐖8 . The complexity is ℎ𝑤𝐶 )

Hence, the final complexity is Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶 - quadratic to patch number ℎ𝑤

Swin Transformer
Window-based self attention (W-MSA) performs self-attention within windows, each of which contains
𝑀 × 𝑀 patches

1. [Same as MSA] the complexity is 3ℎ𝑤𝐶 )

windows. In each window, the complexity of computing 𝐐𝐊 ⊤ is

6 7
2. In W-MSA, there are ×
9 9
6 7
(𝑀) )) 𝐶 and thus the total complexity is × (𝑀) )) 𝐶 = 𝑀) ℎ𝑤𝐶
9 9

3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Similar to aforementioned reason, this operation
takes 𝑀) ℎ𝑤𝐶

4. [Same as MSA] the complexity is ℎ𝑤𝐶 )

• Hence, the final complexity is Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶 – linear to patch number ℎ𝑤
Swin Transformer
Shifted window partitioning in successive blocks
• Enhance connections by bridging the windows
of the preceding layer to improve modelling
power
• Alternating between two partitioning
configurations

A regular window The window partitioning

partitioning scheme is is shifted, resulting in
adopted, and self- new windows. The self-
attention is computed attention computation
within each window in the new windows
crosses the boundaries
of the previous windows
in preceding layer,
providing connections
among them
Swin Transformer
Successive Swin Transformer Blocks
• The first block consists of a window-based MSA module (W-
MSA), followed by a 2-layer MLP (with GELU)
• The second block consists of a shifted window based MSA
module (SW-MSA), followed by a 2-layer MLP (with GELU)
• Layer normalization (LN) is applied before each MSA
module and each MLP, and a residual connection is applied
after each module.

Two Successive Swin Transformer Blocks

Swin Transformer
Issues with shifted window partitioning
6 7 6
• Result in more windows, from × to +1 ×
9 9 9
7
+ 1 in the shifted configuration. For instance, from
9
2×2 to 3×3
• Some of the windows will be smaller than 𝑀×𝑀.
• Naïve solution: pad the smaller windows, but computation
is 2.25 greater (for the case 2×2 → 3×3 )
Swin Transformer
Efficient batch computation
• The goal is to maintain the same number of batched
windows as that of regular window partitioning
• Cyclic-shifting toward the top-left direction
• Using masked MSA to limit self-attention computation to
within each sub-window

Credit: https://zhuanlan.zhihu.com/p/367111046
Swin Transformer
Image Classification
• The image classification is performed by
applying a global average pooling layer on the
output feature map of the last stage, followed
by a linear classifier

• Found to be as accurate as using an additional

class token as in ViT Swin Transformer achieves a
better speed-accuracy trade-
off than state-of-the-art CNNs
Video Swin Transformer
Extension to spatiotemporal domain
The mechanism of shifted windows is reformulated to process spatiotemporal input

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted

Swin Transformer

Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted

Swin Transformer Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

Applications of Transformer

Set prediction problem: DETR simplifies the detection pipeline by dropping multiple hand-designed components
that encode prior knowledge, like spatial anchors or non-maximal suppression.

• DETR uses a conventional CNN backbone to learn a 2D representation of an input image.

• The model flattens it and supplements it with a positional encoding before passing it into a transformer
encoder.
• A transformer decoder then takes as input a small fixed number of learned positional embeddings (object
queries), and additionally attends to the encoder output.
• Pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a
detection (class and bounding box) or a “no object” class.
Nicolas Carion et al., End-to-End Object Detection with Transformers , ECCV 2020
Applications of Transformer

Restored output

Super-resolves the LR image with the guidance of an additional

high-resolution (HR) reference image. Textures of the HR
reference image are transferred to provide more fine details for
the LR image.

Fuzhi Yang et al., Learning Texture Transformer Network for Image Super-Resolution, CVPR 2020
Applications of Transformer
3D human texture estimation from a single image

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image

Most of existing methods CNNs to predict 3D human texture (i.e., a UV map) from
the input image

The input and output do not have strictly-aligned spatial correspondences and may
even have totally different shapes

Convolution layers are by design local operations and inefficient in processing global
information that is crucial in 3D human texture estimation

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image

Transformer uses the attention mechanism to more

effectively exploit global information, which leads to higher-
quality 3D human texture estimation.

Query = color encoding map obtained from 3D coordinates of a

standard 3D body mesh, and each element in the Query corresponds to
a physical vertex

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Pretraining of Transformers
Attain excellent results when pretrained on large-scale datasets
(e.g., JFT-300M)

Lower accuracy than ResNet when it is trained on ImageNet

Lack inherent inductive biases inherent to CNN

• Inductive biases are the characteristics of learning algorithms
that influence their generalization behaviour, independent of An equivariant mapping is a mapping which
data. preserves the algebraic structure of a
• Translation equivariance, two-dimensional neighborhood transformation.
structure, and locality
A translation equivariant mapping is a
• And thus do not generalize well given insufficient training data mapping which, when the input is translated,
leads to a translated mapping
Transformers vs. CNN
• Do Transformers act like convolutions?
• Do they act like convolutions, learning the same inductive biases from scratch?
• Or are they developing novel task representations?

Maithra Raghu et al., Do Vision Transformers See Like Convolutional Neural Networks?, arXiv:2108.08810
Transformers vs. CNN

Tool: Centered kernel alignment (CKA) to compare the representations (activations) between all pairs of layers -
measure meaningful similarities between representations of higher dimension than the number of data points

Gram matrices for the two layers

Hilbert-Schmidt Independence Criterion

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In ICML, 2019
A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf. Measuring statistical dependence with Hilbert Schmidt norms. In ALT
Transformers vs. CNN

ViT having more uniform representations, with greater similarity between lower and higher layers.
Transformers vs. CNN
Cross-model comparisons

The lower half of 60 ResNet layers are similar to

approximately the lowest quarter of ViT layers.

In particular, many more lower layers in the

ResNet are needed to compute similar
representations to the lower layers of ViT.

The top half of the ResNet is approximately

similar to the next third of the ViT layers.

The final third of ViT layers is less similar to all

ResNet layers
Transformers vs. CNN
(i) ViT lower layers compute representations
in a different way to lower layers in the
ResNet

(ii) ViT also more strongly propagates

representations between lower and higher Within model
layers

(iii) The highest layers of ViT have quite

different representations to ResNet

Cross model
Transformers vs. CNN
With large-scale pre-training

Two highest layers in the ViT

At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it Self-attention layers have a mix of local heads
attends to (small distances) and global heads (large
distances). This is in contrast to CNNs, which are
hardcoded to attend only locally in the lower
layers.

Heads sorted by their average distance.

Transformers vs. CNN
With NO large-scale pre-training (just training on ImageNet) – much lower performance

Two highest layers in the ViT

At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it ViT does not learn to attend locally in earlier
attends to layers!!!

Using local information early on for

image tasks (which is hardcoded into
Heads sorted by their average attention. CNN architectures) is important for
strong performance
Further Reading

Credit: Zang Yuhang

Transformers
No ratings yet
Transformers
15 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Lec 12
No ratings yet
Lec 12
30 pages
Self-Attention Mechanism in NLP
No ratings yet
Self-Attention Mechanism in NLP
18 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
NLP 8
No ratings yet
NLP 8
42 pages
Transformer
No ratings yet
Transformer
31 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
29 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Transformer
No ratings yet
Transformer
14 pages
Transformer
No ratings yet
Transformer
4 pages
Transformer
No ratings yet
Transformer
30 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
All You Need To Know About The Self-Attention Layer
No ratings yet
All You Need To Know About The Self-Attention Layer
80 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Paper 2
No ratings yet
Paper 2
8 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Transformers
No ratings yet
Transformers
15 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
A1
No ratings yet
A1
11 pages
Transformer
No ratings yet
Transformer
33 pages
Transformer
No ratings yet
Transformer
10 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Slides Attention
No ratings yet
Slides Attention
16 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
GEN-AI Handout 1
No ratings yet
GEN-AI Handout 1
4 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Transformer
No ratings yet
Transformer
5 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Transformer
No ratings yet
Transformer
58 pages
N-gram vs Negative Sampling in NLP
No ratings yet
N-gram vs Negative Sampling in NLP
117 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Attention
No ratings yet
Attention
15 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
Burner With Stove Quotations
No ratings yet
Burner With Stove Quotations
2 pages
KENT Crystal Star Black
No ratings yet
KENT Crystal Star Black
1 page
Black and Orange Marketing Plan Presentation
No ratings yet
Black and Orange Marketing Plan Presentation
25 pages
Lecture 3 CNN Training
No ratings yet
Lecture 3 CNN Training
109 pages
Mirpur School List
100% (4)
Mirpur School List
8 pages
RAKEZ Brochure English
No ratings yet
RAKEZ Brochure English
9 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
Sequence To Sequence Learning
No ratings yet
Sequence To Sequence Learning
33 pages
Chat GPT
No ratings yet
Chat GPT
17 pages
GenAI Unit1 3
No ratings yet
GenAI Unit1 3
31 pages
Coding Transformers from Scratch
No ratings yet
Coding Transformers from Scratch
59 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Seminar Presentation Print
No ratings yet
Seminar Presentation Print
9 pages
Deep Unit 3 F
No ratings yet
Deep Unit 3 F
51 pages
RWKV: Reinventing RNNs For The Transformer Era - Cropped
No ratings yet
RWKV: Reinventing RNNs For The Transformer Era - Cropped
25 pages
Recent Advances in Text To SQL
No ratings yet
Recent Advances in Text To SQL
22 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Roformer - Enhanced Transformer With Rotary Position Embedding
No ratings yet
Roformer - Enhanced Transformer With Rotary Position Embedding
14 pages
Build GPT Model from Scratch in PyTorch
No ratings yet
Build GPT Model from Scratch in PyTorch
27 pages
NNDL Unit 5
No ratings yet
NNDL Unit 5
21 pages
Applsci 14 11628 With Cover
No ratings yet
Applsci 14 11628 With Cover
17 pages
Attention Is All You Need Summary
No ratings yet
Attention Is All You Need Summary
5 pages
Large Language Models in Neuroscience
No ratings yet
Large Language Models in Neuroscience
20 pages
EPIVAN
No ratings yet
EPIVAN
7 pages
Job Runtime Prediction of HPC Cluster Based On PC-Transformer
No ratings yet
Job Runtime Prediction of HPC Cluster Based On PC-Transformer
27 pages
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
No ratings yet
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
14 pages
Transfer Learning with Text-to-Text Transformer
No ratings yet
Transfer Learning with Text-to-Text Transformer
67 pages
Cs224n 2025 Lecture06 Fancy RNN
No ratings yet
Cs224n 2025 Lecture06 Fancy RNN
57 pages
Not All Attention Is Needed: Gated Attention Network For Sequence Data
No ratings yet
Not All Attention Is Needed: Gated Attention Network For Sequence Data
9 pages
Towards An Understanding of Large Language
No ratings yet
Towards An Understanding of Large Language
48 pages
An Attention Free Transformer - Cropped
No ratings yet
An Attention Free Transformer - Cropped
29 pages
Early Detection of Mental Health Issues Using Soci
No ratings yet
Early Detection of Mental Health Issues Using Soci
9 pages
Challenges in NMT - 2004.05809
No ratings yet
Challenges in NMT - 2004.05809
22 pages
Transformers
No ratings yet
Transformers
127 pages
Paper 9
No ratings yet
Paper 9
13 pages
Nguyen CALICO Part-Focused Semantic Co-Segmentation With Large Vision-Language Models CVPR 2025 Paper
No ratings yet
Nguyen CALICO Part-Focused Semantic Co-Segmentation With Large Vision-Language Models CVPR 2025 Paper
12 pages

Lecture 6 Transformers

Uploaded by

Lecture 6 Transformers

Uploaded by

AI6126 Advanced Computer Vision

Last update: 17 February 2022 4:45pm

• The illustrated Transformer by Jay Alammar

• Midregion of the V4 visual cortex known to be important to attention (V4 is tuned

The amount of attention is quantified by learned

Self-attention is a type of attention mechanism where

What does “it” in this sentence refer to? Is it referring to the

When the model is processing the word “it”, self-attention

Idea: use new context vector at each step of decoder!

Model of choice in natural language

Step-by-step guide to self-attention with illustrations and code https://jalammar.github.io/illustrated-transformer/

Like LSTM, Transformer is an architecture for transforming one

But it differs from the existing sequence-to-sequence models

It is entirely built on the self-attention

Self-attention is a type of attention mechanism

What does “it” in this sentence refer to? Is it referring to the

When the model is processing the word “it”, self-attention

The decoding component is a stack of

Scaled Dot-Product Attention

What are the “query”, “key”, and “value” vectors?

z1= 0.88v1+ 0.12v2

The independent attention outputs are simply concatenated and linearly

where 𝑝𝑜𝑠 is the position and 𝑖 is the

• A stack of 𝑁 = 6 identical layers.

• The linear transformations are the same across

Layer Normalization (LN) 1 Gaussian Error Linear Units (GELU) 3

The “Encoder-Decoder Attention”

• The encoder start by processing the input sequence

• The output of each step is fed to the bottom

John Thickstun, The Transformer Model in Equations

The model attends to image regions that are

Compute the average distance in image space across which

The attention distance increases

Other attention heads have consistently

ViT conducts global self-attention

Patch Partition Linear Embedding

Swin Transformer Block

Repeating the same process of Stage 2 to produce a hierarchical representation

Conventional multi-head self attention (MSA),

Window-based self attention (W-MSA),

2. Compute 𝐐𝐊 ⊤ , and we know 𝐐, 𝐊, 𝐕 ∈ ℝ67×% . Thus, the complexity is ℎ𝑤 ) 𝐶

Hence, the final complexity is Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶 - quadratic to patch number ℎ𝑤

1. [Same as MSA] the complexity is 3ℎ𝑤𝐶 )

windows. In each window, the complexity of computing 𝐐𝐊 ⊤ is

4. [Same as MSA] the complexity is ℎ𝑤𝐶 )

A regular window The window partitioning

Two Successive Swin Transformer Blocks

• Found to be as accurate as using an additional

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

Swin Transformer Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230

• DETR uses a conventional CNN backbone to learn a 2D representation of an input image.

Super-resolves the LR image with the guidance of an additional

Transformer uses the attention mechanism to more

Query = color encoding map obtained from 3D coordinates of a

Lower accuracy than ResNet when it is trained on ImageNet

Lack inherent inductive biases inherent to CNN

Gram matrices for the two layers

Hilbert-Schmidt Independence Criterion

The lower half of 60 ResNet layers are similar to

In particular, many more lower layers in the

The top half of the ResNet is approximately

The final third of ViT layers is less similar to all

(ii) ViT also more strongly propagates

(iii) The highest layers of ViT have quite

Two highest layers in the ViT

Heads sorted by their average distance.

Two highest layers in the ViT

Using local information early on for

Credit: Zang Yuhang

You might also like