AI6126 Advanced Computer Vision
Last update: 17 February 2022 4:45pm
Transformers for
Computer Vision
Chen-Change Loy
吕健勤
https://www.mmlab-ntu.com/
https://twitter.com/ccloy
Outline
• Attention
• Sequence-to-Sequence with RNNs and Attention
• Transformers
• Extra – Vision Transformers
Background
• Convolutional neural networks (CNN) has been dominating
• Greater scale
• More extensive connections
• More sophisticated forms of convolution
• Transformers
• Competitive alternative to CNN
• Generally found to perform best in settings with large amounts of training data
• Enable multi-modality learning
Credits
• A nice blog by Lilian Weng:
• https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
• https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html
• The illustrated Transformer by Jay Alammar
• https://jalammar.github.io/illustrated-transformer/
Attention
"Where's Waldo?"
Visual Attention
Scientists divided into two camps
• In the first model, the spotlight of attention would track across the page, checking
each detail against a mental image of Waldo's red stocking cap and striped shirt.
• In the second model, the color red and stocking-cap shapes would gradually come to
the foreground and other shapes and colors would recede.
"Where's Waldo?"
• The study by MIT showed that both processes are going on in the same chunk of the
brain and in the same neurons
• Midregion of the V4 visual cortex known to be important to attention (V4 is tuned
for object features of intermediate complexity, like simple geometric shapes)
Research explains how the brain finds Waldo, MIT News 2005
Attention
The term “visual attention” refers to a set of cognitive operations that mediate the selection of relevant
and the filtering out of irrelevant information from cluttered visual scenes.
• Reduce complexity
• Resource saving
Attention
Attention is a mechanism that a model can learn to
make predictions by selectively attending to a given set
of data.
The amount of attention is quantified by learned
weights and thus the output is usually formed as a
weighted average.
Self-attention is a type of attention mechanism where
the model makes prediction for one part of a data
sample using other parts of the observation about the
same sample.
What does “it” in this sentence refer to? Is it referring to the
street or to the animal?
When the model is processing the word “it”, self-attention
allows it to associate “it” with “animal”.
Sequence-to-Sequence
with RNNs and Attention
Input-output scenarios
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs
st-1,
Sequence-to-Sequence with RNNs
st-1
Sequence-to-Sequence with RNNs
st-1
Sequence-to-Sequence with RNNs
st-1
Context vector
summarizes all the
information need for
the decoder
Sequence-to-Sequence with RNNs
st-1
Idea: use new context vector at each step of decoder!
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
To predict how much we should attend to each hidden state of
the encoder given the current hidden state of the decoder
scalar
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Transformers
Transformers
https://transformer.huggingface.co/doc/distil-gpt2
Transformers
Notable for its use of attention to model
long-range dependencies in data
A sequence-to-sequence model
Model of choice in natural language
processing (NLP)
Step-by-step guide to self-attention with illustrations and code https://jalammar.github.io/illustrated-transformer/
Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers
Like LSTM, Transformer is an architecture for transforming one
sequence into another one with the help of two parts (Encoder
and Decoder)
But it differs from the existing sequence-to-sequence models
because it does not imply any Recurrent Networks (GRU, LSTM,
etc.).
• Layer outputs can be calculated in parallel, instead of a series
like an RNN
• Attention-based models allow modeling of dependencies
without regard to their distance in the input or output
sequences
Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers
It is entirely built on the self-attention
mechanisms without using sequence-aligned
recurrent architecture
Self-attention is a type of attention mechanism
where the model makes prediction for one part of
a data sample using other parts of the observation
about the same sample.
What does “it” in this sentence refer to? Is it referring to the
street or to the animal?
When the model is processing the word “it”, self-attention
allows it to associate “it” with “animal”.
Transformers
The encoding component is a stack of
encoders
The decoding component is a stack of
decoders of the same number
Transformers
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other
words in the input sentence as it encodes a specific word
The outputs of the self-attention layer are fed to a feed-forward neural network.
The exact same feed-forward network is independently applied to each position (each word/token).
Transformers
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence
Transformers
Decoder
Encoder
Transformers
Self-attention = Scaled dot-product attention
The output is a weighted sum of the values, where the weight
assigned to each value is determined by the dot-product of the
query with all the keys
𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Scaled Dot-Product Attention
Self-Attention in Detail
First Step
Create three vectors from each of the encoder’s
input vectors (in this case, the embedding of
each word).
So for each word, we create a Query vector, a
Key vector, and a Value vector.
These vectors are created by multiplying the
embedding by three matrices that we trained
during the training process.
What are the “query”, “key”, and “value” vectors?
𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Second Step
Calculate a score for each word of the input
sentence against a word.
The score determines how much focus to place
on other parts of the input sentence as we
encode a word at a certain position.
The score is calculated by taking the dot product
of the query vector with the key vector of the
respective word we’re scoring.
𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Third Step
Divide the scores by 𝑑! , the square root of the
dimension of the key vectors
This leads to having more stable gradients (large
similarities will cause softmax to saturate and
give vanishing gradients)
Fourth Step
Softmax for normalization
𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Fifth Step
Multiply each value vector by the softmax score
Sixth Step
Sum up the weighted value vectors to get the
output of the self-attention layer at this position
(for the first word)
~ ~
z1= 0.88v1+ 0.12v2
𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Matrix Calculation of Self-Attention
First Step
Calculate the Query, Key, and Value matrices.
Pack our embeddings into a matrix X, and
multiplying it by the weight matrices we’ve
Every row in
trained (WQ, WK, WV) the X matrix
corresponds to a word
in the input sentence.
Second Step
Calculate the outputs of the self-attention layer.
SoftMax is row-wise
Multi-Head Self-Attention
Multi-Head Self-Attention
Rather than only computing the attention once, the multi-head mechanism
runs through the scaled dot-product attention multiple times in parallel.
The independent attention outputs are simply concatenated and linearly
transformed into the expected dimensions.
Why?
“Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions. ”
Example:
Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning.
Given “representation learning”, the first head attends to “Deep learning” while the Multi-head Attention
second head attends to the more general term “machine learning methods”
Positional Encoding
Self-attention operation is permutation
equivariant
SelfAtt(𝜋 ⋅ 𝑋) = 𝜋 ⋅ SelfAtt(𝑋)
Self-attention layer works on sets of
vectors and it doesn’t know the order of
the vectors it is processing
The positional encoding has the same
dimension as the input embedding
Adds a vector to each input embedding
to give information about the relative or
absolute position of the tokens in the
sequence
These vectors follow a specific pattern
Positional Encoding
What might this pattern look like?
Each row corresponds the a positional encoding of a vector.
The first row would be the vector we’d add to the embedding of the first word in an input sequence.
Each position is uniquely encoded and the encoding can deal with sequences longer than any sequence
seen in the training time.
Sinusoidal positional encoding - interweaves
the two signals
where 𝑝𝑜𝑠 is the position and 𝑖 is the
Sinusoidal positional encoding with 32 tokens and embedding dimension of 128. The value is dimension,
between -1 (black) and 1 (white) and the value 0 is in gray.
Example: Positional Encoding
Example:
Given the following Sinusoidal positional encoding, calculate the 𝑃𝐸(𝑝𝑜𝑠 = 1) for the first five dimensions [0, 1, 2, 3, 4].
Assume 𝑑!"#$% = 512
Solution:
Given 𝑝𝑜𝑠 = 1 and 𝑑!"#$% = 512
At dimension 0, 2𝑖 = 0 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,.) = sin(1/10000./0-* )
At dimension 1, 2𝑖 + 1 = 1 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,-) = cos(1/10000./0-* )
At dimension 2, 2𝑖 = 2 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,*) = sin(1/10000*/0-* )
At dimension 3, 2𝑖 + 1 = 3 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,2) = cos(1/10000*/0-* )
At dimension 4, 2𝑖 = 4 thus 𝑖 = 2, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,3) = sin(1/100003/0-* )
Positional Encoding
Other form of positional encoding (e.g., learnable encoding)
Xuanqing Liu et al., Learning to Encode Position for Transformer with Continuous Dynamical Model, ICML 2020
Transformer Encoder
Encoder
• A stack of 𝑁 = 6 identical layers.
• Each layer has a multi-head self-attention
layer and a simple position-wise fully connected
feed-forward network.
If we’re to visualize the vectors and
the layer-norm operation
associated with self attention, it
would look like this:
• The linear transformations are the same across
different positions, they use different
parameters from layer to layer.
• Each sub-layer adopts a residual connection and
a layer normalization.
Transformer Encoder
BN LN
Input, 𝑁
Input, 𝑁
Channel, 𝐷 Channel, 𝐷
Layer Normalization (LN) 1 Gaussian Error Linear Units (GELU) 3
• The pixels along the red arrow are normalized by the same • Can be thought of as a smoother ReLU
mean and variance, computed by aggregating the values of • Used in GPT-3, BERT, and most other Transformers
these pixels.
• BN is found unstable in Transformers 2
• Works well with RNN and now being used in Transformers
1 Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, Layer Normalization, arXiv:1607.06450
2 Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer, Rethinking Batch Normalization in Transformers, ICML 2020
3 Dan Hendrycks, Kevin Gimpel, Gaussian Error Linear Units (GELUs), arXiv:1606.08415
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:
The “Encoder-Decoder Attention”
layer works just like multiheaded
self-attention, except it creates its
Queries matrix from the layer
below it, and takes the Keys and
Values matrix from the output of
the encoder stack.
Transformer Decoder
How encoder and decoder work together
• The encoder start by processing the input sequence
• The output of the top encoder is then transformed
into a set of attention vectors K and V.
• These are to be used by each decoder in its
“encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input
sequence
After finishing the encoding phase, we begin the decoding phase. Each
step in the decoding phase outputs an element from the output sequence
(the English translation sentence in this case).
Transformer Decoder
How encoder and decoder work together
• The output of each step is fed to the bottom
decoder in the next time step
• Embed and add positional encoding to those
decoder inputs. Process the inputs
• Repeat the process until a special symbol is reached
indicating the transformer decoder has completed
its output.
Note:
In the decoder, the self-attention layer is only allowed to attend to earlier
positions in the output sequence. This is done by masking future positions
(setting them to -inf) before the softmax step in the self-attention
calculation.
Transformers
Decoder
Encoder
John Thickstun, The Transformer Model in Equations
Vision Transformers
Vision Transformer (ViT)
Vision Transformer
• Do not have decoder
• Reshape the image 𝐱 ∈ ℝ"×$×% into a sequence of
!
flattened 2D patches 𝐱 ∈ ℝ&× ' ( % , where (𝐻, 𝑊)
is the resolution of the original image, 𝐶 is the
number of channels, (𝑃, 𝑃) is the resolution of each
image patch, and 𝑁 = 𝐻𝑊/𝑃) is the resulting
number of patches
• Patch embedding - Linearly embed each of them to
𝐷 dimension with a trainable linear projection 𝐄
• Add learnable position embeddings 𝐄*+, to retain
positional information
Alexey Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Vision Transformer (ViT)
Vision Transformer
• Prepend a learnable embedding (𝐳-- = 𝐱 ./011 ) to
the sequence of embedded patches
• Feed the resulting sequence of vectors to a
standard Transformer encoder
• A classification head is attached to 𝐳2-
Transformer Encoder
Classification Head
Vision Transformer (ViT)
Transformer Encoder
• Consists of a multi-head self-attention module (MSA),
followed by a 2-layer MLP (with GELU)
• LayerNorm (LN) is applied before MSA module and MLP,
and a residual connection is applied after each module.
Transformer Encoder
Vision Transformer (ViT)
The model attends to image regions that are
The model learns to encode distance within the image - closer patches tend to
semantically relevant for classification
have more similar position embeddings.
Attention Rollout 1 - averaged attention weights
The row-column structure appears - patches in the same row/column have similar of ViT- L/16 across all heads and then recursively
embeddings.
multiplied the weight matrices of all layers
1 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL, 2020 (allows attention to be more meaningfully visualized and
interpreted for deeper layers in a transformer)
Vision Transformer (ViT)
Examine the attention distance, analogous to receptive field in CNN
Compute the average distance in image space across which
information is integrated, based on the attention weights.
The attention distance increases
Some heads attend to most of the image already with network depth
in the lowest layers, showing the capability of ViT
in integrating information globally
Other attention heads have consistently
small attention distances in the low layers
Vision Transformer (ViT)
Performance of ViT
• ViT performs significantly worse than the CNN equivalent (BiT)
when trained on ImageNet (1M images).
• However, on ImageNet-21k (14M images) performance is
comparable, and on JFT (300M images), ViT outperforms BiT.
• ViT overfits the ImageNet task due to its lack of inbuilt
knowledge about images
ViT conducts global self-attention
• Relationships between a token and all other tokens are
computed
• Quadratic complexity with respect to the number of tokens
• Not tractable for dense prediction or to represent a high-
resolution image
Swin Transformers
Swin Transformer
Swin Transformer
• Perform local self-attention thus having linear computational complexity with respect to
the number of tokens
• Shifted window between consecutive self-attention layers to allow connections
• Flexibility to model at various scales with hierarchical representation
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021
Swin Transformer
Patch Partition Linear Embedding
• Split an input RGB image into non-overlapping Applied on the raw-valued feature to project it
patches to an arbitrary dimension, denoted as 𝐶
• Treat each patch as a “token” (4×4×3 = 48),
4 5
number of tokens ( × )
3 3
• Concatenate raw pixel RGB values as patch
feature set
Swin Transformer
Patch Merging
• Reminiscent the pooling in CNN
• Reduce the dimension, form hierarchical representation
• Concatenate the features of each group of 2 × 2
neighboring patches
• Apply a linear layer on the 4𝐶-dimensional
concatenated features to get an output dimension of 2𝐶
Swin Transformer
Swin Transformer Block
With modified self-attention computation
Repeating the same process of Stage 2 to produce a hierarchical representation
Swin Transformer
Swin Transformer conducts local self-attention
• Limiting self-attention computation to non-
overlapping local windows
• The number of patches in each window is fixed
• Supposing each window contains 𝑀 × 𝑀 patches and
there are ℎ × 𝑤 patches on an image
Conventional multi-head self attention (MSA),
Swin Transformer ViT
quadratic to patch number ℎ𝑤 (omit SoftMax)
Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶
Window-based self attention (W-MSA),
linear to patch number ℎ𝑤 (omit SoftMax)
Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶
Swin Transformer
if C = AB for
an n × m matrix A and
Supposing each window contains 𝑀 × 𝑀 patches and there are ℎ × 𝑤 patches on an image. an m × p matrix B, then C is
an n × p matrix with entries.
This algorithm
For conventional multi-head self attention (MSA): takes time Θ(nmp) (in asym
ptotic notation).
1. For each patch we need to compute the respective 𝑄, 𝐾, and 𝑉, with 𝐐 = 𝐗𝐖 3 , 𝐊 = 𝐗𝐖4 , and
𝐕 = 𝐗𝐖 5 , where 𝐗 ∈ ℝ67×% and 𝐖 ∈ ℝ%×% . Thus, the complexity is 3ℎ𝑤𝐶 )
2. Compute 𝐐𝐊 ⊤ , and we know 𝐐, 𝐊, 𝐕 ∈ ℝ67×% . Thus, the complexity is ℎ𝑤 ) 𝐶
3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Since 𝐐𝐊 ⊤ ∈ ℝ67×67 , this operation takes
ℎ𝑤 ) 𝐶
4. The final output is obtained by multiplying 𝐙 with the output matrix 𝐖8 . The complexity is ℎ𝑤𝐶 )
Hence, the final complexity is Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶 - quadratic to patch number ℎ𝑤
Swin Transformer
Window-based self attention (W-MSA) performs self-attention within windows, each of which contains
𝑀 × 𝑀 patches
1. [Same as MSA] the complexity is 3ℎ𝑤𝐶 )
windows. In each window, the complexity of computing 𝐐𝐊 ⊤ is
6 7
2. In W-MSA, there are ×
9 9
6 7
(𝑀) )) 𝐶 and thus the total complexity is × (𝑀) )) 𝐶 = 𝑀) ℎ𝑤𝐶
9 9
3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Similar to aforementioned reason, this operation
takes 𝑀) ℎ𝑤𝐶
4. [Same as MSA] the complexity is ℎ𝑤𝐶 )
• Hence, the final complexity is Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶 – linear to patch number ℎ𝑤
Swin Transformer
Shifted window partitioning in successive blocks
• Enhance connections by bridging the windows
of the preceding layer to improve modelling
power
• Alternating between two partitioning
configurations
A regular window The window partitioning
partitioning scheme is is shifted, resulting in
adopted, and self- new windows. The self-
attention is computed attention computation
within each window in the new windows
crosses the boundaries
of the previous windows
in preceding layer,
providing connections
among them
Swin Transformer
Successive Swin Transformer Blocks
• The first block consists of a window-based MSA module (W-
MSA), followed by a 2-layer MLP (with GELU)
• The second block consists of a shifted window based MSA
module (SW-MSA), followed by a 2-layer MLP (with GELU)
• Layer normalization (LN) is applied before each MSA
module and each MLP, and a residual connection is applied
after each module.
Two Successive Swin Transformer Blocks
Swin Transformer
Issues with shifted window partitioning
6 7 6
• Result in more windows, from × to +1 ×
9 9 9
7
+ 1 in the shifted configuration. For instance, from
9
2×2 to 3×3
• Some of the windows will be smaller than 𝑀×𝑀.
• Naïve solution: pad the smaller windows, but computation
is 2.25 greater (for the case 2×2 → 3×3 )
Swin Transformer
Efficient batch computation
• The goal is to maintain the same number of batched
windows as that of regular window partitioning
• Cyclic-shifting toward the top-left direction
• Using masked MSA to limit self-attention computation to
within each sub-window
Credit: https://zhuanlan.zhihu.com/p/367111046
Swin Transformer
Image Classification
• The image classification is performed by
applying a global average pooling layer on the
output feature map of the last stage, followed
by a linear classifier
• Found to be as accurate as using an additional
class token as in ViT Swin Transformer achieves a
better speed-accuracy trade-
off than state-of-the-art CNNs
Video Swin Transformer
Extension to spatiotemporal domain
The mechanism of shifted windows is reformulated to process spatiotemporal input
Ze Liu et al., Video Swin Transformer, arXiv:2106.13230
Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted
Swin Transformer
Video Swin Transformer
Ze Liu et al., Video Swin Transformer, arXiv:2106.13230
Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted
Swin Transformer Video Swin Transformer
Ze Liu et al., Video Swin Transformer, arXiv:2106.13230
Applications of Transformer
Set prediction problem: DETR simplifies the detection pipeline by dropping multiple hand-designed components
that encode prior knowledge, like spatial anchors or non-maximal suppression.
• DETR uses a conventional CNN backbone to learn a 2D representation of an input image.
• The model flattens it and supplements it with a positional encoding before passing it into a transformer
encoder.
• A transformer decoder then takes as input a small fixed number of learned positional embeddings (object
queries), and additionally attends to the encoder output.
• Pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a
detection (class and bounding box) or a “no object” class.
Nicolas Carion et al., End-to-End Object Detection with Transformers , ECCV 2020
Applications of Transformer
Restored output
Super-resolves the LR image with the guidance of an additional
high-resolution (HR) reference image. Textures of the HR
reference image are transferred to provide more fine details for
the LR image.
Fuzhi Yang et al., Learning Texture Transformer Network for Image Super-Resolution, CVPR 2020
Applications of Transformer
3D human texture estimation from a single image
Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image
Most of existing methods CNNs to predict 3D human texture (i.e., a UV map) from
the input image
The input and output do not have strictly-aligned spatial correspondences and may
even have totally different shapes
Convolution layers are by design local operations and inefficient in processing global
information that is crucial in 3D human texture estimation
Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image
Transformer uses the attention mechanism to more
effectively exploit global information, which leads to higher-
quality 3D human texture estimation.
Query = color encoding map obtained from 3D coordinates of a
standard 3D body mesh, and each element in the Query corresponds to
a physical vertex
Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Pretraining of Transformers
Attain excellent results when pretrained on large-scale datasets
(e.g., JFT-300M)
Lower accuracy than ResNet when it is trained on ImageNet
Lack inherent inductive biases inherent to CNN
• Inductive biases are the characteristics of learning algorithms
that influence their generalization behaviour, independent of An equivariant mapping is a mapping which
data. preserves the algebraic structure of a
• Translation equivariance, two-dimensional neighborhood transformation.
structure, and locality
A translation equivariant mapping is a
• And thus do not generalize well given insufficient training data mapping which, when the input is translated,
leads to a translated mapping
Transformers vs. CNN
• Do Transformers act like convolutions?
• Do they act like convolutions, learning the same inductive biases from scratch?
• Or are they developing novel task representations?
Maithra Raghu et al., Do Vision Transformers See Like Convolutional Neural Networks?, arXiv:2108.08810
Transformers vs. CNN
Tool: Centered kernel alignment (CKA) to compare the representations (activations) between all pairs of layers -
measure meaningful similarities between representations of higher dimension than the number of data points
Gram matrices for the two layers
Hilbert-Schmidt Independence Criterion
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In ICML, 2019
A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf. Measuring statistical dependence with Hilbert Schmidt norms. In ALT
Transformers vs. CNN
ViT having more uniform representations, with greater similarity between lower and higher layers.
Transformers vs. CNN
Cross-model comparisons
The lower half of 60 ResNet layers are similar to
approximately the lowest quarter of ViT layers.
In particular, many more lower layers in the
ResNet are needed to compute similar
representations to the lower layers of ViT.
The top half of the ResNet is approximately
similar to the next third of the ViT layers.
The final third of ViT layers is less similar to all
ResNet layers
Transformers vs. CNN
(i) ViT lower layers compute representations
in a different way to lower layers in the
ResNet
(ii) ViT also more strongly propagates
representations between lower and higher Within model
layers
(iii) The highest layers of ViT have quite
different representations to ResNet
Cross model
Transformers vs. CNN
With large-scale pre-training
Two highest layers in the ViT
At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it Self-attention layers have a mix of local heads
attends to (small distances) and global heads (large
distances). This is in contrast to CNNs, which are
hardcoded to attend only locally in the lower
layers.
Heads sorted by their average distance.
Transformers vs. CNN
With NO large-scale pre-training (just training on ImageNet) – much lower performance
Two highest layers in the ViT
At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it ViT does not learn to attend locally in earlier
attends to layers!!!
Using local information early on for
image tasks (which is hardcoded into
Heads sorted by their average attention. CNN architectures) is important for
strong performance
Further Reading
Credit: Zang Yuhang