0% found this document useful (0 votes)
41 views92 pages

Lecture 6 Transformers

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views92 pages

Lecture 6 Transformers

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AI6126 Advanced Computer Vision

Last update: 17 February 2022 4:45pm

Transformers for
Computer Vision
Chen-Change Loy
吕健勤

https://www.mmlab-ntu.com/
https://twitter.com/ccloy
Outline
• Attention
• Sequence-to-Sequence with RNNs and Attention
• Transformers
• Extra – Vision Transformers
Background
• Convolutional neural networks (CNN) has been dominating
• Greater scale
• More extensive connections
• More sophisticated forms of convolution

• Transformers
• Competitive alternative to CNN
• Generally found to perform best in settings with large amounts of training data
• Enable multi-modality learning
Credits
• A nice blog by Lilian Weng:
• https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
• https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html

• The illustrated Transformer by Jay Alammar


• https://jalammar.github.io/illustrated-transformer/
Attention
"Where's Waldo?"
Visual Attention
Scientists divided into two camps
• In the first model, the spotlight of attention would track across the page, checking
each detail against a mental image of Waldo's red stocking cap and striped shirt.
• In the second model, the color red and stocking-cap shapes would gradually come to
the foreground and other shapes and colors would recede.
"Where's Waldo?"

• The study by MIT showed that both processes are going on in the same chunk of the
brain and in the same neurons

• Midregion of the V4 visual cortex known to be important to attention (V4 is tuned


for object features of intermediate complexity, like simple geometric shapes)

Research explains how the brain finds Waldo, MIT News 2005
Attention

The term “visual attention” refers to a set of cognitive operations that mediate the selection of relevant
and the filtering out of irrelevant information from cluttered visual scenes.
• Reduce complexity
• Resource saving
Attention
Attention is a mechanism that a model can learn to
make predictions by selectively attending to a given set
of data.

The amount of attention is quantified by learned


weights and thus the output is usually formed as a
weighted average.

Self-attention is a type of attention mechanism where


the model makes prediction for one part of a data
sample using other parts of the observation about the
same sample.

What does “it” in this sentence refer to? Is it referring to the


street or to the animal?

When the model is processing the word “it”, self-attention


allows it to associate “it” with “animal”.
Sequence-to-Sequence
with RNNs and Attention
Input-output scenarios
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs
Sequence-to-Sequence with RNNs

st-1,
Sequence-to-Sequence with RNNs

st-1
Sequence-to-Sequence with RNNs

st-1
Sequence-to-Sequence with RNNs

st-1

Context vector
summarizes all the
information need for
the decoder
Sequence-to-Sequence with RNNs

st-1

Idea: use new context vector at each step of decoder!


Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
To predict how much we should attend to each hidden state of
the encoder given the current hidden state of the decoder

scalar
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Sequence-to-Sequence with RNNs and Attention
Transformers
Transformers

https://transformer.huggingface.co/doc/distil-gpt2
Transformers
Notable for its use of attention to model
long-range dependencies in data

A sequence-to-sequence model

Model of choice in natural language


processing (NLP)

Step-by-step guide to self-attention with illustrations and code https://jalammar.github.io/illustrated-transformer/


Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers

Like LSTM, Transformer is an architecture for transforming one


sequence into another one with the help of two parts (Encoder
and Decoder)

But it differs from the existing sequence-to-sequence models


because it does not imply any Recurrent Networks (GRU, LSTM,
etc.).
• Layer outputs can be calculated in parallel, instead of a series
like an RNN
• Attention-based models allow modeling of dependencies
without regard to their distance in the input or output
sequences

Ashish Vaswani et al., Attention Is All You Need, NIPS 2017 (from Google)
Transformers

It is entirely built on the self-attention


mechanisms without using sequence-aligned
recurrent architecture

Self-attention is a type of attention mechanism


where the model makes prediction for one part of
a data sample using other parts of the observation
about the same sample.

What does “it” in this sentence refer to? Is it referring to the


street or to the animal?

When the model is processing the word “it”, self-attention


allows it to associate “it” with “animal”.
Transformers
The encoding component is a stack of
encoders

The decoding component is a stack of


decoders of the same number
Transformers
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other
words in the input sentence as it encodes a specific word
The outputs of the self-attention layer are fed to a feed-forward neural network.
The exact same feed-forward network is independently applied to each position (each word/token).
Transformers
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on
relevant parts of the input sentence
Transformers

Decoder

Encoder
Transformers
Self-attention = Scaled dot-product attention
The output is a weighted sum of the values, where the weight
assigned to each value is determined by the dot-product of the
query with all the keys

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!

Scaled Dot-Product Attention


Self-Attention in Detail
First Step
Create three vectors from each of the encoder’s
input vectors (in this case, the embedding of
each word).
So for each word, we create a Query vector, a
Key vector, and a Value vector.
These vectors are created by multiplying the
embedding by three matrices that we trained
during the training process.

What are the “query”, “key”, and “value” vectors?

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Second Step
Calculate a score for each word of the input
sentence against a word.
The score determines how much focus to place
on other parts of the input sentence as we
encode a word at a certain position.
The score is calculated by taking the dot product
of the query vector with the key vector of the
respective word we’re scoring.

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Third Step
Divide the scores by 𝑑! , the square root of the
dimension of the key vectors
This leads to having more stable gradients (large
similarities will cause softmax to saturate and
give vanishing gradients)

Fourth Step
Softmax for normalization

𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Self-Attention in Detail
Fifth Step
Multiply each value vector by the softmax score

Sixth Step
Sum up the weighted value vectors to get the
output of the self-attention layer at this position
(for the first word)

~ ~

z1= 0.88v1+ 0.12v2


𝐐𝐊 ⊤
Attention 𝐐, 𝐊, 𝐕 = SoftMax 𝐕
𝑑!
Matrix Calculation of Self-Attention
First Step
Calculate the Query, Key, and Value matrices.
Pack our embeddings into a matrix X, and
multiplying it by the weight matrices we’ve
Every row in
trained (WQ, WK, WV) the X matrix
corresponds to a word
in the input sentence.
Second Step
Calculate the outputs of the self-attention layer.
SoftMax is row-wise
Multi-Head Self-Attention
Multi-Head Self-Attention
Rather than only computing the attention once, the multi-head mechanism
runs through the scaled dot-product attention multiple times in parallel.

The independent attention outputs are simply concatenated and linearly


transformed into the expected dimensions.

Why?
“Multi-head attention allows the model to jointly attend to information from
different representation subspaces at different positions. ”

Example:
Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning.

Given “representation learning”, the first head attends to “Deep learning” while the Multi-head Attention
second head attends to the more general term “machine learning methods”
Positional Encoding
Self-attention operation is permutation
equivariant

SelfAtt(𝜋 ⋅ 𝑋) = 𝜋 ⋅ SelfAtt(𝑋)
Self-attention layer works on sets of
vectors and it doesn’t know the order of
the vectors it is processing
The positional encoding has the same
dimension as the input embedding
Adds a vector to each input embedding
to give information about the relative or
absolute position of the tokens in the
sequence
These vectors follow a specific pattern
Positional Encoding
What might this pattern look like?
Each row corresponds the a positional encoding of a vector.
The first row would be the vector we’d add to the embedding of the first word in an input sequence.
Each position is uniquely encoded and the encoding can deal with sequences longer than any sequence
seen in the training time.
Sinusoidal positional encoding - interweaves
the two signals

where 𝑝𝑜𝑠 is the position and 𝑖 is the


Sinusoidal positional encoding with 32 tokens and embedding dimension of 128. The value is dimension,
between -1 (black) and 1 (white) and the value 0 is in gray.
Example: Positional Encoding
Example:
Given the following Sinusoidal positional encoding, calculate the 𝑃𝐸(𝑝𝑜𝑠 = 1) for the first five dimensions [0, 1, 2, 3, 4].
Assume 𝑑!"#$% = 512

Solution:
Given 𝑝𝑜𝑠 = 1 and 𝑑!"#$% = 512
At dimension 0, 2𝑖 = 0 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,.) = sin(1/10000./0-* )
At dimension 1, 2𝑖 + 1 = 1 thus 𝑖 = 0, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,-) = cos(1/10000./0-* )
At dimension 2, 2𝑖 = 2 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,*) = sin(1/10000*/0-* )
At dimension 3, 2𝑖 + 1 = 3 thus 𝑖 = 1, therefore 𝑃𝐸('"(,*+1-) = 𝑃𝐸(-,2) = cos(1/10000*/0-* )
At dimension 4, 2𝑖 = 4 thus 𝑖 = 2, therefore 𝑃𝐸('"(,*+) = 𝑃𝐸(-,3) = sin(1/100003/0-* )
Positional Encoding
Other form of positional encoding (e.g., learnable encoding)

Xuanqing Liu et al., Learning to Encode Position for Transformer with Continuous Dynamical Model, ICML 2020
Transformer Encoder
Encoder

• A stack of 𝑁 = 6 identical layers.


• Each layer has a multi-head self-attention
layer and a simple position-wise fully connected
feed-forward network.
If we’re to visualize the vectors and
the layer-norm operation
associated with self attention, it
would look like this:

• The linear transformations are the same across


different positions, they use different
parameters from layer to layer.
• Each sub-layer adopts a residual connection and
a layer normalization.
Transformer Encoder
BN LN
Input, 𝑁

Input, 𝑁
Channel, 𝐷 Channel, 𝐷

Layer Normalization (LN) 1 Gaussian Error Linear Units (GELU) 3


• The pixels along the red arrow are normalized by the same • Can be thought of as a smoother ReLU
mean and variance, computed by aggregating the values of • Used in GPT-3, BERT, and most other Transformers
these pixels.
• BN is found unstable in Transformers 2
• Works well with RNN and now being used in Transformers

1 Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, Layer Normalization, arXiv:1607.06450
2 Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer, Rethinking Batch Normalization in Transformers, ICML 2020
3 Dan Hendrycks, Kevin Gimpel, Gaussian Error Linear Units (GELUs), arXiv:1606.08415
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:
Transformer Encoder
The outputs of encoder go to the sub-layers of the decoder as well. If we’re to think of a Transformer of two
stacked encoders and decoders, it would look something like this:

The “Encoder-Decoder Attention”


layer works just like multiheaded
self-attention, except it creates its
Queries matrix from the layer
below it, and takes the Keys and
Values matrix from the output of
the encoder stack.
Transformer Decoder
How encoder and decoder work together

• The encoder start by processing the input sequence


• The output of the top encoder is then transformed
into a set of attention vectors K and V.
• These are to be used by each decoder in its
“encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input
sequence

After finishing the encoding phase, we begin the decoding phase. Each
step in the decoding phase outputs an element from the output sequence
(the English translation sentence in this case).
Transformer Decoder
How encoder and decoder work together

• The output of each step is fed to the bottom


decoder in the next time step
• Embed and add positional encoding to those
decoder inputs. Process the inputs
• Repeat the process until a special symbol is reached
indicating the transformer decoder has completed
its output.

Note:
In the decoder, the self-attention layer is only allowed to attend to earlier
positions in the output sequence. This is done by masking future positions
(setting them to -inf) before the softmax step in the self-attention
calculation.
Transformers
Decoder

Encoder

John Thickstun, The Transformer Model in Equations


Vision Transformers
Vision Transformer (ViT)
Vision Transformer
• Do not have decoder
• Reshape the image 𝐱 ∈ ℝ"×$×% into a sequence of
!
flattened 2D patches 𝐱 ∈ ℝ&× ' ( % , where (𝐻, 𝑊)
is the resolution of the original image, 𝐶 is the
number of channels, (𝑃, 𝑃) is the resolution of each
image patch, and 𝑁 = 𝐻𝑊/𝑃) is the resulting
number of patches
• Patch embedding - Linearly embed each of them to
𝐷 dimension with a trainable linear projection 𝐄
• Add learnable position embeddings 𝐄*+, to retain
positional information

Alexey Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Vision Transformer (ViT)
Vision Transformer
• Prepend a learnable embedding (𝐳-- = 𝐱 ./011 ) to
the sequence of embedded patches
• Feed the resulting sequence of vectors to a
standard Transformer encoder
• A classification head is attached to 𝐳2-

Transformer Encoder

Classification Head
Vision Transformer (ViT)
Transformer Encoder
• Consists of a multi-head self-attention module (MSA),
followed by a 2-layer MLP (with GELU)
• LayerNorm (LN) is applied before MSA module and MLP,
and a residual connection is applied after each module.

Transformer Encoder
Vision Transformer (ViT)

The model attends to image regions that are


The model learns to encode distance within the image - closer patches tend to
semantically relevant for classification
have more similar position embeddings.
Attention Rollout 1 - averaged attention weights
The row-column structure appears - patches in the same row/column have similar of ViT- L/16 across all heads and then recursively
embeddings.
multiplied the weight matrices of all layers
1 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL, 2020 (allows attention to be more meaningfully visualized and
interpreted for deeper layers in a transformer)
Vision Transformer (ViT)
Examine the attention distance, analogous to receptive field in CNN

Compute the average distance in image space across which


information is integrated, based on the attention weights.

The attention distance increases


Some heads attend to most of the image already with network depth
in the lowest layers, showing the capability of ViT
in integrating information globally

Other attention heads have consistently


small attention distances in the low layers
Vision Transformer (ViT)
Performance of ViT
• ViT performs significantly worse than the CNN equivalent (BiT)
when trained on ImageNet (1M images).
• However, on ImageNet-21k (14M images) performance is
comparable, and on JFT (300M images), ViT outperforms BiT.
• ViT overfits the ImageNet task due to its lack of inbuilt
knowledge about images

ViT conducts global self-attention


• Relationships between a token and all other tokens are
computed
• Quadratic complexity with respect to the number of tokens
• Not tractable for dense prediction or to represent a high-
resolution image
Swin Transformers
Swin Transformer

Swin Transformer
• Perform local self-attention thus having linear computational complexity with respect to
the number of tokens
• Shifted window between consecutive self-attention layers to allow connections
• Flexibility to model at various scales with hierarchical representation

Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021
Swin Transformer

Patch Partition Linear Embedding


• Split an input RGB image into non-overlapping Applied on the raw-valued feature to project it
patches to an arbitrary dimension, denoted as 𝐶
• Treat each patch as a “token” (4×4×3 = 48),
4 5
number of tokens ( × )
3 3
• Concatenate raw pixel RGB values as patch
feature set
Swin Transformer

Patch Merging
• Reminiscent the pooling in CNN
• Reduce the dimension, form hierarchical representation
• Concatenate the features of each group of 2 × 2
neighboring patches
• Apply a linear layer on the 4𝐶-dimensional
concatenated features to get an output dimension of 2𝐶
Swin Transformer

Swin Transformer Block


With modified self-attention computation

Repeating the same process of Stage 2 to produce a hierarchical representation


Swin Transformer
Swin Transformer conducts local self-attention
• Limiting self-attention computation to non-
overlapping local windows
• The number of patches in each window is fixed
• Supposing each window contains 𝑀 × 𝑀 patches and
there are ℎ × 𝑤 patches on an image

Conventional multi-head self attention (MSA),


Swin Transformer ViT
quadratic to patch number ℎ𝑤 (omit SoftMax)
Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶

Window-based self attention (W-MSA),


linear to patch number ℎ𝑤 (omit SoftMax)
Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶
Swin Transformer
if C = AB for
an n × m matrix A and
Supposing each window contains 𝑀 × 𝑀 patches and there are ℎ × 𝑤 patches on an image. an m × p matrix B, then C is
an n × p matrix with entries.
This algorithm
For conventional multi-head self attention (MSA): takes time Θ(nmp) (in asym
ptotic notation).

1. For each patch we need to compute the respective 𝑄, 𝐾, and 𝑉, with 𝐐 = 𝐗𝐖 3 , 𝐊 = 𝐗𝐖4 , and
𝐕 = 𝐗𝐖 5 , where 𝐗 ∈ ℝ67×% and 𝐖 ∈ ℝ%×% . Thus, the complexity is 3ℎ𝑤𝐶 )

2. Compute 𝐐𝐊 ⊤ , and we know 𝐐, 𝐊, 𝐕 ∈ ℝ67×% . Thus, the complexity is ℎ𝑤 ) 𝐶

3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Since 𝐐𝐊 ⊤ ∈ ℝ67×67 , this operation takes
ℎ𝑤 ) 𝐶

4. The final output is obtained by multiplying 𝐙 with the output matrix 𝐖8 . The complexity is ℎ𝑤𝐶 )

Hence, the final complexity is Ω MSA = 4ℎ𝑤𝐶 ) + 2 ℎ𝑤 ) 𝐶 - quadratic to patch number ℎ𝑤


Swin Transformer
Window-based self attention (W-MSA) performs self-attention within windows, each of which contains
𝑀 × 𝑀 patches

1. [Same as MSA] the complexity is 3ℎ𝑤𝐶 )

windows. In each window, the complexity of computing 𝐐𝐊 ⊤ is


6 7
2. In W-MSA, there are ×
9 9
6 7
(𝑀) )) 𝐶 and thus the total complexity is × (𝑀) )) 𝐶 = 𝑀) ℎ𝑤𝐶
9 9

3. Apply SoftMax and multiply with 𝐕 to obtain 𝐙. Similar to aforementioned reason, this operation
takes 𝑀) ℎ𝑤𝐶

4. [Same as MSA] the complexity is ℎ𝑤𝐶 )

• Hence, the final complexity is Ω W−MSA = 4ℎ𝑤𝐶 ) + 2𝑀) ℎ𝑤𝐶 – linear to patch number ℎ𝑤
Swin Transformer
Shifted window partitioning in successive blocks
• Enhance connections by bridging the windows
of the preceding layer to improve modelling
power
• Alternating between two partitioning
configurations

A regular window The window partitioning


partitioning scheme is is shifted, resulting in
adopted, and self- new windows. The self-
attention is computed attention computation
within each window in the new windows
crosses the boundaries
of the previous windows
in preceding layer,
providing connections
among them
Swin Transformer
Successive Swin Transformer Blocks
• The first block consists of a window-based MSA module (W-
MSA), followed by a 2-layer MLP (with GELU)
• The second block consists of a shifted window based MSA
module (SW-MSA), followed by a 2-layer MLP (with GELU)
• Layer normalization (LN) is applied before each MSA
module and each MLP, and a residual connection is applied
after each module.

Two Successive Swin Transformer Blocks


Swin Transformer
Issues with shifted window partitioning
6 7 6
• Result in more windows, from × to +1 ×
9 9 9
7
+ 1 in the shifted configuration. For instance, from
9
2×2 to 3×3
• Some of the windows will be smaller than 𝑀×𝑀.
• Naïve solution: pad the smaller windows, but computation
is 2.25 greater (for the case 2×2 → 3×3 )
Swin Transformer
Efficient batch computation
• The goal is to maintain the same number of batched
windows as that of regular window partitioning
• Cyclic-shifting toward the top-left direction
• Using masked MSA to limit self-attention computation to
within each sub-window

Credit: https://zhuanlan.zhihu.com/p/367111046
Swin Transformer
Image Classification
• The image classification is performed by
applying a global average pooling layer on the
output feature map of the last stage, followed
by a linear classifier

• Found to be as accurate as using an additional


class token as in ViT Swin Transformer achieves a
better speed-accuracy trade-
off than state-of-the-art CNNs
Video Swin Transformer
Extension to spatiotemporal domain
The mechanism of shifted windows is reformulated to process spatiotemporal input

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230


Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted

Swin Transformer

Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230


Video Swin Transformer
Extension to spatiotemporal domain
The architecture is also adjusted

Swin Transformer Video Swin Transformer

Ze Liu et al., Video Swin Transformer, arXiv:2106.13230


Applications of Transformer

Set prediction problem: DETR simplifies the detection pipeline by dropping multiple hand-designed components
that encode prior knowledge, like spatial anchors or non-maximal suppression.

• DETR uses a conventional CNN backbone to learn a 2D representation of an input image.


• The model flattens it and supplements it with a positional encoding before passing it into a transformer
encoder.
• A transformer decoder then takes as input a small fixed number of learned positional embeddings (object
queries), and additionally attends to the encoder output.
• Pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a
detection (class and bounding box) or a “no object” class.
Nicolas Carion et al., End-to-End Object Detection with Transformers , ECCV 2020
Applications of Transformer

Restored output

Super-resolves the LR image with the guidance of an additional


high-resolution (HR) reference image. Textures of the HR
reference image are transferred to provide more fine details for
the LR image.

Fuzhi Yang et al., Learning Texture Transformer Network for Image Super-Resolution, CVPR 2020
Applications of Transformer
3D human texture estimation from a single image

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image

Most of existing methods CNNs to predict 3D human texture (i.e., a UV map) from
the input image

The input and output do not have strictly-aligned spatial correspondences and may
even have totally different shapes

Convolution layers are by design local operations and inefficient in processing global
information that is crucial in 3D human texture estimation

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Applications of Transformer
3D human texture estimation from a single image

Transformer uses the attention mechanism to more


effectively exploit global information, which leads to higher-
quality 3D human texture estimation.

Query = color encoding map obtained from 3D coordinates of a


standard 3D body mesh, and each element in the Query corresponds to
a physical vertex

Xiangyu Xu, Chen Change Loy, 3D Human Texture Estimation from a Single Image with Transformers, ICCV 2021
Pretraining of Transformers
Attain excellent results when pretrained on large-scale datasets
(e.g., JFT-300M)

Lower accuracy than ResNet when it is trained on ImageNet

Lack inherent inductive biases inherent to CNN


• Inductive biases are the characteristics of learning algorithms
that influence their generalization behaviour, independent of An equivariant mapping is a mapping which
data. preserves the algebraic structure of a
• Translation equivariance, two-dimensional neighborhood transformation.
structure, and locality
A translation equivariant mapping is a
• And thus do not generalize well given insufficient training data mapping which, when the input is translated,
leads to a translated mapping
Transformers vs. CNN
• Do Transformers act like convolutions?
• Do they act like convolutions, learning the same inductive biases from scratch?
• Or are they developing novel task representations?

Maithra Raghu et al., Do Vision Transformers See Like Convolutional Neural Networks?, arXiv:2108.08810
Transformers vs. CNN

Tool: Centered kernel alignment (CKA) to compare the representations (activations) between all pairs of layers -
measure meaningful similarities between representations of higher dimension than the number of data points

Gram matrices for the two layers

Hilbert-Schmidt Independence Criterion

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In ICML, 2019
A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf. Measuring statistical dependence with Hilbert Schmidt norms. In ALT
Transformers vs. CNN

ViT having more uniform representations, with greater similarity between lower and higher layers.
Transformers vs. CNN
Cross-model comparisons

The lower half of 60 ResNet layers are similar to


approximately the lowest quarter of ViT layers.

In particular, many more lower layers in the


ResNet are needed to compute similar
representations to the lower layers of ViT.

The top half of the ResNet is approximately


similar to the next third of the ViT layers.

The final third of ViT layers is less similar to all


ResNet layers
Transformers vs. CNN
(i) ViT lower layers compute representations
in a different way to lower layers in the
ResNet

(ii) ViT also more strongly propagates


representations between lower and higher Within model
layers

(iii) The highest layers of ViT have quite


different representations to ResNet

Cross model
Transformers vs. CNN
With large-scale pre-training

Two highest layers in the ViT


At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it Self-attention layers have a mix of local heads
attends to (small distances) and global heads (large
distances). This is in contrast to CNNs, which are
hardcoded to attend only locally in the lower
layers.

Heads sorted by their average distance.


Transformers vs. CNN
With NO large-scale pre-training (just training on ImageNet) – much lower performance

Two highest layers in the ViT


At higher layers, all self-attention heads are global.
Average distance
between the
query patch
position and the Two lowest layers in the ViT
locations it ViT does not learn to attend locally in earlier
attends to layers!!!

Using local information early on for


image tasks (which is hardcoded into
Heads sorted by their average attention. CNN architectures) is important for
strong performance
Further Reading

Credit: Zang Yuhang

You might also like