AI VIETNAM
All-in-One Course
Applications of
Transformer Encoder
Quang-Vinh Dinh
Ph.D. in Computer Science
Year 2023
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Normalization
❖ Overview
𝑁 𝐻 𝑊 𝑁 𝐻 𝑊
1 1 2
𝜇𝑐 = 𝐹𝑖𝑗𝑘 𝜎𝑐 = 𝐹𝑖𝑗𝑘 − 𝜇𝑐
𝑁×𝐻×𝑊 𝑁×𝐻×𝑊
𝑖=1 𝑗=1 𝑘=1 𝑖=1 𝑗=1 𝑘=1
https://arxiv.org/pdf/1803.08494.pdf 1
AI VIETNAM
All-in-One Course
Batch Normalization
𝜖 = 10−5 𝜇 = [2.0, 3.0] 𝛾 = 1.0
𝜎 2 = [3.5, 3.25] β = 0.0
0 5 6 4 1.6 − 1.1 1.6 0.5
3 0 5 2 −1.1 0.5 1.1 − 0.5
1 4 2 3
3 0 0 2 −0.5 1.1 −0.5 0.0
0.5 − 1.1 −1.6 − 0.5
𝑋= , 𝑋 = ,
sample 1 sample 2 sample 1 sample 2
𝑌 = …
batch-size = 2
Batch-Norm Layer
input_shape = (2, 2, 2, 2)
2
AI VIETNAM
All-in-One Course
Layer Normalization
𝐶 𝐻 𝑊
1
𝜇𝑛 = 𝐹𝑐𝑗𝑘
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1
𝐶 𝐻 𝑊
1 2
𝜎𝑛 = 𝐹𝑐𝑗𝑘 − 𝜇𝑛
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1
https://arxiv.org/pdf/
1607.06450.pdf
5 1 0.36 − 1.09
𝑋= 𝑋 =
2 8 −0.73 1.46
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Layer Norm
(mean=4 & std=2.73) 3
Layer Normalization
a sample
8 6 , 0 2 1.76 0.98 −1.37 − 0.58
𝑋=
2 4 1 5
𝑋 = ,
−0.58 0.19 −0.98 0.58
input_shape=(1, 2, 2, 2) shape=(1, 2, 2, 2)
Layer Norm
(mean=3.5 & std=2.54)
sample 1 sample 2 sample 1 sample 2
8 6 0 2 1.34 0.44 −1.06 0.0
𝑋= ; 𝑋 = ;
2 4 1 5 −1.34 − 0.44 −0.53 1.6
input_shape=(2, 2, 2, 1) shape=(2, 2, 2, 1)
Layer Norm
(mean1=5.0 & mean2=2.0)
(std1=2.23 & std2=1.87) 4
AI VIETNAM
All-in-One Course
Instance Normalization
𝐻 𝑊
𝜇 = 2.5 1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = 𝐹𝑖𝑗
𝜎 2 = 2.06 𝐻×𝑊
𝑖=1 𝑗=1
𝐻 𝑊
1 2
𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
𝑖=1 𝑗=1
https://arxiv.org/pdf/
1607.08022.pdf
1 5 −0.7276 1.2127
𝑋= 𝑋 =
4 0 0.7276 − 1.2127
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Instance-Norm Layer
5
AI VIETNAM
All-in-One Course
Instance Normalization
𝜇 = [2.5, 4.75] 𝛾 = 1.0 𝜖 = 10−5 𝐻 𝑊
1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = 𝐹𝑖𝑗
𝜎 2 = [4.25, 6.18] β = 0.0 𝐻×𝑊
𝑖=1 𝑗=1
𝐻 𝑊
1 2
1 5 5 8 𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
4 0 5 1 𝑖=1 𝑗=1
sample 1
𝑋=
−0.72 1.21
0.72 − 1.21
𝑋 =
0.1 1.3
0.1 − 1.5
batch-size = 1 Instance-Norm Layer
input_shape = (1, 2, 2, 2) 6
AI VIETNAM
All-in-One Course
Instance Normalization
𝛾 = 1.0 𝜖 = 10−5
𝜇 = [2.5, 5.0]
β = 0.0
𝜎2 = [4.25, 7.7] 𝜇 = [4.25, 1.75]
𝜎 2 = [5.68, 2.18] −0.72 1.21
0.72 − 1.21
1 5 9 2 1.46 − 1.09
4 0 6 3 0.36 − 0.73
6 3 0 2 0.73 − 0.52
1 7 1 4 −1.36 1.15
−1.12 0.16
−0.5 − 1.52
𝑋= , 𝑋 = ,
sample 1 sample 2 sample 1 sample 2
𝑌 = …
batch-size = 2
Instance-Norm Layer
input_shape = (2, 2, 2, 2)
7
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠(𝐶) = 4 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠(𝐺) = 2 𝑘𝐶 𝑖𝐶
𝑆𝑖 = 𝑘 | 𝑘𝑛 = 𝑖𝑛 , =
𝐶/𝐺 𝐶/𝐺
𝑛𝑢𝑚_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2 1
H, W 𝜇𝑔𝑟 = 𝑥𝑘
𝑚
𝑘∈𝑆𝑖
𝜇=3 𝜇=4 1 2
𝜎𝑔𝑟 = 𝑥𝑘 − 𝜇𝑔𝑟
𝜎2 = 5 𝜎 2 = 9.5 N 𝑚
C 𝑘∈𝑆𝑖
1 5 1 2 9 2 6 3 −0. 89 0.89 −0.9 − 0.44
𝑋1 = , , , , ,
4 7 4 0 0 3 1 8 0.44 1.78 0.44 − 1.3
𝑋1 =
1.62 − 0.64 0.64 − 0.32
5 2 1 7 0 2 1 4 ,
𝑋2 = , , , −1.29 − 0.32 −0.97 1.29
6 3 0 7 3 3 2 5
0.43 − 0.72 −1.11 1.21
, ,
0.82 − 0.34 −1.5 1.21
𝜇 = 3.8 𝜇 = 2.5 𝑋2 =
−1.66 − 0.33 −1.0 1.0
Group-Norm Layer ,
𝜎 2 = 6.6 𝜎 2 = 2.25 0.33 0.33 −0.33 1.67
shape=(2, 2, 2, 4) shape=(2, 2, 2, 4) 8
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 1 𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 6
9
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Transformer Block (-)
Output
(N, Seq_len, E_dim)
Add & Norm
Feed Forward https://arxiv.org/pdf/
1803.08494.pdf
Add & Norm
Multi-head
Attention
Input Embedding
(N, Seq_len, E_dim)
10
AI VIETNAM
All-in-One Course
Group Normalization
MLP + Softmax Skip Connection
𝑥1
𝑧1 yො 1
𝑥2 X Conv Conv + Y
𝑧2 Softmax yො 2
𝑥3 … …
… 𝑧𝑛 yො 𝑛
𝑥𝑛
X Conv Conv f Y
1
11
Transformer Block −0.35 0.51 0.50
𝑊𝑄 = 0.36 − 0.47 − 0.29
−0.51 − 0.14 − 0.56
−0.49 − 0.68 0.18
𝑊𝐾 = −0.44 − 0.46 0.18
0.07 − 0.10 0.44
bs = 1 −0.41 0.39 − 0.65
−0.029 − 0.028 0.065 sequence_length = 2 𝑊𝑉 = −0.40 − 0.07 − 0.34
−0.025 − 0.025 0.058 embed_dim = 3 −0.55 − 0.13 − 0.29
Multi-head Attention
−0.36 − 0.08 0.32
𝑊𝑂 = 0.27 0.05 0.15
−0.05 − 0.28 0.05
𝑊𝑄 𝑊𝐾 𝑊𝑉
bs = 1 𝑄𝐾 𝑇
−0.1 0.1 0.3 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑉
sequence_length = 2 𝑑
0.4 − 1.1 − 0.3 embed_dim = 3
12
Transformer Block
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +
𝑄𝐾 𝑇
𝑌 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑 −0.029 − 0.028 0.065
Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention
𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 13
Transformer Block −1.14 − 0.15 1.29
1.14 − 1.29 0.14
Layer Norm
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +
−0.029 − 0.028 0.065
Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention
𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
https://openaccess.thecvf.com/content/ICCV2021W/Ne
(N, Seq_len, E_dim) urArch/papers/Yao_Leveraging_Batch_Normalization_ −0.1 0.1 0.3
for_Vision_Transformers_ICCVW_2021_paper.pdf 0.4 − 1.1 − 0.3 14
Transformer Block
0.53 − 0.62 − 0.22
Output 0.78 − 0.26 − 1.08
(N, Seq_len, E_dim)
Feed Forward
0.15 0.39 − 0.34
Add & Norm
𝑊 = −0.41 0.54 0.49
Feed Forward 0.50 − 0.07 − 0.41 −1.14 − 0.15 1.29
1.14 − 1.29 0.14
Add & Norm
Layer Norm
Multi-head
Attention
Multi-head Attention
Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 15
Transformer Block −0.59 − 0.81 1.40
bs = 1
sequence_length = 2 1.39 − 0.90 − 0.49
embed_dim = 3
Output
Layer Norm
(N, Seq_len, E_dim)
Feed Forward
Add & Norm
Feed Forward −1.14 − 0.15 1.29
1.14 − 1.29 0.14
Add & Norm
Multi-head Layer Norm
Attention
Multi-head Attention
Input Embedding bs = 1
(N, Seq_len, E_dim) sequence_length = 2 −0.1 0.1 0.3
embed_dim = 3 0.4 − 1.1 − 0.3 16
Transformer Models for Text Classification
Softmax
Linear
Add & Norm
https://arxiv.org/pdf/
Feed Forward 1803.08494.pdf
N× Add & Norm
Multi-head
Attention
Positional
+
Embedding
Input Embedding
17
AI VIETNAM
All-in-One Course
Transformer
❖ Positional encoding
𝑐𝑜𝑠 𝑤𝑘 × pos
𝑑−1
𝑘=
2
0 , 𝑣 1 , 𝑣 2 , 𝑣 3 , … , 𝑣 𝑑−2 , 𝑣 𝑑−1
𝑣pos pos pos pos pos pos 𝑠𝑖𝑛 𝑤0 × pos
𝑇 = 10000 𝑐𝑜𝑠 𝑤0 × pos
𝑑 𝑠𝑖𝑛 𝑤1 × pos
𝑘= 𝑐𝑜𝑠 𝑤1 × pos
2 𝒗pos =
1 …
…
𝑤𝑘 = 𝑠𝑖𝑛 𝑤𝑑 × pos
𝑇 2𝑘/𝑑 𝑠𝑖𝑛 𝑤𝑘 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2
0 ≤ pos ≤ 99
0 ≤ d ≤ 127
𝑇 = 10 𝑇 = 100
𝑇 = 1000 𝑇 = 10000
19
0 ≤ pos ≤ 9 0 ≤ d ≤ 127
𝑇 = 100
0 ≤ pos ≤ 199
0 ≤ pos ≤ 99
0 ≤ pos ≤ 499
❖ Positional encoding
𝑠𝑖𝑛 𝑤0 × pos
𝑐𝑜𝑠 𝑤0 × pos
𝑠𝑖𝑛 𝑤1 × pos
𝑐𝑜𝑠 𝑤1 × pos 1
𝒗pos = … 𝑤𝑘 =
… 𝑇 2𝑘/𝑑
𝑠𝑖𝑛 𝑤𝑑 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2 𝒗0 𝑠𝑖𝑛 𝑤0 × 0 , 𝑐𝑜𝑠 𝑤0 × 0 , 𝑠𝑖𝑛 𝑤1 × 0 , 𝑐𝑜𝑠 𝑤1 × 0
𝒗 = 𝒗1 = 𝑠𝑖𝑛 𝑤0 × 1 , 𝑐𝑜𝑠 𝑤0 × 1 , 𝑠𝑖𝑛 𝑤1 × 1 , 𝑐𝑜𝑠 𝑤1 × 1
𝒗2 𝑠𝑖𝑛 𝑤0 × 2 , 𝑐𝑜𝑠 𝑤0 × 2 , 𝑠𝑖𝑛 𝑤1 × 2 , 𝑐𝑜𝑠 𝑤1 × 2
𝑇 = 100
0 ≤ pos ≤ 2 0.0 1.0 0.0 1.0
= 0.84 0.54 0.09 0.99
0≤d≤3 0.91 − 0.41 0.19 0.98
https://kazemnejad.com/blog/transformer
_architecture_positional_encoding/
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Text Classification
❖ IMDB dataset - 50,000 movie review for sentiment analysis (data)
- Consist of: + 25,000 movie review for training
+ 25,000 movie review for testing
- Label: positive – negative = 1 – 1
“A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time- positive
BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece
…”
“This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years negative
were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's
continued its decline further to the complete waste of time it is today….”
“I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air positive
conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty
and the characters are likable (even the well bread suspected serial killer)….”
“BTW Carver gets a very annoying sidekick who makes you wanna shoot him the first three minutes he's negative
on screen.”
Transformer Models for Text Classification
❖ Embedding
Positional
+
Embedding
Input Embedding
23
Transformer Models for Text Classification
❖ Transformer block
Add & Norm
Feed Forward
Add & Norm
Multi-head
Attention
24
Transformer Models for Text Classification
Softmax
Linear
Add & Norm
Feed Forward
N× Add & Norm
Multi-head
Attention
Positional
+
Embedding
Input Embedding 25
Transformer Models
for Text Classification
❖ Results
100
90
80
70
60
Test accuracy: ~82%
50 transfer learning
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Training Accuracy Test Accuracy
Train Transformer from Scratch
85
Text Deep Models 80
75
70
Long short-term memory 65
60
(RNN) Test accuracy: ~68%
55
dim=2 50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Train Accuracy Test Accuracy
hidden_dim=64
LSTM Cell LSTM Cell LSTM Cell 110
100
90
80
LSTM Cell LSTM Cell LSTM Cell 70
60 (LSTM) Test accuracy: ~87%
50
embed_dim = 128 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Train Accuracy Test Accuracy
Word-1 Word-2 Word-500
27
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset - Training samples: 7613
- Total Number of disaster tweets: 4342
- Total Number of non-disaster tweets: 3271
28
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
Transformer Models for Text Classification
Softmax
Linear
Add & Norm
Feed Forward
N× Add & Norm
Multi-head
Attention
Positional
+
Embedding
Input Embedding
31
AI VIETNAM
All-in-One Course
Text Classification
❖ Results
100
90
80
70
60
Test accuracy: ~78%
50
40
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Training Accuracy Test Accuracy
Train Transformer from Scratch
Model Architecture
Bidirectional Encoder
Representations from Transformers
Trained on Wikipedia (~2.5B words) and Google’s
BooksCorpus (~800M words)
64 TPUs trained BERT over the course of 4 days
DistilBERT offers a lighter version of BERT; runs 60%
faster while maintaining over 95% of BERT’s performance.
https://arxiv.org/pdf/1810.04805.pdf
Model Inputs
Bidirectional Encoder
Representations from Transformers
https://huggingface.co/blog/bert-101 35
Attention Visualization
https://github.com/jessevig/bertviz
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛
…
Self-Attention
𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛
𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉
𝑋0 𝑋1 … 𝑋𝑛
Embedding
word-0 word-1 … word-n 37
𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛
…
Self-Attention
𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛
𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉
𝑋0 𝑋1 … 𝑋𝑛
Projection and Embedding
patch-0 patch-1 … patch-n 38
AI VIETNAM
All-in-One Course
Vision Transformer
❖ From text to image
39
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches
40
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches
41
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches
1 2 3
4 5 6
7 8 9
42
0 1 2 3 4 5 6 7 8
Vision Transformer
position
embedding
❖ Patch and position embedding
128
128
+ + + + + + + + +
128
Linear Linear Linear Linear Linear Linear Linear Linear Linear
4800
flatten
Pretrained Vision
Transformer
❖ JFT-300M
300M images
Internal Google dataset
For training image classification models
Vision Transformer
From text to image
https://arxiv.org/pdf/2010.11929.pdf
Performance of ViT using Cifar-10 dataset
Train from scratch Transfer Learning
78% 98%
* You may have different results in your own experiments
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Patch
Time-Series
Transformer
❖ For time-series data
https://github.com/yu
qinie98/PatchTST
Patch
Time-Series
Transformer
❖ For time-series data
https://github.com/yuqinie98/PatchTST
47
Patch
Time-Series
Transformer
❖ For time-series data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
Categorical Embeddings
https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820 50
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
Contextual Embeddings
https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820
https://keras.io/examples/structured_data/tabtransformer/ 51