0% found this document useful (0 votes)
12 views62 pages

Transformer For Different Data - v3

The document outlines an all-in-one course on the applications of Transformer encoders, covering topics such as layer normalization, transformer blocks, BERT for text classification, vision transformers for image classification, and applications for time-series and tabular data. It includes detailed explanations and examples of various normalization techniques like batch normalization, layer normalization, instance normalization, and group normalization. The course is authored by Quang-Vinh Dinh, Ph.D. in Computer Science, in the year 2023.

Uploaded by

nqthanh2101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views62 pages

Transformer For Different Data - v3

The document outlines an all-in-one course on the applications of Transformer encoders, covering topics such as layer normalization, transformer blocks, BERT for text classification, vision transformers for image classification, and applications for time-series and tabular data. It includes detailed explanations and examples of various normalization techniques like batch normalization, layer normalization, instance normalization, and group normalization. The course is authored by Quang-Vinh Dinh, Ph.D. in Computer Science, in the year 2023.

Uploaded by

nqthanh2101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AI VIETNAM

All-in-One Course

Applications of
Transformer Encoder

Quang-Vinh Dinh
Ph.D. in Computer Science

Year 2023
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Normalization
❖ Overview

𝑁 𝐻 𝑊 𝑁 𝐻 𝑊
1 1 2
𝜇𝑐 = ෍ ෍ ෍ 𝐹𝑖𝑗𝑘 𝜎𝑐 = ෍ ෍ ෍ 𝐹𝑖𝑗𝑘 − 𝜇𝑐
𝑁×𝐻×𝑊 𝑁×𝐻×𝑊
𝑖=1 𝑗=1 𝑘=1 𝑖=1 𝑗=1 𝑘=1
https://arxiv.org/pdf/1803.08494.pdf 1
AI VIETNAM
All-in-One Course
Batch Normalization
𝜖 = 10−5 𝜇 = [2.0, 3.0] 𝛾 = 1.0
𝜎 2 = [3.5, 3.25] β = 0.0

0 5 6 4 1.6 − 1.1 1.6 0.5


3 0 5 2 −1.1 0.5 1.1 − 0.5
1 4 2 3
3 0 0 2 −0.5 1.1 −0.5 0.0
0.5 − 1.1 −1.6 − 0.5

𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …

batch-size = 2
Batch-Norm Layer
input_shape = (2, 2, 2, 2)
2
AI VIETNAM
All-in-One Course
Layer Normalization
𝐶 𝐻 𝑊
1
𝜇𝑛 = ෍ ෍ ෍ 𝐹𝑐𝑗𝑘
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1

𝐶 𝐻 𝑊
1 2
𝜎𝑛 = ෍ ෍ ෍ 𝐹𝑐𝑗𝑘 − 𝜇𝑛
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1

https://arxiv.org/pdf/
1607.06450.pdf

5 1 0.36 − 1.09
𝑋= 𝑋෠ =
2 8 −0.73 1.46
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Layer Norm
(mean=4 & std=2.73) 3
Layer Normalization
a sample

8 6 , 0 2 1.76 0.98 −1.37 − 0.58


𝑋=
2 4 1 5
𝑋෠ = ,
−0.58 0.19 −0.98 0.58
input_shape=(1, 2, 2, 2) shape=(1, 2, 2, 2)

Layer Norm
(mean=3.5 & std=2.54)

sample 1 sample 2 sample 1 sample 2


8 6 0 2 1.34 0.44 −1.06 0.0
𝑋= ; 𝑋෠ = ;
2 4 1 5 −1.34 − 0.44 −0.53 1.6
input_shape=(2, 2, 2, 1) shape=(2, 2, 2, 1)

Layer Norm
(mean1=5.0 & mean2=2.0)
(std1=2.23 & std2=1.87) 4
AI VIETNAM
All-in-One Course
Instance Normalization
𝐻 𝑊
𝜇 = 2.5 1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗
𝜎 2 = 2.06 𝐻×𝑊
𝑖=1 𝑗=1

𝐻 𝑊
1 2
𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
𝑖=1 𝑗=1

https://arxiv.org/pdf/
1607.08022.pdf

1 5 −0.7276 1.2127
𝑋= 𝑋෠ =
4 0 0.7276 − 1.2127
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Instance-Norm Layer
5
AI VIETNAM
All-in-One Course
Instance Normalization
𝜇 = [2.5, 4.75] 𝛾 = 1.0 𝜖 = 10−5 𝐻 𝑊
1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗
𝜎 2 = [4.25, 6.18] β = 0.0 𝐻×𝑊
𝑖=1 𝑗=1

𝐻 𝑊
1 2
1 5 5 8 𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
4 0 5 1 𝑖=1 𝑗=1

sample 1

𝑋=
−0.72 1.21
0.72 − 1.21
𝑋෠ =
0.1 1.3
0.1 − 1.5

batch-size = 1 Instance-Norm Layer


input_shape = (1, 2, 2, 2) 6
AI VIETNAM
All-in-One Course
Instance Normalization
𝛾 = 1.0 𝜖 = 10−5
𝜇 = [2.5, 5.0]
β = 0.0
𝜎2 = [4.25, 7.7] 𝜇 = [4.25, 1.75]
𝜎 2 = [5.68, 2.18] −0.72 1.21
0.72 − 1.21
1 5 9 2 1.46 − 1.09
4 0 6 3 0.36 − 0.73
6 3 0 2 0.73 − 0.52
1 7 1 4 −1.36 1.15
−1.12 0.16
−0.5 − 1.52

𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …

batch-size = 2
Instance-Norm Layer
input_shape = (2, 2, 2, 2)
7
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠(𝐶) = 4 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠(𝐺) = 2 𝑘𝐶 𝑖𝐶
𝑆𝑖 = 𝑘 | 𝑘𝑛 = 𝑖𝑛 , ቞ ቟ =቞ ቟
𝐶/𝐺 𝐶/𝐺
𝑛𝑢𝑚_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2 1
H, W 𝜇𝑔𝑟 = ෍ 𝑥𝑘
𝑚
𝑘∈𝑆𝑖

𝜇=3 𝜇=4 1 2
𝜎𝑔𝑟 = ෍ 𝑥𝑘 − 𝜇𝑔𝑟
𝜎2 = 5 𝜎 2 = 9.5 N 𝑚
C 𝑘∈𝑆𝑖

1 5 1 2 9 2 6 3 −0. 89 0.89 −0.9 − 0.44


𝑋1 = , , , , ,
4 7 4 0 0 3 1 8 0.44 1.78 0.44 − 1.3
𝑋෠1 =
1.62 − 0.64 0.64 − 0.32
5 2 1 7 0 2 1 4 ,
𝑋2 = , , , −1.29 − 0.32 −0.97 1.29
6 3 0 7 3 3 2 5
0.43 − 0.72 −1.11 1.21
, ,
0.82 − 0.34 −1.5 1.21
𝜇 = 3.8 𝜇 = 2.5 𝑋෠2 =
−1.66 − 0.33 −1.0 1.0
Group-Norm Layer ,
𝜎 2 = 6.6 𝜎 2 = 2.25 0.33 0.33 −0.33 1.67
shape=(2, 2, 2, 4) shape=(2, 2, 2, 4) 8
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 1 𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 6

9
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Transformer Block (-)

Output
(N, Seq_len, E_dim)

Add & Norm


Feed Forward https://arxiv.org/pdf/
1803.08494.pdf

Add & Norm

Multi-head
Attention

Input Embedding
(N, Seq_len, E_dim)
10
AI VIETNAM
All-in-One Course
Group Normalization

MLP + Softmax Skip Connection


𝑥1
𝑧1 yො 1
𝑥2 X Conv Conv + Y
𝑧2 Softmax yො 2
𝑥3 … …
… 𝑧𝑛 yො 𝑛
𝑥𝑛
X Conv Conv f Y
1

11
Transformer Block −0.35 0.51 0.50
𝑊𝑄 = 0.36 − 0.47 − 0.29
−0.51 − 0.14 − 0.56

−0.49 − 0.68 0.18


𝑊𝐾 = −0.44 − 0.46 0.18
0.07 − 0.10 0.44

bs = 1 −0.41 0.39 − 0.65


−0.029 − 0.028 0.065 sequence_length = 2 𝑊𝑉 = −0.40 − 0.07 − 0.34
−0.025 − 0.025 0.058 embed_dim = 3 −0.55 − 0.13 − 0.29

Multi-head Attention
−0.36 − 0.08 0.32
𝑊𝑂 = 0.27 0.05 0.15
−0.05 − 0.28 0.05
𝑊𝑄 𝑊𝐾 𝑊𝑉

bs = 1 𝑄𝐾 𝑇
−0.1 0.1 0.3 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑉
sequence_length = 2 𝑑
0.4 − 1.1 − 0.3 embed_dim = 3
12
Transformer Block
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +
𝑄𝐾 𝑇
𝑌 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑 −0.029 − 0.028 0.065
Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention

𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 13
Transformer Block −1.14 − 0.15 1.29
1.14 − 1.29 0.14

Layer Norm
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +

−0.029 − 0.028 0.065


Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention

𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
https://openaccess.thecvf.com/content/ICCV2021W/Ne
(N, Seq_len, E_dim) urArch/papers/Yao_Leveraging_Batch_Normalization_ −0.1 0.1 0.3
for_Vision_Transformers_ICCVW_2021_paper.pdf 0.4 − 1.1 − 0.3 14
Transformer Block
0.53 − 0.62 − 0.22
Output 0.78 − 0.26 − 1.08
(N, Seq_len, E_dim)
Feed Forward
0.15 0.39 − 0.34
Add & Norm
𝑊 = −0.41 0.54 0.49
Feed Forward 0.50 − 0.07 − 0.41 −1.14 − 0.15 1.29
1.14 − 1.29 0.14

Add & Norm


Layer Norm
Multi-head
Attention
Multi-head Attention

Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 15
Transformer Block −0.59 − 0.81 1.40
bs = 1
sequence_length = 2 1.39 − 0.90 − 0.49
embed_dim = 3
Output
Layer Norm
(N, Seq_len, E_dim)

Feed Forward
Add & Norm
Feed Forward −1.14 − 0.15 1.29
1.14 − 1.29 0.14
Add & Norm

Multi-head Layer Norm


Attention

Multi-head Attention

Input Embedding bs = 1
(N, Seq_len, E_dim) sequence_length = 2 −0.1 0.1 0.3
embed_dim = 3 0.4 − 1.1 − 0.3 16
Transformer Models for Text Classification
Softmax

Linear

Add & Norm


https://arxiv.org/pdf/
Feed Forward 1803.08494.pdf

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding
17
AI VIETNAM
All-in-One Course
Transformer
❖ Positional encoding

𝑐𝑜𝑠 𝑤𝑘 × pos
𝑑−1
𝑘=
2

0 , 𝑣 1 , 𝑣 2 , 𝑣 3 , … , 𝑣 𝑑−2 , 𝑣 𝑑−1
𝑣pos pos pos pos pos pos 𝑠𝑖𝑛 𝑤0 × pos
𝑇 = 10000 𝑐𝑜𝑠 𝑤0 × pos
𝑑 𝑠𝑖𝑛 𝑤1 × pos
𝑘= 𝑐𝑜𝑠 𝑤1 × pos
2 𝒗pos =
1 …

𝑤𝑘 = 𝑠𝑖𝑛 𝑤𝑑 × pos
𝑇 2𝑘/𝑑 𝑠𝑖𝑛 𝑤𝑘 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2
0 ≤ pos ≤ 99

0 ≤ d ≤ 127

𝑇 = 10 𝑇 = 100

𝑇 = 1000 𝑇 = 10000

19
0 ≤ pos ≤ 9 0 ≤ d ≤ 127
𝑇 = 100

0 ≤ pos ≤ 199

0 ≤ pos ≤ 99

0 ≤ pos ≤ 499
❖ Positional encoding

𝑠𝑖𝑛 𝑤0 × pos
𝑐𝑜𝑠 𝑤0 × pos
𝑠𝑖𝑛 𝑤1 × pos
𝑐𝑜𝑠 𝑤1 × pos 1
𝒗pos = … 𝑤𝑘 =
… 𝑇 2𝑘/𝑑
𝑠𝑖𝑛 𝑤𝑑 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2 𝒗0 𝑠𝑖𝑛 𝑤0 × 0 , 𝑐𝑜𝑠 𝑤0 × 0 , 𝑠𝑖𝑛 𝑤1 × 0 , 𝑐𝑜𝑠 𝑤1 × 0
𝒗 = 𝒗1 = 𝑠𝑖𝑛 𝑤0 × 1 , 𝑐𝑜𝑠 𝑤0 × 1 , 𝑠𝑖𝑛 𝑤1 × 1 , 𝑐𝑜𝑠 𝑤1 × 1
𝒗2 𝑠𝑖𝑛 𝑤0 × 2 , 𝑐𝑜𝑠 𝑤0 × 2 , 𝑠𝑖𝑛 𝑤1 × 2 , 𝑐𝑜𝑠 𝑤1 × 2
𝑇 = 100
0 ≤ pos ≤ 2 0.0 1.0 0.0 1.0
= 0.84 0.54 0.09 0.99
0≤d≤3 0.91 − 0.41 0.19 0.98
https://kazemnejad.com/blog/transformer
_architecture_positional_encoding/
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Text Classification
❖ IMDB dataset - 50,000 movie review for sentiment analysis (data)
- Consist of: + 25,000 movie review for training
+ 25,000 movie review for testing
- Label: positive – negative = 1 – 1

“A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time- positive
BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece
…”
“This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years negative
were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's
continued its decline further to the complete waste of time it is today….”
“I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air positive
conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty
and the characters are likable (even the well bread suspected serial killer)….”
“BTW Carver gets a very annoying sidekick who makes you wanna shoot him the first three minutes he's negative
on screen.”
Transformer Models for Text Classification
❖ Embedding

Positional
+
Embedding

Input Embedding

23
Transformer Models for Text Classification
❖ Transformer block

Add & Norm


Feed Forward

Add & Norm

Multi-head
Attention

24
Transformer Models for Text Classification
Softmax

Linear

Add & Norm


Feed Forward

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding 25
Transformer Models
for Text Classification
❖ Results
100

90

80

70

60
Test accuracy: ~82%
50 transfer learning
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Training Accuracy Test Accuracy

Train Transformer from Scratch


85

Text Deep Models 80

75

70

Long short-term memory 65

60
(RNN) Test accuracy: ~68%
55
dim=2 50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Train Accuracy Test Accuracy

hidden_dim=64

LSTM Cell LSTM Cell LSTM Cell 110

100

90

80

LSTM Cell LSTM Cell LSTM Cell 70

60 (LSTM) Test accuracy: ~87%


50
embed_dim = 128 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Train Accuracy Test Accuracy

Word-1 Word-2 Word-500


27
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset - Training samples: 7613
- Total Number of disaster tweets: 4342
- Total Number of non-disaster tweets: 3271

28
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
Transformer Models for Text Classification
Softmax

Linear

Add & Norm


Feed Forward

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding
31
AI VIETNAM
All-in-One Course
Text Classification
❖ Results

100

90

80

70

60
Test accuracy: ~78%
50

40
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Training Accuracy Test Accuracy

Train Transformer from Scratch


Model Architecture

Bidirectional Encoder
Representations from Transformers

Trained on Wikipedia (~2.5B words) and Google’s


BooksCorpus (~800M words)

64 TPUs trained BERT over the course of 4 days

DistilBERT offers a lighter version of BERT; runs 60%


faster while maintaining over 95% of BERT’s performance.

https://arxiv.org/pdf/1810.04805.pdf
Model Inputs
Bidirectional Encoder
Representations from Transformers
https://huggingface.co/blog/bert-101 35
Attention Visualization

https://github.com/jessevig/bertviz
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛

Self-Attention

𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛

𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉

𝑋0 𝑋1 … 𝑋𝑛

Embedding

word-0 word-1 … word-n 37


𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛

Self-Attention

𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛

𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉

𝑋0 𝑋1 … 𝑋𝑛

Projection and Embedding

patch-0 patch-1 … patch-n 38


AI VIETNAM
All-in-One Course
Vision Transformer
❖ From text to image

39
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

40
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

41
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

1 2 3

4 5 6

7 8 9
42
0 1 2 3 4 5 6 7 8
Vision Transformer
position
embedding
❖ Patch and position embedding

128

128

+ + + + + + + + +

128

Linear Linear Linear Linear Linear Linear Linear Linear Linear

4800

flatten
Pretrained Vision
Transformer
❖ JFT-300M
300M images

Internal Google dataset

For training image classification models


Vision Transformer
From text to image

https://arxiv.org/pdf/2010.11929.pdf

Performance of ViT using Cifar-10 dataset


Train from scratch Transfer Learning
78% 98%
* You may have different results in your own experiments
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Patch
Time-Series
Transformer
❖ For time-series data

https://github.com/yu
qinie98/PatchTST
Patch
Time-Series
Transformer
❖ For time-series data

https://github.com/yuqinie98/PatchTST

47
Patch
Time-Series
Transformer
❖ For time-series data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data

Categorical Embeddings

https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820 50
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
Contextual Embeddings

https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820
https://keras.io/examples/structured_data/tabtransformer/ 51

You might also like