0% found this document useful (0 votes)

12 views62 pages

Transformer For Different Data - v3

The document outlines an all-in-one course on the applications of Transformer encoders, covering topics such as layer normalization, transformer blocks, BERT for text classification, vision transformers for image classification, and applications for time-series and tabular data. It includes detailed explanations and examples of various normalization techniques like batch normalization, layer normalization, instance normalization, and group normalization. The course is authored by Quang-Vinh Dinh, Ph.D. in Computer Science, in the year 2023.

Uploaded by

nqthanh2101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views62 pages

Transformer For Different Data - v3

Uploaded by

nqthanh2101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI VIETNAM

All-in-One Course

Applications of
Transformer Encoder

Quang-Vinh Dinh
Ph.D. in Computer Science

Year 2023
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Normalization
❖ Overview

𝑁 𝐻 𝑊 𝑁 𝐻 𝑊
1 1 2
𝜇𝑐 = ෍ ෍ ෍ 𝐹𝑖𝑗𝑘 𝜎𝑐 = ෍ ෍ ෍ 𝐹𝑖𝑗𝑘 − 𝜇𝑐
𝑁×𝐻×𝑊 𝑁×𝐻×𝑊
𝑖=1 𝑗=1 𝑘=1 𝑖=1 𝑗=1 𝑘=1
https://arxiv.org/pdf/1803.08494.pdf 1
AI VIETNAM
All-in-One Course
Batch Normalization
𝜖 = 10−5 𝜇 = [2.0, 3.0] 𝛾 = 1.0
𝜎 2 = [3.5, 3.25] β = 0.0

0 5 6 4 1.6 − 1.1 1.6 0.5

3 0 5 2 −1.1 0.5 1.1 − 0.5
1 4 2 3
3 0 0 2 −0.5 1.1 −0.5 0.0
0.5 − 1.1 −1.6 − 0.5

𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …

batch-size = 2
Batch-Norm Layer
input_shape = (2, 2, 2, 2)
2
AI VIETNAM
All-in-One Course
Layer Normalization
𝐶 𝐻 𝑊
1
𝜇𝑛 = ෍ ෍ ෍ 𝐹𝑐𝑗𝑘
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1

𝐶 𝐻 𝑊
1 2
𝜎𝑛 = ෍ ෍ ෍ 𝐹𝑐𝑗𝑘 − 𝜇𝑛
𝐶×𝐻×𝑊
𝑐=1 𝑗=1 𝑘=1

https://arxiv.org/pdf/
1607.06450.pdf

5 1 0.36 − 1.09
𝑋= 𝑋෠ =
2 8 −0.73 1.46
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Layer Norm
(mean=4 & std=2.73) 3
Layer Normalization
a sample

8 6 , 0 2 1.76 0.98 −1.37 − 0.58

𝑋=
2 4 1 5
𝑋෠ = ,
−0.58 0.19 −0.98 0.58
input_shape=(1, 2, 2, 2) shape=(1, 2, 2, 2)

Layer Norm
(mean=3.5 & std=2.54)

sample 1 sample 2 sample 1 sample 2

8 6 0 2 1.34 0.44 −1.06 0.0
𝑋= ; 𝑋෠ = ;
2 4 1 5 −1.34 − 0.44 −0.53 1.6
input_shape=(2, 2, 2, 1) shape=(2, 2, 2, 1)

Layer Norm
(mean1=5.0 & mean2=2.0)
(std1=2.23 & std2=1.87) 4
AI VIETNAM
All-in-One Course
Instance Normalization
𝐻 𝑊
𝜇 = 2.5 1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗
𝜎 2 = 2.06 𝐻×𝑊
𝑖=1 𝑗=1

𝐻 𝑊
1 2
𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
𝑖=1 𝑗=1

https://arxiv.org/pdf/
1607.08022.pdf

1 5 −0.7276 1.2127
𝑋= 𝑋෠ =
4 0 0.7276 − 1.2127
shape=(1, 2, 2, 1) shape=(1, 2, 2, 1)
Instance-Norm Layer
5
AI VIETNAM
All-in-One Course
Instance Normalization
𝜇 = [2.5, 4.75] 𝛾 = 1.0 𝜖 = 10−5 𝐻 𝑊
1
𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗
𝜎 2 = [4.25, 6.18] β = 0.0 𝐻×𝑊
𝑖=1 𝑗=1

𝐻 𝑊
1 2
1 5 5 8 𝜎𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 = ෍ ෍ 𝐹𝑖𝑗 − 𝜇𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝐻×𝑊
4 0 5 1 𝑖=1 𝑗=1

sample 1

𝑋=
−0.72 1.21
0.72 − 1.21
𝑋෠ =
0.1 1.3
0.1 − 1.5

batch-size = 1 Instance-Norm Layer

input_shape = (1, 2, 2, 2) 6
AI VIETNAM
All-in-One Course
Instance Normalization
𝛾 = 1.0 𝜖 = 10−5
𝜇 = [2.5, 5.0]
β = 0.0
𝜎2 = [4.25, 7.7] 𝜇 = [4.25, 1.75]
𝜎 2 = [5.68, 2.18] −0.72 1.21
0.72 − 1.21
1 5 9 2 1.46 − 1.09
4 0 6 3 0.36 − 0.73
6 3 0 2 0.73 − 0.52
1 7 1 4 −1.36 1.15
−1.12 0.16
−0.5 − 1.52

𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …

batch-size = 2
Instance-Norm Layer
input_shape = (2, 2, 2, 2)
7
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠(𝐶) = 4 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠(𝐺) = 2 𝑘𝐶 𝑖𝐶
𝑆𝑖 = 𝑘 | 𝑘𝑛 = 𝑖𝑛 , ቞ ቟ =቞ ቟
𝐶/𝐺 𝐶/𝐺
𝑛𝑢𝑚_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 2 1
H, W 𝜇𝑔𝑟 = ෍ 𝑥𝑘
𝑚
𝑘∈𝑆𝑖

𝜇=3 𝜇=4 1 2
𝜎𝑔𝑟 = ෍ 𝑥𝑘 − 𝜇𝑔𝑟
𝜎2 = 5 𝜎 2 = 9.5 N 𝑚
C 𝑘∈𝑆𝑖

1 5 1 2 9 2 6 3 −0. 89 0.89 −0.9 − 0.44

𝑋1 = , , , , ,
4 7 4 0 0 3 1 8 0.44 1.78 0.44 − 1.3
𝑋෠1 =
1.62 − 0.64 0.64 − 0.32
5 2 1 7 0 2 1 4 ,
𝑋2 = , , , −1.29 − 0.32 −0.97 1.29
6 3 0 7 3 3 2 5
0.43 − 0.72 −1.11 1.21
, ,
0.82 − 0.34 −1.5 1.21
𝜇 = 3.8 𝜇 = 2.5 𝑋෠2 =
−1.66 − 0.33 −1.0 1.0
Group-Norm Layer ,
𝜎 2 = 6.6 𝜎 2 = 2.25 0.33 0.33 −0.33 1.67
shape=(2, 2, 2, 4) shape=(2, 2, 2, 4) 8
AI VIETNAM
All-in-One Course
Group Normalization
𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 1 𝑛𝑢𝑚_𝑐ℎ𝑎𝑛𝑒𝑙𝑠 = 6 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 = 6

9
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Transformer Block (-)

Output
(N, Seq_len, E_dim)

Add & Norm

Feed Forward https://arxiv.org/pdf/
1803.08494.pdf

Add & Norm

Multi-head
Attention

Input Embedding
(N, Seq_len, E_dim)
10
AI VIETNAM
All-in-One Course
Group Normalization

MLP + Softmax Skip Connection

𝑥1
𝑧1 yො 1
𝑥2 X Conv Conv + Y
𝑧2 Softmax yො 2
𝑥3 … …
… 𝑧𝑛 yො 𝑛
𝑥𝑛
X Conv Conv f Y
1

11
Transformer Block −0.35 0.51 0.50
𝑊𝑄 = 0.36 − 0.47 − 0.29
−0.51 − 0.14 − 0.56

−0.49 − 0.68 0.18

𝑊𝐾 = −0.44 − 0.46 0.18
0.07 − 0.10 0.44

bs = 1 −0.41 0.39 − 0.65

−0.029 − 0.028 0.065 sequence_length = 2 𝑊𝑉 = −0.40 − 0.07 − 0.34
−0.025 − 0.025 0.058 embed_dim = 3 −0.55 − 0.13 − 0.29

Multi-head Attention
−0.36 − 0.08 0.32
𝑊𝑂 = 0.27 0.05 0.15
−0.05 − 0.28 0.05
𝑊𝑄 𝑊𝐾 𝑊𝑉

bs = 1 𝑄𝐾 𝑇
−0.1 0.1 0.3 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑉
sequence_length = 2 𝑑
0.4 − 1.1 − 0.3 embed_dim = 3
12
Transformer Block
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +
𝑄𝐾 𝑇
𝑌 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑 −0.029 − 0.028 0.065
Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention

𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 13
Transformer Block −1.14 − 0.15 1.29
1.14 − 1.29 0.14

Layer Norm
Output
(N, Seq_len, E_dim)
−0.12 0.07 0.36
0.37 − 1.12 − 0.24
Add & Norm
Feed Forward +

−0.029 − 0.028 0.065

Add & Norm
−0.025 − 0.025 0.058
Multi-head
Attention
Multi-head Attention

𝑊𝑄 𝑊𝐾 𝑊𝑉
Input Embedding
https://openaccess.thecvf.com/content/ICCV2021W/Ne
(N, Seq_len, E_dim) urArch/papers/Yao_Leveraging_Batch_Normalization_ −0.1 0.1 0.3
for_Vision_Transformers_ICCVW_2021_paper.pdf 0.4 − 1.1 − 0.3 14
Transformer Block
0.53 − 0.62 − 0.22
Output 0.78 − 0.26 − 1.08
(N, Seq_len, E_dim)
Feed Forward
0.15 0.39 − 0.34
Add & Norm
𝑊 = −0.41 0.54 0.49
Feed Forward 0.50 − 0.07 − 0.41 −1.14 − 0.15 1.29
1.14 − 1.29 0.14

Add & Norm

Layer Norm
Multi-head
Attention
Multi-head Attention

Input Embedding
(N, Seq_len, E_dim) −0.1 0.1 0.3
0.4 − 1.1 − 0.3 15
Transformer Block −0.59 − 0.81 1.40
bs = 1
sequence_length = 2 1.39 − 0.90 − 0.49
embed_dim = 3
Output
Layer Norm
(N, Seq_len, E_dim)

Feed Forward
Add & Norm
Feed Forward −1.14 − 0.15 1.29
1.14 − 1.29 0.14
Add & Norm

Multi-head Layer Norm

Attention

Multi-head Attention

Input Embedding bs = 1
(N, Seq_len, E_dim) sequence_length = 2 −0.1 0.1 0.3
embed_dim = 3 0.4 − 1.1 − 0.3 16
Transformer Models for Text Classification
Softmax

Linear

Add & Norm

https://arxiv.org/pdf/
Feed Forward 1803.08494.pdf

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding
17
AI VIETNAM
All-in-One Course
Transformer
❖ Positional encoding

𝑐𝑜𝑠 𝑤𝑘 × pos
𝑑−1
𝑘=
2

0 , 𝑣 1 , 𝑣 2 , 𝑣 3 , … , 𝑣 𝑑−2 , 𝑣 𝑑−1
𝑣pos pos pos pos pos pos 𝑠𝑖𝑛 𝑤0 × pos
𝑇 = 10000 𝑐𝑜𝑠 𝑤0 × pos
𝑑 𝑠𝑖𝑛 𝑤1 × pos
𝑘= 𝑐𝑜𝑠 𝑤1 × pos
2 𝒗pos =
1 …
…
𝑤𝑘 = 𝑠𝑖𝑛 𝑤𝑑 × pos
𝑇 2𝑘/𝑑 𝑠𝑖𝑛 𝑤𝑘 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2
0 ≤ pos ≤ 99

0 ≤ d ≤ 127

𝑇 = 10 𝑇 = 100

𝑇 = 1000 𝑇 = 10000

19
0 ≤ pos ≤ 9 0 ≤ d ≤ 127
𝑇 = 100

0 ≤ pos ≤ 199

0 ≤ pos ≤ 99

0 ≤ pos ≤ 499
❖ Positional encoding

𝑠𝑖𝑛 𝑤0 × pos
𝑐𝑜𝑠 𝑤0 × pos
𝑠𝑖𝑛 𝑤1 × pos
𝑐𝑜𝑠 𝑤1 × pos 1
𝒗pos = … 𝑤𝑘 =
… 𝑇 2𝑘/𝑑
𝑠𝑖𝑛 𝑤𝑑 × pos
2
𝑐𝑜𝑠 𝑤𝑑−1 × pos
2 𝒗0 𝑠𝑖𝑛 𝑤0 × 0 , 𝑐𝑜𝑠 𝑤0 × 0 , 𝑠𝑖𝑛 𝑤1 × 0 , 𝑐𝑜𝑠 𝑤1 × 0
𝒗 = 𝒗1 = 𝑠𝑖𝑛 𝑤0 × 1 , 𝑐𝑜𝑠 𝑤0 × 1 , 𝑠𝑖𝑛 𝑤1 × 1 , 𝑐𝑜𝑠 𝑤1 × 1
𝒗2 𝑠𝑖𝑛 𝑤0 × 2 , 𝑐𝑜𝑠 𝑤0 × 2 , 𝑠𝑖𝑛 𝑤1 × 2 , 𝑐𝑜𝑠 𝑤1 × 2
𝑇 = 100
0 ≤ pos ≤ 2 0.0 1.0 0.0 1.0
= 0.84 0.54 0.09 0.99
0≤d≤3 0.91 − 0.41 0.19 0.98
https://kazemnejad.com/blog/transformer
_architecture_positional_encoding/
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
AI VIETNAM
All-in-One Course
Text Classification
❖ IMDB dataset - 50,000 movie review for sentiment analysis (data)
- Consist of: + 25,000 movie review for training
+ 25,000 movie review for testing
- Label: positive – negative = 1 – 1

“A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time- positive
BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece
…”
“This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years negative
were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's
continued its decline further to the complete waste of time it is today….”
“I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air positive
conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty
and the characters are likable (even the well bread suspected serial killer)….”
“BTW Carver gets a very annoying sidekick who makes you wanna shoot him the first three minutes he's negative
on screen.”
Transformer Models for Text Classification
❖ Embedding

Positional
+
Embedding

Input Embedding

23
Transformer Models for Text Classification
❖ Transformer block

Add & Norm

Feed Forward

Add & Norm

Multi-head
Attention

24
Transformer Models for Text Classification
Softmax

Linear

Add & Norm

Feed Forward

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding 25
Transformer Models
for Text Classification
❖ Results
100

60
Test accuracy: ~82%
50 transfer learning
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Training Accuracy Test Accuracy

Train Transformer from Scratch

Text Deep Models 80

Long short-term memory 65

60
(RNN) Test accuracy: ~68%
55
dim=2 50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Train Accuracy Test Accuracy

hidden_dim=64

LSTM Cell LSTM Cell LSTM Cell 110

100

LSTM Cell LSTM Cell LSTM Cell 70

60 (LSTM) Test accuracy: ~87%

50
embed_dim = 128 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Train Accuracy Test Accuracy

Word-1 Word-2 Word-500

27
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset - Training samples: 7613
- Total Number of disaster tweets: 4342
- Total Number of non-disaster tweets: 3271

28
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
AI VIETNAM
All-in-One Course
Text Classification
❖ Tweets dataset
Transformer Models for Text Classification
Softmax

Linear

Add & Norm

Feed Forward

N× Add & Norm

Multi-head
Attention

Positional
+
Embedding

Input Embedding
31
AI VIETNAM
All-in-One Course
Text Classification
❖ Results

100

60
Test accuracy: ~78%
50

40
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Training Accuracy Test Accuracy

Train Transformer from Scratch

Model Architecture

Bidirectional Encoder
Representations from Transformers

Trained on Wikipedia (~2.5B words) and Google’s

BooksCorpus (~800M words)

64 TPUs trained BERT over the course of 4 days

DistilBERT offers a lighter version of BERT; runs 60%

faster while maintaining over 95% of BERT’s performance.

https://arxiv.org/pdf/1810.04805.pdf
Model Inputs
Bidirectional Encoder
Representations from Transformers
https://huggingface.co/blog/bert-101 35
Attention Visualization

https://github.com/jessevig/bertviz
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛
…

Self-Attention

𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛

𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉

𝑋0 𝑋1 … 𝑋𝑛

Embedding

word-0 word-1 … word-n 37

𝑄𝐾 𝑇
Vision Transformer 𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
𝑚
𝑉
𝑂 = 𝐴𝑊𝑂
𝑌0 𝑌1 𝑌𝑛
…

Self-Attention

𝑄0 𝐾0 𝑉0 𝑄1 𝐾1 𝑉1 𝑄𝑛 𝐾𝑛 𝑉𝑛

𝑊𝐾 𝑊𝐾 𝑊𝐾
𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉 𝑊𝑄 𝑊𝑉

𝑋0 𝑋1 … 𝑋𝑛

Projection and Embedding

patch-0 patch-1 … patch-n 38

AI VIETNAM
All-in-One Course
Vision Transformer
❖ From text to image

39
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

40
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

41
AI VIETNAM
All-in-One Course
Vision Transformer
❖ Get patches

1 2 3

4 5 6

7 8 9
42
0 1 2 3 4 5 6 7 8
Vision Transformer
position
embedding
❖ Patch and position embedding

128

+ + + + + + + + +

128

Linear Linear Linear Linear Linear Linear Linear Linear Linear

4800

flatten
Pretrained Vision
Transformer
❖ JFT-300M
300M images

Internal Google dataset

For training image classification models

Vision Transformer
From text to image

https://arxiv.org/pdf/2010.11929.pdf

Performance of ViT using Cifar-10 dataset

Train from scratch Transfer Learning
78% 98%
* You may have different results in your own experiments
Outline
➢ Layer Normalization
➢ Transformers Block
➢ BERT and Text Classification
➢ Vision Transformer and Image Classification
➢ For Time-series and Tabular Data
Patch
Time-Series
Transformer
❖ For time-series data

https://github.com/yu
qinie98/PatchTST
Patch
Time-Series
Transformer
❖ For time-series data

https://github.com/yuqinie98/PatchTST

47
Patch
Time-Series
Transformer
❖ For time-series data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data

Categorical Embeddings

https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820 50
AI VIETNAM
All-in-One Course
TabTransformer
❖ For Tabular Data
Contextual Embeddings

https://towardsdatascience.com/transformers-for-tabular-
data-tabtransformer-deep-dive-5fb2438da820
https://keras.io/examples/structured_data/tabtransformer/ 51

Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
03b. Transformers
No ratings yet
03b. Transformers
75 pages
Conditional Positional Encodings in Vision Transformers
No ratings yet
Conditional Positional Encodings in Vision Transformers
19 pages
Conditional Positional Encodings For Vision Transformers
No ratings yet
Conditional Positional Encodings For Vision Transformers
13 pages
Vision Transformer U1
No ratings yet
Vision Transformer U1
42 pages
Transformer LectureNotes
No ratings yet
Transformer LectureNotes
33 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Transformers Without Normalization: Jiachen Zhu Xinlei Chen Kaiming He Yann Lecun Zhuang Liu
No ratings yet
Transformers Without Normalization: Jiachen Zhu Xinlei Chen Kaiming He Yann Lecun Zhuang Liu
19 pages
Trans From Scratch
No ratings yet
Trans From Scratch
7 pages
GPT: N T R - L H: N Ormalized Ransformer With Epre Sentation Earning On The Ypersphere
No ratings yet
GPT: N T R - L H: N Ormalized Ransformer With Epre Sentation Earning On The Ypersphere
21 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
Evaluating Transformer Modifications
No ratings yet
Evaluating Transformer Modifications
16 pages
Thura2023-06-25 (Progress Report)
No ratings yet
Thura2023-06-25 (Progress Report)
49 pages
Transformers Without Tears
No ratings yet
Transformers Without Tears
11 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Part 15 MD
No ratings yet
Part 15 MD
36 pages
Enhancing Model Generalization Techniques
No ratings yet
Enhancing Model Generalization Techniques
46 pages
Transformers for CAP6412 Students
No ratings yet
Transformers for CAP6412 Students
69 pages
Inversion Neural Network and It's Uses For Data Reconstruction
No ratings yet
Inversion Neural Network and It's Uses For Data Reconstruction
15 pages
D5 PPT
No ratings yet
D5 PPT
79 pages
Deep Learning for Visual Recognition
No ratings yet
Deep Learning for Visual Recognition
82 pages
Lec5 CNN RNN Attention
No ratings yet
Lec5 CNN RNN Attention
71 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
DL1 Ver1
No ratings yet
DL1 Ver1
49 pages
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
No ratings yet
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
10 pages
6 Transformers
No ratings yet
6 Transformers
77 pages
CNN Fundamentals for Image Classification
No ratings yet
CNN Fundamentals for Image Classification
113 pages
Multiple Transformer Mining for VizWiz
No ratings yet
Multiple Transformer Mining for VizWiz
2 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
Compact Transformers for Small Data
No ratings yet
Compact Transformers for Small Data
18 pages
Zhu Transformers Without Normalization CVPR 2025 Paper
No ratings yet
Zhu Transformers Without Normalization CVPR 2025 Paper
11 pages
Clip
No ratings yet
Clip
15 pages
He & Hofmann, Simplifying Transformer Blocks
No ratings yet
He & Hofmann, Simplifying Transformer Blocks
29 pages
Bactran A Hardware Batch Normalization Implementation For CNN Training Engine
No ratings yet
Bactran A Hardware Batch Normalization Implementation For CNN Training Engine
4 pages
WS 2021
No ratings yet
WS 2021
16 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Generative AI Techniques by Tom Yeh
No ratings yet
Generative AI Techniques by Tom Yeh
28 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
18 pages
Transformers
No ratings yet
Transformers
30 pages
Coding Neural Networks-Classification & Regression
No ratings yet
Coding Neural Networks-Classification & Regression
39 pages
CNN Image Classification Practical Guide
No ratings yet
CNN Image Classification Practical Guide
11 pages
DeepLearning Unit-II
No ratings yet
DeepLearning Unit-II
70 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
Video Classification Techniques Overview
No ratings yet
Video Classification Techniques Overview
52 pages
Assignment 3
No ratings yet
Assignment 3
25 pages
Chapter17 Autoencoders
No ratings yet
Chapter17 Autoencoders
23 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
(Slide) Advanced CNN Architecture
No ratings yet
(Slide) Advanced CNN Architecture
50 pages
Self-Attention Mechanism in NLP
No ratings yet
Self-Attention Mechanism in NLP
18 pages
PhoCLIP 232 Specialized Project OFFICIAL
No ratings yet
PhoCLIP 232 Specialized Project OFFICIAL
105 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
AI & Machine Learning Insights
No ratings yet
AI & Machine Learning Insights
109 pages
VR Part2 Lecture 6 Annotated
No ratings yet
VR Part2 Lecture 6 Annotated
10 pages
LBDL A5 Booklet
No ratings yet
LBDL A5 Booklet
90 pages
Generative Modeling With Optimal Transport Maps
No ratings yet
Generative Modeling With Optimal Transport Maps
22 pages
Interpreting and Improving Diffusion Models Using The Euclidean Distance Function
No ratings yet
Interpreting and Improving Diffusion Models Using The Euclidean Distance Function
20 pages
Interpreting Diffusion Score Matching Using Normalizing Flow
No ratings yet
Interpreting Diffusion Score Matching Using Normalizing Flow
8 pages
Introducing Plotly Express - Plotly - Medium
No ratings yet
Introducing Plotly Express - Plotly - Medium
17 pages
Data Assimilation For The Geosciences: From Theory To Application 2nd Edition Steven J. Fletcher - Get The Ebook Instantly With Just One Click
100% (3)
Data Assimilation For The Geosciences: From Theory To Application 2nd Edition Steven J. Fletcher - Get The Ebook Instantly With Just One Click
50 pages
Digital Signal Processing Exam
No ratings yet
Digital Signal Processing Exam
4 pages
Fine-Tuning Techniques for LLMs
No ratings yet
Fine-Tuning Techniques for LLMs
8 pages
Assignment 04 ICT
No ratings yet
Assignment 04 ICT
4 pages
Linear Programming in Business Analysis
No ratings yet
Linear Programming in Business Analysis
46 pages
PreCal TOS-1
No ratings yet
PreCal TOS-1
3 pages
Linear Algebra Concepts Overview
No ratings yet
Linear Algebra Concepts Overview
2 pages
Cheat Sheet Quantitative Methods in Finance Nova Cheat Sheet Quantitative Methods in Finance Nova
0% (1)
Cheat Sheet Quantitative Methods in Finance Nova Cheat Sheet Quantitative Methods in Finance Nova
3 pages
Topic 1: Introduction To Intertemporal Optimization: Yulei Luo
No ratings yet
Topic 1: Introduction To Intertemporal Optimization: Yulei Luo
59 pages
Applications of Genetic Algorithm in Software Engineering, Distributed Computing and Machine Learning
No ratings yet
Applications of Genetic Algorithm in Software Engineering, Distributed Computing and Machine Learning
6 pages
Zero-Knowledge Rollups
No ratings yet
Zero-Knowledge Rollups
20 pages
Algorithm Complexity Basics
No ratings yet
Algorithm Complexity Basics
6 pages
Gaussian States in Quantum Information
No ratings yet
Gaussian States in Quantum Information
47 pages
7SENG010C - Data Structures and Algorithms (IIT Sri Lanka) 2024-25v
No ratings yet
7SENG010C - Data Structures and Algorithms (IIT Sri Lanka) 2024-25v
5 pages
PMS Chapter 4
No ratings yet
PMS Chapter 4
21 pages
Distributed Online Convex Optimization Over Time-Varying Unbalanced Digraphs With Multiple Coupled Constraints
No ratings yet
Distributed Online Convex Optimization Over Time-Varying Unbalanced Digraphs With Multiple Coupled Constraints
15 pages
Understanding Functions and Notations
No ratings yet
Understanding Functions and Notations
17 pages
Search Algorithms Overview
No ratings yet
Search Algorithms Overview
51 pages
Mathematics For Computing
No ratings yet
Mathematics For Computing
6 pages
Neural Networks & Fuzzy Logic Course
No ratings yet
Neural Networks & Fuzzy Logic Course
2 pages
Research On Defect Detection Method For Steel Metal Surface Based On Deep Learning
No ratings yet
Research On Defect Detection Method For Steel Metal Surface Based On Deep Learning
5 pages
Probability of Events A, B, and C
No ratings yet
Probability of Events A, B, and C
49 pages
Data Analytics Exam Questions
No ratings yet
Data Analytics Exam Questions
2 pages
Digital Filters
100% (1)
Digital Filters
39 pages
ML Lab 6
No ratings yet
ML Lab 6
7 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Simulink Data Converter Course
No ratings yet
Simulink Data Converter Course
86 pages
Ece203 Unit 3 Part 1
No ratings yet
Ece203 Unit 3 Part 1
21 pages
General Mathematics: Functions
No ratings yet
General Mathematics: Functions
11 pages

Transformer For Different Data - v3

Uploaded by

Transformer For Different Data - v3

Uploaded by

AI VIETNAM

0 5 6 4 1.6 − 1.1 1.6 0.5

8 6 , 0 2 1.76 0.98 −1.37 − 0.58

sample 1 sample 2 sample 1 sample 2

batch-size = 1 Instance-Norm Layer

1 5 1 2 9 2 6 3 −0. 89 0.89 −0.9 − 0.44

Add & Norm

Add & Norm

MLP + Softmax Skip Connection

−0.49 − 0.68 0.18

bs = 1 −0.41 0.39 − 0.65

−0.029 − 0.028 0.065

Add & Norm

Multi-head Layer Norm

Add & Norm

N× Add & Norm

Add & Norm

Add & Norm

Add & Norm

N× Add & Norm

Train Transformer from Scratch

Text Deep Models 80

Long short-term memory 65

LSTM Cell LSTM Cell LSTM Cell 110

LSTM Cell LSTM Cell LSTM Cell 70

60 (LSTM) Test accuracy: ~87%

Train Accuracy Test Accuracy

Word-1 Word-2 Word-500

Add & Norm

N× Add & Norm

Train Transformer from Scratch

Trained on Wikipedia (~2.5B words) and Google’s

64 TPUs trained BERT over the course of 4 days

DistilBERT offers a lighter version of BERT; runs 60%

word-0 word-1 … word-n 37

Projection and Embedding

patch-0 patch-1 … patch-n 38

Linear Linear Linear Linear Linear Linear Linear Linear Linear

Internal Google dataset

For training image classification models

Performance of ViT using Cifar-10 dataset

You might also like