http://synergy.ece.gatech.
edu
SCALE-Sim
Systolic CNN Accelerator Simulator
Open sourced at
https://github.com/ARM-software/SCALE-Sim
Project website
https://scalesim-project.github.io/
ASPLOS 2021 April 16, 2021
1
Kind notice
Please note that all the sessions of this tutorial is being recorded
2
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest
2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs
3. Demos
3
Systolic arrays
High Efficient
parallelism data reuse
Simple High
implementation scalability
4
Metrics of interest
Modeling Modeling Modeling
compute memory interface
Performance Reuse System level
Efficiency Performance implications
Scalability Efficiency Performance
5
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest
2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs
3. Demos
6
SCALE-Sim: Overview
Parameter Value
Array Height 32 Filter SRAM
Output_log_white
(Double buffered) Cycle and IF BW
Array Width 32
SRAM Read
IFMAP SRAM Size 1024
Filter SRAM Size 1024
OFRAM SRAM Size 128
IFMAP SRAM
Dataflow WS (Double
Config file with architecture specs buffered) SRAM
Accelerator Read
Interface
SRAM Write
Topology.csv
OFMAP SRAM
(Double buffered) Traces
SRAM Read/Write
DRAM Read/Write
Inputs
SCALE SIM
Outputs
7
Inputs: Config file
Array microarchitecture
Memory Sizes
Matrix offsets
Data flow
8
Inputs: Topology CSV file
Layer-by-layer configuration Network hyperparameters
9
SCALE-Sim
Modelling compute
Cols
Support for non-square arrays
Support for multiple dataflows
Rows
Fast execution by tracking the
edges
Layer by layer execution
Intrinsic folding of big compute
10
SCALE Sim
Modelling on-chip memory From DRAM From DRAM To DRAM
Double buffered memories
IFMAP
IFMAP IFMAP
Filter IFMAP
OFMAP
Models three memory regions SRAM SRAM SRAM
SRAM SRAM SRAM
one for each matrix
No replication of matrix To array To array From array
elements in SRAM buffers
Modelling system interface
DRAM Read BW IFMAP
Tool outputs the required SRAM
SRAM
DRAM Write BW
DRAM read and write
bandwidth requirements Accelerator
11
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest
2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs
3. Demos
12
Modelling GEMM
d
a d
C[0,0] C[0,1]
C[1,0] C[1,0]
= D
A B
E
C
F X b
c
e
f
c
b
e
f
a
C A B
C B A MAC MAC
=
F E D
X
MAC MAC
2x2 systolic array
C A B
13
Convolution in CNN
Input Image
Output Image
Filter
H
R E
R H E
14
Convolution in CNN
Input Image
Output Image
Filter
H
R E
R H E
dot partial sum
product accumulation
15
Convolution in CNN
Many Input Image
Output Image
Input Channels C
…
C
…
… H
…
R E
…
…
R H E
16
Convolution in CNNs
Many Input Image
Output Image
Filters C
…
C
…
…
H M
…
R E
1
…
…
R H E
Many
…
Output Channels
C
…
R
M
…
17
SCALE-Sim: Modelling convolutions
a3 b3 c3 d3 e3
a2 b2 c2 d2 e2
j3
a1 b1 c1 d1 e1j2 A3 B3 C3 A3 B3 C3 I3
o3 A2 B2 C2 A2 B2 C2 I3
…
f1 g1 h1 i1 j1o2 D3 C1
A1 B1 E3 F3 D3 C1
A1 B1 E3 F3
t3 D2 E2 F2 D2 E2 F2
…
k1 l1 m1 n1 o1t2 G3 F1
D1 E1 H3 I3 G3 F1
D1 E1 H3 I3 B2
y3 G2 H2 I2 G2 H2 I2 B2 B1
p1 q1 r1 s1 t1y2 G1 H1 I1 G1 H1 I1 B1 A3
u1 v1 w1 x1 y1 A3 A2
Filter A2 A1
Input Feature Map
A1
… m1 m2 m3 m3 … b2 b1 a3 a2 a1 MAC MAC
a1 a2 a3 b1 b2 b3
b1 b2 b3 c1 c2 c3 … n1 n2 n3 …
n3 c2 c1 b3 b2 b1 MAC MAC
18
Dataflows: Output Stationary
I3
A3 B3 C3 a3 b3 c3 d3 e3
…
A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2
j3 A2
Each MAC unit responsible for
D2 E2 F2 I3
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 A1 particular output pixels
o3 I3
…
G2 H2 I2
G1 H1 I1 f1 g1 h1 i1 j1o2 I3
…
B2
t3
Accumulation of partial sums done
…
k1 l1 m1 n1 o1t2 B2 B1
A3 B3 C3 y3 B2 B1 A3
A2 B2 C2
D3 C1
A1 B1 E3 F3
p1 q1 r1 s1 t1y2 B1 A3 A2 locally
D2 E2 F2 u1 v1 w1 x1 y1 A3 A2 A1
G3 F1
D1 E1 H3 I3
G2 H2 I2
Input Feature Map
A2 A1 Each column generates pixels from
G1 H1 I1 A1
different output channel
A3 B3 C3 m3 …
m filters A2 B2 C2 b2 b1 a3 a2 a1 SCALE-Sim assumes output
D3 C1
A1 B1 E3 F3
D2 E2 F2
G3 F1
D1 E1 H3 I3 collection is not on critical path
G2 H2 I2 n3 … c2 c1 b3 b2 b1
G1 H1 I1
9
Maximum usable dimensions
o3 … d2 d1 c3 c2 c1
…
rows Rows: Pixels per output channel
A3 B3 C3
A2 B2 C2
D3 C1
A1 B1 E3 F3 Cols: Number of filters
D2 E2 F2
G3 F1
D1 E1 H3 I3
G2 H2 I2 y3 … m2 m1
G1 H1 I1
Weights
m cols
19
Dataflows : Weight Stationary
A3 B3 C3 a3 b3 c3 d3 e3
A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2 Elements of filters are pre-filled into
D2 E2 F2 j3
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 MAC units
G2 H2 I2 o3
G1 H1 I1 f1 g1 h1 i1 j1o2
t3 Every column is assigned unique filter
k1 l1 m1 n1 o1t2
A3 B3 C3 y3
A2 B2 C2
D3 C1
A1 B1 E3 F3
p1 q1 r1 s1 t1y2 Reduction is done across the rows
D2 E2 F2
G3 F1
D1 E1 H3 I3 u1 v1 w1 x1 y1 Pre fill weights within a column
G2 H2 I2
G1 H1 I1 Input Feature Map
Critical path contains time to fill in
m filters A3 B3 C3
A2 B2 C2 y3 r3 o3 n3 m3 I3 I3 I3 I3 weights, partial sum generation and
D3 C1
E3 F3
A1 B1
D2 E2 F2 reduction time
G3 F1
D1 E1 H3 I3
G2 H2 I2 y2 r2 o2 n2 m2 I2 I2 I2 I2
G1 H1 I1
Maximum usable dimensions
r1 o1 n1 m1 I1 I1 I1 I1 27
y1
…
… rows Rows: Partial sums per pixel
A3 B3 C3
A2 B2 C2
Cols: Number of filters
…
…
…
…
…
D3 C1
A1 B1 E3 F3
D2 E2 F2
G3 F1
D1 E1 H3 I3
G2 H2 I2 f1 c1 b1 a1 A1 A1 A1 A1
G1 H1 I1 m1
Weights Unrolled convolution windows
time
m cols
20
Dataflows : Input Stationary
A3 B3 C3 a3 b3 c3 d3 e3
A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2
j3
Elements of convolution windows are
D2 E2 F2
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 pre-filled into MAC units
G2 H2 I2 o3
G1 H1 I1 f1 g1 h1 i1 j1o2
t3 Every column is assigned output pixel
k1 l1 m1 n1 o1t2
A3 B3 C3 y3
A2 B2 C2 p1 q1 r1 s1 t1y2
D3 C1
A1 B1 E3 F3
D2 E2 F2
Reduction is done across the rows
u1 v1 w1 x1 y1
G3 F1
D1 E1 H3 I3
G2 H2 I2
Pre fill inputs within a column
G1 H1 I1 Input Feature Map
Critical path contains time to fill in input
A3 B3 C3
A2 B2 C2
D3 C1
A1 B1 E3 F3
I3 I3 I3 I3 m3 n3 o3 y3 elements, partial sum generation and
D2 E2 F2
G3 F1
D1 E1 H3 I3 reduction time
G2 H2 I2 I2 I2 I2 I2 m2 n2 o2 y2
G1 H1 I1
27 Maximum usable dimensions
I1 … I1 I1 I1 m1 n1 o1 y1
…
rows
A3 B3 C3 Rows: Partial sums per pixel
A2 B2 C2
…
…
…
…
D3 C1
E3 F3
A1 B1
D2 E2 F2 Cols: Number of output pixels per
G3 F1
D1 E1 H3 I3
G2 H2 I2
G1 H1 I1 A1 A1 A1 A1 a1 b1 c1 m1 output channel
Weights Unrolled Weight Matrices
9 cols
time
21
Supporting other layer types
Fully connected
Can be modelled as matrix
vector multiplication
X
SCALE-Sim models as
Input Layer weights convolution with input
dimension same as weight
dimension
} }
Modelled as Softmax Elementwise
LSTMs Matrix-Vector operations
Attention or Pooling Not efficient on
Vector-Vector systolic arrays
22
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest
2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs
3. Demos
23
Console Output
1
1 Summary of input configurations
26
Console Output
1 Summary of input configurations
2 Run and stall cycles
2
27
Console Output
1 Summary of input configurations
2 Run and stall cycles
3 3 Mapping efficiency and compute utilization
28
Console Output
1 Summary of input configurations
2 Run and stall cycles
3 Mapping efficiency and compute utilization
4 Off chip access bandwidth
4
29
Generated outputs
Cycle accurate traces per operand
30
Generated outputs
Cycle accurate traces per operand
Summary files
31
Summary Files
Filename Attributes
COMPUTE_REPORT.csv Layer wise compute cycles, stall cycles, mapping utilization etc
BANDWIDTH_REPORT.csv Layer wise SRAM and DRAM access bandwidths for operands
DETAILED_ACCESS_REPORT.csv Access counts and timing informataion
32
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github
33
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github
New features
1. Tool can be run in both stall free and bandwidth limited mode
2. New metrics like mapping efficiency, stall count added
3. Modular code
4. Available as python package
5. More enhancements in the pipeline!
34
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github
New features
1. Tool can be run in both stall free and bandwidth limited mode
2. New metrics like mapping efficiency, stall count added
3. Modular code
4. Available as python package
5. More enhancements in the pipeline!
We also have a new website
https://scalesim-project.github.io
35
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest
2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs
3. Demos
36
Demos
We will showcase SCALE-Sim v2 capabilities with 3 tutorials
1. Using SCALE-Sim as a package
Design space exploration of a systolic accelerator
2. Adding new features to Simulator
Adding new buffer hierarchies in SCALE-Sim
3. Using SCALE-Sim as a library to build bigger simulators
Building a Scaled-out simulator using SCALE-Sim API
37