0% found this document useful (0 votes)
160 views35 pages

SCALE-Sim Tutorial ASPLOS2021 2 Overview

SCALE-Sim is an open source simulator for systolic array based convolutional neural network (CNN) accelerators that models the computation, on-chip memory usage, and interface between memory and accelerator at a cycle-accurate level to evaluate performance and efficiency. It supports different dataflows and takes as input the architectural configuration and layer-by-layer network topology to simulate the execution and output performance metrics and memory traffic traces. SCALE-Sim aims to help hardware designers evaluate and optimize systolic array based CNN accelerators.

Uploaded by

Seuneedhi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views35 pages

SCALE-Sim Tutorial ASPLOS2021 2 Overview

SCALE-Sim is an open source simulator for systolic array based convolutional neural network (CNN) accelerators that models the computation, on-chip memory usage, and interface between memory and accelerator at a cycle-accurate level to evaluate performance and efficiency. It supports different dataflows and takes as input the architectural configuration and layer-by-layer network topology to simulate the execution and output performance metrics and memory traffic traces. SCALE-Sim aims to help hardware designers evaluate and optimize systolic array based CNN accelerators.

Uploaded by

Seuneedhi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

http://synergy.ece.gatech.

edu
SCALE-Sim
Systolic CNN Accelerator Simulator
Open sourced at
https://github.com/ARM-software/SCALE-Sim

Project website
https://scalesim-project.github.io/

ASPLOS 2021 April 16, 2021


1
Kind notice

Please note that all the sessions of this tutorial is being recorded

2
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest

2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs

3. Demos

3
Systolic arrays

High Efficient
parallelism data reuse

Simple High
implementation scalability

4
Metrics of interest

Modeling Modeling Modeling


compute memory interface

Performance Reuse System level


Efficiency Performance implications

Scalability Efficiency Performance


5
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest

2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs

3. Demos

6
SCALE-Sim: Overview

Parameter Value
Array Height 32 Filter SRAM
Output_log_white

(Double buffered) Cycle and IF BW


Array Width 32
SRAM Read
IFMAP SRAM Size 1024
Filter SRAM Size 1024

OFRAM SRAM Size 128


IFMAP SRAM
Dataflow WS (Double
Config file with architecture specs buffered) SRAM
Accelerator Read
Interface

SRAM Write
Topology.csv
OFMAP SRAM
(Double buffered) Traces
SRAM Read/Write
DRAM Read/Write

Inputs
SCALE SIM
Outputs
7
Inputs: Config file

Array microarchitecture

Memory Sizes

Matrix offsets

Data flow
8
Inputs: Topology CSV file

Layer-by-layer configuration Network hyperparameters

9
SCALE-Sim
Modelling compute
Cols
Support for non-square arrays

Support for multiple dataflows


Rows
Fast execution by tracking the
edges

Layer by layer execution

Intrinsic folding of big compute

10
SCALE Sim
Modelling on-chip memory From DRAM From DRAM To DRAM

Double buffered memories


IFMAP
IFMAP IFMAP
Filter IFMAP
OFMAP
Models three memory regions SRAM SRAM SRAM
SRAM SRAM SRAM
one for each matrix
No replication of matrix To array To array From array
elements in SRAM buffers

Modelling system interface


DRAM Read BW IFMAP
Tool outputs the required SRAM
SRAM
DRAM Write BW
DRAM read and write
bandwidth requirements Accelerator

11
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest

2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs

3. Demos

12
Modelling GEMM
d
a d
C[0,0] C[0,1]

C[1,0] C[1,0]
= D
A B
E
C
F X b
c
e
f
c
b
e
f
a
C A B
C B A MAC MAC

=
F E D

X
MAC MAC

2x2 systolic array

C A B
13
Convolution in CNN

Input Image
Output Image
Filter

H
R E

R H E

14
Convolution in CNN

Input Image
Output Image
Filter

H
R E

R H E
dot partial sum
product accumulation

15
Convolution in CNN

Many Input Image


Output Image
Input Channels C


C

… H


R E


R H E

16
Convolution in CNNs

Many Input Image


Output Image
Filters C


C


H M


R E
1


R H E
Many

Output Channels
C

R
M

17
SCALE-Sim: Modelling convolutions
a3 b3 c3 d3 e3
a2 b2 c2 d2 e2
j3
a1 b1 c1 d1 e1j2 A3 B3 C3 A3 B3 C3 I3
o3 A2 B2 C2 A2 B2 C2 I3


f1 g1 h1 i1 j1o2 D3 C1
A1 B1 E3 F3 D3 C1
A1 B1 E3 F3
t3 D2 E2 F2 D2 E2 F2


k1 l1 m1 n1 o1t2 G3 F1
D1 E1 H3 I3 G3 F1
D1 E1 H3 I3 B2
y3 G2 H2 I2 G2 H2 I2 B2 B1
p1 q1 r1 s1 t1y2 G1 H1 I1 G1 H1 I1 B1 A3
u1 v1 w1 x1 y1 A3 A2
Filter A2 A1
Input Feature Map
A1

… m1 m2 m3 m3 … b2 b1 a3 a2 a1 MAC MAC
a1 a2 a3 b1 b2 b3

b1 b2 b3 c1 c2 c3 … n1 n2 n3 …
n3 c2 c1 b3 b2 b1 MAC MAC

18
Dataflows: Output Stationary
I3
A3 B3 C3 a3 b3 c3 d3 e3


A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2
j3 A2
Each MAC unit responsible for
D2 E2 F2 I3
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 A1 particular output pixels
o3 I3


G2 H2 I2
G1 H1 I1 f1 g1 h1 i1 j1o2 I3


B2
t3
Accumulation of partial sums done


k1 l1 m1 n1 o1t2 B2 B1
A3 B3 C3 y3 B2 B1 A3
A2 B2 C2
D3 C1
A1 B1 E3 F3
p1 q1 r1 s1 t1y2 B1 A3 A2 locally
D2 E2 F2 u1 v1 w1 x1 y1 A3 A2 A1
G3 F1
D1 E1 H3 I3
G2 H2 I2
Input Feature Map
A2 A1 Each column generates pixels from
G1 H1 I1 A1
different output channel
A3 B3 C3 m3 …
m filters A2 B2 C2 b2 b1 a3 a2 a1 SCALE-Sim assumes output
D3 C1
A1 B1 E3 F3
D2 E2 F2
G3 F1
D1 E1 H3 I3 collection is not on critical path
G2 H2 I2 n3 … c2 c1 b3 b2 b1
G1 H1 I1
9
Maximum usable dimensions
o3 … d2 d1 c3 c2 c1

rows Rows: Pixels per output channel


A3 B3 C3
A2 B2 C2
D3 C1
A1 B1 E3 F3 Cols: Number of filters
D2 E2 F2
G3 F1
D1 E1 H3 I3
G2 H2 I2 y3 … m2 m1
G1 H1 I1
Weights
m cols

19
Dataflows : Weight Stationary
A3 B3 C3 a3 b3 c3 d3 e3
A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2 Elements of filters are pre-filled into
D2 E2 F2 j3
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 MAC units
G2 H2 I2 o3
G1 H1 I1 f1 g1 h1 i1 j1o2
t3 Every column is assigned unique filter
k1 l1 m1 n1 o1t2
A3 B3 C3 y3
A2 B2 C2
D3 C1
A1 B1 E3 F3
p1 q1 r1 s1 t1y2 Reduction is done across the rows
D2 E2 F2
G3 F1
D1 E1 H3 I3 u1 v1 w1 x1 y1 Pre fill weights within a column
G2 H2 I2
G1 H1 I1 Input Feature Map
Critical path contains time to fill in
m filters A3 B3 C3
A2 B2 C2 y3 r3 o3 n3 m3 I3 I3 I3 I3 weights, partial sum generation and
D3 C1
E3 F3
A1 B1
D2 E2 F2 reduction time
G3 F1
D1 E1 H3 I3
G2 H2 I2 y2 r2 o2 n2 m2 I2 I2 I2 I2
G1 H1 I1
Maximum usable dimensions
r1 o1 n1 m1 I1 I1 I1 I1 27
y1

… rows Rows: Partial sums per pixel


A3 B3 C3
A2 B2 C2
Cols: Number of filters



D3 C1
A1 B1 E3 F3
D2 E2 F2
G3 F1
D1 E1 H3 I3
G2 H2 I2 f1 c1 b1 a1 A1 A1 A1 A1
G1 H1 I1 m1
Weights Unrolled convolution windows
time
m cols

20
Dataflows : Input Stationary
A3 B3 C3 a3 b3 c3 d3 e3
A2 B2 C2
D3 C1
A1 B1 E3 F3 a2 b2 c2 d2 e2
j3
Elements of convolution windows are
D2 E2 F2
G3 F1
D1 E1 H3 I3 a1 b1 c1 d1 e1j2 pre-filled into MAC units
G2 H2 I2 o3
G1 H1 I1 f1 g1 h1 i1 j1o2
t3 Every column is assigned output pixel
k1 l1 m1 n1 o1t2
A3 B3 C3 y3
A2 B2 C2 p1 q1 r1 s1 t1y2
D3 C1
A1 B1 E3 F3
D2 E2 F2
Reduction is done across the rows
u1 v1 w1 x1 y1
G3 F1
D1 E1 H3 I3
G2 H2 I2
Pre fill inputs within a column
G1 H1 I1 Input Feature Map
Critical path contains time to fill in input
A3 B3 C3
A2 B2 C2
D3 C1
A1 B1 E3 F3
I3 I3 I3 I3 m3 n3 o3 y3 elements, partial sum generation and
D2 E2 F2
G3 F1
D1 E1 H3 I3 reduction time
G2 H2 I2 I2 I2 I2 I2 m2 n2 o2 y2
G1 H1 I1

27 Maximum usable dimensions


I1 … I1 I1 I1 m1 n1 o1 y1

rows
A3 B3 C3 Rows: Partial sums per pixel
A2 B2 C2



D3 C1
E3 F3
A1 B1
D2 E2 F2 Cols: Number of output pixels per
G3 F1
D1 E1 H3 I3
G2 H2 I2
G1 H1 I1 A1 A1 A1 A1 a1 b1 c1 m1 output channel
Weights Unrolled Weight Matrices
9 cols
time

21
Supporting other layer types
Fully connected

Can be modelled as matrix


vector multiplication
X
SCALE-Sim models as
Input Layer weights convolution with input
dimension same as weight
dimension

} }
Modelled as Softmax Elementwise
LSTMs Matrix-Vector operations
Attention or Pooling Not efficient on
Vector-Vector systolic arrays
22
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest

2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs

3. Demos

23
Console Output

1
1 Summary of input configurations

26
Console Output

1 Summary of input configurations

2 Run and stall cycles


2

27
Console Output

1 Summary of input configurations

2 Run and stall cycles

3 3 Mapping efficiency and compute utilization

28
Console Output

1 Summary of input configurations

2 Run and stall cycles

3 Mapping efficiency and compute utilization

4 Off chip access bandwidth


4

29
Generated outputs

Cycle accurate traces per operand

30
Generated outputs

Cycle accurate traces per operand

Summary files

31
Summary Files

Filename Attributes
COMPUTE_REPORT.csv Layer wise compute cycles, stall cycles, mapping utilization etc

BANDWIDTH_REPORT.csv Layer wise SRAM and DRAM access bandwidths for operands

DETAILED_ACCESS_REPORT.csv Access counts and timing informataion

32
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github

33
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github

New features
1. Tool can be run in both stall free and bandwidth limited mode
2. New metrics like mapping efficiency, stall count added
3. Modular code
4. Available as python package
5. More enhancements in the pipeline!

34
Announcement!
SCALE-Sim v2 Release (Beta)
We are releasing a new version of SCALE-Sim : https://github.com/scalesim-project/scale-sim-v2
We will soon have a stable version repo in ARM’s Github

New features
1. Tool can be run in both stall free and bandwidth limited mode
2. New metrics like mapping efficiency, stall count added
3. Modular code
4. Available as python package
5. More enhancements in the pipeline!

We also have a new website


https://scalesim-project.github.io

35
Outline
1. Simulating for DNN accelerator
- Motivation
- Metrics of interest

2. SCALE-Sim
- Overview
- Modelling compute, memory, and interface
- Modelling GEMM
- Modelling Convolutions
- Dataflows
- Outputs

3. Demos

36
Demos
We will showcase SCALE-Sim v2 capabilities with 3 tutorials

1. Using SCALE-Sim as a package


Design space exploration of a systolic accelerator

2. Adding new features to Simulator


Adding new buffer hierarchies in SCALE-Sim

3. Using SCALE-Sim as a library to build bigger simulators


Building a Scaled-out simulator using SCALE-Sim API

37

You might also like