ECE 6775
High-Level Digital Design Automation
Fall 2024
Hardware Specialization
Announcements
▸First reading assignment
– A. Boutros and V. Betz, “FPGA Architecture:
Principles and Progression”, IEEE CAS-M 2021
– Complete reading before Thursday 9/5
▸Lab 1 and an HLS tool setup guide will be
released soon (by Monday)
1
Recap: Our Interpretation of E-D-A
Exponential
in complexity (or Extreme scale) Exponential
Diverse
increasing system heterogeneity
Algorithmic
intrinsically computational
Diverse Algorithmic
2
Significance of EDA: Another Proof
Productivity Innovation : Reduce Custom Design (Structured Synthesis)
# of Customs over Time
1
0.9 >10x reduction over 5 generation Milestone:
0.8
Digital Logic
0.7 in 22nm server class
0.6 Microprocessors
0.5 99% synthesized
0.4 and signed-off by
0.3 Gate Level signoff
0.2
0.1
0
Ruchir Puri, High Performance Microprocessor Design, and Synthesis
Automation: Challenges and Opportunities, TAU’2013 keynote.
results w/
custom-like
data flow
alignment.3
*ISPD 2013 Best Paper Award: “Network Flow Based Datapath Bit Slicing” H.Xiang et al.
Agenda
▸Motivation for hardware specialization
– Key driving forces from applications and technology
– Main sources of inefficiency in general-purpose
computing
▸A taxonomy of common specialization
techniques
▸Introduction to fixed-point types
4
A Golden Age of Hardware Specialization
▸ Higher demand on efficient compute acceleration,
esp. for machine learning (ML) workloads
▸ Lower barrier with open-source hardware &
accelerators in cloud coming of age
5
Rising Computational Demands of Emerging
Applications
▸ Deep neural networks (DNNs) require enormous amount of compute
– Consider ResNet50, a 70-layer model that performs 7.7 billion operations
to classify an image (a relatively small model by today's standards)
Minerva
PaLM
GPT-3
AlphaGo
Transformer
ResNet
NPLM
AlexNet
Decision tree
LSTM
LeNet
NVIDIA Intel SPR
Intel Haswell H100 60-Core
NVIDIA Kepler 18-Core
Intel Pentium 4
Intel 386
6
Figure source: Cornell Zhang Research Group
On Crash Course with the End of “Cheap”
Technology Scaling
7
Dennard Scaling in a Nutshell
▸ Classical Dennard scaling
– Frequency increases at constant power profiles
– Performance improves “for free"!
Dennard scaling
Transistor (trans.) # S2
Capacitance / trans. 1/S
Voltage (Vdd) 1/S
Frequency S
Total power 1
Note: Dynamic power ∝ CV2F
8
End of Dennard Scaling and its Implications
▸ Power limited scaling
– Vth scaling halted due to exponentially increasing leakage power
– VDD scaling nearly stopped as well to maintain performance
Leakage limited scaling
Transistor (trans.) # S2
Capacitance / trans. 1/S
Voltage (Vdd) ~1
Frequency ~1
Total power S
Note: Dynamic power ∝ CV2F
▸ Implication: “Dark silicon”?
– Power limits restrict how much of the chip can be activated simultaneously
– No longer 100% without more power
9
Trade-off Between Flexibility and Efficiency
FLEXIBILITY EFFICIENCY
Register
Contr s
ol
Unit
CPUs
Arithmet GPUs FPGAs ASICs
(CU) ic Logic
Unit
(ALU)
Why are general-purpose
CPUs less energy efficient?
10
CPU Core Architecture
▸ Core = complex control + limited # of ALU ALU
compute units + large caches Control
ALU ALU
– Scalar & vector instructions
• Backward compatible ISA
Cache
– Complex control logic: decoding, hazard
detection, exception handling, etc.
▸ Mainly optimized to reduce latency of running serial code
– Shallow pipelines (< 30 stages)
– Superscalar, OOO, speculative execution, branch prediction,
prefetching, etc.
– Low throughput, even with multithreading
11
Poll & Discussion
Shallow vs. Deep Pipelining in context of CPU design
http://pollev.com/ece6775
Sign in or register using your Cornell email
12
Multi-Core Architecture
ALU ALU ALU ALU ALU ALU ALU ALU
Control Control Control Control
ALU ALU ALU ALU ALU ALU ALU ALU
Private L1/L2 Cache Private L1/L2 Cache Private L1/L2 Cache Private L1/L2 Cache
Shared Last-Level Cache (LLC)
With four cores, should we expect a 4x speedup
on an arbitrary application?
13
Graphics Processing Unit (GPU)
▸ GPU has thousands of cores to run ALU ALU
many threads in parallel Control
ALU ALU
– Cores are simpler (compared to CPU)
• No support of superscalar, OOO,
speculative execution, etc. Cache
• ISA not backward compatible
– Amortize overhead with SIMD + single
instruction multiple threads (SIMT) CPU
▸ Optimized to increase throughput of
running data-parallel applications
– Initially targeting graphics code
– Latency tolerant with many
concurrent threads
GPU
14
It’s Not Just About Performance: Computing’s
Energy Problem
Reading two 32b words from Energy
DRAM 1.3 nJ
Large SRAM (256MB) 58 pJ
Small SRAM (8KB) 5 pJ
Moving two 32b words by Energy
65,000x
40mm (across a 400mm2 chip) 77 pJ
1mm (local communication) 1.9 pJ 250x
Arithmetic on two 32b words Energy
FMA (float fused multiply-add) 1.2 pJ
IADD (integer add) 0.02 pJ
Data from [1], based on a 14nm process
Data supply far outweighs arithmetic operations in energy cost
15
[1] William Dally and Uzi Vishkin, On the Model of Computation, CACM’2022.
Rough Energy Breakdown for an Instruction
>20pJ 5pJ Control
I-Cache Register Control overheads 32-bit
access file access (clocking, decoding, ALU
pipeline control, ….)
Diagram adapted from W. Qadder, et al., Convolution Engine: Balancing Efficiency & Flexibility in 16
Specialized Computing, ISCA’2013.
Principles for Improving Energy Efficiency
Do less work!
– Amortize overhead in control and data supply across
multiple instructions
17
Amortizing the Overhead
A sequence of energy-inefficient instructions
I-Cache RF Control Arithmetic
I-Cache RF Control
…
I-Cache RF Control
Single instruction multiple Data (SIMD): tens of operations per instruction
I-Cache RF Control …
Further specialization (what we achieve using accelerators)
I-Cache RF Control … hundreds …
or more
Diagram adapted from W. Qadder, et al., Convolution Engine: Balancing Efficiency & Flexibility in 18
Specialized Computing, ISCA’2013.
Principles for Improving Energy Efficiency
Do less work!
– Amortize overhead in control and data supply across
multiple instructions
Do even less work!
– Use smaller (or simpler) data => cheaper operations,
lower storage & communication costs
– Move data locally and directly
• Store data nearby in simpler memory (e.g., scratchpads are
cheaper than cache)
• Wire compute units for direct (or even combinational)
communication when possible
19
Tensor Processing Unit (TPU)
▸ A domain-specific accelerator specialized for deep learning
– Main focus: accelerating matrix multiplication (MatMul) with a systolic array
• Use CISC instructions: MatMul Unit may take thousands of cycles
– TPUv1 does 8-bit integer (INT8) inference; TPUv2 supports a customized floating-
point type (bfloat16) for training
PE PE PE … PE
PE PE PE … PE
PE PE PE … PE
… … … …
PE PE PE … PE
A 256x256 Systolic Array
Google TPU v1
20
Source: Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)
▸ Custom Numeric Types: Trade off accuracy and
efficiency with data types that use smaller bit widths or
simpler arithmetic
▸ Custom Memory Hierarchy: Exploit data access
patterns to reduce energy per memory operation
▸ Custom Communication Architecture: Tailor on-chip
networks to data movement patterns
21
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)
▸ Custom Numeric Types: Trade off accuracy and
efficiency with data types that use smaller bit widths or
simpler arithmetic
▸ Custom Memory Hierarchy: Exploit data access
patterns to reduce energy per memory operation
▸ Custom Communication Architecture: Tailor on-chip
networks to data movement patterns
22
Customizing Compute Units: An Intuitive View
+2 0
1
SE(OFF,0) Adder
MP
Fm … F0
DR
SA RF
DataA
Inst. RAM
Decoder
SB M_address
IMM LD
DataB ALU Data
PC
MB SA Data_in 0
FS
0 RAM
SB 1
MD 1
DR
LD
D_in SE
MW VCZN MW MD
BS
MB
IMM
A simple single-cycle CPU
23
Evaluating a Simple Expression on CPU
Cycle-by-cycle
+
CPU activities
R5 <= R1 * R3
Decode
RF
ALU RAM
P
C
R6 <= R2 * R4
Decode
RF
ALU RAM
P
C
R7 <= R5 - R6 Decode
RF
ALU RAM
P
C
R8 <= R9 + R7
Decode
RF
ALU RAM
P
C
24
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
“Unrolling” the Instruction Execution
+
1. Replicate the
CPU1 CPU hardware
R5 <= R1 * R3 Instruction fixed
Decode
RF
ALU RAM
P
C
=> disable fetch
& decode logic
+
CPU2
R6 <= R2 * R4
Decode
RF
ALU RAM
P
C
Space
+
CPU3
R7 <= R5 - R6 Decode
RF
ALU RAM
P
C
+
CPU4
R8 <= R9 + R7
Decode
RF
ALU RAM
P
C
25
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
Removing Unused Logic
2. Removing
unused logic
R5 <= R1 * R3 => ALU also
RF
x
simplified
R6 <= R2 * R4
RF
x
Space
R7 <= R5 - R6
RF
–
R8 <= R9 + R7
RF
26
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
An Application-Specific Compute Unit
R5 <= R1 * R3 R1 R3 R2 R4 3. Wire up registers and
functional units
R6 <= R2 * R4 x x Use combinational connections
when timing constraints allow
R7 <= R5 - R6 R5 R6
(e.g., R7)
–
R8 <= R9 + R7 R9
R7
+
R8
27
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)
▸ Custom Numeric Types: Trade off accuracy and
efficiency with data types that use smaller bit widths or
simpler arithmetic
▸ Custom Memory Hierarchy: Exploit data access
patterns to reduce energy per memory operation
▸ Custom Communication Architecture: Tailor on-chip
networks to data movement patterns
28
Customized Data Types
▸ Using custom numeric types tailored for a given
application/domain improves performance & efficiency
Sign Exponent Mantissa
Half float (fp16)
bfloat16
block-fp
fixed<9,4>
int4
uint256 …
uint1 Covered in lectures & labs 29
Binary Representation – Positional Encoding
Unsigned number Two’s complement
▸ MSB has a place value ▸ MSB weight = -2n-1
(weight) of 2n-1
Most Binary Point
significant bit (implicit)
(MSB)
23 22 21 20 unsigned -23 22 21 20 2’c
1 0 1 1 = 11 1 0 1 1 = -5
30
Fixed-Point Representation of Fractional Numbers
▸ The positional binary encoding can also represent fractional values,
by using a fixed position of the binary point and place values with
negative exponents
(–) Less convenient to use in software, compared to floating point
(+) Much more efficient in hardware
Integer part (4 bits) Fractional part (2 bits)
Unsigned 23 22 21 20 2-1 2-2 unsigned
fixed-point
number
1 0 1 1 0 1 = 11.25
Binary point
Signed 2’c
fixed-point
number 1 0 1 1 0 1 = ??
31
Next Lecture
▸More Hardware Specialization
32
Acknowledgements
▸These slides contain/adapt materials developed
by
– Bill Dally, NVIDIA
– System for AI Education Resource by Microsoft
Research
33