0% found this document useful (0 votes)
71 views34 pages

Lecture02 - High-Level Digital Design Automation

The document outlines the course ECE 6775 on High-Level Digital Design Automation for Fall 2024, including announcements for reading assignments and lab setups. It discusses the significance of Electronic Design Automation (EDA) in improving productivity and efficiency in hardware specialization, particularly for machine learning applications. The agenda includes topics on hardware specialization techniques, energy efficiency principles, and the introduction of fixed-point types in computing.

Uploaded by

leprelepre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views34 pages

Lecture02 - High-Level Digital Design Automation

The document outlines the course ECE 6775 on High-Level Digital Design Automation for Fall 2024, including announcements for reading assignments and lab setups. It discusses the significance of Electronic Design Automation (EDA) in improving productivity and efficiency in hardware specialization, particularly for machine learning applications. The agenda includes topics on hardware specialization techniques, energy efficiency principles, and the introduction of fixed-point types in computing.

Uploaded by

leprelepre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

ECE 6775

High-Level Digital Design Automation


Fall 2024

Hardware Specialization
Announcements

▸First reading assignment


– A. Boutros and V. Betz, “FPGA Architecture:
Principles and Progression”, IEEE CAS-M 2021
– Complete reading before Thursday 9/5

▸Lab 1 and an HLS tool setup guide will be


released soon (by Monday)

1
Recap: Our Interpretation of E-D-A

Exponential
in complexity (or Extreme scale) Exponential

Diverse
increasing system heterogeneity

Algorithmic
intrinsically computational
Diverse Algorithmic

2
Significance of EDA: Another Proof
Productivity Innovation : Reduce Custom Design (Structured Synthesis)
# of Customs over Time

1
0.9 >10x reduction over 5 generation Milestone:
0.8
Digital Logic
0.7 in 22nm server class
0.6 Microprocessors
0.5 99% synthesized
0.4 and signed-off by
0.3 Gate Level signoff
0.2
0.1
0

Ruchir Puri, High Performance Microprocessor Design, and Synthesis


Automation: Challenges and Opportunities, TAU’2013 keynote.
results w/
custom-like
data flow
alignment.3

*ISPD  2013  Best  Paper  Award:  “Network  Flow  Based  Datapath  Bit  Slicing”  H.Xiang  et  al.
Agenda

▸Motivation for hardware specialization


– Key driving forces from applications and technology
– Main sources of inefficiency in general-purpose
computing

▸A taxonomy of common specialization


techniques

▸Introduction to fixed-point types

4
A Golden Age of Hardware Specialization
▸ Higher demand on efficient compute acceleration,
esp. for machine learning (ML) workloads

▸ Lower barrier with open-source hardware &


accelerators in cloud coming of age

5
Rising Computational Demands of Emerging
Applications
▸ Deep neural networks (DNNs) require enormous amount of compute
– Consider ResNet50, a 70-layer model that performs 7.7 billion operations
to classify an image (a relatively small model by today's standards)

Minerva
PaLM
GPT-3
AlphaGo
Transformer

ResNet
NPLM
AlexNet
Decision tree
LSTM
LeNet

NVIDIA Intel SPR


Intel Haswell H100 60-Core
NVIDIA Kepler 18-Core
Intel Pentium 4
Intel 386

6
Figure source: Cornell Zhang Research Group
On Crash Course with the End of “Cheap”
Technology Scaling

7
Dennard Scaling in a Nutshell

▸ Classical Dennard scaling


– Frequency increases at constant power profiles
– Performance improves “for free"!

Dennard scaling
Transistor (trans.) # S2
Capacitance / trans. 1/S
Voltage (Vdd) 1/S
Frequency S
Total power 1
Note: Dynamic power ∝ CV2F

8
End of Dennard Scaling and its Implications

▸ Power limited scaling


– Vth scaling halted due to exponentially increasing leakage power
– VDD scaling nearly stopped as well to maintain performance

Leakage limited scaling


Transistor (trans.) # S2
Capacitance / trans. 1/S
Voltage (Vdd) ~1
Frequency ~1
Total power S
Note: Dynamic power ∝ CV2F

▸ Implication: “Dark silicon”?


– Power limits restrict how much of the chip can be activated simultaneously
– No longer 100% without more power

9
Trade-off Between Flexibility and Efficiency

FLEXIBILITY EFFICIENCY
Register
Contr s
ol
Unit
CPUs
Arithmet GPUs FPGAs ASICs
(CU) ic Logic
Unit
(ALU)

Why are general-purpose


CPUs less energy efficient?

10
CPU Core Architecture

▸ Core = complex control + limited # of ALU ALU


compute units + large caches Control
ALU ALU
– Scalar & vector instructions
• Backward compatible ISA
Cache
– Complex control logic: decoding, hazard
detection, exception handling, etc.

▸ Mainly optimized to reduce latency of running serial code


– Shallow pipelines (< 30 stages)
– Superscalar, OOO, speculative execution, branch prediction,
prefetching, etc.
– Low throughput, even with multithreading

11
Poll & Discussion

Shallow vs. Deep Pipelining in context of CPU design

http://pollev.com/ece6775
Sign in or register using your Cornell email
12
Multi-Core Architecture

ALU ALU ALU ALU ALU ALU ALU ALU


Control Control Control Control
ALU ALU ALU ALU ALU ALU ALU ALU

Private L1/L2 Cache Private L1/L2 Cache Private L1/L2 Cache Private L1/L2 Cache

Shared Last-Level Cache (LLC)

With four cores, should we expect a 4x speedup


on an arbitrary application?

13
Graphics Processing Unit (GPU)
▸ GPU has thousands of cores to run ALU ALU
many threads in parallel Control
ALU ALU
– Cores are simpler (compared to CPU)
• No support of superscalar, OOO,
speculative execution, etc. Cache
• ISA not backward compatible
– Amortize overhead with SIMD + single
instruction multiple threads (SIMT) CPU

▸ Optimized to increase throughput of


running data-parallel applications
– Initially targeting graphics code
– Latency tolerant with many
concurrent threads
GPU

14
It’s Not Just About Performance: Computing’s
Energy Problem
Reading two 32b words from Energy
DRAM 1.3 nJ
Large SRAM (256MB) 58 pJ
Small SRAM (8KB) 5 pJ
Moving two 32b words by Energy
65,000x
40mm (across a 400mm2 chip) 77 pJ
1mm (local communication) 1.9 pJ 250x
Arithmetic on two 32b words Energy
FMA (float fused multiply-add) 1.2 pJ
IADD (integer add) 0.02 pJ
Data from [1], based on a 14nm process

Data supply far outweighs arithmetic operations in energy cost


15
[1] William Dally and Uzi Vishkin, On the Model of Computation, CACM’2022.
Rough Energy Breakdown for an Instruction

>20pJ 5pJ Control

I-Cache Register Control overheads 32-bit


access file access (clocking, decoding, ALU
pipeline control, ….)

Diagram adapted from W. Qadder, et al., Convolution Engine: Balancing Efficiency & Flexibility in 16
Specialized Computing, ISCA’2013.
Principles for Improving Energy Efficiency
Do less work!
– Amortize overhead in control and data supply across
multiple instructions

17
Amortizing the Overhead
A sequence of energy-inefficient instructions
I-Cache RF Control Arithmetic
I-Cache RF Control

I-Cache RF Control

Single instruction multiple Data (SIMD): tens of operations per instruction


I-Cache RF Control …

Further specialization (what we achieve using accelerators)


I-Cache RF Control … hundreds …
or more

Diagram adapted from W. Qadder, et al., Convolution Engine: Balancing Efficiency & Flexibility in 18
Specialized Computing, ISCA’2013.
Principles for Improving Energy Efficiency
Do less work!
– Amortize overhead in control and data supply across
multiple instructions

Do even less work!


– Use smaller (or simpler) data => cheaper operations,
lower storage & communication costs
– Move data locally and directly
• Store data nearby in simpler memory (e.g., scratchpads are
cheaper than cache)
• Wire compute units for direct (or even combinational)
communication when possible

19
Tensor Processing Unit (TPU)
▸ A domain-specific accelerator specialized for deep learning
– Main focus: accelerating matrix multiplication (MatMul) with a systolic array
• Use CISC instructions: MatMul Unit may take thousands of cycles
– TPUv1 does 8-bit integer (INT8) inference; TPUv2 supports a customized floating-
point type (bfloat16) for training

PE PE PE … PE

PE PE PE … PE

PE PE PE … PE
… … … …
PE PE PE … PE

A 256x256 Systolic Array

Google TPU v1

20
Source: Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)

▸ Custom Numeric Types: Trade off accuracy and


efficiency with data types that use smaller bit widths or
simpler arithmetic

▸ Custom Memory Hierarchy: Exploit data access


patterns to reduce energy per memory operation

▸ Custom Communication Architecture: Tailor on-chip


networks to data movement patterns

21
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)

▸ Custom Numeric Types: Trade off accuracy and


efficiency with data types that use smaller bit widths or
simpler arithmetic

▸ Custom Memory Hierarchy: Exploit data access


patterns to reduce energy per memory operation

▸ Custom Communication Architecture: Tailor on-chip


networks to data movement patterns

22
Customizing Compute Units: An Intuitive View

+2 0
1
SE(OFF,0) Adder
MP
Fm … F0
DR
SA RF
DataA
Inst. RAM

Decoder

SB M_address
IMM LD
DataB ALU Data
PC

MB SA Data_in 0
FS
0 RAM
SB 1
MD 1
DR
LD
D_in SE
MW VCZN MW MD
BS
MB
IMM

A simple single-cycle CPU

23
Evaluating a Simple Expression on CPU
Cycle-by-cycle
+
CPU activities
R5 <= R1 * R3

Decode

RF
ALU RAM
P
C

R6 <= R2 * R4

Decode

RF
ALU RAM
P
C

R7 <= R5 - R6 Decode

RF
ALU RAM
P
C

R8 <= R9 + R7
Decode

RF

ALU RAM
P
C

24
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
“Unrolling” the Instruction Execution
+
1. Replicate the
CPU1 CPU hardware
R5 <= R1 * R3 Instruction fixed

Decode

RF
ALU RAM
P
C
=> disable fetch
& decode logic

+
CPU2
R6 <= R2 * R4

Decode

RF
ALU RAM
P
C

Space

+
CPU3
R7 <= R5 - R6 Decode

RF
ALU RAM
P
C

+
CPU4
R8 <= R9 + R7
Decode

RF

ALU RAM
P
C

25
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
Removing Unused Logic
2. Removing
unused logic
R5 <= R1 * R3 => ALU also

RF
x
simplified

R6 <= R2 * R4

RF
x

Space

R7 <= R5 - R6

RF

R8 <= R9 + R7
RF

26
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
An Application-Specific Compute Unit

R5 <= R1 * R3 R1 R3 R2 R4 3. Wire up registers and


functional units
R6 <= R2 * R4 x x Use combinational connections
when timing constraints allow
R7 <= R5 - R6 R5 R6
(e.g., R7)

R8 <= R9 + R7 R9

R7
+

R8

27
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
Common Hardware Specialization Techniques:
A Taxonomy
▸ Custom Compute Units: Use complex instructions to
amortize overhead (e.g., SIMD, “ASIC”-in-an-instruction)

▸ Custom Numeric Types: Trade off accuracy and


efficiency with data types that use smaller bit widths or
simpler arithmetic

▸ Custom Memory Hierarchy: Exploit data access


patterns to reduce energy per memory operation

▸ Custom Communication Architecture: Tailor on-chip


networks to data movement patterns

28
Customized Data Types
▸ Using custom numeric types tailored for a given
application/domain improves performance & efficiency
Sign Exponent Mantissa
Half float (fp16)

bfloat16

block-fp

fixed<9,4>

int4

uint256 …
uint1 Covered in lectures & labs 29
Binary Representation – Positional Encoding

Unsigned number Two’s complement


▸ MSB has a place value ▸ MSB weight = -2n-1
(weight) of 2n-1

Most Binary Point


significant bit (implicit)
(MSB)

23 22 21 20 unsigned -23 22 21 20 2’c

1 0 1 1 = 11 1 0 1 1 = -5

30
Fixed-Point Representation of Fractional Numbers

▸ The positional binary encoding can also represent fractional values,


by using a fixed position of the binary point and place values with
negative exponents
(–) Less convenient to use in software, compared to floating point
(+) Much more efficient in hardware

Integer part (4 bits) Fractional part (2 bits)

Unsigned 23 22 21 20 2-1 2-2 unsigned


fixed-point
number
1 0 1 1 0 1 = 11.25
Binary point
Signed 2’c
fixed-point
number 1 0 1 1 0 1 = ??
31
Next Lecture

▸More Hardware Specialization

32
Acknowledgements

▸These slides contain/adapt materials developed


by
– Bill Dally, NVIDIA
– System for AI Education Resource by Microsoft
Research

33

You might also like