3.2 The A100 Datacenter GPU and Ampere Architecture

The A100 Datacenter GPU, built on the Ampere architecture, delivers significant performance improvements over the V100, achieving 1.5-to-2.5× speedups on AI workloads and supporting a wide range of applications including deep learning and scientific computing. Key features include a 3rd-generation Tensor Core, Multi-Instance GPU (MIG) capabilities, and enhanced memory bandwidth, enabling efficient processing and resource utilization in cloud data centers. The A100 also introduces new data types and improved energy efficiency, making it a versatile solution for modern compute-intensive tasks.

Uploaded by

scoty Scott

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

237 views3 pages

3.2 The A100 Datacenter GPU and Ampere Architecture

Uploaded by

scoty Scott

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

ISSCC 2021 / SESSION 3 / HIGHLIGHTED CHIP RELEASES: MODERN DIGITAL SoCs / 3.

3.2 The A100 Datacenter GPU and Ampere Architecture Figure 3.2.4 shows the speedups achieved per chip for A100 vs V100. For AI workloads,
on MLPerf v0.7 training, A100 achieves 1.5-to-2.5× speedups vs. V100 and outperforms
all other commercially available solutions. The MLPerf benchmarks also demonstrate
Jack Choquette, Edward Lee, Ronny Krashinsky, Vishnu Balan, Brucek Khailany
A100’s breadth of support for AI, as the only system able to run all benchmarks, and
run them with high performance. The CUDA programmability and strong scaling
Nvidia, Santa Clara, CA
features in A100 also beneﬁt general-purpose datacenter workloads including HPC.
A100 runs 1.5-to-2× faster than V100 on key workloads from molecular dynamics,
The diversity of compute-intensive applications in modern cloud data centers has driven
physics, engineering, and geo science.
the explosion of GPU-accelerated cloud computing. Such applications include AI deep
learning training and inference, data analytics, scientiﬁc computing, genomics, edge
A100 also adds Multi-Instance GPU (MIG) features to address under-utilization of the
video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU
system with small batches of work in datacenters. Each A100 can function as up to 7
2021 IEEE International Solid- State Circuits Conference (ISSCC) | 978-1-7281-9549-0/20/$31.00 ©2021 IEEE | DOI: 10.1109/ISSCC42613.2021.9365803

introduces several features targeting these workloads: a 3rd-generation Tensor Core with
isolated GPUs, reconfigurable on the fly to meet the instantaneous demand. Two types
support for fine-grained sparsity, new BFloat16 (BF16), TensorFloat-32 (TF32), and FP64
of MIG instances are supported. One type isolates compute resources, but not the
datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up
memory system, enabling an OS to schedule processes with lightweight administration.
support with a 3rd-generation 50Gbps NVLink I/O interface (NVLink3) and NVSwitch
The other type further provides functional and performance isolation in the memory
inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming
system. In this scenario, A100 assigns each MIG physical pathways through the GPU
Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache
(including the on-chip crossbar, the L2 cache, and the memory interface) that are not
and 1.56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective
shared with any other MIG instances, providing flexible security boundaries for cloud
peak 1248TOPS (8b integers), 624TFLOPS (FP16) and 312TFLOPS (TF32) when
computing providers.
including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100
die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.
To further support strong scaling of A100 in multi-GPU systems, I/O BW was also
increased. A100’s NVLink3 doubles the BW per GPU to 300GB/s in each direction
Tensor Cores were first introduced in the Volta GPU architecture for accelerating matrix
through a combination of faster signaling and increasing the links per GPU to 12.
multiplies. The 3rd-generation A100 Tensor Core provides 2× higher throughput on
NVLink3 is a long-reach (LR) differential interface operating at a raw bitrate of 50Gbps
dense FP16 matrix multiplies per SM per clock cycle, provides another 2× boost from
(Fig. 3.2.5). Unlike the prevailing 50Gbps LR standards which use PAM4 signaling and
fine-grained sparsity (Fig.3.2.2), adds comprehensive data type support, and improves
require the use of FEC (forward error correction) to achieve low BER, the NVLink3 PHY
energy efficiency vs. V100. Combined with a 1.35× increase in SM count, A100 supports
uses NRZ signaling and achieves 1e-15 BER without requiring a FEC. This saves tens of
2.5× higher FP16 TFLOPS per GPU for dense data, and 5× for sparse data vs. V100.
nanoseconds of round-trip latency. In the LR mode, the NVLink3 PHY uses a
The A100 Tensor Cores also support new datatypes (Fig. 3.2.2). For DL training, A100
combination of 3-tap Tx FIR filter, Rx Continuous-Time Linear Equalizer (CTLE) and
supports BF16 data and a new TF32 format for Tensor Core input operands that contains
Partial Response Maximum-Likelihood (PRML) sequence detection to equalize the
1 sign, 8 exponent, and 10 mantissa bits. With TF32 everything outside of the tensor
channel and boost the SNR. The Tx FIR filter is trained through a back-channel supported
core remains standard FP32, including accumulators and memory storage. TF32 enables
by the NVLink protocol. The CTLE has equalization tunability at multiple frequency bands
AI training to use Tensor Cores by default for FP32 data without programmer effort,
to closely track the channel response. To relax the equalization requirement, partial
providing 10× throughput on dense FP32 data and 20× for sparse FP32 data vs. V100.
response target is used. The inter-symbol correlation from a partial response target can
For DL inference, A100 supports 8b integer, 4b integer, and binary datatypes with 32b
be exploited by using sequence detection instead of symbol-by-symbol detection to
integer accumulation at even higher throughputs. For HPC, A100 introduces FP64
boost the SNR. The Viterbi convolutional decoder is used to achieve maximum-likelihood
Tensor Cores, delivering a 2.5× FLOPS increase over V100.
sequence detection. A baud-rate clock recovery architecture is used to reduce the power
and area of the receiver.
The new A100 Tensor Cores consume 2× the data BW per SM vs. V100 for dense data
and 3× for sparse data (compressed weights and uncompressed activations). As shown
Figure 3.2.6 shows a DGX A100 system with 8 A100 GPUs and 6 NVSwitch chips. Each
in Fig. 3.2.3, A100 adds several data movement features to keep the Tensor Cores fed, GPU connects to each of the NVSwitch chips using 2 NVLink3 links for a total of 12
and for improved performance and energy efficiency. First, wider Tensor Cores exploit NVLink3 connections supporting 600GB/s of NVLink interconnect for each A100. The
more data reuse. When 4 warps (32-thread group) worked together on a matrix multiply system supports full-BW non-blocking communication pairwise between all the GPUs.
in V100, its 8-thread Tensor Cores would load data 4 times from shared memory into It also contains multiple high speed Mellanox HDR 200Gb/s Infiniband NICs for
different threads’ registers. In A100, the 32-thread Tensor Cores halve the shared connecting to massively parallel multi-GPU systems via a full fat tree switched
memory load BW. Second, in V100, load-global instructions transfer data from L2 and interconnect.
DRAM, through the L1 cache, and into the register file (RF) before store-shared
instructions write the data back to shared memory. A100 has a new combined load- Acknowledgements:
global-store-shared instruction that performs an asynchronous-copy to transfer data The authors would like to thank the countless architects and engineers who designed
directly into shared memory, bypassing the RF. The asynchronous copy is programmed and built the A100 & products, and who contributed to the content of this paper.
with a new ISO C++20 asynchronous barrier [5] supported in CUDA 8.0.
References:
In aggregate, A100’s Tensor Cores run 2.5× faster than V100 for dense FP16 data but [1] J. Choquette, “NVIDIA’s Volta GPU: Programmability and Performance for GPU
we set a design goal to not rely on weak scaling (growing the workload size by increasing Computing”, Hot Chips, 2017.
the NN size or batch size) for achieving speedups. Instead, A100 targeted strong scaling [2] V. Balan et al., “A 130mW 20Gb/s Half-Duplex Serial Link in 28nm CMOS,” ISSCC,
with 2.5× speedup on fixed-size NNs, constraining the available parallelism and requiring pp. 438-439, 2014.
more L2 cache BW per SM. To achieve this, the L2 was split into partitions using a [3] P. Mattson et al., “MLPerf Training Benchmark”, arXiv:1910.01500.
hierarchical crossbar structure. Each L2 partition also caches data nearer to the [4] A. Ishii et al., “NVSwitch and DGX-2: NVLink-Switching Chip and Scale-Up Compute
accessing SMs, lowering latency. Hardware cache coherence still maintains the memory Server”, Hot Chips, 2018.
consistency supported by CUDA across the full GPU. A100 also adds memory system [5] International Standard ISO/IEC 14882:2020 – Programming Language C++.
features that increase the available DRAM BW and reduce required memory traffic. An
additional HBM2 site and faster clocks provide 1.56TB/s, a 1.7× increase over V100. To
reduce memory traffic, A100 increases L2 capacity by almost 7× over V100 and adds
L2 controls to give programmers the ability to manage the on-chip residency of data.
A100 also provides compute data compression to exploit the available unstructured
sparsity in NN activations, typically over 50%. A100’s compression HW can compress
these activation tensors in DRAM by 2-4× and in the L2 by up to 2×.

Authorized licensed use limited to: Swinburne University of Technology. Downloaded on March 14,2025 at 08:09:25 UTC from IEEE Xplore. Restrictions apply.
48 • 2021 IEEE International Solid-State Circuits Conference 978-1-7281-9549-0/21/$31.00 ©2021 IEEE
ISSCC 2021 / February 15, 2021 / 8:35 AM

Figure 3.2.2: A100 Tensor Core input/output formats and performance: TOPS
Figure 3.2.1: A100 architecture block diagram: 108 SMs (6912 CUDA cores), 40MB column indicates TFLOPS for ﬂoating-point ops and TOPS for integer ops. Sparse
L2 (6.7× capacity vs. V100), 1.56TB/s HBM2 (1.7× BW vs. V100). TOPS represents effective TOPS / TFLOPS using the new sparsity feature.

Figure 3.2.3: A100 Streaming Multiprocessor (SM) architecture and data movement
efficiency for strong scaling: A100 improves SM BW efficiency compared to V100
with a new load-global-store-shared asynchronous copy instruction that bypasses
L1 cache and register file (RF). Additionally, A100’s more efficient Tensor Cores
reduce shared memory (SMEM) loads. Figure 3.2.4: A100 speedups on HPC and MLPerf v0.7 vs. V100 and alternatives.

Figure 3.2.5: NVLink3 link architecture. Figure 3.2.6: DGX-A100 system block diagram.
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on March 14,2025 at 08:09:25 UTC from IEEE Xplore. Restrictions apply.
DIGEST OF TECHNICAL PAPERS • 49
ISSCC 2021 PAPER CONTINUATIONS

Figure 3.2.7: A100 die photo.

Authorized licensed use limited to: Swinburne University of Technology. Downloaded on March 14,2025 at 08:09:25 UTC from IEEE Xplore. Restrictions apply.
• 2021 IEEE International Solid-State Circuits Conference 978-1-7281-9549-0/21/$31.00 ©2021 IEEE

Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
A100 80gb Datasheet Update Nvidia Us 1521051 r2 Web
No ratings yet
A100 80gb Datasheet Update Nvidia Us 1521051 r2 Web
3 pages
Nvidia A100 Datasheet Nvidia Us 2188504 Web
No ratings yet
Nvidia A100 Datasheet Nvidia Us 2188504 Web
3 pages
Nvidia A100 Datasheet Us Nvidia 1758950 r4 Web
No ratings yet
Nvidia A100 Datasheet Us Nvidia 1758950 r4 Web
3 pages
HotChips2020 GPU NVIDIA Choquette v01
No ratings yet
HotChips2020 GPU NVIDIA Choquette v01
43 pages
NVIDIA A100 GPU: Transforming Data Centers
No ratings yet
NVIDIA A100 GPU: Transforming Data Centers
3 pages
DGX A100 System Architecture Whitepaper
No ratings yet
DGX A100 System Architecture Whitepaper
23 pages
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
No ratings yet
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
3 pages
HPC Day 12 ppt-2
No ratings yet
HPC Day 12 ppt-2
139 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
PART18
No ratings yet
PART18
30 pages
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
No ratings yet
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
66 pages
Ampere GPU CUDA Tuning Guide
No ratings yet
Ampere GPU CUDA Tuning Guide
5 pages
Energy-Efficient Matrix Multiply on SoC
No ratings yet
Energy-Efficient Matrix Multiply on SoC
7 pages
STM 32 MP 157 A
No ratings yet
STM 32 MP 157 A
258 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
C Flex Lions
No ratings yet
C Flex Lions
9 pages
STM 32 MP 157 C
No ratings yet
STM 32 MP 157 C
258 pages
STM 32 H 750 VB
No ratings yet
STM 32 H 750 VB
337 pages
STM32N6x5xx STM32N6x7xx
No ratings yet
STM32N6x5xx STM32N6x7xx
258 pages
AMD Accelerated Parallel Processing OCL Programming Guide-2013!06!21
No ratings yet
AMD Accelerated Parallel Processing OCL Programming Guide-2013!06!21
288 pages
STM 32 MP 157 D
No ratings yet
STM 32 MP 157 D
262 pages
NVIDIA GPU Evolution: Gaming to AI
100% (1)
NVIDIA GPU Evolution: Gaming to AI
91 pages
stm32n657z0 Datasheet
No ratings yet
stm32n657z0 Datasheet
257 pages
DM 00033267
No ratings yet
DM 00033267
52 pages
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
No ratings yet
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
5 pages
List of Nvidia Graphics Processing Units - Wikiped
No ratings yet
List of Nvidia Graphics Processing Units - Wikiped
53 pages
STM32F101x4 STM32F101x6
No ratings yet
STM32F101x4 STM32F101x6
87 pages
STM32H7R3x8 STM32H7R7x8
No ratings yet
STM32H7R3x8 STM32H7R7x8
315 pages
Optimizing GPU Energy Efficiency With 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
No ratings yet
Optimizing GPU Energy Efficiency With 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
25 pages
stm32h750vbt6 PDF
No ratings yet
stm32h750vbt6 PDF
335 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
STM32F 030C6 PDF
No ratings yet
STM32F 030C6 PDF
93 pages
STM 32 F 407 Ve
No ratings yet
STM 32 F 407 Ve
205 pages
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
No ratings yet
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
7 pages
STM 32 MP 135 F
No ratings yet
STM 32 MP 135 F
225 pages
STM32H7S3x8 STM32H7S7x8
No ratings yet
STM32H7S3x8 STM32H7S7x8
320 pages
STM32F030x4 STM32F030x6 STM32F030x8 STM32F030xC
No ratings yet
STM32F030x4 STM32F030x6 STM32F030x8 STM32F030xC
91 pages
STM32H742xI/G STM32H743xI/G
No ratings yet
STM32H742xI/G STM32H743xI/G
362 pages
Stm32h753bi PDF
No ratings yet
Stm32h753bi PDF
357 pages
STM32H7R3x8 STM32H7R7x8
No ratings yet
STM32H7R3x8 STM32H7R7x8
320 pages
STM 32 H 743 Zi
No ratings yet
STM 32 H 743 Zi
358 pages
RISC-V SIMD for Data Processing
No ratings yet
RISC-V SIMD for Data Processing
71 pages
STM 32 MP 151 A
No ratings yet
STM 32 MP 151 A
260 pages
Tesla K40 vs V100 GPU Comparison
No ratings yet
Tesla K40 vs V100 GPU Comparison
2 pages
STM 32 MP 153 A
No ratings yet
STM 32 MP 153 A
261 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
STM 32 L 496 Ag
No ratings yet
STM 32 L 496 Ag
285 pages
STM 32 MP 151 A
No ratings yet
STM 32 MP 151 A
258 pages
Stm32h753xi PDF
No ratings yet
Stm32h753xi PDF
356 pages
STM 32 H 743 VI
No ratings yet
STM 32 H 743 VI
357 pages
STM 32 MP 153 D
No ratings yet
STM 32 MP 153 D
259 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
Poweredge Server Gpu Matrix
No ratings yet
Poweredge Server Gpu Matrix
2 pages
STM32H750VB Datasheet
No ratings yet
STM32H750VB Datasheet
323 pages
Lattice2024 Wagner
No ratings yet
Lattice2024 Wagner
38 pages
STM 32 MP 135 F
No ratings yet
STM 32 MP 135 F
224 pages
STM32N647xx STM32N657xx
No ratings yet
STM32N647xx STM32N657xx
244 pages
Workabout MX General Technical Information
No ratings yet
Workabout MX General Technical Information
13 pages
B.E. Computer Tech Exam Paper
No ratings yet
B.E. Computer Tech Exam Paper
4 pages
PSXCE - 2015 - UAG Study Guide v1.1
No ratings yet
PSXCE - 2015 - UAG Study Guide v1.1
12 pages
DD1050 Installatori English 81320289
No ratings yet
DD1050 Installatori English 81320289
112 pages
Hypermesh With Answer
No ratings yet
Hypermesh With Answer
10 pages
Computer Graphics Lab Day-1: Here
No ratings yet
Computer Graphics Lab Day-1: Here
7 pages
Han OneLLM One Framework To Align All Modalities With Language CVPR 2024 Paper
No ratings yet
Han OneLLM One Framework To Align All Modalities With Language CVPR 2024 Paper
12 pages
Master in Parametric Design 400H. 2017
100% (2)
Master in Parametric Design 400H. 2017
159 pages
Haiwell PLC Programming Manual
No ratings yet
Haiwell PLC Programming Manual
7 pages
TEMS DISCOVERY Training
100% (5)
TEMS DISCOVERY Training
147 pages
AI-900 Related Question Bank
No ratings yet
AI-900 Related Question Bank
52 pages
CH01 COA9e Introduction
No ratings yet
CH01 COA9e Introduction
15 pages
StyleCAD V7 Release Note
No ratings yet
StyleCAD V7 Release Note
85 pages
Cisco UCS Manager Firmware Guide 4.2
No ratings yet
Cisco UCS Manager Firmware Guide 4.2
146 pages
Catalog Product KCK 2024
No ratings yet
Catalog Product KCK 2024
27 pages
Nozomi Academy: Cloudshare Training Guide
No ratings yet
Nozomi Academy: Cloudshare Training Guide
7 pages
WinTAK Quick Start Guide
No ratings yet
WinTAK Quick Start Guide
43 pages
1BG19CS009 aAKSHAT SAXENA DBMS REPORT Final 2
No ratings yet
1BG19CS009 aAKSHAT SAXENA DBMS REPORT Final 2
39 pages
CCMS Django 1 PPT Yqyh3b
No ratings yet
CCMS Django 1 PPT Yqyh3b
24 pages
C-DAC Postgraduate IT Courses
No ratings yet
C-DAC Postgraduate IT Courses
21 pages
Top VSCode Extensions for Developers
No ratings yet
Top VSCode Extensions for Developers
13 pages
Midjourney Prompt Guide
No ratings yet
Midjourney Prompt Guide
10 pages
Input/Output Organization in Computers
No ratings yet
Input/Output Organization in Computers
82 pages
EinScan HX 3D Scanner Quick Start Guide
No ratings yet
EinScan HX 3D Scanner Quick Start Guide
9 pages
AMS10004 SKF Results Reporter PDF
No ratings yet
AMS10004 SKF Results Reporter PDF
8 pages
Software Requirements Specification For Sales Prediction Model Page-Ii
No ratings yet
Software Requirements Specification For Sales Prediction Model Page-Ii
11 pages
R18 B.tech - CSE (AIML) 4-2 Tentative Syllabus
No ratings yet
R18 B.tech - CSE (AIML) 4-2 Tentative Syllabus
8 pages
Pi 4B Maximum Current - Raspberry Pi Forums
No ratings yet
Pi 4B Maximum Current - Raspberry Pi Forums
9 pages
Multicore Processors Overview
No ratings yet
Multicore Processors Overview
36 pages
Real Dating Format PDF 3 PDF Romance (Love)
No ratings yet
Real Dating Format PDF 3 PDF Romance (Love)
1 page

3.2 The A100 Datacenter GPU and Ampere Architecture

Uploaded by

3.2 The A100 Datacenter GPU and Ampere Architecture

Uploaded by

ISSCC 2021 / SESSION 3 / HIGHLIGHTED CHIP RELEASES: MODERN DIGITAL SoCs / 3.

Figure 3.2.7: A100 die photo.

You might also like