+
William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 19
General-Purpose
Graphic Processing Units
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Compute Unified Device
Architecture (CUDA)
A parallel computing platform and programming model created by NVIDIA
and implemented by the graphics processing units (GPUs) that they produce
CUDA C is a C/C++ based language
Program can be divided into three general sections
Code to be run on the host (CPU)
Code to be run on the device (GPU)
The code related to the transfer of data between the host and the device
The data-parallel code to be run on the GPU is called a kernel
Typically will have few to no branching statements
Branching statements in the kernel result in serial execution of the threads in the GPU
hardware
A thread is a single instance of the kernel function
The programmer defines the number of threads launched when the kernel
function is called
The total number of threads defined is typically in the thousands to maximize the
utilization of the GPU processor cores, as well as maximize the available speedup
The programmer specifies how these threads are to be bundled
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Grid
Block(0, 0) Block(1, 0) Block(2, 0)
Block(0, 1) Block(1, 1) Block(2, 1)
Block (1,1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Figure 19.1 Relationship Among Threads, Blocks, and a Grid
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Table 19.1
CUDA Terms to GPU’s Hardware Components
Equivalence Mapping
CUDA Term Definition Equivalent GPU Hardware
Component
Kernel Parallel code in the form of a function to Not applicable
be run on GPU
Thread An instance of the kernel on the GPU GPU/CUDA processor core
Block A group of threads assigned to a CUDA multiprocessor (SM)
particular SM
Grid The GPU GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
ALU ALU
Control
ALU ALU
Cache
DRAM DRAM
CPU GPU
Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Theoretical
GFLOPS
5500
5000
4500
NVIDIA GPU Single Precision
NVIDIA GPU Double Precision
4000
Intel CPU Single Precision
Intel CPU Double Precision
3500
3000
2500
2000
1500
1000
500
Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13
Figure 19.3 Floating-Point Operations per Second for CPU and GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
GPU Architecture Overview
The The first phase would cover early 1980s to late 1990s,
where the GPU was composed of fixed,
historical nonprogrammable, specialized processing stages
evolution
can be
divided The second phase would cover the iterative modification
up into of the resulting Phase I GPU architecture from a fixed,
three specialized, hardware pipeline to a fully programmable
processor (early to mid-2000s)
major
phases:
The third phase covers how the GPU/GPGPU
architecture makes an excellent and affordable highly
parallelized SIMD coprocessor for accelerating the run
times of some nongraphics-related programs, along with
how a GPGPU language maps to this architecture
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
DRAM
DRAM
Host Interface
DRAM
L2 Cache
GigaThread
DRAM
DRAM
DRAM
Figure 19.4 NVIDIA Fermi Architecture
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit
Register File (32k x 32-bit)
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
CUDA Core Ld/St
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU
Ld/St
Core Core Core Core
FP Int Ld/St
Unit Unit
Ld/St
Core Core Core Core
Result Queue Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
Interconnect Network
64-kB Shared Memory/L1 Cache
Uniform Cache
Figure 19.5 Single SM Architecture
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
WARP Scheduler WARP Scheduler
Instruction Dispatch Unit Instruction Dispatch Unit
Warp 8 instruction 11 Warp 9 instruction 11
Warp 2 instruction 42 Warp 3 instruction 33
Warp 14 instruction 95 Warp 15 instruction 95
Time
Warp 8 instruction 12 Warp 9 instruction 12
Warp 14 instruction 96 Warp 3 instruction 34
Warp 2 instruction 43 Warp 15 instruction 96
Figure 19.6 Dual Warp Schedulers and
Instruction Dispatch Units Run Example
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
CUDA Cores
The NVIDIA GPU processor cores are also known as CUDA
cores
There are a total of 32 CUDA cores dedicated to each SM
in the Fermi architecture
Each CUDA core has two separate pipelines or data paths
An integer (INT) unit pipeline
Is capable of 32-bit, 64-bit, and extended precision for
integer and logic/bitwise operations
Floating-point (FP) unit pipeline
Can perform a single-precision FP operation, while a
double-precision FP operation requires two CUDA cores
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Table 19.2
GPU’s Memory Hierarchy Attributes
M emory Relative Access Access Scope Data Lifetime
Type Times Type
Registers Fastest. On-chip R/W Single thread Thread
Shared Fast. On-chip R/W All threads in a Block
block
Local 100´ to 150´ slower than R/W Single thread Thread
shared & register. Off-chip
Global 100´ to 150´ slower than R/W All threads & host Application
shared & register. Off-chip.
Constant 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip
Texture 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
(Device) Grid
Block (0,0) Block (1,0)
Shared Memory Shared Memory
Registers Registers Registers Registers
Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)
Global
Memory
Host
Constant
Memory
Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.
The example GPU shown has two SMs and two CUDA cores per SM.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
EU: Execution Unit
Superscalar Pipeline Send
Superscalar Pipeline
Intruction Fetch
Thread Arbiter
Superscalar Pipeline Branch
Superscalar Pipeline SIMD
Superscalar Pipeline FPU
Superscalar Pipeline SIMD
FPU
Superscalar Pipeline
Figure 19.9 Intel Gen8 Execution Unit
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Subslice: 8 EUs
Instruction Local thread
cache dispatcher
EU EU
EU EU
EU EU
EU EU
Sampler L2 Data port
sampler
L1 cache
Figure 19.10 Intel Gen8 Subslice
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Slice: 24 EUs
Fixed function units
Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs
Instruction Local thread Instruction Local thread Instruction Local thread
cache dispatcher cache dispatcher cache dispatcher
EU EU EU EU EU EU
EU EU EU EU EU EU
EU EU EU EU EU EU
EU EU EU EU EU EU
Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port
sampler sampler sampler
L1 cache L1 cache L1 cache
Function Shared local
logic L3 data cache memory
Figure 19.11 Intel Gen8 Slice
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Intel Core M Processor
Intel Processor Graphics Gen8 System Agent
Display
Slice: 24 EUs
CPU CPU Controller
Fixed function units core core
Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs
GTI
Instruction Local thr ead Instruction Local thr ead Instruction Local thr ead
cache dispatcher cache dispatcher cache dispatcher
EU EU EU EU EU EU
Memory
EU EU EU EU EU EU Controller
EU EU EU EU EU EU
EU EU EU EU EU EU SoC Ring Interconnect
Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port
sampler sampler sampler
LLC LLC
L1 cache L1 cache L1 cache
PCIe
Atomics,
L3 data cache
Shared local
Cache Cache
Barriers memory
slice slice
Figure 19.12 Intel Core M Processor SoC
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Summary General-Purpose
Graphic Processing
Chapter 19 Units
GPUarchitecture
CUDA basics
overview
GPU versus CPU Baseline GPU architecture
Basic differences between Full chip layout
CPU and GPU architectures Streaming multiprocessor
Performance and architecture details
performance per watt Importance of knowing and
comparison programming to your
memory types
Intel’s Gen8 GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.