0% found this document useful (0 votes)
45 views20 pages

PART19

The document discusses Compute Unified Device Architecture (CUDA) and how it allows parallel processing on GPUs. CUDA C is used to define code sections for the CPU host and GPU device. Kernels contain parallel code run on the GPU by threads organized in a grid of blocks. GPUs have more processing units than CPUs and are well-suited for parallel tasks like graphics rendering.

Uploaded by

halilkuyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

PART19

The document discusses Compute Unified Device Architecture (CUDA) and how it allows parallel processing on GPUs. CUDA C is used to define code sections for the CPU host and GPU device. Kernels contain parallel code run on the GPU by threads organized in a grid of blocks. GPUs have more processing units than CPUs and are well-suited for parallel tasks like graphics rendering.

Uploaded by

halilkuyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

+

William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 19
General-Purpose
Graphic Processing Units
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Compute Unified Device
Architecture (CUDA)
 A parallel computing platform and programming model created by NVIDIA
and implemented by the graphics processing units (GPUs) that they produce

 CUDA C is a C/C++ based language

 Program can be divided into three general sections


 Code to be run on the host (CPU)
 Code to be run on the device (GPU)
 The code related to the transfer of data between the host and the device

 The data-parallel code to be run on the GPU is called a kernel


 Typically will have few to no branching statements
 Branching statements in the kernel result in serial execution of the threads in the GPU
hardware

 A thread is a single instance of the kernel function


 The programmer defines the number of threads launched when the kernel
function is called
 The total number of threads defined is typically in the thousands to maximize the
utilization of the GPU processor cores, as well as maximize the available speedup
 The programmer specifies how these threads are to be bundled
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Grid
Block(0, 0) Block(1, 0) Block(2, 0)

Block(0, 1) Block(1, 1) Block(2, 1)

Block (1,1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Figure 19.1 Relationship Among Threads, Blocks, and a Grid


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Table 19.1

CUDA Terms to GPU’s Hardware Components


Equivalence Mapping

CUDA Term Definition Equivalent GPU Hardware


Component
Kernel Parallel code in the form of a function to Not applicable
be run on GPU
Thread An instance of the kernel on the GPU GPU/CUDA processor core
Block A group of threads assigned to a CUDA multiprocessor (SM)
particular SM
Grid The GPU GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU

Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Theoretical
GFLOPS

5500

5000

4500
NVIDIA GPU Single Precision
NVIDIA GPU Double Precision
4000
Intel CPU Single Precision
Intel CPU Double Precision
3500

3000

2500

2000

1500

1000

500

Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13

Figure 19.3 Floating-Point Operations per Second for CPU and GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
GPU Architecture Overview

The The first phase would cover early 1980s to late 1990s,
where the GPU was composed of fixed,
historical nonprogrammable, specialized processing stages
evolution
can be
divided The second phase would cover the iterative modification
up into of the resulting Phase I GPU architecture from a fixed,
three specialized, hardware pipeline to a fully programmable
processor (early to mid-2000s)
major
phases:
The third phase covers how the GPU/GPGPU
architecture makes an excellent and affordable highly
parallelized SIMD coprocessor for accelerating the run
times of some nongraphics-related programs, along with
how a GPGPU language maps to this architecture

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


DRAM
DRAM
Host Interface

DRAM
L2 Cache
GigaThread

DRAM
DRAM
DRAM

Figure 19.4 NVIDIA Fermi Architecture


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit

Register File (32k x 32-bit)

Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
CUDA Core Ld/St
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU
Ld/St
Core Core Core Core
FP Int Ld/St
Unit Unit
Ld/St
Core Core Core Core
Result Queue Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St

Interconnect Network

64-kB Shared Memory/L1 Cache

Uniform Cache

Figure 19.5 Single SM Architecture


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
WARP Scheduler WARP Scheduler

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11


Warp 2 instruction 42 Warp 3 instruction 33

Warp 14 instruction 95 Warp 15 instruction 95


Time

Warp 8 instruction 12 Warp 9 instruction 12


Warp 14 instruction 96 Warp 3 instruction 34

Warp 2 instruction 43 Warp 15 instruction 96

Figure 19.6 Dual Warp Schedulers and


Instruction Dispatch Units Run Example
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
CUDA Cores

 The NVIDIA GPU processor cores are also known as CUDA


cores

 There are a total of 32 CUDA cores dedicated to each SM


in the Fermi architecture

 Each CUDA core has two separate pipelines or data paths


 An integer (INT) unit pipeline
 Is capable of 32-bit, 64-bit, and extended precision for
integer and logic/bitwise operations
 Floating-point (FP) unit pipeline
 Can perform a single-precision FP operation, while a
double-precision FP operation requires two CUDA cores

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Table 19.2
GPU’s Memory Hierarchy Attributes

M emory Relative Access Access Scope Data Lifetime


Type Times Type
Registers Fastest. On-chip R/W Single thread Thread
Shared Fast. On-chip R/W All threads in a Block
block
Local 100´ to 150´ slower than R/W Single thread Thread
shared & register. Off-chip
Global 100´ to 150´ slower than R/W All threads & host Application
shared & register. Off-chip.
Constant 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip
Texture 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
(Device) Grid

Block (0,0) Block (1,0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Global
Memory

Host

Constant
Memory

Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.


The example GPU shown has two SMs and two CUDA cores per SM.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
EU: Execution Unit

Superscalar Pipeline Send


Superscalar Pipeline
Intruction Fetch

Thread Arbiter
Superscalar Pipeline Branch
Superscalar Pipeline SIMD
Superscalar Pipeline FPU

Superscalar Pipeline SIMD


FPU
Superscalar Pipeline

Figure 19.9 Intel Gen8 Execution Unit


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Subslice: 8 EUs
Instruction Local thread
cache dispatcher

EU EU

EU EU

EU EU

EU EU

Sampler L2 Data port


sampler
L1 cache

Figure 19.10 Intel Gen8 Subslice


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Slice: 24 EUs
Fixed function units

Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs


Instruction Local thread Instruction Local thread Instruction Local thread
cache dispatcher cache dispatcher cache dispatcher

EU EU EU EU EU EU

EU EU EU EU EU EU

EU EU EU EU EU EU

EU EU EU EU EU EU

Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port


sampler sampler sampler
L1 cache L1 cache L1 cache

Function Shared local


logic L3 data cache memory

Figure 19.11 Intel Gen8 Slice


© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Intel Core M Processor
Intel Processor Graphics Gen8 System Agent

Display
Slice: 24 EUs
CPU CPU Controller
Fixed function units core core
Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs

GTI
Instruction Local thr ead Instruction Local thr ead Instruction Local thr ead
cache dispatcher cache dispatcher cache dispatcher

EU EU EU EU EU EU
Memory
EU EU EU EU EU EU Controller
EU EU EU EU EU EU

EU EU EU EU EU EU SoC Ring Interconnect


Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port
sampler sampler sampler

LLC LLC
L1 cache L1 cache L1 cache

PCIe
Atomics,
L3 data cache
Shared local
Cache Cache
Barriers memory
slice slice

Figure 19.12 Intel Core M Processor SoC

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+ Summary General-Purpose
Graphic Processing
Chapter 19 Units

 GPUarchitecture
 CUDA basics
overview
 GPU versus CPU  Baseline GPU architecture
 Basic differences between  Full chip layout
CPU and GPU architectures  Streaming multiprocessor
 Performance and architecture details
performance per watt  Importance of knowing and
comparison programming to your
memory types
 Intel’s Gen8 GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

You might also like