0% found this document useful (0 votes)

45 views20 pages

PART19

The document discusses Compute Unified Device Architecture (CUDA) and how it allows parallel processing on GPUs. CUDA C is used to define code sections for the CPU host and GPU device. Kernels contain parallel code run on the GPU by threads organized in a grid of blocks. GPUs have more processing units than CPUs and are well-suited for parallel tasks like graphics rendering.

Uploaded by

halilkuyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views20 pages

PART19

Uploaded by

halilkuyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

+

William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 19
General-Purpose
Graphic Processing Units
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Compute Unified Device
Architecture (CUDA)
 A parallel computing platform and programming model created by NVIDIA
and implemented by the graphics processing units (GPUs) that they produce

 CUDA C is a C/C++ based language

 Program can be divided into three general sections

 Code to be run on the host (CPU)
 Code to be run on the device (GPU)
 The code related to the transfer of data between the host and the device

 The data-parallel code to be run on the GPU is called a kernel

 Typically will have few to no branching statements
 Branching statements in the kernel result in serial execution of the threads in the GPU
hardware

 A thread is a single instance of the kernel function

 The programmer defines the number of threads launched when the kernel
function is called
 The total number of threads defined is typically in the thousands to maximize the
utilization of the GPU processor cores, as well as maximize the available speedup
 The programmer specifies how these threads are to be bundled
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Grid
Block(0, 0) Block(1, 0) Block(2, 0)

Block(0, 1) Block(1, 1) Block(2, 1)

Block (1,1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Figure 19.1 Relationship Among Threads, Blocks, and a Grid

CUDA Terms to GPU’s Hardware Components

Equivalence Mapping

CUDA Term Definition Equivalent GPU Hardware

Component
Kernel Parallel code in the form of a function to Not applicable
be run on GPU
Thread An instance of the kernel on the GPU GPU/CUDA processor core
Block A group of threads assigned to a CUDA multiprocessor (SM)
particular SM
Grid The GPU GPU

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU

Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Theoretical
GFLOPS

5500

5000

4500
NVIDIA GPU Single Precision
NVIDIA GPU Double Precision
4000
Intel CPU Single Precision
Intel CPU Double Precision
3500

3000

2500

2000

1500

1000

500

Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13

Figure 19.3 Floating-Point Operations per Second for CPU and GPU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
GPU Architecture Overview

The The first phase would cover early 1980s to late 1990s,
where the GPU was composed of fixed,
historical nonprogrammable, specialized processing stages
evolution
can be
divided The second phase would cover the iterative modification
up into of the resulting Phase I GPU architecture from a fixed,
three specialized, hardware pipeline to a fully programmable
processor (early to mid-2000s)
major
phases:
The third phase covers how the GPU/GPGPU
architecture makes an excellent and affordable highly
parallelized SIMD coprocessor for accelerating the run
times of some nongraphics-related programs, along with
how a GPGPU language maps to this architecture

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

DRAM
DRAM
Host Interface

DRAM
L2 Cache
GigaThread

DRAM
DRAM
DRAM

Figure 19.4 NVIDIA Fermi Architecture

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Instruction Cache
Warp Scheduler Warp Scheduler
Dispatch Unit Dispatch Unit

Register File (32k x 32-bit)

Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
CUDA Core Ld/St
Core Core Core Core
Dispatch Port Ld/St
Operand Collector SFU
Ld/St
Core Core Core Core
FP Int Ld/St
Unit Unit
Ld/St
Core Core Core Core
Result Queue Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St
Ld/St
Core Core Core Core
Ld/St
SFU
Ld/St
Core Core Core Core
Ld/St

Interconnect Network

64-kB Shared Memory/L1 Cache

Uniform Cache

Figure 19.5 Single SM Architecture

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11

Warp 2 instruction 42 Warp 3 instruction 33

Warp 14 instruction 95 Warp 15 instruction 95

Time

Warp 8 instruction 12 Warp 9 instruction 12

Warp 14 instruction 96 Warp 3 instruction 34

Warp 2 instruction 43 Warp 15 instruction 96

Figure 19.6 Dual Warp Schedulers and

Instruction Dispatch Units Run Example
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+
CUDA Cores

 The NVIDIA GPU processor cores are also known as CUDA

cores

 There are a total of 32 CUDA cores dedicated to each SM

in the Fermi architecture

 Each CUDA core has two separate pipelines or data paths

 An integer (INT) unit pipeline
 Is capable of 32-bit, 64-bit, and extended precision for
integer and logic/bitwise operations
 Floating-point (FP) unit pipeline
 Can perform a single-precision FP operation, while a
double-precision FP operation requires two CUDA cores

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Table 19.2
GPU’s Memory Hierarchy Attributes

M emory Relative Access Access Scope Data Lifetime

Type Times Type
Registers Fastest. On-chip R/W Single thread Thread
Shared Fast. On-chip R/W All threads in a Block
block
Local 100´ to 150´ slower than R/W Single thread Thread
shared & register. Off-chip
Global 100´ to 150´ slower than R/W All threads & host Application
shared & register. Off-chip.
Constant 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip
Texture 100´ to 150´ slower than R All threads & host Application
shared & register. Off-chip

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Block (0,0) Block (1,0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Global
Memory

Host

Constant
Memory

Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.

Superscalar Pipeline Send

Superscalar Pipeline
Intruction Fetch

Thread Arbiter
Superscalar Pipeline Branch
Superscalar Pipeline SIMD
Superscalar Pipeline FPU

Superscalar Pipeline SIMD

FPU
Superscalar Pipeline

Figure 19.9 Intel Gen8 Execution Unit

EU EU

Sampler L2 Data port

sampler
L1 cache

Figure 19.10 Intel Gen8 Subslice

Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs

Instruction Local thread Instruction Local thread Instruction Local thread
cache dispatcher cache dispatcher cache dispatcher

EU EU EU EU EU EU

Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port

sampler sampler sampler
L1 cache L1 cache L1 cache

Function Shared local

logic L3 data cache memory

Figure 19.11 Intel Gen8 Slice

Display
Slice: 24 EUs
CPU CPU Controller
Fixed function units core core
Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs

GTI
Instruction Local thr ead Instruction Local thr ead Instruction Local thr ead
cache dispatcher cache dispatcher cache dispatcher

EU EU EU EU EU EU
Memory
EU EU EU EU EU EU Controller
EU EU EU EU EU EU

EU EU EU EU EU EU SoC Ring Interconnect

Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port
sampler sampler sampler

LLC LLC
L1 cache L1 cache L1 cache

PCIe
Atomics,
L3 data cache
Shared local
Cache Cache
Barriers memory
slice slice

Figure 19.12 Intel Core M Processor SoC

+ Summary General-Purpose
Graphic Processing
Chapter 19 Units

 GPUarchitecture
 CUDA basics
overview
 GPU versus CPU  Baseline GPU architecture
 Basic differences between  Full chip layout
CPU and GPU architectures  Streaming multiprocessor
 Performance and architecture details
performance per watt  Importance of knowing and
comparison programming to your
memory types
 Intel’s Gen8 GPU

CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Architecture Overview and Comparisons
No ratings yet
GPU Architecture Overview and Comparisons
29 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Cuda
No ratings yet
Cuda
25 pages
Introduction to CUDA Parallel Programming
No ratings yet
Introduction to CUDA Parallel Programming
25 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Overview of CUDA and GPU Benefits
No ratings yet
Overview of CUDA and GPU Benefits
9 pages
Unit 4
100% (1)
Unit 4
48 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Note2 4
No ratings yet
Note2 4
11 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
Cuda C
No ratings yet
Cuda C
70 pages
CUDA Programming: Advantages & Limitations
No ratings yet
CUDA Programming: Advantages & Limitations
35 pages
Hardware
No ratings yet
Hardware
54 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
GPU Programming Course Schedule
No ratings yet
GPU Programming Course Schedule
33 pages
Ünite
No ratings yet
Ünite
33 pages
Ünite
No ratings yet
Ünite
40 pages
Part 1
No ratings yet
Part 1
22 pages
PART20
No ratings yet
PART20
24 pages
Part 13
No ratings yet
Part 13
48 pages
Part 8
No ratings yet
Part 8
23 pages
PART15
No ratings yet
PART15
39 pages
PART14
No ratings yet
PART14
54 pages
Machine Instruction Elements and Types
No ratings yet
Machine Instruction Elements and Types
13 pages
PART13
No ratings yet
PART13
36 pages
PART8
No ratings yet
PART8
54 pages
Data Loggers
No ratings yet
Data Loggers
2 pages
中考英语阅读理解与词汇练习
No ratings yet
中考英语阅读理解与词汇练习
15 pages
Stellar Evolution & Element Formation
No ratings yet
Stellar Evolution & Element Formation
14 pages
English B Paper 2 SL Markscheme
No ratings yet
English B Paper 2 SL Markscheme
13 pages
Naomi Saito Resume 7
No ratings yet
Naomi Saito Resume 7
1 page
HT Lecture 01 Modes of HeatTransfer
No ratings yet
HT Lecture 01 Modes of HeatTransfer
29 pages
Student Scholarship Status Report
No ratings yet
Student Scholarship Status Report
2 pages
CSTA K-12 Computer Science Standards (Revised 2017)
No ratings yet
CSTA K-12 Computer Science Standards (Revised 2017)
30 pages
Gaz Classifieds 280814
No ratings yet
Gaz Classifieds 280814
8 pages
How To Calculate Your Destiny Number Vanessa Somuayina
No ratings yet
How To Calculate Your Destiny Number Vanessa Somuayina
2 pages
Present Continuous Tense Examples
No ratings yet
Present Continuous Tense Examples
3 pages
2006 BIR - Ruling - DA 745 06 - 20180405 1159 Sdfpar
No ratings yet
2006 BIR - Ruling - DA 745 06 - 20180405 1159 Sdfpar
7 pages
Anuario Abracopel 2022 English Final
No ratings yet
Anuario Abracopel 2022 English Final
108 pages
Structuring IB Business 10 Mark Answers
100% (10)
Structuring IB Business 10 Mark Answers
3 pages
Let's Talk About Language Learning!
No ratings yet
Let's Talk About Language Learning!
1 page
(1915) Harmuth, Louis - Dictionary of Textiles
No ratings yet
(1915) Harmuth, Louis - Dictionary of Textiles
186 pages
A1 Formulae
No ratings yet
A1 Formulae
4 pages
Feeding System PDF
100% (1)
Feeding System PDF
52 pages
02 Bilge Pumping Systems: Manual Diaphragm Pumps
No ratings yet
02 Bilge Pumping Systems: Manual Diaphragm Pumps
24 pages
STONEFISH
No ratings yet
STONEFISH
4 pages
Ut 2 Maths Class X Q P
0% (1)
Ut 2 Maths Class X Q P
3 pages
Implicit-to-Explicit Solution Guide
No ratings yet
Implicit-to-Explicit Solution Guide
22 pages
Effective Meeting Guide for Leaders
No ratings yet
Effective Meeting Guide for Leaders
3 pages
Catalog Bioline International Version
No ratings yet
Catalog Bioline International Version
70 pages
Disc Sander & Polisher Specs
No ratings yet
Disc Sander & Polisher Specs
16 pages
Doukas, Historia Turcobyzantina
No ratings yet
Doukas, Historia Turcobyzantina
173 pages
Cordero 16.5.3
No ratings yet
Cordero 16.5.3
9 pages
Contoh Soal Uts Akm 2
No ratings yet
Contoh Soal Uts Akm 2
3 pages
Solaris, VXVM, Cluster L1 - L2 and L3 Also Interview Questions - Exploring Solaris and Veritas
No ratings yet
Solaris, VXVM, Cluster L1 - L2 and L3 Also Interview Questions - Exploring Solaris and Veritas
4 pages
BTS KTN-Mk2 Eng06 W
No ratings yet
BTS KTN-Mk2 Eng06 W
25 pages

PART19

Uploaded by

PART19

Uploaded by

+

 CUDA C is a C/C++ based language

 Program can be divided into three general sections

 The data-parallel code to be run on the GPU is called a kernel

 A thread is a single instance of the kernel function

Block(0, 1) Block(1, 1) Block(2, 1)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Figure 19.1 Relationship Among Threads, Blocks, and a Grid

CUDA Terms to GPU’s Hardware Components

CUDA Term Definition Equivalent GPU Hardware

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.4 NVIDIA Fermi Architecture

Register File (32k x 32-bit)

64-kB Shared Memory/L1 Cache

Figure 19.5 Single SM Architecture

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11

Warp 14 instruction 95 Warp 15 instruction 95

Warp 8 instruction 12 Warp 9 instruction 12

Warp 2 instruction 43 Warp 15 instruction 96

Figure 19.6 Dual Warp Schedulers and

 The NVIDIA GPU processor cores are also known as CUDA

 There are a total of 32 CUDA cores dedicated to each SM

 Each CUDA core has two separate pipelines or data paths

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

M emory Relative Access Access Scope Data Lifetime

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Block (0,0) Block (1,0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0)

Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.

Superscalar Pipeline Send

Superscalar Pipeline SIMD

Figure 19.9 Intel Gen8 Execution Unit

Sampler L2 Data port

Figure 19.10 Intel Gen8 Subslice

Subslice: 8 EUs Subslice: 8 EUs Subslice: 8 EUs

Sampler L2 Data port Sampler L2 Data port Sampler L2 Data port

Function Shared local

Figure 19.11 Intel Gen8 Slice

EU EU EU EU EU EU SoC Ring Interconnect

Figure 19.12 Intel Core M Processor SoC

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

You might also like