0% found this document useful (0 votes)

109 views15 pages

Lecture 29 GPU Architecture Example

This document discusses data level parallelism and GPU architectures. It describes how GPUs use a single instruction multiple thread programming model to efficiently perform data parallel operations. The CUDA programming language is used to write functions ("kernels") that execute across many threads in parallel. Threads are organized into blocks, and blocks are organized into grids. GPUs have SIMD processors that execute the same instruction across multiple threads at once. GPU registers are divided among threads, with each thread getting its own elements. Programming the GPU involves distinguishing device and host functions and coordinating data transfers between CPU and GPU memory.

Uploaded by

Udai Valluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views15 pages

Lecture 29 GPU Architecture Example

Uploaded by

Udai Valluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CKV

Advanced VLSI Architecture

MEL G624

Lecture 29: Data Level Parallelism

CKV
Graphical Processing Units
Basic Idea

Heterogeneous execution model

CPU is the host, GPU is the device

Develop a C‐like programming language for GPU

Compute Unified Device Architecture (CUDA)  Nvidia
OpenCL for vendor-independent language

Unify all forms of GPU parallelism as CUDA thread

Programming model is “Single Instruction Multiple Thread”

CKV
NVIDIA GPU Architecture
Similarities to vector machines:
Works well with data‐level parallel problems
Scatter‐gather transfers from memory into local store
Mask registers
Large register files

Differences:
No scalar processor, scalar integration
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
CKV
Example
Multiply two vectors of length 8192
Code that works over all elements is the grid

Thread blocks break this down into manageable sizes

512 elements per block

SIMD instruction executes 32 elements (per thread) at a time

Thus grid size = 16 blocks

Block is analogous to a strip‐mined vector loop (VL 32 elements)

Block is assigned to a multithreaded SIMD processor be scheduler

CKV
A[0]=B[0] * C[0]
SIMD A[1]=B[1] * C[1]
Thread 0
A[31]=B[31] * C[31]

SIMD
Thread Thread 1
Block 0 A[63]=B[63] * C[63]

SIMD
Thread
Grid 15
A[511]=B[511] * C[511]

Thread
Block 15
SIMD
Thread
15
A[8191]=B[8191] * C[8191]
CKV
Fermi GTX 480 Thread Block assigned to
Multithreaded SIMD Processor
Thread Scheduler

Thread Block Scheduler

SM
CKV
Terminology
Threads of SIMD instructions
Each has its own PC
Thread scheduler uses scoreboard to dispatch
No data dependencies between threads!
Keeps track of up to 48 threads of SIMD instructions
Hides Memory Latency

Thread block scheduler schedules blocks to SIMD processors

Within each SIMD processor:

16 SIMD lanes
Wide and shallow compared to vector processors
CKV
Thread Scheduler
CKV
GPU Register Example
NVIDIA GPU has 32,768 registers

Divided logically into lanes

Each SIMD thread is limited to 64 vector like registers

SIMD thread has up to:

64 vector registers of 32 32‐bit elements 2048
32 vector registers of 32 64‐bit elements 32-bit Registers

Each CUDA thread gets one element

CUDA threads of thread block uses how many Reg?
CKV
Multi-Threaded SIMD Processor

SIMD Thread
Each instruction works on 32 elements
Keep Track of 48
Lanes
independent threads
CUDA Thread

2048 32-bit registers Dynamically Allocated to a Thread

CKV
Multi Threading

MIMD

SIMD

ILP
SM
CKV
Programming the GPU

Challenge for GPU Programmer

Get good performance

Coordinating the scheduling of computation on the system

processor and GPU

Transfer of data between system memory and GPU Memory

CKV
Programming the GPU

To distinguish between functions for GPU (device) and

System processor (host)

_device_ or _global_ => GPU Device

_host_ => System processor (host)

Variables declared in _device_ or _global_ functions are

allocated to GPU memory
Function Call: name<<<dimGrid, dimBlock>>>..parameters list..)
threadIdx: Identifier for threads per block
blockIdx: Identifier for blocks
blockDim: Number of threads per block
CKV
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}

//Invoke DAXPY with 256 threads per Thread Block

_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}
CKV

Thank You for Attending

GPU Programming and Parallelism
No ratings yet
GPU Programming and Parallelism
16 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Threads
No ratings yet
Threads
54 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
50 pages
Manycore GPU Programming Overview
No ratings yet
Manycore GPU Programming Overview
67 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
GPU Programming Course Overview
No ratings yet
GPU Programming Course Overview
49 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
CUDA C vs Thrust vs Libraries Overview
No ratings yet
CUDA C vs Thrust vs Libraries Overview
64 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA C: Hello World Guide
No ratings yet
CUDA C: Hello World Guide
40 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Yang Dissertation 2017
No ratings yet
Yang Dissertation 2017
159 pages
1.1.2 SLM Ge I
No ratings yet
1.1.2 SLM Ge I
64 pages
2043348-Paricaya Sep2023 QP
No ratings yet
2043348-Paricaya Sep2023 QP
4 pages
2015 371813 Shrii-Lalitoo
No ratings yet
2015 371813 Shrii-Lalitoo
452 pages
Two Stage Miller OTA 1725070490
No ratings yet
Two Stage Miller OTA 1725070490
29 pages
Advanced VLSI Architecture: Lecture 6: Memory Hierarchy
No ratings yet
Advanced VLSI Architecture: Lecture 6: Memory Hierarchy
14 pages
Lecture 21
No ratings yet
Lecture 21
25 pages
Electronic Devices Tutorial - 2: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 2: Prof. Sanket Goel & Dr. Surya Shankar Dan
8 pages
Lecture 4: Static NMOS/CMOS Inverter: VTC: (MEL G621)
No ratings yet
Lecture 4: Static NMOS/CMOS Inverter: VTC: (MEL G621)
6 pages
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
7 pages
(MEL G621) : Lecture 1: Introduction
No ratings yet
(MEL G621) : Lecture 1: Introduction
8 pages
Electronic Devices Tutorial - 5: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 5: Prof. Sanket Goel & Dr. Surya Shankar Dan
3 pages
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
7 pages
GPU-Based Polyhedral DEM Simulations
No ratings yet
GPU-Based Polyhedral DEM Simulations
75 pages
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
No ratings yet
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
13 pages
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
No ratings yet
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
5 pages
CUDA Fortran
No ratings yet
CUDA Fortran
88 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
CUDA and GPGPU Concepts Explained
No ratings yet
CUDA and GPGPU Concepts Explained
2 pages
Ampere GPU CUDA Tuning Guide
No ratings yet
Ampere GPU CUDA Tuning Guide
5 pages
GPU Memory Exploitation
No ratings yet
GPU Memory Exploitation
19 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Histogram
No ratings yet
Histogram
11 pages
Large-Scale Welding Process Simulation by GPU
No ratings yet
Large-Scale Welding Process Simulation by GPU
27 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
CUDA ClustalW: Fast GPU Sequence Alignment
No ratings yet
CUDA ClustalW: Fast GPU Sequence Alignment
7 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Optimizing CUDA Memory Hierarchy
No ratings yet
Optimizing CUDA Memory Hierarchy
77 pages
Efficient GPU Resource Sharing Strategies
No ratings yet
Efficient GPU Resource Sharing Strategies
22 pages
Section 2
No ratings yet
Section 2
7 pages
Overview of GPU Architecture and CUDA
No ratings yet
Overview of GPU Architecture and CUDA
18 pages
GPU & CUDA Programming Essentials
No ratings yet
GPU & CUDA Programming Essentials
73 pages
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
5 pages
Cu Asm RL
No ratings yet
Cu Asm RL
18 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Image Processing with CUDA on GPU
No ratings yet
Image Processing with CUDA on GPU
87 pages
CUDA Dynamic Parallelism Programming Guide
No ratings yet
CUDA Dynamic Parallelism Programming Guide
30 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Colfax Gemm Kernels Hopper
No ratings yet
Colfax Gemm Kernels Hopper
17 pages

Lecture 29 GPU Architecture Example

Uploaded by

Lecture 29 GPU Architecture Example

Uploaded by

CKV

Advanced VLSI Architecture

Lecture 29: Data Level Parallelism

Heterogeneous execution model

Develop a C‐like programming language for GPU

Unify all forms of GPU parallelism as CUDA thread

Programming model is “Single Instruction Multiple Thread”

Thread blocks break this down into manageable sizes

SIMD instruction executes 32 elements (per thread) at a time

Thus grid size = 16 blocks

Block is analogous to a strip‐mined vector loop (VL 32 elements)

Block is assigned to a multithreaded SIMD processor be scheduler

Thread Block Scheduler

Thread block scheduler schedules blocks to SIMD processors

Within each SIMD processor:

Divided logically into lanes

Each SIMD thread is limited to 64 vector like registers

SIMD thread has up to:

Each CUDA thread gets one element

2048 32-bit registers Dynamically Allocated to a Thread

Challenge for GPU Programmer

Coordinating the scheduling of computation on the system

Transfer of data between system memory and GPU Memory

To distinguish between functions for GPU (device) and

_device_ or _global_ => GPU Device

_host_ => System processor (host)

Variables declared in _device_ or _global_ functions are

//Invoke DAXPY with 256 threads per Thread Block

Thank You for Attending

You might also like