0% found this document useful (0 votes)

28 views7 pages

Section 2

Uploaded by

Ashwini Purkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views7 pages

Section 2

Uploaded by

Ashwini Purkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

CUDA Execution model

Understanding the Device

-Understanding the device internals - like what the device looks like,
But if we had better understanding on how our device or GPU looked like and what
parts of the device

effect the execution of certain type of kernels then we would How come to logical
conclusion beforehand

-The GPU architecture is built around a scalable array of streaming multiprocessor

or SMs and

GPU hardware parallelism is achieved through by replication of this architectural

building block.

- THE GPU (SM Streaming Multiprocessor)

* IC - Instruction Cache
* Warp Scheduler
* Dispatch Unit
* Register File
*Core
*LD/ST
* SFU (special function units)
* Interconnect Network
* 64KB Shared Mem + L1 Cache
* Uniform Cache

Volta has more cores - Tensor Cores, Int cores, FP32, FP64 cores

-Computer architecture classification

SISD (Single Instruction single data)

SIMD
MISD (Fault tolerant - space shuttle flight control etC)
MIMD

CUDA follows different - SIMT - Single instr on multiple threads

CUDA design

- A thread Block executes on single SM (Multiple TB - Thread Blocks cannot excute

simultaneously on same SM depending on resource limitation on SM)
- one TB cannot exec on mutiple SM

WARPS

We cannot execute 1Million thread at the same time

So we need to divide the TB into smaller units called warps - each having 32
consecutive threads
no of warps per block = block size / warp size

512 / 32 = 16 ==> so warp is a bsic unit of execution in a SM

Once a thread block is scheduled to an SM, threads in the TB are further
partitioned into warps

The warp indx is calculated by using threadIdx / 32

Warp Divergence
A warp is a single exec unit in a GPU - what if we code in a way that it will force
some threads in the warp to execute different instructions
SIMT - single instruction in principle - but then in what conditions different
instr ?

if(tid < 16) { }

else { }

multiple condition checks like these basically cause perf penalty

What causes warps to diverge ?

When there is multiple path of execution within the same warp. So condition checks
which result in all thread execution the same path will NOT induce any warp diverge

ex: if(tid / 32 < 1) ==> this may not cause warp div

Warp Div and Perf degradation can be measured using nvprof - how to calculate
branch efficiency ?

branch_efficiency = total - divergent branch / total * 100

nvprof --metrics branch_efficiancy fname.out

nvcc optimizes the div code with predicates

we can specify the disabling of the compiler optimization with nvcc -G

Resource Partitioning and latency hiding

- The GPU schdules the blocks in the SM depending on the resource limitations of
that SM. Let us understand what are those resource limitations and how exactly the
scheduling works.

- The local execution context of a warp mainly consists of following resources -

program counters, registers and shared memory.

- The execution context of each warp processed by a SM is maintained on the chip

during the entire life time of a warp. Thus switching from one execution context to
another has no cost.
This zero-cost context switching is achieved by making GPU threads more light
weight and limiting it's capabilities.

- Out of the above mentioned resources, number of registers and shared memory can
be directly controlled by the programmers. Each SM has a set of 32-bit registers
stored
in a register file. This reg file is partitioned amoing threads and then there a
fixed amount of shared memory that is partitioned amonig thread blocks.

- The number of warps that can simultaneously reside on a SM for a given kernel
depend on the number of registers and amount of shared memory available on SM.
So if more threads per warp then fewer threads - same applies for shared memory-
thread blocks for the grid for that SM.

- The max registers and shared memory per SM vary for device of different compute
capabilities and device architecture. So we need to optimize the perf of the kernel
depending on the device that is going to execute on.

Programmer can control registers and shared mem - the reg file is partitioned among
threads and the shared mem is partitioned amoing thread blocks

Warp and Block Categories

A thread block is called active when compute resources such as registers and shared
mmeory have been allocated to it.
Warps in such an active block are called active warps.

Active Warps are further categorized as - selected, stalled, eligible

The warp scheduler on an SM selects active warps on each execution cycle and
dispatches them to the execution unit.
- A warp that is actively executing is called is an eligible warp.
- If not ready then it is called stalled warp.
- A warp is eligible for execution if both of the following condition is met - 32
CUDA cores should be avaliable for exec + all arguments should be ready.
- If a warp stalls then the warp scheduler picks up an eligible warp to execute in
its place.
- Compute resource allocation on warp is static and is kept on chip during exite
warp lifetime, therefore warp context switching is not costing any execution cycle.

SM is a resource constrained enviroment - so, execessive usage of resources like

registers and shared memory from single thread causes SM to have only smaller
number of active warps
residing on that SM at a given time, resulting in underutilization of computing
resources.

Understanding the concep tof latency and latency hiding we shall learn why there is
a need of enough active warps to utilize the computation power optimally.

What is latency (Instruction Latency) ? No of clock cyles that go between when

instruction issued and completed => Arith + Mem operations.

In a chip - there are 2 units - CU and ALU - The control unit and the Arithmetic
Logic Unit
The control unit issues the instr to the ALU - the ALU takes some clock cycles to
complete the instr. What do these instructions look like ?

1. Add 2. Max 3. Min 4. MAD (Multiply and Add) 5. Mul 6. Div ( > 200 clock
cycles) 7 Remainder 8. Memory Load 9. Mem store

Also - the different types of memories in the GPU have different types of latencies
- GLOBAL, LOCAL, SHARED, TEXTURE , CONSTANT.
eg. L1 cache - mem latency of 194 , DRAM - 350, Shared Mem - 28.

* CUDA latency hiding mechanisms

- At Instruction level - we know that arith-Logic and mem instructions can happen
in parallel
- If only a single warp hold the SM for long, that is perf degradation.
- Execution context of a warp is valid for the entire lifetime of a warp
- Switch is zero cost
- So, if one warp holds then and if another large number of warps are eligible then
warp scheduler can dispatch another set of eligible warps to the core.

- Basically a Stalling core can continue the execution of a new warp and when the
results of the previous warp are ready, then the previous warp can resume its
execution.

CUDA Applications are either compute bound or memory bound

Occupancy

- Ratio of active warps to maximum number of warps allowed per SM

- The concept of latency hiding is measured through this ratio
- Max warps allowed on SM is mentioned in the documentation
- On a single core - instructions are executed sequentially and the parallelism is
at instr level.

- On the other hand, active warp counts per SM depends on available resources in
device and resource consumption of your kernel.
- Excel with formulas are available to calculate occupancy - one needs to enter
shared memory amount.

Profiling with NVPROF

- Profiling gives us insight about our CUDA program execution - vairables are
different grid configurations and different resource usages.

nvprof - used for profile-driven optimization - i.e change resource usage and grid
config through command line args and measure the performance.
nvprof has 3 modes of operation:
- summary mode
- gpu trace mode
- api trace mode
-g or -G is used to disable compiler optimizations
-In summary mode (default) we can see:
- GPU activity
- Kernel executions
- Memory copy functions etc.

Metrics examples - gld_efficiency, average occupancy,

Thread Synchronization

In parallel programming, synchronization is a vital aspect of programming. Here, we

are working with > 1 thread and often we need to order the operations perform in
these threads
the cuda primitive used are:
1. cudaDeviceSynchronize() - global synchronization between host and device
2. syncthreads() - between threads on a block

Usually a CUDA program will consist of asynchronous calls to launch kernel from
host - Async - means non-blocking i.e. after launching the kernel on the device,
the host program continues its execution. To make the host wait until device
execution finishes - i.e to make host call synchronous we use the
cudaDeviceSynchronize() funtion after launching the kernel.

the syncthreads() function is used for synchronization within a block across all
warps. And this function should be called only from the device code. It forces all
the threads to wait until all threads in that block complete till that point in the
code.
Also the global and shared memory access will be visible to these threads after the
execution point.

As we discussed earlier, CUDA follows SIMT execution paradigm. With in a wrap, all
the 32 threads execute same instruction. But between warps
there are no such guarantees. So let's say warp 2 starts execution first. This
execution will be independent from other warps.
But if our kernel has syncthreads() function in the code, then warp 2 has to wait
in that statement until all other warps reach that point.
So all the threads in the block will reach that synchronize point and then only any
warps for this block can execute further instructions.

The reduction problem - any commutative or associative operation across a vector is

known as the reduction problem.

Prefix Sum problem in parallel paradigm: neighboured pair approach

We are going to first partition the input vector in to smaller data chunks and each
data chunk will be sums up separately
and finally, we will add these partial sums to get total sum. This diagram
demonstrate steps we are going to follow exactly. In the implementation we will use
thread block size as our data chunk size. So each thread is responsible for adding
up given data chunk. After thread block finish its accumulation
The result will be stored in to separate array. Each slot in this new array will
populate with partial sum of each thread block.

See Section 2 sync reduction code - we use syncthreads() in each iteration fo the
for loop. This is because warps can execute in any order.
So when we have thread block with multiple warps we don't have any way of guarantee
this restriction.
This is where syncthreads() function comes to the action. After each iteration, we
are going to use synctrhead() function.

We perform syncthreads() function call to make sure that all the threads in a block
reach one point before any of the thread in particular block moves forward.
But if we are using syncthreads with in a condition check then that condition check
might not allow some of the threads in that particular thread block to reach the
syncthread() statement based on the condition.
This will result in paradox and the outcome will be undefined value. So be mindful
when you are using syncthreads() function specially in the presence of condition
checks.

Warp Diveregence parallel reduction example

As I already explained to you, if the threads with in a warp executes in a

different path then there is a warp divergence in the kernel. And our device will
arrange all these paths together and execute the code sequentially which will
result in
heavy performance penalties. Unfortunately our previous parallel reduction
implementation also has warp divergence. For example, in this diagram you can see
128 threads belong to a single thread block and each slot in
this table belongs to a one thread and I have wrote down the thread ids for each
slot as well. In our previous implementation, in the first iteration only threads
with even number as thread id perform any effective work.
So only 50% of threads perform any work And these threads spread across all four
warps of this thread block as well.

In the second iteration only threads with thread id which is multiplier of value 4
perform any summation. So only quarter of threads perform the summation and those
threads also spread across
all the warps as well. So in each iteration other than last couple of ones, we have
warp divergence in every warp in the thread block. In final two iterations,
only couple of warps will show warps divergence since the threads perform sum is
spread across only couple of warps.

So our previous reduction implementation has lots of divergence code. Luckily there
are more than one way of avoiding warp divergence in this type of kernels. Here,
we are going to implements reduction algorithm in two new ways to avoid warp
divergence. As the first solution we will make neighboring threads to perform
effective work in each iteration.
In the second method we are going to use interleaved pair approach as a solution
for divergence in reduction algorithm. Let's look at each of these approaches now.
In the first approach we are going to make sure that all these neighboring thread
are the ones performing this summation. For example in previous neighboring pair
approach when we perform algorithm on 8 element block, in the first iteration
only threads with even number as thread id perform the summation. Threads with odd
value as thread id did not used to add the values. However in our new approach, we
make sure that neighboring threads perform the summation.
So in the corresponding first iteration first four consecutive threads will perform
the summation.

Ok let's see an example of this approach.Let's say we have a data block with 8
elements. And in the first iteration, first four threads going to perform the
summation. So T0 thread will add first two elements, and store the results back to
the first index, and T1 thread will add next 2 elements and store the results back
to the
second index and so on. Then when it come to next iteration still consecutive
threads will be the one's execute the summation.

In this case T0 and T1. In our previous implementation in the second iteration T0
and T4 where the ones execute the summation in second iteration.
And in the third iteration only T0 will be the one execute the summation. If we use
128 as data block size so our thread block size will also be 128
So in the first iteration first 64 threads or first two warps will perform the
summation and second two warps will not do anything. But still there is no warp
divergence since with in a warp all the threads follow the
same path. In the next iteration Only first 32 threads, or first warp will perform
the summation and all other three warps will not do anything. In the 3rd iteration
however, only first 16 threads or half of the first warp will be performing the
summation so from this iteration onwards, there will be warp divergence.

But notice, here we have one warp with divergence per iteration from this iteration
onwards. But if you remember, in our initial neighbored pair approach
when we consider 128 size data block all four warps had warp divergence until last
two iteration. Ok let's see the implementation now.
The main function for this implementation is almost similar to what we have in the
previous neighbored pairs implementation. So I will not go through that.

So let's look at our kernel now. In the kernel. Let's set local thread id value and
then calculate global thread Id value first.
Then we have to set local memory pointer with given offset for the corresponding
thread block, so that we can access global memory using this local pointer for this
thread block.
Then we can have our boundary check now. And then we are going to perform usual
iteration while multiplying offset by two in each iteration.
Now we need consecutive threads to perform summation. For that we calculate index
value for each thread in the block based on the thread id and offset value.
We can use this condition check to limit the threads which are going to perform the
summation. And we need all the threads in the block to finish executing one
iteration before any of threads in that block
move on to the next one so here, we are going to have syncthreads() function call
as well. After all the iteration ends first element in the thread block
will have the partial sum for this thread block so we have to store it to the our
partial sum array. Now let's run this implementation and check the validity.
Ok, in the output you can see that it printed out GPU and CPU results are same. So
our implementation is a valid one. Ok, Let's move on to the next way of solving
warp divergence using interleaved pair approach. In this approach also we are
forcing the summation
of elements to happen in consecutive threads. And we are going to start the offset
of the elements which are going to added together in a iteration from
reverse order compared to the previous approaches. For example, in the first
iteration we will set our offset value to half of the block size.
So if we consider data block with 8 elements, offset value will be 4, hence in the
first iteration first thread will accumulate first element and fifth element in the
data block. Second thread will accumulate
second and sixth elements in the data block and so on. In the second iteration we
will divide offset by half. So for second
iteration, offset value will be 2 hence first thread will accumulate first and
third element, and second thread will accumulate second and fourth element in the
current data block.
Notice here, we are performing in-place reduction so output of one iteration will
be the input to the next iteration
so first element for the second iteration contains the summation of first and fifth
element in the original array,
and third element for second iteration have third and seventh element in the
original array. and after second iteration first element in the array
now have summation of first third fifth and seventh element in the original array.
In the last iteration our offset will be one and when the offset reached one we
stop iterating
and first thread will accumulate first and second element in the current data
block. Now after this step, first element in our array now contains the summation
of all 8 elements in our array.

Ok, let's see the implementation now. This is almost similar to what we done in
previous implementation. The difference is that we initialized offset value to half
of the block size and then we keep dividing
it by 2 in each iteration. And in the condition check, since we need consecutive
threads to execute the accumulation steps we check whether the thread id is
less than the offset. So in the first iteration only first half of the thread block
will perform the summation. In the next iteration only
first quarter of thread in the block perform the summation and so on. And here also
we need all the threads in one thread block to execute one iteration before moving
in to next iteration
henceforth I have used syncthread() function here as well. Then we have to store
the summation to partial sum array as well.

Now if you run this example, you will see that GPU and CPU result are same So this
implementation also valid one.

[0-9]*:[0-9]*

GTC S62191
No ratings yet
GTC S62191
89 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Unit 5 Part2
No ratings yet
Unit 5 Part2
25 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Lecture GPU 17
No ratings yet
Lecture GPU 17
51 pages
CUDA Programming: Advantages & Limitations
No ratings yet
CUDA Programming: Advantages & Limitations
35 pages
Hardware
No ratings yet
Hardware
54 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
CUDA Barrier Synchronization Explained
No ratings yet
CUDA Barrier Synchronization Explained
26 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
CUDA Programming Overview and Guide
No ratings yet
CUDA Programming Overview and Guide
28 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Cuda C
No ratings yet
Cuda C
70 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
CUDA Warps & Occupancy Guide
No ratings yet
CUDA Warps & Occupancy Guide
14 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
CUDA Programming Basics Guide
No ratings yet
CUDA Programming Basics Guide
38 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Control Flow and Synchronisation in CUDA
No ratings yet
Control Flow and Synchronisation in CUDA
48 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Cuda
No ratings yet
Cuda
25 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
GPU Compute Architecture Overview
No ratings yet
GPU Compute Architecture Overview
17 pages
Introduction to CUDA Parallel Programming
No ratings yet
Introduction to CUDA Parallel Programming
25 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Efficient GPU Resource Sharing Strategies
No ratings yet
Efficient GPU Resource Sharing Strategies
22 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Cyber Defense Mechanisms: Security, Privacy, and Challenges 1st Edition Gautam Kumar eBook 2025 premium version
No ratings yet
Cyber Defense Mechanisms: Security, Privacy, and Challenges 1st Edition Gautam Kumar eBook 2025 premium version
47 pages
First Industrial Revolution Invention
No ratings yet
First Industrial Revolution Invention
10 pages
CH 5 Internal Scanning Organizational Analysis
100% (1)
CH 5 Internal Scanning Organizational Analysis
33 pages
If A Teacher Catches A Student Cheating On An Exam Paper What Happens Next What Can The Students Expect From Their Teachers After That Incident
100% (3)
If A Teacher Catches A Student Cheating On An Exam Paper What Happens Next What Can The Students Expect From Their Teachers After That Incident
21 pages
Asb 460 560
No ratings yet
Asb 460 560
2 pages
Suitcase Fusion 7 Guide
No ratings yet
Suitcase Fusion 7 Guide
108 pages
00 Simpolo Copos & Granos Catalogue August 2018 FINAL
No ratings yet
00 Simpolo Copos & Granos Catalogue August 2018 FINAL
20 pages
Producer Behavior
No ratings yet
Producer Behavior
33 pages
Prospectus: Bureau of Energy Efficiency (Bee)
100% (1)
Prospectus: Bureau of Energy Efficiency (Bee)
20 pages
Medical Certi Parental Consent Entry Forms
No ratings yet
Medical Certi Parental Consent Entry Forms
5 pages
Unforgettable Train Trip To Abeokuta
No ratings yet
Unforgettable Train Trip To Abeokuta
2 pages
Mathematics P2 Feb March 2014 Memo Afr Eng
No ratings yet
Mathematics P2 Feb March 2014 Memo Afr Eng
15 pages
Full Thesis
No ratings yet
Full Thesis
136 pages
DuoFast cn350-b
No ratings yet
DuoFast cn350-b
4 pages
Phirst Park Homes Payment Computation
No ratings yet
Phirst Park Homes Payment Computation
3 pages
Ignou Pgdema Proj Synopsis
100% (1)
Ignou Pgdema Proj Synopsis
21 pages
Free VIN Check & Vehicle History Report
No ratings yet
Free VIN Check & Vehicle History Report
1 page
(Ebook PDF) Canadian Democracy 8Th Edition by Stephen Brooks
No ratings yet
(Ebook PDF) Canadian Democracy 8Th Edition by Stephen Brooks
52 pages
Selection of Irr Method
No ratings yet
Selection of Irr Method
6 pages
Genpact Industry Peers & Reports
No ratings yet
Genpact Industry Peers & Reports
6 pages
Clustering Techniques in Data Mining
100% (1)
Clustering Techniques in Data Mining
56 pages
Suspension and Steering System Insights
No ratings yet
Suspension and Steering System Insights
6 pages
Power and Function of Consolidated Holding Corporation (CHC) Under The Law
No ratings yet
Power and Function of Consolidated Holding Corporation (CHC) Under The Law
6 pages
Property Possession Guidelines
No ratings yet
Property Possession Guidelines
18 pages
Factions For Shadowdark
No ratings yet
Factions For Shadowdark
24 pages
Department of Computer Science Final Year Project Proposal: Existing System
No ratings yet
Department of Computer Science Final Year Project Proposal: Existing System
2 pages
NGF Series - Element Replacement Procedure
No ratings yet
NGF Series - Element Replacement Procedure
2 pages
Examining Erosion Readalong Ebook Joelle Riley PDF Download
No ratings yet
Examining Erosion Readalong Ebook Joelle Riley PDF Download
17 pages
Method of Statement For Drain Pipe For PACU Units Installation. (N-IMS-NTT-P-ME-32-32)
No ratings yet
Method of Statement For Drain Pipe For PACU Units Installation. (N-IMS-NTT-P-ME-32-32)
7 pages
Mid MSC 221 - Sec 01 Summer 2025
100% (1)
Mid MSC 221 - Sec 01 Summer 2025
16 pages

Section 2

Uploaded by

Section 2

Uploaded by

CUDA Execution model

Understanding the Device

-The GPU architecture is built around a scalable array of streaming multiprocessor

GPU hardware parallelism is achieved through by replication of this architectural

- THE GPU (SM Streaming Multiprocessor)

-Computer architecture classification

SISD (Single Instruction single data)

CUDA follows different - SIMT - Single instr on multiple threads

- A thread Block executes on single SM (Multiple TB - Thread Blocks cannot excute

We cannot execute 1Million thread at the same time

512 / 32 = 16 ==> so warp is a bsic unit of execution in a SM

The warp indx is calculated by using threadIdx / 32

if(tid < 16) { }

multiple condition checks like these basically cause perf penalty

What causes warps to diverge ?

branch_efficiency = total - divergent branch / total * 100

nvprof --metrics branch_efficiancy fname.out

we can specify the disabling of the compiler optimization with nvcc -G

Resource Partitioning and latency hiding

- The local execution context of a warp mainly consists of following resources -

- The execution context of each warp processed by a SM is maintained on the chip

Warp and Block Categories

Active Warps are further categorized as - selected, stalled, eligible

SM is a resource constrained enviroment - so, execessive usage of resources like

What is latency (Instruction Latency) ? No of clock cyles that go between when

* CUDA latency hiding mechanisms

CUDA Applications are either compute bound or memory bound

- Ratio of active warps to maximum number of warps allowed per SM

Profiling with NVPROF

Metrics examples - gld_efficiency, average occupancy,

In parallel programming, synchronization is a vital aspect of programming. Here, we

The reduction problem - any commutative or associative operation across a vector is

Prefix Sum problem in parallel paradigm: neighboured pair approach

Warp Diveregence parallel reduction example

As I already explained to you, if the threads with in a warp executes in a

You might also like