CUDA Execution model
Understanding the Device
-Understanding the device internals - like what the device looks like,
But if we had better understanding on how our device or GPU looked like and what
parts of the device
effect the execution of certain type of kernels then we would How come to logical
conclusion beforehand
-The GPU architecture is built around a scalable array of streaming multiprocessor
or SMs and
GPU hardware parallelism is achieved through by replication of this architectural
building block.
- THE GPU (SM Streaming Multiprocessor)
* IC - Instruction Cache
* Warp Scheduler
* Dispatch Unit
* Register File
*Core
*LD/ST
* SFU (special function units)
* Interconnect Network
* 64KB Shared Mem + L1 Cache
* Uniform Cache
Volta has more cores - Tensor Cores, Int cores, FP32, FP64 cores
-Computer architecture classification
SISD (Single Instruction single data)
SIMD
MISD (Fault tolerant - space shuttle flight control etC)
MIMD
CUDA follows different - SIMT - Single instr on multiple threads
CUDA design
- A thread Block executes on single SM (Multiple TB - Thread Blocks cannot excute
simultaneously on same SM depending on resource limitation on SM)
- one TB cannot exec on mutiple SM
WARPS
We cannot execute 1Million thread at the same time
So we need to divide the TB into smaller units called warps - each having 32
consecutive threads
no of warps per block = block size / warp size
512 / 32 = 16 ==> so warp is a bsic unit of execution in a SM
Once a thread block is scheduled to an SM, threads in the TB are further
partitioned into warps
The warp indx is calculated by using threadIdx / 32
Warp Divergence
A warp is a single exec unit in a GPU - what if we code in a way that it will force
some threads in the warp to execute different instructions
SIMT - single instruction in principle - but then in what conditions different
instr ?
if(tid < 16) { }
else { }
multiple condition checks like these basically cause perf penalty
What causes warps to diverge ?
When there is multiple path of execution within the same warp. So condition checks
which result in all thread execution the same path will NOT induce any warp diverge
ex: if(tid / 32 < 1) ==> this may not cause warp div
Warp Div and Perf degradation can be measured using nvprof - how to calculate
branch efficiency ?
branch_efficiency = total - divergent branch / total * 100
nvprof --metrics branch_efficiancy fname.out
nvcc optimizes the div code with predicates
we can specify the disabling of the compiler optimization with nvcc -G
Resource Partitioning and latency hiding
- The GPU schdules the blocks in the SM depending on the resource limitations of
that SM. Let us understand what are those resource limitations and how exactly the
scheduling works.
- The local execution context of a warp mainly consists of following resources -
program counters, registers and shared memory.
- The execution context of each warp processed by a SM is maintained on the chip
during the entire life time of a warp. Thus switching from one execution context to
another has no cost.
This zero-cost context switching is achieved by making GPU threads more light
weight and limiting it's capabilities.
- Out of the above mentioned resources, number of registers and shared memory can
be directly controlled by the programmers. Each SM has a set of 32-bit registers
stored
in a register file. This reg file is partitioned amoing threads and then there a
fixed amount of shared memory that is partitioned amonig thread blocks.
- The number of warps that can simultaneously reside on a SM for a given kernel
depend on the number of registers and amount of shared memory available on SM.
So if more threads per warp then fewer threads - same applies for shared memory-
thread blocks for the grid for that SM.
- The max registers and shared memory per SM vary for device of different compute
capabilities and device architecture. So we need to optimize the perf of the kernel
depending on the device that is going to execute on.
Programmer can control registers and shared mem - the reg file is partitioned among
threads and the shared mem is partitioned amoing thread blocks
Warp and Block Categories
A thread block is called active when compute resources such as registers and shared
mmeory have been allocated to it.
Warps in such an active block are called active warps.
Active Warps are further categorized as - selected, stalled, eligible
The warp scheduler on an SM selects active warps on each execution cycle and
dispatches them to the execution unit.
- A warp that is actively executing is called is an eligible warp.
- If not ready then it is called stalled warp.
- A warp is eligible for execution if both of the following condition is met - 32
CUDA cores should be avaliable for exec + all arguments should be ready.
- If a warp stalls then the warp scheduler picks up an eligible warp to execute in
its place.
- Compute resource allocation on warp is static and is kept on chip during exite
warp lifetime, therefore warp context switching is not costing any execution cycle.
SM is a resource constrained enviroment - so, execessive usage of resources like
registers and shared memory from single thread causes SM to have only smaller
number of active warps
residing on that SM at a given time, resulting in underutilization of computing
resources.
Understanding the concep tof latency and latency hiding we shall learn why there is
a need of enough active warps to utilize the computation power optimally.
What is latency (Instruction Latency) ? No of clock cyles that go between when
instruction issued and completed => Arith + Mem operations.
In a chip - there are 2 units - CU and ALU - The control unit and the Arithmetic
Logic Unit
The control unit issues the instr to the ALU - the ALU takes some clock cycles to
complete the instr. What do these instructions look like ?
1. Add 2. Max 3. Min 4. MAD (Multiply and Add) 5. Mul 6. Div ( > 200 clock
cycles) 7 Remainder 8. Memory Load 9. Mem store
Also - the different types of memories in the GPU have different types of latencies
- GLOBAL, LOCAL, SHARED, TEXTURE , CONSTANT.
eg. L1 cache - mem latency of 194 , DRAM - 350, Shared Mem - 28.
* CUDA latency hiding mechanisms
- At Instruction level - we know that arith-Logic and mem instructions can happen
in parallel
- If only a single warp hold the SM for long, that is perf degradation.
- Execution context of a warp is valid for the entire lifetime of a warp
- Switch is zero cost
- So, if one warp holds then and if another large number of warps are eligible then
warp scheduler can dispatch another set of eligible warps to the core.
- Basically a Stalling core can continue the execution of a new warp and when the
results of the previous warp are ready, then the previous warp can resume its
execution.
CUDA Applications are either compute bound or memory bound
Occupancy
- Ratio of active warps to maximum number of warps allowed per SM
- The concept of latency hiding is measured through this ratio
- Max warps allowed on SM is mentioned in the documentation
- On a single core - instructions are executed sequentially and the parallelism is
at instr level.
- On the other hand, active warp counts per SM depends on available resources in
device and resource consumption of your kernel.
- Excel with formulas are available to calculate occupancy - one needs to enter
shared memory amount.
Profiling with NVPROF
- Profiling gives us insight about our CUDA program execution - vairables are
different grid configurations and different resource usages.
nvprof - used for profile-driven optimization - i.e change resource usage and grid
config through command line args and measure the performance.
nvprof has 3 modes of operation:
- summary mode
- gpu trace mode
- api trace mode
-g or -G is used to disable compiler optimizations
-In summary mode (default) we can see:
- GPU activity
- Kernel executions
- Memory copy functions etc.
Metrics examples - gld_efficiency, average occupancy,
Thread Synchronization
In parallel programming, synchronization is a vital aspect of programming. Here, we
are working with > 1 thread and often we need to order the operations perform in
these threads
the cuda primitive used are:
1. cudaDeviceSynchronize() - global synchronization between host and device
2. syncthreads() - between threads on a block
Usually a CUDA program will consist of asynchronous calls to launch kernel from
host - Async - means non-blocking i.e. after launching the kernel on the device,
the host program continues its execution. To make the host wait until device
execution finishes - i.e to make host call synchronous we use the
cudaDeviceSynchronize() funtion after launching the kernel.
the syncthreads() function is used for synchronization within a block across all
warps. And this function should be called only from the device code. It forces all
the threads to wait until all threads in that block complete till that point in the
code.
Also the global and shared memory access will be visible to these threads after the
execution point.
As we discussed earlier, CUDA follows SIMT execution paradigm. With in a wrap, all
the 32 threads execute same instruction. But between warps
there are no such guarantees. So let's say warp 2 starts execution first. This
execution will be independent from other warps.
But if our kernel has syncthreads() function in the code, then warp 2 has to wait
in that statement until all other warps reach that point.
So all the threads in the block will reach that synchronize point and then only any
warps for this block can execute further instructions.
The reduction problem - any commutative or associative operation across a vector is
known as the reduction problem.
Prefix Sum problem in parallel paradigm: neighboured pair approach
We are going to first partition the input vector in to smaller data chunks and each
data chunk will be sums up separately
and finally, we will add these partial sums to get total sum. This diagram
demonstrate steps we are going to follow exactly. In the implementation we will use
thread block size as our data chunk size. So each thread is responsible for adding
up given data chunk. After thread block finish its accumulation
The result will be stored in to separate array. Each slot in this new array will
populate with partial sum of each thread block.
See Section 2 sync reduction code - we use syncthreads() in each iteration fo the
for loop. This is because warps can execute in any order.
So when we have thread block with multiple warps we don't have any way of guarantee
this restriction.
This is where syncthreads() function comes to the action. After each iteration, we
are going to use synctrhead() function.
We perform syncthreads() function call to make sure that all the threads in a block
reach one point before any of the thread in particular block moves forward.
But if we are using syncthreads with in a condition check then that condition check
might not allow some of the threads in that particular thread block to reach the
syncthread() statement based on the condition.
This will result in paradox and the outcome will be undefined value. So be mindful
when you are using syncthreads() function specially in the presence of condition
checks.
Warp Diveregence parallel reduction example
As I already explained to you, if the threads with in a warp executes in a
different path then there is a warp divergence in the kernel. And our device will
arrange all these paths together and execute the code sequentially which will
result in
heavy performance penalties. Unfortunately our previous parallel reduction
implementation also has warp divergence. For example, in this diagram you can see
128 threads belong to a single thread block and each slot in
this table belongs to a one thread and I have wrote down the thread ids for each
slot as well. In our previous implementation, in the first iteration only threads
with even number as thread id perform any effective work.
So only 50% of threads perform any work And these threads spread across all four
warps of this thread block as well.
In the second iteration only threads with thread id which is multiplier of value 4
perform any summation. So only quarter of threads perform the summation and those
threads also spread across
all the warps as well. So in each iteration other than last couple of ones, we have
warp divergence in every warp in the thread block. In final two iterations,
only couple of warps will show warps divergence since the threads perform sum is
spread across only couple of warps.
So our previous reduction implementation has lots of divergence code. Luckily there
are more than one way of avoiding warp divergence in this type of kernels. Here,
we are going to implements reduction algorithm in two new ways to avoid warp
divergence. As the first solution we will make neighboring threads to perform
effective work in each iteration.
In the second method we are going to use interleaved pair approach as a solution
for divergence in reduction algorithm. Let's look at each of these approaches now.
In the first approach we are going to make sure that all these neighboring thread
are the ones performing this summation. For example in previous neighboring pair
approach when we perform algorithm on 8 element block, in the first iteration
only threads with even number as thread id perform the summation. Threads with odd
value as thread id did not used to add the values. However in our new approach, we
make sure that neighboring threads perform the summation.
So in the corresponding first iteration first four consecutive threads will perform
the summation.
Ok let's see an example of this approach.Let's say we have a data block with 8
elements. And in the first iteration, first four threads going to perform the
summation. So T0 thread will add first two elements, and store the results back to
the first index, and T1 thread will add next 2 elements and store the results back
to the
second index and so on. Then when it come to next iteration still consecutive
threads will be the one's execute the summation.
In this case T0 and T1. In our previous implementation in the second iteration T0
and T4 where the ones execute the summation in second iteration.
And in the third iteration only T0 will be the one execute the summation. If we use
128 as data block size so our thread block size will also be 128
So in the first iteration first 64 threads or first two warps will perform the
summation and second two warps will not do anything. But still there is no warp
divergence since with in a warp all the threads follow the
same path. In the next iteration Only first 32 threads, or first warp will perform
the summation and all other three warps will not do anything. In the 3rd iteration
however, only first 16 threads or half of the first warp will be performing the
summation so from this iteration onwards, there will be warp divergence.
But notice, here we have one warp with divergence per iteration from this iteration
onwards. But if you remember, in our initial neighbored pair approach
when we consider 128 size data block all four warps had warp divergence until last
two iteration. Ok let's see the implementation now.
The main function for this implementation is almost similar to what we have in the
previous neighbored pairs implementation. So I will not go through that.
So let's look at our kernel now. In the kernel. Let's set local thread id value and
then calculate global thread Id value first.
Then we have to set local memory pointer with given offset for the corresponding
thread block, so that we can access global memory using this local pointer for this
thread block.
Then we can have our boundary check now. And then we are going to perform usual
iteration while multiplying offset by two in each iteration.
Now we need consecutive threads to perform summation. For that we calculate index
value for each thread in the block based on the thread id and offset value.
We can use this condition check to limit the threads which are going to perform the
summation. And we need all the threads in the block to finish executing one
iteration before any of threads in that block
move on to the next one so here, we are going to have syncthreads() function call
as well. After all the iteration ends first element in the thread block
will have the partial sum for this thread block so we have to store it to the our
partial sum array. Now let's run this implementation and check the validity.
Ok, in the output you can see that it printed out GPU and CPU results are same. So
our implementation is a valid one. Ok, Let's move on to the next way of solving
warp divergence using interleaved pair approach. In this approach also we are
forcing the summation
of elements to happen in consecutive threads. And we are going to start the offset
of the elements which are going to added together in a iteration from
reverse order compared to the previous approaches. For example, in the first
iteration we will set our offset value to half of the block size.
So if we consider data block with 8 elements, offset value will be 4, hence in the
first iteration first thread will accumulate first element and fifth element in the
data block. Second thread will accumulate
second and sixth elements in the data block and so on. In the second iteration we
will divide offset by half. So for second
iteration, offset value will be 2 hence first thread will accumulate first and
third element, and second thread will accumulate second and fourth element in the
current data block.
Notice here, we are performing in-place reduction so output of one iteration will
be the input to the next iteration
so first element for the second iteration contains the summation of first and fifth
element in the original array,
and third element for second iteration have third and seventh element in the
original array. and after second iteration first element in the array
now have summation of first third fifth and seventh element in the original array.
In the last iteration our offset will be one and when the offset reached one we
stop iterating
and first thread will accumulate first and second element in the current data
block. Now after this step, first element in our array now contains the summation
of all 8 elements in our array.
Ok, let's see the implementation now. This is almost similar to what we done in
previous implementation. The difference is that we initialized offset value to half
of the block size and then we keep dividing
it by 2 in each iteration. And in the condition check, since we need consecutive
threads to execute the accumulation steps we check whether the thread id is
less than the offset. So in the first iteration only first half of the thread block
will perform the summation. In the next iteration only
first quarter of thread in the block perform the summation and so on. And here also
we need all the threads in one thread block to execute one iteration before moving
in to next iteration
henceforth I have used syncthread() function here as well. Then we have to store
the summation to partial sum array as well.
Now if you run this example, you will see that GPU and CPU result are same So this
implementation also valid one.
[0-9]*:[0-9]*