07b - CUDA Parallel Patterns + Notes
07b - CUDA Parallel Patterns + Notes
com/COMCUDA7
Parallel Computing
With GPUs
Parallel Patterns
Dr Robert Chisholm
1
[Link]
2
[Link]
Higher level view for design: Think at the pattern level rather than blocks warps or
grids etc.
3
[Link]
Gather
❑Multiple inputs and single coalesced output
❑Might have sequential loading or random access
❑Affect memory performance
❑Differs to map due to multiple inputs
Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 ❑ Read from a number of locations
Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 4 5 6 7 ❑ Read from a number of locations
❑ Random access load
4
[Link]
Scatter
❑Reads from a single input and writes to one or many
❑Can be implemented in CUDA using atomics or block/warp co-
operation
❑Write pattern will determine performance
Scatter refers to the non coalesced write rather than number of writes
5
[Link]
6
[Link]
Reduction
❑A reduction is where all elements of a set have a common binary associative
operator (⊕) applied to them to “reduce” the set to a single value
❑Binary associative = order in which operations is performed on two inputs does not matter
❑ E.g. (1 + 2) + 3 + 4 == 1 + (2 + 3) + 4 == 10
❑Example operators
❑Most obvious example is addition (Summation)
❑Other examples, Maximum, Minimum, product
❑Serial example is trivial but how does this work in parallel?
int data[N]; int data[N];
int r = 0; int r = 0;
for (int i = 0; i < N; i++){ OR for (int i = N-1; i >= 0; i--){
r = reduce(r, data[i]); r = reduce(r, data[i]);
} }
Binary associative means that changing the order has no impact on the result
7
[Link]
Parallel Reduction
❑Order of operations does not matter so we don’t have to think serially.
❑A tree-based approach can be used
❑At each step data is reduced by a factor of 2
N Elements
9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕
14 3 5 9
Log2(N) steps
⊕ ⊕
17 14
31
COM4521/COM6521 - Parallel Computing with GPUs
8
[Link]
9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕ Kernel Launch 1
14 3 5 9
⊕ ⊕ Kernel Launch 2
17 14
⊕ Kernel Launch 3
31
COM4521/COM6521 - Parallel Computing with GPUs
9
[Link]
sdata[threadIdx.x] = input[i];
__syncthreads();
if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}
10
[Link]
sdata[threadIdx.x] = input[i];
__syncthreads();
if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}
11
[Link]
if (threadIdx.x == 0)
block_results[blockIdx.x] = sdata[0];
}
COM4521/COM6521 - Parallel Computing with GPUs
12
[Link]
13
[Link]
14
[Link]
Type Size 4
Stride 2
banks 10
11
21
23
21
23
12 25 25
❑2-way bank conflict 13 27 27
14 29 29
❑Try evaluating this using a spreadsheet 15 31 31
16 33 1
Banks
Used 16
COM4521/COM6521 - Parallel Computing with GPUs Max
Conflicts 2
15
[Link]
16
Type size 4 [Link]
Stride 1
Banks
Used 32
Max
Conflicts 1
COM4521/COM6521 - Parallel Computing with GPUs
How does this work for problems bigger than a single block? Suggestions? Hint: Our
very first version was suitable for global reductions…
17
[Link]
18
[Link]
if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}
Only difference here is the atomic operation to global with the block’s result.
19
[Link]
Further Optimisation?
❑Can we improve our technique further?
__global__ void sum_reduction(float *input, float *result){
extern __shared__ int sdata[];
if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}
20
[Link]
Further Optimisation?
❑Can we improve our technique further? Yes
❑Other optimisations
❑Loop unrolling
❑Increasing Instruction Level Parallelism, do more work in each thread
21
[Link]
What is a Scan?
❑Consider the following …
2 0 5 6 3 0 1 0
2 0 1 2 1 3 0 0 2 5 6 3 1
Remove empty elements from array (compact)
A C D F G
B E H
I 1 2 5 1 3 4 3 8
Output variable numbers of values per thread
1 5 1 3 3 2 4 8
Split elements from array based on condition (split)
22
[Link]
What is a Scan?
2 0 5 6 3 0 1 0
❑Consider the following …
2 5 6 3 1
Remove empty elements from array (compact)
2 0 1 2 1 3 0 0
1 2 5 1 3 4 3 8
A C D F G
B E H
1 5 1 3 3 2 4 8
I
Split elements from array based on condition (split)
Output variable numbers of values per thread
23
[Link]
2 0 5 6 3 0 1 0 Sparse data
2 5 6 3 1 Compacted data
24
[Link]
25
[Link]
26
[Link]
Inclusive Scan
27
[Link]
28
[Link]
Just like reduction. This can be specialised for warp level using warp shuffles
29
[Link]
❑Exactly the same as the block level technique but at warp level
❑Warp prefix sum is in threadIdx.x%32==31
❑Either use shared memory to reduce between warps
❑Or consider the following global scan approaches.
COM4521/COM6521 - Parallel Computing with GPUs
30
[Link]
threadIdx.x 0 1 2 3 4 5 6 0 1 2 3 4 5 6
threadIdx.x 0 1 2 3 4 5 0 1 2 3 4 5
threadIdx.x 0 1 2 3 0 1 2 3
31
[Link]
Each requires a final update stage to increment the individual values with block totals
32
[Link]
// atomic store to all following block totals (e.g. blocks after this block)
// Could improve this by splitting work across more threads
if (threadIdx.x == 0){
for (int i=blockIdx.x+1; i<gridDim.x; i++)
atomicAdd(&block_sums[i], s_data[blockDim.x-1]);
}
}
Same technique for global level reduction, the block total is added to a device
symbol.
33
[Link]
if (threadIdx.x == 0)
block_total = block_sums[blockIdx.x];
__syncthreads();
34
[Link]
Summary
❑Parallel Patterns
❑Define a parallel pattern as building blocks for parallel applications
❑Give examples of common patterns
❑Reduction
❑Present the process of performing parallel reduction
❑Explore the performance implications of parallel reduction implementations
❑Analyse block level and atomic approaches for reduction
❑Scan
❑Give motivating examples of parallel prefix sum (scan)
❑Describe the serial and parallel approaches towards scan
❑Compare block level and atomic approaches to the parallel prefix sum
algorithm
❑More on these in the lecture 9, CUDA libraries that provide highly
optimised parallel primitives.
COM4521/COM6521 - Parallel Computing with GPUs
35
[Link]
Further Reading
❑[Link]
❑All about application of warp shuffles to reduction
❑[Link]
s2018/stanford/lectures/lecture_6/parallel_patterns_1.pdf
❑Scan material based loosely on this lecture
❑[Link]
❑Reduction material is based on this fantastic lecture by Mark Harris (NVIDIA)
36