0% found this document useful (0 votes)
13 views36 pages

07b - CUDA Parallel Patterns + Notes

The document discusses parallel computing with GPUs, focusing on parallel patterns such as Map, Gather, Scatter, and Reduction. It emphasizes the importance of shared memory, memory coalescing, and occupancy in optimizing performance. The lecture also covers the implementation of parallel reduction techniques and the challenges associated with them, including bandwidth issues and bank conflicts in shared memory access.

Uploaded by

noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

07b - CUDA Parallel Patterns + Notes

The document discusses parallel computing with GPUs, focusing on parallel patterns such as Map, Gather, Scatter, and Reduction. It emphasizes the importance of shared memory, memory coalescing, and occupancy in optimizing performance. The lecture also covers the implementation of parallel reduction techniques and the challenges associated with them, including bandwidth issues and bank conflicts in shared memory access.

Uploaded by

noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

[Link]

com/COMCUDA7

Parallel Computing
With GPUs
Parallel Patterns

Dr Robert Chisholm

COM4521/COM6521 - Parallel Computing with GPUs

We learnt about shared memory


Very powerful for block level computations
Excellent for improving performance by reducing memory bandwidth
User controlled caching and needs careful consideration for bank conflicts and
boundary conditions
Memory coalescing: Vital for good memory bandwidth performance
Need to be aware of cache usage and line size
Occupancy can be changed by modifying block sizes, registers and shared memory
usage
This week:
How exactly are warps scheduled?
Can we program at the warp level?
What mechanisms are there for communication between threads?

1
[Link]

This Lecture (Learning Objectives)


❑Parallel Patterns
❑Define a parallel pattern as building blocks for parallel applications
❑Give examples of common patterns
❑Reduction
❑Present the process of performing parallel reduction
❑Explore the performance implications of parallel reduction implementations
❑Analyse block level and atomic approaches for reduction
❑Prefix Sum / Scan
❑Give motivating examples of parallel prefix sum (scan)
❑Describe the serial and parallel approaches towards scan
❑Compare block level and atomic approaches to the parallel prefix sum
algorithm

COM4521/COM6521 - Parallel Computing with GPUs

2
[Link]

What are Parallel Patterns


❑Parallel patterns are high level building blocks that can be used to
create algorithms
❑Implementation is abstracted to give a higher-level view
❑Patterns describe techniques suited to parallelism
❑Allows algorithms to be built with parallelism from ground up
❑Top-down approach might not parallelise very easily…
❑Consider the simplest parallel pattern: Map
❑Takes the input list i
❑Applies a function f
❑Writes the result list o by applying f to all members of i
❑Equivalent to a CUDA kernel where i and o are memory locations determined
by threadIdx etc.

COM4521/COM6521 - Parallel Computing with GPUs

Ground up rather than top down

Higher level view for design: Think at the pattern level rather than blocks warps or
grids etc.

Map aka Transform

3
[Link]

Gather
❑Multiple inputs and single coalesced output
❑Might have sequential loading or random access
❑Affect memory performance
❑Differs to map due to multiple inputs

Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 ❑ Read from a number of locations

Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 4 5 6 7 ❑ Read from a number of locations
❑ Random access load

COM4521/COM6521 - Parallel Computing with GPUs

We have seen this in many places,

Image blur, matrix mul, reduction etc.

4
[Link]

Scatter
❑Reads from a single input and writes to one or many
❑Can be implemented in CUDA using atomics or block/warp co-
operation
❑Write pattern will determine performance

ThreadIdx.x 0 1 2 3 4 5 6 7 Scatter operation


❑ Write to a number of locations
❑ Collision on write
Memory Values/Locations

ThreadIdx.x 0 1 2 3 4 5 6 7 Scatter operation


❑ Write to a number of locations
❑ Random access write?
Memory Values/Locations

COM4521/COM6521 - Parallel Computing with GPUs

Scatter might be guaranteed to have no clashing writes.

Scatter refers to the non coalesced write rather than number of writes

5
[Link]

Other Parallel Patterns


❑Stencil
❑Gather a fixed pattern, usually based on locality
❑See 2D shared memory examples (e.g. image kernels)
Stencil Gather
❑Reduce
❑Reduce many values to a single value
❑Combined with Map to form Map Reduce
❑ (often with intermediate shuffle or sort)
❑Scan
❑Compute a cumulative sum across a set
❑Sort (later)
❑Sort values or <key, value> pairs

COM4521/COM6521 - Parallel Computing with GPUs

6
[Link]

Reduction
❑A reduction is where all elements of a set have a common binary associative
operator (⊕) applied to them to “reduce” the set to a single value
❑Binary associative = order in which operations is performed on two inputs does not matter
❑ E.g. (1 + 2) + 3 + 4 == 1 + (2 + 3) + 4 == 10
❑Example operators
❑Most obvious example is addition (Summation)
❑Other examples, Maximum, Minimum, product
❑Serial example is trivial but how does this work in parallel?
int data[N]; int data[N];
int r = 0; int r = 0;
for (int i = 0; i < N; i++){ OR for (int i = N-1; i >= 0; i--){
r = reduce(r, data[i]); r = reduce(r, data[i]);
} }

int reduce(int r, int i){


return r + i;
}

COM4521/COM6521 - Parallel Computing with GPUs

Summation already considered at a warp level (warp shuffles)

Binary associative means that changing the order has no impact on the result

7
[Link]

Parallel Reduction
❑Order of operations does not matter so we don’t have to think serially.
❑A tree-based approach can be used
❑At each step data is reduced by a factor of 2
N Elements

9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕

14 3 5 9
Log2(N) steps
⊕ ⊕
17 14

31
COM4521/COM6521 - Parallel Computing with GPUs

8
[Link]

Parallel Reduction in CUDA


❑No global synchronisation so how do multiple blocks perform
reduction?
❑Split the execution into multiple stages
❑Recursive method

9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕ Kernel Launch 1
14 3 5 9

⊕ ⊕ Kernel Launch 2
17 14

⊕ Kernel Launch 3
31
COM4521/COM6521 - Parallel Computing with GPUs

Each stage has same implementation

9
[Link]

Recursive Reduction Problems


❑ What might be some problems with the following?
__global__ void sum_reduction(float *input, float *results){

extern __shared__ int sdata[];


unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Using shared memory but not very well.

Problem is bandwidth bound. Need to reduce the global memory reads/writes!

Can anyone think of a way to improve this?

Last lecture warp reductions!

10
[Link]

Recursive Reduction Problems


❑Small kernel has a high launch overhead
❑Lots of writes to global memory
❑Poor use of shared memory or caching
❑Only half the threads in each warp are active after __syncthreads()

__global__ void sum_reduction(float *input, float *results){

extern __shared__ int sdata[];


unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Using shared memory but not very well.

Problem is bandwidth bound. Need to reduce the global memory reads/writes!

Can anyone think of a way to improve this?

11
[Link]

Block Level Reduction


❑Lower launch overhead (reduction within block)
❑Much better use of shared memory
❑One thread per block writes the result to global memory
__global__ void sum_reduction(float *input, float *block_results){
extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;


sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){


unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
block_results[blockIdx.x] = sdata[0];
}
COM4521/COM6521 - Parallel Computing with GPUs

Reduction of a whole block reduces memory reads

Reduce by factor of 32 the number of launches and read/writes

12
[Link]

Block Level Recursive Reduction


Shared Memory Values 9 5 2 1 2 3 8 1
Loop 1
threadIdx.x 0 1 2 3 stride = 1
Shared Memory Values 14 5 3 1 5 3 9 1
Loop 2
threadIdx.x 0 1 stride = 2
Shared Memory Values 17 5 3 1 14 3 9 1
Loop 3
threadIdx.x 0 stride = 4
Shared Memory Values 31 5 3 1 14 3 9 1

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){


unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride]
}
__syncthreads();
}
COM4521/COM6521 - Parallel Computing with GPUs

Same reduction pattern as for warp shuffles (but on the block)

13
[Link]

Block Level Reduction


❑Is this shared memory access pattern bank conflict
free?

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){


unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride];
}
__syncthreads();
}

COM4521/COM6521 - Parallel Computing with GPUs

14
[Link]
Type Size 4
Stride 2

Block Level Reduction threadIdx.x


0
index
1
bank
1

❑Is this shared memory access pattern conflict free? No 1


2
3
3
5
7
3
5
7
❑Each thread accesses SM bank using the following 4
5
9
11
9
11

❑sm_bank = (threadIdx.x * 2 * stride + stride) % 32 6


7
13
15
13
15

❑Between each thread there is therefore strided access across SM 8


9
17
19
17
19

banks 10
11
21
23
21
23
12 25 25
❑2-way bank conflict 13 27 27
14 29 29
❑Try evaluating this using a spreadsheet 15 31 31
16 33 1

❑To avoid bank conflicts SM stride between threads should 17


18
19
35
37
39
3
5
7
be 1 20
21
41
43
9
11
22 45 13
23 47 15
for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ 24 49 17
unsigned int strided_i = threadIdx.x * 2 * stride; 25 51 19
if (strided_i < blockDim.x){ 26 53 21
sdata[strided_i] += sdata[strided_i + stride]; 27 55 23
28 57 25
} 29 59 27
__syncthreads(); 30 61 29
} 31 63 31

Banks
Used 16
COM4521/COM6521 - Parallel Computing with GPUs Max
Conflicts 2

Index for this example shown only for the read

Bank = index * sm_stride % 32

Sm stride = different to stride in this example

In this case it is 2 (even)

15
[Link]

Block Level Reduction (Sequential Addressing)


Shared Memory Values 9 5 2 1 2 3 8 1
Loop 1
threadIdx.x 0 1 2 3 stride = 4
Shared Memory Values 11 8 10 2 2 3 8 1
Loop 2
threadIdx.x 0 1 stride = 2
Shared Memory Values 21 10 10 2 2 3 8 1
Loop 3
threadIdx.x 0 stride = 1
Shared Memory Values 31 10 10 2 2 3 8 1
Bitshift form of
stride /=2
for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=1){
if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride];
}
__syncthreads();
}
COM4521/COM6521 - Parallel Computing with GPUs

Different gather pattern

16
Type size 4 [Link]
Stride 1

threadIdx.x index bank


0 16 16

❑Now conflict free regardless of the


1 17 17
2 18 18
3 19 19
4
5
20
21
20
21 reduction loop stride
6 22 22
7
8
23
24
23
24 ❑The stride between shared memory
9 25 25
10
11
26
27
26
27
variable accesses for threads is always
12
13
28
29
28
29 sequential
14 30 30
15 31 31
16 32 0
17 33 1
18 34 2
19 35 3
20 36 4
21 37 5
22 38 6
23 39 7
24 40 8
25 41 9
26 42 10
27 43 11
28 44 12
29 45 13
30 46 14
31 47 15

Banks
Used 32
Max
Conflicts 1
COM4521/COM6521 - Parallel Computing with GPUs

How does this work for problems bigger than a single block? Suggestions? Hint: Our
very first version was suitable for global reductions…

Atomics or recursive kernel calls.

17
[Link]

Global Reduction Approach


❑Use the recursive method
❑Our block level reduction can be applied to the result
❑At some stage it may be more effective to simply sum the final block on the
CPU
❑Or use atomics on block results
Thread block width

COM4521/COM6521 - Parallel Computing with GPUs

Each stage has same implementation

Different architectures might favour different implementations so benchmarking is


important

18
[Link]

Global Reduction Atomics

__global__ void sum_reduction(float *input, float *result){


extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;


sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){


if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}

COM4521/COM6521 - Parallel Computing with GPUs

Only difference here is the atomic operation to global with the block’s result.

19
[Link]

Further Optimisation?
❑Can we improve our technique further?
__global__ void sum_reduction(float *input, float *result){
extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;


sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){


if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}

COM4521/COM6521 - Parallel Computing with GPUs

ILP more work per thread.

Writing cross architecture approaches is hard! Many optimisations all favouring


different hardware.

20
[Link]

Further Optimisation?
❑Can we improve our technique further? Yes

❑We could optimise for the warp level


❑Warp Level: Shuffles for reduction (see last lecture)
❑Block Level: Shared Memory reduction (or SM atomics)
❑Grid Level: Recursive Kernel Launches or Global Atomics

❑Other optimisations
❑Loop unrolling
❑Increasing Instruction Level Parallelism, do more work in each thread

❑Different architectures, data-types and problem sizes may favour different


implementations/optimisations

COM4521/COM6521 - Parallel Computing with GPUs

ILP more work per thread.

Writing cross architecture approaches is hard! Many optimisations all favouring


different hardware.

21
[Link]

What is a Scan?
❑Consider the following …
2 0 5 6 3 0 1 0

2 0 1 2 1 3 0 0 2 5 6 3 1
Remove empty elements from array (compact)
A C D F G
B E H
I 1 2 5 1 3 4 3 8
Output variable numbers of values per thread
1 5 1 3 3 2 4 8
Split elements from array based on condition (split)

COM4521/COM6521 - Parallel Computing with GPUs

22
[Link]

What is a Scan?
2 0 5 6 3 0 1 0
❑Consider the following …
2 5 6 3 1
Remove empty elements from array (compact)
2 0 1 2 1 3 0 0

1 2 5 1 3 4 3 8
A C D F G
B E H
1 5 1 3 3 2 4 8
I
Split elements from array based on condition (split)
Output variable numbers of values per thread

❑Each has the same problem


❑Not even considered for sequential
programs!
❑Where to write output in parallel?
COM4521/COM6521 - Parallel Computing with GPUs

23
[Link]

Parallel Prefix Sum (Scan)


❑Where to write output in parallel?
❑Each threads needs to know the output location(s) it can
write to avoid conflicts.
0 1 2 3 4 5 6 7 Thread/Read index

0 0 1 2 3 3 4 4 Output/Write index – running sum of binary output state

2 0 5 6 3 0 1 0 Sparse data

2 5 6 3 1 Compacted data

❑The solution is a parallel prefix sum (or scan)


❑Given the inputs A = [a0, a1, …, an-1] and binary associate operator ⊕
❑Scan(A) = [0, a0, (a0⊕a1), …, (a0⊕a1⊕…⊕an-1)]

COM4521/COM6521 - Parallel Computing with GPUs

24
[Link]

Serial Prefix Sum Example


❑E.g. Given the input and the addition operator
❑A= [2, 6, 2 ,4, 7, 2 ,1, 5]
❑Scan(A) = [0, 2, 2+6, 2+6+2, 2+6+2+4, …]
❑Scan(A) = [0, 2, 8, 10, 14, 21, 23, 24]
❑More generally a serial implementation of an additive scan using a
running sum looks like…
int A[8] = { 2, 6, 2, 4, 7, 2, 1, 5 };
int scan_A[8];
int running_sum = 0;
for (int i = 0; i < 8; ++i){
scan_A[i] = running_sum;
running_sum += A[i];
}

COM4521/COM6521 - Parallel Computing with GPUs

25
[Link]

Serial Scan for Compaction


int Input[8] = { 2, 0, 5, 6, 3, 0, 1, 0 };
int A[8] = { 2, 0, 5, 6, 3, 0, 1, 0 };
int scan_A[8];
int output[5]
int running_sum = 0;

for (int i = 0; i < 8; ++i){ // generate scan input


A[i] = Input>0; // A = {1, 0, 1, 1, 1, 0, 1, 0}
}

for (int i = 0; i < 8; ++i){


scan_A[i] = running_sum; // scan
running_sum += A[i]; // scan_A = {0, 1, 1, 2, 3, 4, 4, 5}
}

for (int i = 0; i < 8; ++i){


We could test either Input[i] or A[i] to find empty values
int input = Input[i];
if (input > 0){ // scattered write
int idx = scan_A[i]; // output = {2, 5, 6, 3, 1}
output[idx] = input;
}
}
COM4521/COM6521 - Parallel Computing with GPUs

26
[Link]

Parallel Local (Shared Memory) Scan


After Log(N) loops each sum has local plus preceding 2n-1 values

Shared Memory Values 2 6 2 4 7 2 1 5


Loop 1
threadIdx.x 0 1 2 3 4 5 6 stride = 1
Shared Memory Values 2 8 8 6 11 9 3 6
Loop 2 Log2(N) steps
threadIdx.x 0 1 2 3 4 5 stride = 2
Shared Memory Values 2 8 10 14 19 15 14 15
Loop 3
threadIdx.x 0 1 2 3 stride = 4
Shared Memory Values 2 8 10 14 21 23 24 29

Inclusive Scan

COM4521/COM6521 - Parallel Computing with GPUs

Can be designed like a reduction with a tree

27
[Link]

Parallel Local Scan (Shared Memory) Scan

Shared Memory Values 2 6 2 4 7 2 1 5


Loop 1
threadIdx.x 0 1 2 3 4 5 6 stride = 1
Shared Memory Values 2 8 8 6 11 9 3 6
Loop 2 Log2(N) steps
threadIdx.x 0 1 2 3 4 5 stride = 2
Shared Memory Values 2 8 10 14 19 15 14 15
Loop 3
threadIdx.x 0 1 2 3 stride = 4
Shared Memory Values 2 8 10 14 21 23 24 29 Inclusive scan

0 2 8 10 14 21 23 24 29 Exclusive scan + reduction

COM4521/COM6521 - Parallel Computing with GPUs

Reduction as the final value is the sum

28
[Link]

Implementing Local Scan with Shared Memory


__global__ void scan(float *input) {
extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride<blockDim.x; stride<<=1) {


__syncthreads();
float s_value = (threadIdx.x >= stride) ? s_data[threadIdx.x - stride] : 0;
__syncthreads();
s_data[threadIdx.x] += s_value;
}

// move the result to global?


}

❑No bank conflicts (stride of 1 between threads)


❑Synchronisation required between read and write

COM4521/COM6521 - Parallel Computing with GPUs

Bit shift with perform the same as *=2

Just like reduction. This can be specialised for warp level using warp shuffles

29
[Link]

Implementing Local Scan (at warp level)


__global__ void scan(float *input) {
__shared__ float s_data[32];
float val1, val2;

val1 = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int s = 1; s < 32; s <<= 1) {


val2 = __shfl_up_sync(0xFFFFFFFF, val1, s);
if (threadIdx.x % 32 >= s)
val1 += val2;
}

// store/process warp level results


}

❑Exactly the same as the block level technique but at warp level
❑Warp prefix sum is in threadIdx.x%32==31
❑Either use shared memory to reduce between warps
❑Or consider the following global scan approaches.
COM4521/COM6521 - Parallel Computing with GPUs

No need for sync except on volta

30
[Link]

Implementing Global Scan


Thread Block 1 Thread Block 2

Shared Memory Values 2 6 2 4 7 2 1 5 4 2 2 4 1 0 1 2

threadIdx.x 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Shared Memory Values 2 8 8 6 11 9 3 6 4 6 4 6 5 1 1 3

threadIdx.x 0 1 2 3 4 5 0 1 2 3 4 5

Shared Memory Values 2 8 10 14 19 15 14 15 4 6 8 12 9 7 6 4

threadIdx.x 0 1 2 3 0 1 2 3

Local Scan Result 2 8 10 14 21 23 24 29 4 6 8 12 13 13 14 16

Global Scan Result 2 8 10 14 21 23 24 29 33 35 37 41 42 42 43 45

COM4521/COM6521 - Parallel Computing with GPUs

Need to propagate sums to next blocks

31
[Link]

Implementing Global Scan


❑Same problem as reduction when scaling to grid/device level
❑Each block is required to add the reduction value from proceeding blocks

❑Global scan therefore requires either:


1. Recursive scan kernel on results of local scan
❑ Additional kernel to add sums of proceeding blocks
2. Atomic Increments
❑ Increment a counter for block level results
❑ Additional kernel to add sums of proceeding blocks to each value

COM4521/COM6521 - Parallel Computing with GPUs

Each requires a final update stage to increment the individual values with block totals

32
[Link]

Global Level Scan (Atomics Part 1)


__device__ block_sums[BLOCK_DIM];

__global__ void scan(float *input, float *local_result) {


extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride<blockDim.x; stride<<=1) {


__syncthreads();
float s_value = (threadIdx.x >= stride) ? s_data[threadIdx.x - stride] : 0;
__syncthreads();
s_data[threadIdx.x] += s_value;
}

//store local scan result to each thread


local_result[threadIdx.x + blockIdx.x*blockDim.x] = s_data[threadIdx.x];

// atomic store to all following block totals (e.g. blocks after this block)
// Could improve this by splitting work across more threads
if (threadIdx.x == 0){
for (int i=blockIdx.x+1; i<gridDim.x; i++)
atomicAdd(&block_sums[i], s_data[blockDim.x-1]);
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Same technique for global level reduction, the block total is added to a device
symbol.

33
[Link]

Global Level Scan (Atomics Part 2)


❑After completion of the first kernel, block sums are all synchronised
❑Use first thread in block to load block total into shared memory
❑Increment local result
__device__ block_sums[BLOCK_DIM];

__global__ void scan_update(float *local_result, float *global_result) {


extern __shared__ float block_total;
int idx = threadIdx.x + blockIdx.x*blockDim.x;

if (threadIdx.x == 0)
block_total = block_sums[blockIdx.x];

__syncthreads();

global_result[idx] = local_result[idx] + block_total;


}

COM4521/COM6521 - Parallel Computing with GPUs

1 thread loads the value into shared mem


All threads update add the block total to their previous result and store

34
[Link]

Summary
❑Parallel Patterns
❑Define a parallel pattern as building blocks for parallel applications
❑Give examples of common patterns
❑Reduction
❑Present the process of performing parallel reduction
❑Explore the performance implications of parallel reduction implementations
❑Analyse block level and atomic approaches for reduction
❑Scan
❑Give motivating examples of parallel prefix sum (scan)
❑Describe the serial and parallel approaches towards scan
❑Compare block level and atomic approaches to the parallel prefix sum
algorithm
❑More on these in the lecture 9, CUDA libraries that provide highly
optimised parallel primitives.
COM4521/COM6521 - Parallel Computing with GPUs

35
[Link]

Further Reading
❑[Link]
❑All about application of warp shuffles to reduction
❑[Link]
s2018/stanford/lectures/lecture_6/parallel_patterns_1.pdf
❑Scan material based loosely on this lecture
❑[Link]
❑Reduction material is based on this fantastic lecture by Mark Harris (NVIDIA)

COM4521/COM6521 - Parallel Computing with GPUs

36

You might also like