0% found this document useful (0 votes)

13 views36 pages

07b - CUDA Parallel Patterns + Notes

The document discusses parallel computing with GPUs, focusing on parallel patterns such as Map, Gather, Scatter, and Reduction. It emphasizes the importance of shared memory, memory coalescing, and occupancy in optimizing performance. The lecture also covers the implementation of parallel reduction techniques and the challenges associated with them, including bandwidth issues and bank conflicts in shared memory access.

Uploaded by

noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views36 pages

07b - CUDA Parallel Patterns + Notes

Uploaded by

noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

[Link]

com/COMCUDA7

Parallel Computing
With GPUs
Parallel Patterns

Dr Robert Chisholm

COM4521/COM6521 - Parallel Computing with GPUs

We learnt about shared memory

Very powerful for block level computations
Excellent for improving performance by reducing memory bandwidth
User controlled caching and needs careful consideration for bank conflicts and
boundary conditions
Memory coalescing: Vital for good memory bandwidth performance
Need to be aware of cache usage and line size
Occupancy can be changed by modifying block sizes, registers and shared memory
usage
This week:
How exactly are warps scheduled?
Can we program at the warp level?
What mechanisms are there for communication between threads?

1
[Link]

This Lecture (Learning Objectives)

❑Parallel Patterns
❑Define a parallel pattern as building blocks for parallel applications
❑Give examples of common patterns
❑Reduction
❑Present the process of performing parallel reduction
❑Explore the performance implications of parallel reduction implementations
❑Analyse block level and atomic approaches for reduction
❑Prefix Sum / Scan
❑Give motivating examples of parallel prefix sum (scan)
❑Describe the serial and parallel approaches towards scan
❑Compare block level and atomic approaches to the parallel prefix sum
algorithm

COM4521/COM6521 - Parallel Computing with GPUs

2
[Link]

What are Parallel Patterns

❑Parallel patterns are high level building blocks that can be used to
create algorithms
❑Implementation is abstracted to give a higher-level view
❑Patterns describe techniques suited to parallelism
❑Allows algorithms to be built with parallelism from ground up
❑Top-down approach might not parallelise very easily…
❑Consider the simplest parallel pattern: Map
❑Takes the input list i
❑Applies a function f
❑Writes the result list o by applying f to all members of i
❑Equivalent to a CUDA kernel where i and o are memory locations determined
by threadIdx etc.

COM4521/COM6521 - Parallel Computing with GPUs

Ground up rather than top down

Higher level view for design: Think at the pattern level rather than blocks warps or
grids etc.

Map aka Transform

3
[Link]

Gather
❑Multiple inputs and single coalesced output
❑Might have sequential loading or random access
❑Affect memory performance
❑Differs to map due to multiple inputs

Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 ❑ Read from a number of locations

Memory Values/Locations
Gather operation
ThreadIdx.x 0 1 2 3 4 5 6 7 ❑ Read from a number of locations
❑ Random access load

COM4521/COM6521 - Parallel Computing with GPUs

We have seen this in many places,

Image blur, matrix mul, reduction etc.

4
[Link]

Scatter
❑Reads from a single input and writes to one or many
❑Can be implemented in CUDA using atomics or block/warp co-
operation
❑Write pattern will determine performance

ThreadIdx.x 0 1 2 3 4 5 6 7 Scatter operation

❑ Write to a number of locations
❑ Collision on write
Memory Values/Locations

ThreadIdx.x 0 1 2 3 4 5 6 7 Scatter operation

❑ Write to a number of locations
❑ Random access write?
Memory Values/Locations

COM4521/COM6521 - Parallel Computing with GPUs

Scatter might be guaranteed to have no clashing writes.

Scatter refers to the non coalesced write rather than number of writes

5
[Link]

Other Parallel Patterns

❑Stencil
❑Gather a fixed pattern, usually based on locality
❑See 2D shared memory examples (e.g. image kernels)
Stencil Gather
❑Reduce
❑Reduce many values to a single value
❑Combined with Map to form Map Reduce
❑ (often with intermediate shuffle or sort)
❑Scan
❑Compute a cumulative sum across a set
❑Sort (later)
❑Sort values or <key, value> pairs

COM4521/COM6521 - Parallel Computing with GPUs

6
[Link]

Reduction
❑A reduction is where all elements of a set have a common binary associative
operator (⊕) applied to them to “reduce” the set to a single value
❑Binary associative = order in which operations is performed on two inputs does not matter
❑ E.g. (1 + 2) + 3 + 4 == 1 + (2 + 3) + 4 == 10
❑Example operators
❑Most obvious example is addition (Summation)
❑Other examples, Maximum, Minimum, product
❑Serial example is trivial but how does this work in parallel?
int data[N]; int data[N];
int r = 0; int r = 0;
for (int i = 0; i < N; i++){ OR for (int i = N-1; i >= 0; i--){
r = reduce(r, data[i]); r = reduce(r, data[i]);
} }

int reduce(int r, int i){

return r + i;
}

COM4521/COM6521 - Parallel Computing with GPUs

Summation already considered at a warp level (warp shuffles)

Binary associative means that changing the order has no impact on the result

7
[Link]

Parallel Reduction
❑Order of operations does not matter so we don’t have to think serially.
❑A tree-based approach can be used
❑At each step data is reduced by a factor of 2
N Elements

9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕

14 3 5 9
Log2(N) steps
⊕ ⊕
17 14

31
COM4521/COM6521 - Parallel Computing with GPUs

8
[Link]

Parallel Reduction in CUDA

❑No global synchronisation so how do multiple blocks perform
reduction?
❑Split the execution into multiple stages
❑Recursive method

9 5 2 1 2 3 8 1
⊕ ⊕ ⊕ ⊕ Kernel Launch 1
14 3 5 9

⊕ ⊕ Kernel Launch 2
17 14

⊕ Kernel Launch 3
31
COM4521/COM6521 - Parallel Computing with GPUs

Each stage has same implementation

9
[Link]

Recursive Reduction Problems

❑ What might be some problems with the following?
__global__ void sum_reduction(float *input, float *results){

extern shared int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Using shared memory but not very well.

Problem is bandwidth bound. Need to reduce the global memory reads/writes!

Can anyone think of a way to improve this?

Last lecture warp reductions!

10
[Link]

Recursive Reduction Problems

❑Small kernel has a high launch overhead
❑Lots of writes to global memory
❑Poor use of shared memory or caching
❑Only half the threads in each warp are active after __syncthreads()

global void sum_reduction(float input, float results){

extern shared int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1];
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Using shared memory but not very well.

Problem is bandwidth bound. Need to reduce the global memory reads/writes!

Can anyone think of a way to improve this?

11
[Link]

Block Level Reduction

❑Lower launch overhead (reduction within block)
❑Much better use of shared memory
❑One thread per block writes the result to global memory
__global__ void sum_reduction(float *input, float *block_results){
extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){

unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
block_results[blockIdx.x] = sdata[0];
}
COM4521/COM6521 - Parallel Computing with GPUs

Reduction of a whole block reduces memory reads

Reduce by factor of 32 the number of launches and read/writes

12
[Link]

Block Level Recursive Reduction

Shared Memory Values 9 5 2 1 2 3 8 1
Loop 1
threadIdx.x 0 1 2 3 stride = 1
Shared Memory Values 14 5 3 1 5 3 9 1
Loop 2
threadIdx.x 0 1 stride = 2
Shared Memory Values 17 5 3 1 14 3 9 1
Loop 3
threadIdx.x 0 stride = 4
Shared Memory Values 31 5 3 1 14 3 9 1

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){

unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride]
}
__syncthreads();
}
COM4521/COM6521 - Parallel Computing with GPUs

Same reduction pattern as for warp shuffles (but on the block)

13
[Link]

Block Level Reduction

❑Is this shared memory access pattern bank conflict
free?

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){

unsigned int strided_i = threadIdx.x * 2 * stride;
if (strided_i < blockDim.x){
sdata[strided_i] += sdata[strided_i + stride];
}
__syncthreads();
}

COM4521/COM6521 - Parallel Computing with GPUs

14
[Link]
Type Size 4
Stride 2

Block Level Reduction threadIdx.x

0
index
1
bank
1

❑Is this shared memory access pattern conflict free? No 1

2
3
3
5
7
3
5
7
❑Each thread accesses SM bank using the following 4
5
9
11
9
11

❑sm_bank = (threadIdx.x * 2 * stride + stride) % 32 6

7
13
15
13
15

❑Between each thread there is therefore strided access across SM 8

9
17
19
17
19

banks 10
11
21
23
21
23
12 25 25
❑2-way bank conflict 13 27 27
14 29 29
❑Try evaluating this using a spreadsheet 15 31 31
16 33 1

❑To avoid bank conflicts SM stride between threads should 17

18
19
35
37
39
3
5
7
be 1 20
21
41
43
9
11
22 45 13
23 47 15
for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ 24 49 17
unsigned int strided_i = threadIdx.x * 2 * stride; 25 51 19
if (strided_i < blockDim.x){ 26 53 21
sdata[strided_i] += sdata[strided_i + stride]; 27 55 23
28 57 25
} 29 59 27
__syncthreads(); 30 61 29
} 31 63 31

Banks
Used 16
COM4521/COM6521 - Parallel Computing with GPUs Max
Conflicts 2

Index for this example shown only for the read

Bank = index * sm_stride % 32

Sm stride = different to stride in this example

In this case it is 2 (even)

15
[Link]

Block Level Reduction (Sequential Addressing)

Shared Memory Values 9 5 2 1 2 3 8 1
Loop 1
threadIdx.x 0 1 2 3 stride = 4
Shared Memory Values 11 8 10 2 2 3 8 1
Loop 2
threadIdx.x 0 1 stride = 2
Shared Memory Values 21 10 10 2 2 3 8 1
Loop 3
threadIdx.x 0 stride = 1
Shared Memory Values 31 10 10 2 2 3 8 1
Bitshift form of
stride /=2
for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=1){
if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride];
}
__syncthreads();
}
COM4521/COM6521 - Parallel Computing with GPUs

Different gather pattern

16
Type size 4 [Link]
Stride 1

threadIdx.x index bank

0 16 16

❑Now conflict free regardless of the

1 17 17
2 18 18
3 19 19
4
5
20
21
20
21 reduction loop stride
6 22 22
7
8
23
24
23
24 ❑The stride between shared memory
9 25 25
10
11
26
27
26
27
variable accesses for threads is always
12
13
28
29
28
29 sequential
14 30 30
15 31 31
16 32 0
17 33 1
18 34 2
19 35 3
20 36 4
21 37 5
22 38 6
23 39 7
24 40 8
25 41 9
26 42 10
27 43 11
28 44 12
29 45 13
30 46 14
31 47 15

Banks
Used 32
Max
Conflicts 1
COM4521/COM6521 - Parallel Computing with GPUs

How does this work for problems bigger than a single block? Suggestions? Hint: Our
very first version was suitable for global reductions…

Atomics or recursive kernel calls.

17
[Link]

Global Reduction Approach

❑Use the recursive method
❑Our block level reduction can be applied to the result
❑At some stage it may be more effective to simply sum the final block on the
CPU
❑Or use atomics on block results
Thread block width

COM4521/COM6521 - Parallel Computing with GPUs

Each stage has same implementation

Different architectures might favour different implementations so benchmarking is

important

18
[Link]

Global Reduction Atomics

global void sum_reduction(float input, float result){

extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){

if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}

COM4521/COM6521 - Parallel Computing with GPUs

Only difference here is the atomic operation to global with the block’s result.

19
[Link]

Further Optimisation?
❑Can we improve our technique further?
__global__ void sum_reduction(float *input, float *result){
extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];
__syncthreads();

for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){

if (threadIdx.x < stride){
sdata[threadIdx.x] += sdata[threadIdx.x + stride]
}
__syncthreads();
}

if (threadIdx.x == 0)
atomicAdd(result, sdata[0]);
}

COM4521/COM6521 - Parallel Computing with GPUs

ILP more work per thread.

Writing cross architecture approaches is hard! Many optimisations all favouring

different hardware.

20
[Link]

Further Optimisation?
❑Can we improve our technique further? Yes

❑We could optimise for the warp level

❑Warp Level: Shuffles for reduction (see last lecture)
❑Block Level: Shared Memory reduction (or SM atomics)
❑Grid Level: Recursive Kernel Launches or Global Atomics

❑Other optimisations
❑Loop unrolling
❑Increasing Instruction Level Parallelism, do more work in each thread

❑Different architectures, data-types and problem sizes may favour different

implementations/optimisations

COM4521/COM6521 - Parallel Computing with GPUs

ILP more work per thread.

Writing cross architecture approaches is hard! Many optimisations all favouring

different hardware.

21
[Link]

What is a Scan?
❑Consider the following …
2 0 5 6 3 0 1 0

2 0 1 2 1 3 0 0 2 5 6 3 1
Remove empty elements from array (compact)
A C D F G
B E H
I 1 2 5 1 3 4 3 8
Output variable numbers of values per thread
1 5 1 3 3 2 4 8
Split elements from array based on condition (split)

COM4521/COM6521 - Parallel Computing with GPUs

22
[Link]

What is a Scan?
2 0 5 6 3 0 1 0
❑Consider the following …
2 5 6 3 1
Remove empty elements from array (compact)
2 0 1 2 1 3 0 0

1 2 5 1 3 4 3 8
A C D F G
B E H
1 5 1 3 3 2 4 8
I
Split elements from array based on condition (split)
Output variable numbers of values per thread

❑Each has the same problem

❑Not even considered for sequential
programs!
❑Where to write output in parallel?
COM4521/COM6521 - Parallel Computing with GPUs

23
[Link]

Parallel Prefix Sum (Scan)

❑Where to write output in parallel?
❑Each threads needs to know the output location(s) it can
write to avoid conflicts.
0 1 2 3 4 5 6 7 Thread/Read index

0 0 1 2 3 3 4 4 Output/Write index – running sum of binary output state

2 0 5 6 3 0 1 0 Sparse data

2 5 6 3 1 Compacted data

❑The solution is a parallel prefix sum (or scan)

❑Given the inputs A = [a0, a1, …, an-1] and binary associate operator ⊕
❑Scan(A) = [0, a0, (a0⊕a1), …, (a0⊕a1⊕…⊕an-1)]

COM4521/COM6521 - Parallel Computing with GPUs

24
[Link]

Serial Prefix Sum Example

❑E.g. Given the input and the addition operator
❑A= [2, 6, 2 ,4, 7, 2 ,1, 5]
❑Scan(A) = [0, 2, 2+6, 2+6+2, 2+6+2+4, …]
❑Scan(A) = [0, 2, 8, 10, 14, 21, 23, 24]
❑More generally a serial implementation of an additive scan using a
running sum looks like…
int A[8] = { 2, 6, 2, 4, 7, 2, 1, 5 };
int scan_A[8];
int running_sum = 0;
for (int i = 0; i < 8; ++i){
scan_A[i] = running_sum;
running_sum += A[i];
}

COM4521/COM6521 - Parallel Computing with GPUs

25
[Link]

Serial Scan for Compaction

int Input[8] = { 2, 0, 5, 6, 3, 0, 1, 0 };
int A[8] = { 2, 0, 5, 6, 3, 0, 1, 0 };
int scan_A[8];
int output[5]
int running_sum = 0;

for (int i = 0; i < 8; ++i){ // generate scan input

A[i] = Input>0; // A = {1, 0, 1, 1, 1, 0, 1, 0}
}

for (int i = 0; i < 8; ++i){

scan_A[i] = running_sum; // scan
running_sum += A[i]; // scan_A = {0, 1, 1, 2, 3, 4, 4, 5}
}

for (int i = 0; i < 8; ++i){

We could test either Input[i] or A[i] to find empty values
int input = Input[i];
if (input > 0){ // scattered write
int idx = scan_A[i]; // output = {2, 5, 6, 3, 1}
output[idx] = input;
}
}
COM4521/COM6521 - Parallel Computing with GPUs

26
[Link]

Parallel Local (Shared Memory) Scan

After Log(N) loops each sum has local plus preceding 2n-1 values

Shared Memory Values 2 6 2 4 7 2 1 5

Loop 1
threadIdx.x 0 1 2 3 4 5 6 stride = 1
Shared Memory Values 2 8 8 6 11 9 3 6
Loop 2 Log2(N) steps
threadIdx.x 0 1 2 3 4 5 stride = 2
Shared Memory Values 2 8 10 14 19 15 14 15
Loop 3
threadIdx.x 0 1 2 3 stride = 4
Shared Memory Values 2 8 10 14 21 23 24 29

Inclusive Scan

COM4521/COM6521 - Parallel Computing with GPUs

Can be designed like a reduction with a tree

27
[Link]

Parallel Local Scan (Shared Memory) Scan

Shared Memory Values 2 6 2 4 7 2 1 5

0 2 8 10 14 21 23 24 29 Exclusive scan + reduction

COM4521/COM6521 - Parallel Computing with GPUs

Reduction as the final value is the sum

28
[Link]

Implementing Local Scan with Shared Memory

__global__ void scan(float *input) {
extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride<blockDim.x; stride<<=1) {

__syncthreads();
float s_value = (threadIdx.x >= stride) ? s_data[threadIdx.x - stride] : 0;
__syncthreads();
s_data[threadIdx.x] += s_value;
}

// move the result to global?

}

❑No bank conflicts (stride of 1 between threads)

❑Synchronisation required between read and write

COM4521/COM6521 - Parallel Computing with GPUs

Bit shift with perform the same as *=2

Just like reduction. This can be specialised for warp level using warp shuffles

29
[Link]

Implementing Local Scan (at warp level)

__global__ void scan(float *input) {
__shared__ float s_data[32];
float val1, val2;

val1 = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int s = 1; s < 32; s <<= 1) {

val2 = __shfl_up_sync(0xFFFFFFFF, val1, s);
if (threadIdx.x % 32 >= s)
val1 += val2;
}

// store/process warp level results

}

❑Exactly the same as the block level technique but at warp level
❑Warp prefix sum is in threadIdx.x%32==31
❑Either use shared memory to reduce between warps
❑Or consider the following global scan approaches.
COM4521/COM6521 - Parallel Computing with GPUs

No need for sync except on volta

30
[Link]

Implementing Global Scan

Thread Block 1 Thread Block 2

Shared Memory Values 2 6 2 4 7 2 1 5 4 2 2 4 1 0 1 2

threadIdx.x 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Shared Memory Values 2 8 8 6 11 9 3 6 4 6 4 6 5 1 1 3

threadIdx.x 0 1 2 3 4 5 0 1 2 3 4 5

Shared Memory Values 2 8 10 14 19 15 14 15 4 6 8 12 9 7 6 4

threadIdx.x 0 1 2 3 0 1 2 3

Local Scan Result 2 8 10 14 21 23 24 29 4 6 8 12 13 13 14 16

Global Scan Result 2 8 10 14 21 23 24 29 33 35 37 41 42 42 43 45

COM4521/COM6521 - Parallel Computing with GPUs

Need to propagate sums to next blocks

31
[Link]

Implementing Global Scan

❑Same problem as reduction when scaling to grid/device level
❑Each block is required to add the reduction value from proceeding blocks

❑Global scan therefore requires either:

1. Recursive scan kernel on results of local scan
❑ Additional kernel to add sums of proceeding blocks
2. Atomic Increments
❑ Increment a counter for block level results
❑ Additional kernel to add sums of proceeding blocks to each value

COM4521/COM6521 - Parallel Computing with GPUs

Each requires a final update stage to increment the individual values with block totals

32
[Link]

Global Level Scan (Atomics Part 1)

__device__ block_sums[BLOCK_DIM];

global void scan(float input, float local_result) {

extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride<blockDim.x; stride<<=1) {

__syncthreads();
float s_value = (threadIdx.x >= stride) ? s_data[threadIdx.x - stride] : 0;
__syncthreads();
s_data[threadIdx.x] += s_value;
}

//store local scan result to each thread

local_result[threadIdx.x + blockIdx.x*blockDim.x] = s_data[threadIdx.x];

// atomic store to all following block totals (e.g. blocks after this block)
// Could improve this by splitting work across more threads
if (threadIdx.x == 0){
for (int i=blockIdx.x+1; i<gridDim.x; i++)
atomicAdd(&block_sums[i], s_data[blockDim.x-1]);
}
}

COM4521/COM6521 - Parallel Computing with GPUs

Same technique for global level reduction, the block total is added to a device
symbol.

33
[Link]

Global Level Scan (Atomics Part 2)

❑After completion of the first kernel, block sums are all synchronised
❑Use first thread in block to load block total into shared memory
❑Increment local result
__device__ block_sums[BLOCK_DIM];

global void scan_update(float local_result, float global_result) {

extern __shared__ float block_total;
int idx = threadIdx.x + blockIdx.x*blockDim.x;

if (threadIdx.x == 0)
block_total = block_sums[blockIdx.x];

__syncthreads();

global_result[idx] = local_result[idx] + block_total;

}

COM4521/COM6521 - Parallel Computing with GPUs

1 thread loads the value into shared mem

All threads update add the block total to their previous result and store

34
[Link]

Summary
❑Parallel Patterns
❑Define a parallel pattern as building blocks for parallel applications
❑Give examples of common patterns
❑Reduction
❑Present the process of performing parallel reduction
❑Explore the performance implications of parallel reduction implementations
❑Analyse block level and atomic approaches for reduction
❑Scan
❑Give motivating examples of parallel prefix sum (scan)
❑Describe the serial and parallel approaches towards scan
❑Compare block level and atomic approaches to the parallel prefix sum
algorithm
❑More on these in the lecture 9, CUDA libraries that provide highly
optimised parallel primitives.
COM4521/COM6521 - Parallel Computing with GPUs

35
[Link]

Further Reading
❑[Link]
❑All about application of warp shuffles to reduction
❑[Link]
s2018/stanford/lectures/lecture_6/parallel_patterns_1.pdf
❑Scan material based loosely on this lecture
❑[Link]
❑Reduction material is based on this fantastic lecture by Mark Harris (NVIDIA)

COM4521/COM6521 - Parallel Computing with GPUs

06a - CUDA Shared Memory + Notes
No ratings yet
06a - CUDA Shared Memory + Notes
38 pages
Parallel Reduction Trees in Programming
No ratings yet
Parallel Reduction Trees in Programming
45 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Optimizing Parallel Reduction in CUDA
No ratings yet
Optimizing Parallel Reduction in CUDA
38 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Parallel Computing Chapter 1
No ratings yet
Parallel Computing Chapter 1
56 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
GPU Warp Shuffles & Reductions
No ratings yet
GPU Warp Shuffles & Reductions
41 pages
Module 1
No ratings yet
Module 1
53 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
W1 Intro.4u
No ratings yet
W1 Intro.4u
7 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Parellel Computing Module1 Notes
No ratings yet
Parellel Computing Module1 Notes
29 pages
Task Decomposition Techniques in Computing
No ratings yet
Task Decomposition Techniques in Computing
62 pages
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
No ratings yet
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
170 pages
PDC 3
No ratings yet
PDC 3
26 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
High Performance Computing Unit 1-2
No ratings yet
High Performance Computing Unit 1-2
60 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
7 pages
Intro to Serial & Parallel Computing
No ratings yet
Intro to Serial & Parallel Computing
39 pages
CO1-Parallel Computers - Latest
No ratings yet
CO1-Parallel Computers - Latest
16 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CO1-Parallel Computers Latest
No ratings yet
CO1-Parallel Computers Latest
16 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
63 pages
TCS 701 SM01
No ratings yet
TCS 701 SM01
4 pages
Parallel Answers
No ratings yet
Parallel Answers
6 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CS621 SQ
No ratings yet
CS621 SQ
15 pages
Module-01 Bcs702 (Parallel Computing) Search Creators
No ratings yet
Module-01 Bcs702 (Parallel Computing) Search Creators
31 pages
GPU Book
No ratings yet
GPU Book
30 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
29 pages
PDC 1
No ratings yet
PDC 1
41 pages
Introduction To Parallel Computing CIS 410/510 Department of Computer and Information Science
No ratings yet
Introduction To Parallel Computing CIS 410/510 Department of Computer and Information Science
65 pages
HPC Unit 1
100% (1)
HPC Unit 1
12 pages
01 - Lecture Intro To HPC
No ratings yet
01 - Lecture Intro To HPC
62 pages
Chapter 5 An Introduction To Parallel Programming
No ratings yet
Chapter 5 An Introduction To Parallel Programming
41 pages
Lecture # 03 - Parallel Computing
No ratings yet
Lecture # 03 - Parallel Computing
15 pages
Parallel Computing: Charles Koelbel
No ratings yet
Parallel Computing: Charles Koelbel
12 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
756-4-17-2012 Parallel
No ratings yet
756-4-17-2012 Parallel
80 pages
UNIT 2 (HPC)
No ratings yet
UNIT 2 (HPC)
10 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
CUDA Parallel Reduction Optimization
No ratings yet
CUDA Parallel Reduction Optimization
8 pages
Elective 3
No ratings yet
Elective 3
30 pages
HPC Notes Unit 3
No ratings yet
HPC Notes Unit 3
7 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
01a - Course Intro & Overview + Notes
No ratings yet
01a - Course Intro & Overview + Notes
35 pages
Introduction To Programming EC-105
No ratings yet
Introduction To Programming EC-105
35 pages
Assignment# 1: MATH-101-Calculus & Analytical Geometry
No ratings yet
Assignment# 1: MATH-101-Calculus & Analytical Geometry
3 pages
Introduction To English HU-100: Importance and Factors Affecting English Language Learning
No ratings yet
Introduction To English HU-100: Importance and Factors Affecting English Language Learning
23 pages
3rd Lecture Phrases and Clauses
No ratings yet
3rd Lecture Phrases and Clauses
42 pages
CS100 - Computational Problem Solving Fall 2020
No ratings yet
CS100 - Computational Problem Solving Fall 2020
9 pages
3 - NWM Standards, Models, and Language
No ratings yet
3 - NWM Standards, Models, and Language
48 pages
Service Procurement Guide
No ratings yet
Service Procurement Guide
7 pages
Horner's Method
No ratings yet
Horner's Method
12 pages
A Review of Progress and Hydrodynamic Design of in
No ratings yet
A Review of Progress and Hydrodynamic Design of in
29 pages
Adobe Scan 14 Mar 2025
No ratings yet
Adobe Scan 14 Mar 2025
6 pages
English-Portfolio 8
No ratings yet
English-Portfolio 8
54 pages
Ethnolinguistics and Cultural Concepts - Truth, Love, Hate - James William Underhill
No ratings yet
Ethnolinguistics and Cultural Concepts - Truth, Love, Hate - James William Underhill
261 pages
Cases and Exercises 11
No ratings yet
Cases and Exercises 11
5 pages
Techno NJR BTech Seminar Report Template
No ratings yet
Techno NJR BTech Seminar Report Template
21 pages
SKII Factory Inspection Report
No ratings yet
SKII Factory Inspection Report
6 pages
Translate Verbal Phrases to Equations
No ratings yet
Translate Verbal Phrases to Equations
2 pages
System Operation Testing and Adjustin C10 y C12
100% (2)
System Operation Testing and Adjustin C10 y C12
96 pages
Lewandowsky 2021 Climate Change Disinformation and How To Combat It
No ratings yet
Lewandowsky 2021 Climate Change Disinformation and How To Combat It
28 pages
Healthcare Recommender Systems: Simar Preet Singh Deepak Kumar Jain Johan Debayle
No ratings yet
Healthcare Recommender Systems: Simar Preet Singh Deepak Kumar Jain Johan Debayle
379 pages
Assignment 11-New
No ratings yet
Assignment 11-New
5 pages
Bioevopeak Introduction
No ratings yet
Bioevopeak Introduction
30 pages
Symmetrix VMAX3 Hardware Overview
No ratings yet
Symmetrix VMAX3 Hardware Overview
92 pages
Update Dbtab
No ratings yet
Update Dbtab
10 pages
Ted Hughes' "The Thought Fox" Analysis
No ratings yet
Ted Hughes' "The Thought Fox" Analysis
1 page
Six Virtues for the Best Workplace
No ratings yet
Six Virtues for the Best Workplace
12 pages
Scientific Writing for Researchers
No ratings yet
Scientific Writing for Researchers
12 pages
Mysqli PHP
No ratings yet
Mysqli PHP
9 pages
Engineering Relay Systems
No ratings yet
Engineering Relay Systems
71 pages
Stopray Vision Processing Guide 2018 PDF
No ratings yet
Stopray Vision Processing Guide 2018 PDF
18 pages
Lec 8 Problems
No ratings yet
Lec 8 Problems
9 pages
Pioneer in Automatic Infusion System
No ratings yet
Pioneer in Automatic Infusion System
4 pages
Temperature in Mass Concrete
100% (2)
Temperature in Mass Concrete
2 pages
Two-Wheeler Repair Stand Design Project
No ratings yet
Two-Wheeler Repair Stand Design Project
5 pages
Classroom Rules for Positive Learning
No ratings yet
Classroom Rules for Positive Learning
5 pages
Reading Sample Sappress 1622 Salesandistributioninsaperp PDF
No ratings yet
Reading Sample Sappress 1622 Salesandistributioninsaperp PDF
38 pages