0% found this document useful (0 votes)

56 views40 pages

Lecture 10

Uploaded by

Thanh Võ Thị Hoài

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views40 pages

Lecture 10

Uploaded by

Thanh Võ Thị Hoài

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Scan

Parallel Prefix Sum – Scan

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Scan

Objective
Ø To master parallel Prefix Sum (Scan) algorithms
Ø Frequently used for parallel work assignment and resource
allocation
Ø A key primitive in many parallel algorithms to convert serial
computation into parallel computation
Ø Based on reduction tree and reverse reduction tree

Ø Reading – Mark Harris, Parallel Prefix Sum with CUDA

Ø [Link]
projects/scan/doc/[Link]

2
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

(Inclusive) Prefix-Sum (Scan) Definition

Definition: The all-prefix-sums operation takes a binary
associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! [3 1 7 0 4 1 6 3],!
would return! [3 4 11 11 15 16 22 25].!

3
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Inclusive Scan Application Example

Ø Assume we have a 100-inch sandwich to feed 10
Ø We know how many inches each person wants
Ø [3 5 2 7 28 4 3 0 8 1]
Ø How do we cut the sandwich quickly?
Ø How much will be left?

Ø Method 1: cut the sections sequentially: 3 inches first, 5

inches second, 2 inches third, etc.
Ø Method 2: calculate Prefix scan and cut in parallel
Ø [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)

4
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Typical Applications of Scan

Ø Scan is a simple and useful parallel building block
Ø Convert recurrences from sequential :
for(j=1;j<n;j++)
out[j] = out[j-1] + f(j);
Ø into parallel:
forall(j) { temp[j] = f(j) };
scan(out, temp);
Ø Useful for many parallel algorithms:
•Radix sort •Polynomial evaluation
•Quicksort •Solving recurrences
•String comparison •Tree operations
•Lexical analysis •Histograms
•Stream compaction •Etc.

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Scan

Other Applications
Ø Assigning space in farmers market
Ø Allocating memory to parallel threads
Ø Allocating memory buffer for
communication channels
Ø …

6
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Inclusive Sequential Prefix-Sum

Given a sequence [x0, x1, x2, ... ]
Calculate output [y0, y1, y2, ... ]

Such that y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2
…
Using a recursive definition
yi = yi − 1 + xi
7
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Work Efficient C Implementation

y[0] = x[0];
for (i=1; i < Max_i; i++)
y[i] = y[i-1] + x[i];

Computationally efficient:

N additions needed for N elements - O(N)

8
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Naïve Inclusive Parallel Scan

Ø Assign one thread to calculate each y
element
Ø Have every thread add up all x elements
needed for the y element
y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2

Parallel programming is easy as long as you

don’t care about performance.
9
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Slightly Better Parallel Inclusive Scan Algorithm

1. Read input
T0 3 1 7 0 4 1 6 3 from device
memory to
shared
memory
Each thread reads one value from the input array
in device memory into shared memory array T0.
Thread 0 writes 0 into shared memory array.

10
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. Read input from

device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
stride to n: Add pairs
of elements stride
elements apart.
Double stride at
each iteration. (note:
must double buffer
shared mem arrays)

• Active threads: stride to n-1 (n-stride threads)

Iterate #1 • Thread j adds elements j and j-stride from T0 and
Stride = 1 writes result into shared memory buffer T1 (ping-pong) 11
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. (Read input from

• Active threads: stride to n-1 (n-stride threads)

Iterate #2 • Thread j adds elements j and j-stride from T1 and
Stride = 2 writes result into shared memory buffer T0 (ping-pong) 12
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. (Read input from

device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
Stride 2 stride to n: Add pairs
of elements stride
T0 3 4 11 11 12 12 11 14 elements apart.
Double stride at
Stride 4 each iteration. (note:
must double buffer
T1 3 4 11 11 15 16 22 25 shared mem arrays)
3. Write output from
shared memory to
device memory
Iterate #3
Stride = 4 13
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Work Efficiency Considerations

Ø The first-attempt Scan executes log(n) parallel

iterations
Ø The steps do (n-1), (n-2), (n-4),..(n - n/2) adds each
Ø Total adds: n * log(n) - (n-1) à O(n*log(n)) work

Ø This scan algorithm is not very work efficient

Ø Sequential scan algorithm does n adds
Ø A factor of log(n) hurts: 20x for 10^6 elements!

Ø A parallel algorithm can be slow when execution

resources are saturated due to low work
efficiency
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Improving Efficiency
Ø A common parallel algorithm pattern:
Balanced Trees
Ø Build a balanced binary tree on the input data and sweep it to
and from the root
Ø Tree is not an actual data structure, but a concept to determine
what each thread does at each step
Ø For scan:
Ø Traverse down from leaves to root building partial sums at
internal nodes in the tree
Ø Root holds sum of all leaves
Ø Traverse back up the tree building the scan from the partial sums

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Scan

Let’s Look at the Reduction Tree Again

3 1 7 0 4 1 6 3

+ + + +

4 7 5 9

+ +

11 14

+
25 16
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Parallel Scan – Reduction Step

x0 x1 x2 x3 x4 x5 x6 x7

+ + + +

Time! ∑x0..x1 ∑x2..x3 ∑x4..x5 ∑x6..x7

+ +

∑x0..x3 ∑x4..x7

+
In place calculation ! ∑x x
0.. 7
Final value after reduce 17
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7

∑x0..x5

Move (add) a critical value to a

central location where it is
needed!

18
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7

∑x0..x5

+ + +

∑x0..x2 ∑x0..x4 ∑x0..x6

19
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Putting it Together

20
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step Kernel Code

// scan_array[2*BLOCK_SIZE] is in shared memory

int stride = 1;
while(stride <= BLOCK_SIZE)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2*BLOCK_SIZE)
scan_array[index] += scan_array[index-stride];
stride = stride*2;
threadIdx.x+1 = 1, 2, 3, 4….!
__syncthreads(); stride = 1, index = !
}

21
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Sum of left half

Copyright © 2013 by Yong

22 Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step

int stride = BLOCK_SIZE/2;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if((index+stride) < 2*BLOCK_SIZE)
{
scan_array[index+stride] += scan_array[index];
}
stride = stride/2;
__syncthreads();
}

23
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

(Exclusive) Prefix-Sum (Scan) Definition

Definition: The all-prefix-sums operation takes a binary

associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![0, x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-2)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! ! [3 1 7 0 4 1 6 3],!
would return! [0 3 4 11 11 15 16 22].!

24
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Why Exclusive Scan

Ø To find the beginning address of allocated buffers

Ø Inclusive and Exclusive scans can be easily derived

from each other; it is a matter of convenience

! ![3 1 7 0 4 1 6 3]!
!
Exclusive ![0 3 4 11 11 15 16 22]!
!
Inclusive ![3 4 11 11 15 16 22 25]!
!

25
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
(Add-move Operation)
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0

0 +

∑x0..x3

26
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0

0 +

∑x0..x3

+ +

0 ∑x0..x1 ∑x0..x3 ∑x0..x5

+ + + +

0 x0 ∑x0..x1 ∑x0..x2 ∑x0..x3 ∑x0..x4 ∑x0..x5 27∑x0..x6

Exclusive Post Scan Step

if (threadIdx.x==0) scan_array[2*blockDim.x-1] = 0;
int stride = BLOCK_SIZE;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2* BLOCK_SIZE)
{
float temp = scan_array[index];
scan_array[index] += scan_array[index-stride];
scan_array[index-stride] = temp;
}
stride = stride / 2;
__syncthreads();
}

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Scan

Exclusive Scan Example – Reduction Step

T 3 1 7 0 4 1 6 3
Assume array is already in shared memory

29
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step (cont.)

T 3 1 7 0 4 1 6 3
Stride 1 Iteration 1, n/2 threads

T 3 4 7 7 4 5 6 9

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value

Reduction Step (cont.)

T 3 1 7 0 4 1 6 3
Stride 1

T 3 4 7 7 4 5 6 9
Stride 2 Iteration 2, n/4 threads

T 3 4 7 11 4 5 6 14

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value

Reduction Step (cont.)

T 3 1 7 0 4 1 6 3
Stride 1

T 3 4 7 7 4 5 6 9
Stride 2

T 3 4 7 11 4 5 6 14
Stride 4 Iteration log(n), 1 thread

T 3 4 7 11 4 5 6 25 Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value.

Note that this algorithm operates in-place: no need for double buffering

Zero the Last Element

T 3 4 7 11 4 5 6 0

We now have an array of partial sums. Since this is an exclusive scan,

set the last element to zero. It will propagate back to the first element.

Post Scan Step from Partial Sums

T 3 4 7 11 4 5 6 0

Post Scan Step from Partial Sums

T 3 4 7 11 4 5 6 0
Stride 4 Iteration 1
1 thread
T 3 4 7 0 4 5 6 11

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.

Post Scan Step from Partial Sums

T 3 4 7 11 4 5 6 0
Stride 4

T 3 4 7 0 4 5 6 11
Iteration 2
Stride 2
2 threads

T 3 0 7 4 4 11 6 16

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.

Post Scan Step from Partial Sums

T 3 4 7 11 4 5 6 0
Stride 4

T 3 4 7 0 4 5 6 11
Stride 2

T 3 0 7 4 4 11 6 16
Iteration log(n)
Stride 1
n/2 threads

T 0 3 4 11 11 15 16 22 Each corresponds
to a single thread.

Done! We now have a completed scan that we can write out to device memory.

Total steps: 2 * log(n).

Work Analysis
Ø The parallel Inclusive Scan executes 2*log(n) parallel iterations
Ø log(n) in reduction and log(n) in post scan
Ø The iterations do n/2, n/4,..1, 1, …., n/4, n/2 adds
Ø Total adds: 2* (n-1) à O(n) work

Ø The total number of adds is no more than twice that

done in the efficient sequential algorithm
Ø The benefit of parallelism can easily overcome the 2X work
when there is sufficient hardware

Working on Arbitrary Length Input

Ø Build on the scan kernel that handles up to
2*blockDim.x elements
Ø Assign each section of 2*blockDim elements to a
block
Ø Have each block write the sum of its section into a
Sum array indexed by blockIdx.x
Ø Run parallel scan on the Sum array
Ø May need to break down Sum into multiple sections if it is too big
for a block
Ø Add the scanned Sum array values to the elements of
corresponding sections

Scan

Overall Flow of Complete Scan

CUDA Parallel Prefix Sum Guide
No ratings yet
CUDA Parallel Prefix Sum Guide
21 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Lecture 03-Parallel Prefix
No ratings yet
Lecture 03-Parallel Prefix
6 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
CUDA for Parallel Computing Experts
No ratings yet
CUDA for Parallel Computing Experts
33 pages
Parallel Scan Algorithm in Julia
No ratings yet
Parallel Scan Algorithm in Julia
32 pages
Optimized Scan Primitives on CRAY Y-MP
No ratings yet
Optimized Scan Primitives on CRAY Y-MP
10 pages
Parallel Algorithms for Shared Memory
No ratings yet
Parallel Algorithms for Shared Memory
23 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
GPU Algorithms: Reduce, Scan, Histogram
No ratings yet
GPU Algorithms: Reduce, Scan, Histogram
61 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Parallel Reduction Trees in Programming
No ratings yet
Parallel Reduction Trees in Programming
45 pages
5 - ParallelAlgorithms
No ratings yet
5 - ParallelAlgorithms
13 pages
Pram
No ratings yet
Pram
23 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
qt6j57h5zw Nosplash
No ratings yet
qt6j57h5zw Nosplash
2 pages
Co 2
No ratings yet
Co 2
22 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Web GPU
0% (1)
Web GPU
40 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
HW 3
No ratings yet
HW 3
12 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
ECE645 - Lecture3 - Fast - Adders and HW Approaches To Implementation
No ratings yet
ECE645 - Lecture3 - Fast - Adders and HW Approaches To Implementation
51 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
21 pages
Ebook Fundations of Paralllel Programming
No ratings yet
Ebook Fundations of Paralllel Programming
109 pages
CSE524sp10 01
No ratings yet
CSE524sp10 01
62 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
PDC LAB Experiment 2
No ratings yet
PDC LAB Experiment 2
12 pages
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
No ratings yet
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
12 pages
Com - 612 Exam
No ratings yet
Com - 612 Exam
13 pages
Teaching Parallelism in Computing
No ratings yet
Teaching Parallelism in Computing
37 pages
7-Tree Sum Parallel Algorithm & Applications
No ratings yet
7-Tree Sum Parallel Algorithm & Applications
23 pages
Global Sum and Tree Sum New
No ratings yet
Global Sum and Tree Sum New
18 pages
PRAM Parallel Computing Algorithms
No ratings yet
PRAM Parallel Computing Algorithms
49 pages
Pda 4
No ratings yet
Pda 4
82 pages
Lab 7
No ratings yet
Lab 7
3 pages
Problem Statement
No ratings yet
Problem Statement
5 pages
MIPS Assembly: Integer Arrays Guide
No ratings yet
MIPS Assembly: Integer Arrays Guide
13 pages
Practical 5
No ratings yet
Practical 5
23 pages
MPI Parallel Programming Guide
No ratings yet
MPI Parallel Programming Guide
49 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Parallel Numerical Methods Overview
No ratings yet
Parallel Numerical Methods Overview
46 pages
CG Practical
No ratings yet
CG Practical
15 pages
Parellel Computing Module1 Notes
No ratings yet
Parellel Computing Module1 Notes
29 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
29 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Super Linear Speedup in Parallel Algorithms
33% (3)
Super Linear Speedup in Parallel Algorithms
4 pages
CSC 429 Mid Fall 12
No ratings yet
CSC 429 Mid Fall 12
6 pages
DPC Lab Manual
No ratings yet
DPC Lab Manual
21 pages
Midterm
No ratings yet
Midterm
5 pages
++probleme Tot
No ratings yet
++probleme Tot
22 pages
st2100 1849957
No ratings yet
st2100 1849957
39 pages
Ansible Notes Imp
No ratings yet
Ansible Notes Imp
59 pages
Django Guide for Developers
No ratings yet
Django Guide for Developers
14 pages
Ifix Embedded Ds Gfa1287a
No ratings yet
Ifix Embedded Ds Gfa1287a
2 pages
Datasheet
No ratings yet
Datasheet
3 pages
Understanding UV Mapping-TheBasics
No ratings yet
Understanding UV Mapping-TheBasics
23 pages
Ds Smartquake
No ratings yet
Ds Smartquake
2 pages
AI Trends and Opportunities 2023
No ratings yet
AI Trends and Opportunities 2023
24 pages
OpenStreetMap Contribution Guide
No ratings yet
OpenStreetMap Contribution Guide
8 pages
9003d Car Carplay Screen Manual
No ratings yet
9003d Car Carplay Screen Manual
12 pages
Data Mining: A Four-Phase Guide
No ratings yet
Data Mining: A Four-Phase Guide
1 page
Titoe CNC 3018pro Router User Guide
No ratings yet
Titoe CNC 3018pro Router User Guide
10 pages
What Does 12 - or 16-Bit Resolution Mean-2018
No ratings yet
What Does 12 - or 16-Bit Resolution Mean-2018
2 pages
Harris Velocity
No ratings yet
Harris Velocity
3 pages
Comtrend Ar 5381u Firmware PDF
No ratings yet
Comtrend Ar 5381u Firmware PDF
3 pages
iOS App Security Analysis Report
No ratings yet
iOS App Security Analysis Report
10 pages
Mangesh Amul Milk Market Research and Sales Development
No ratings yet
Mangesh Amul Milk Market Research and Sales Development
110 pages
01-MA5600T Product Introduction ISSUE1.00
No ratings yet
01-MA5600T Product Introduction ISSUE1.00
48 pages
Digital Integrated Circuits
No ratings yet
Digital Integrated Circuits
58 pages
B.Tech CSE 5th & 6th Semester Scheme
No ratings yet
B.Tech CSE 5th & 6th Semester Scheme
56 pages
File Management Explained
No ratings yet
File Management Explained
5 pages
Tutorial On Seismic Reflection CDP Data Processing in The Radexpro Plus Software
No ratings yet
Tutorial On Seismic Reflection CDP Data Processing in The Radexpro Plus Software
58 pages
CAD-CAM Report - 24-02-2024
No ratings yet
CAD-CAM Report - 24-02-2024
4 pages
SQL Commands
No ratings yet
SQL Commands
9 pages
Step by Step Encrypting User Data With EFS in Windows Server 2012 R2
No ratings yet
Step by Step Encrypting User Data With EFS in Windows Server 2012 R2
21 pages
Prep - VN
No ratings yet
Prep - VN
6 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
5 pages
Understanding Backhaul Networks
No ratings yet
Understanding Backhaul Networks
10 pages
examSectionGuide1987 - AdmitCard - 2025 03 06 03 08
No ratings yet
examSectionGuide1987 - AdmitCard - 2025 03 06 03 08
4 pages
Dell PowerEdge R420xr Technical Guide
No ratings yet
Dell PowerEdge R420xr Technical Guide
47 pages

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Scan

Parallel Prefix Sum – Scan

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Ø Reading – Mark Harris, Parallel Prefix Sum with CUDA

(Inclusive) Prefix-Sum (Scan) Definition

Inclusive Scan Application Example

Ø Method 1: cut the sections sequentially: 3 inches first, 5

Typical Applications of Scan

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

A Inclusive Sequential Prefix-Sum

A Work Efficient C Implementation

N additions needed for N elements - O(N)

A Naïve Inclusive Parallel Scan

Parallel programming is easy as long as you

A Slightly Better Parallel Inclusive Scan Algorithm

T0 3 1 7 0 4 1 6 3 1. Read input from

• Active threads: stride to n-1 (n-stride threads)

T0 3 1 7 0 4 1 6 3 1. (Read input from

• Active threads: stride to n-1 (n-stride threads)

T0 3 1 7 0 4 1 6 3 1. (Read input from

Work Efficiency Considerations

Ø The first-attempt Scan executes log(n) parallel

Ø This scan algorithm is not very work efficient

Ø A parallel algorithm can be slow when execution

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Let’s Look at the Reduction Tree Again

Parallel Scan – Reduction Step

Time! ∑x0..x1 ∑x2..x3 ∑x4..x5 ∑x6..x7

Move (add) a critical value to a

∑x0..x2 ∑x0..x4 ∑x0..x6

Reduction Step Kernel Code

Sum of left half

Copyright © 2013 by Yong

Post Scan Step

(Exclusive) Prefix-Sum (Scan) Definition

Definition: The all-prefix-sums operation takes a binary

Why Exclusive Scan

Ø Inclusive and Exclusive scans can be easily derived

0 ∑x0..x1 ∑x0..x3 ∑x0..x5

0 x0 ∑x0..x1 ∑x0..x2 ∑x0..x3 ∑x0..x4 ∑x0..x5 27∑x0..x6

Exclusive Post Scan Step

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Exclusive Scan Example – Reduction Step

Reduction Step (cont.)

Reduction Step (cont.)

Reduction Step (cont.)

Zero the Last Element

We now have an array of partial sums. Since this is an exclusive scan,

Post Scan Step from Partial Sums

Post Scan Step from Partial Sums

Post Scan Step from Partial Sums

Post Scan Step from Partial Sums

Total steps: 2 * log(n).

Ø The total number of adds is no more than twice that

Working on Arbitrary Length Input

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

Overall Flow of Complete Scan

You might also like