Scan
Parallel Prefix Sum – Scan
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Objective
Ø To master parallel Prefix Sum (Scan) algorithms
Ø Frequently used for parallel work assignment and resource
allocation
Ø A key primitive in many parallel algorithms to convert serial
computation into parallel computation
Ø Based on reduction tree and reverse reduction tree
Ø Reading – Mark Harris, Parallel Prefix Sum with CUDA
Ø [Link]
projects/scan/doc/[Link]
2
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
(Inclusive) Prefix-Sum (Scan) Definition
Definition: The all-prefix-sums operation takes a binary
associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! [3 1 7 0 4 1 6 3],!
would return! [3 4 11 11 15 16 22 25].!
3
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Scan Application Example
Ø Assume we have a 100-inch sandwich to feed 10
Ø We know how many inches each person wants
Ø [3 5 2 7 28 4 3 0 8 1]
Ø How do we cut the sandwich quickly?
Ø How much will be left?
Ø Method 1: cut the sections sequentially: 3 inches first, 5
inches second, 2 inches third, etc.
Ø Method 2: calculate Prefix scan and cut in parallel
Ø [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)
4
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Typical Applications of Scan
Ø Scan is a simple and useful parallel building block
Ø Convert recurrences from sequential :
for(j=1;j<n;j++)
out[j] = out[j-1] + f(j);
Ø into parallel:
forall(j) { temp[j] = f(j) };
scan(out, temp);
Ø Useful for many parallel algorithms:
•Radix sort •Polynomial evaluation
•Quicksort •Solving recurrences
•String comparison •Tree operations
•Lexical analysis •Histograms
•Stream compaction •Etc.
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Other Applications
Ø Assigning space in farmers market
Ø Allocating memory to parallel threads
Ø Allocating memory buffer for
communication channels
Ø …
6
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
A Inclusive Sequential Prefix-Sum
Given a sequence [x0, x1, x2, ... ]
Calculate output [y0, y1, y2, ... ]
Such that y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2
…
Using a recursive definition
yi = yi − 1 + xi
7
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
A Work Efficient C Implementation
y[0] = x[0];
for (i=1; i < Max_i; i++)
y[i] = y[i-1] + x[i];
Computationally efficient:
N additions needed for N elements - O(N)
8
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
A Naïve Inclusive Parallel Scan
Ø Assign one thread to calculate each y
element
Ø Have every thread add up all x elements
needed for the y element
y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2
Parallel programming is easy as long as you
don’t care about performance.
9
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
A Slightly Better Parallel Inclusive Scan Algorithm
1. Read input
T0 3 1 7 0 4 1 6 3 from device
memory to
shared
memory
Each thread reads one value from the input array
in device memory into shared memory array T0.
Thread 0 writes 0 into shared memory array.
10
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
T0 3 1 7 0 4 1 6 3 1. Read input from
device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
stride to n: Add pairs
of elements stride
elements apart.
Double stride at
each iteration. (note:
must double buffer
shared mem arrays)
• Active threads: stride to n-1 (n-stride threads)
Iterate #1 • Thread j adds elements j and j-stride from T0 and
Stride = 1 writes result into shared memory buffer T1 (ping-pong) 11
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
T0 3 1 7 0 4 1 6 3 1. (Read input from
device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
Stride 2 stride to n: Add pairs
of elements stride
T0 3 4 11 11 12 12 11 14 elements apart.
Double stride at
each iteration. (note:
must double buffer
shared mem arrays)
• Active threads: stride to n-1 (n-stride threads)
Iterate #2 • Thread j adds elements j and j-stride from T1 and
Stride = 2 writes result into shared memory buffer T0 (ping-pong) 12
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
T0 3 1 7 0 4 1 6 3 1. (Read input from
device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
Stride 2 stride to n: Add pairs
of elements stride
T0 3 4 11 11 12 12 11 14 elements apart.
Double stride at
Stride 4 each iteration. (note:
must double buffer
T1 3 4 11 11 15 16 22 25 shared mem arrays)
3. Write output from
shared memory to
device memory
Iterate #3
Stride = 4 13
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Work Efficiency Considerations
Ø The first-attempt Scan executes log(n) parallel
iterations
Ø The steps do (n-1), (n-2), (n-4),..(n - n/2) adds each
Ø Total adds: n * log(n) - (n-1) à O(n*log(n)) work
Ø This scan algorithm is not very work efficient
Ø Sequential scan algorithm does n adds
Ø A factor of log(n) hurts: 20x for 10^6 elements!
Ø A parallel algorithm can be slow when execution
resources are saturated due to low work
efficiency
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Improving Efficiency
Ø A common parallel algorithm pattern:
Balanced Trees
Ø Build a balanced binary tree on the input data and sweep it to
and from the root
Ø Tree is not an actual data structure, but a concept to determine
what each thread does at each step
Ø For scan:
Ø Traverse down from leaves to root building partial sums at
internal nodes in the tree
Ø Root holds sum of all leaves
Ø Traverse back up the tree building the scan from the partial sums
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Let’s Look at the Reduction Tree Again
3 1 7 0 4 1 6 3
+ + + +
4 7 5 9
+ +
11 14
+
25 16
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Parallel Scan – Reduction Step
x0 x1 x2 x3 x4 x5 x6 x7
+ + + +
Time! ∑x0..x1 ∑x2..x3 ∑x4..x5 ∑x6..x7
+ +
∑x0..x3 ∑x4..x7
+
In place calculation ! ∑x x
0.. 7
Final value after reduce 17
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7
∑x0..x5
Move (add) a critical value to a
central location where it is
needed!
18
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7
∑x0..x5
+ + +
∑x0..x2 ∑x0..x4 ∑x0..x6
19
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Putting it Together
20
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Reduction Step Kernel Code
// scan_array[2*BLOCK_SIZE] is in shared memory
int stride = 1;
while(stride <= BLOCK_SIZE)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2*BLOCK_SIZE)
scan_array[index] += scan_array[index-stride];
stride = stride*2;
threadIdx.x+1 = 1, 2, 3, 4….!
__syncthreads(); stride = 1, index = !
}
21
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Sum of left half
Copyright © 2013 by Yong
22 Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Post Scan Step
int stride = BLOCK_SIZE/2;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if((index+stride) < 2*BLOCK_SIZE)
{
scan_array[index+stride] += scan_array[index];
}
stride = stride/2;
__syncthreads();
}
23
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
(Exclusive) Prefix-Sum (Scan) Definition
Definition: The all-prefix-sums operation takes a binary
associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![0, x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-2)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! ! [3 1 7 0 4 1 6 3],!
would return! [0 3 4 11 11 15 16 22].!
24
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Why Exclusive Scan
Ø To find the beginning address of allocated buffers
Ø Inclusive and Exclusive scans can be easily derived
from each other; it is a matter of convenience
! ![3 1 7 0 4 1 6 3]!
!
Exclusive ![0 3 4 11 11 15 16 22]!
!
Inclusive ![3 4 11 11 15 16 22 25]!
!
25
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
(Add-move Operation)
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0
0 +
∑x0..x3
26
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0
0 +
∑x0..x3
+ +
0 ∑x0..x1 ∑x0..x3 ∑x0..x5
+ + + +
0 x0 ∑x0..x1 ∑x0..x2 ∑x0..x3 ∑x0..x4 ∑x0..x5 27∑x0..x6
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
if (threadIdx.x==0) scan_array[2*blockDim.x-1] = 0;
int stride = BLOCK_SIZE;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2* BLOCK_SIZE)
{
float temp = scan_array[index];
scan_array[index] += scan_array[index-stride];
scan_array[index-stride] = temp;
}
stride = stride / 2;
__syncthreads();
}
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Scan Example – Reduction Step
T 3 1 7 0 4 1 6 3
Assume array is already in shared memory
29
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Reduction Step (cont.)
T 3 1 7 0 4 1 6 3
Stride 1 Iteration 1, n/2 threads
T 3 4 7 7 4 5 6 9
Each corresponds
to a single thread.
Iterate log(n) times. Each thread adds value stride elements away to its own value
30
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Reduction Step (cont.)
T 3 1 7 0 4 1 6 3
Stride 1
T 3 4 7 7 4 5 6 9
Stride 2 Iteration 2, n/4 threads
T 3 4 7 11 4 5 6 14
Each corresponds
to a single thread.
Iterate log(n) times. Each thread adds value stride elements away to its own value
31
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Reduction Step (cont.)
T 3 1 7 0 4 1 6 3
Stride 1
T 3 4 7 7 4 5 6 9
Stride 2
T 3 4 7 11 4 5 6 14
Stride 4 Iteration log(n), 1 thread
T 3 4 7 11 4 5 6 25 Each corresponds
to a single thread.
Iterate log(n) times. Each thread adds value stride elements away to its own value.
Note that this algorithm operates in-place: no need for double buffering
32
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Zero the Last Element
T 3 4 7 11 4 5 6 0
We now have an array of partial sums. Since this is an exclusive scan,
set the last element to zero. It will propagate back to the first element.
33
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Post Scan Step from Partial Sums
T 3 4 7 11 4 5 6 0
34
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Post Scan Step from Partial Sums
T 3 4 7 11 4 5 6 0
Stride 4 Iteration 1
1 thread
T 3 4 7 0 4 5 6 11
Each corresponds
to a single thread.
Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.
35
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Post Scan Step from Partial Sums
T 3 4 7 11 4 5 6 0
Stride 4
T 3 4 7 0 4 5 6 11
Iteration 2
Stride 2
2 threads
T 3 0 7 4 4 11 6 16
Each corresponds
to a single thread.
Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.
36
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Post Scan Step from Partial Sums
T 3 4 7 11 4 5 6 0
Stride 4
T 3 4 7 0 4 5 6 11
Stride 2
T 3 0 7 4 4 11 6 16
Iteration log(n)
Stride 1
n/2 threads
T 0 3 4 11 11 15 16 22 Each corresponds
to a single thread.
Done! We now have a completed scan that we can write out to device memory.
Total steps: 2 * log(n).
Total work: 2 * (n-1) adds = O(n) Work Efficient!
37
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Work Analysis
Ø The parallel Inclusive Scan executes 2*log(n) parallel iterations
Ø log(n) in reduction and log(n) in post scan
Ø The iterations do n/2, n/4,..1, 1, …., n/4, n/2 adds
Ø Total adds: 2* (n-1) à O(n) work
Ø The total number of adds is no more than twice that
done in the efficient sequential algorithm
Ø The benefit of parallelism can easily overcome the 2X work
when there is sufficient hardware
38
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Working on Arbitrary Length Input
Ø Build on the scan kernel that handles up to
2*blockDim.x elements
Ø Assign each section of 2*blockDim elements to a
block
Ø Have each block write the sum of its section into a
Sum array indexed by blockIdx.x
Ø Run parallel scan on the Sum array
Ø May need to break down Sum into multiple sections if it is too big
for a block
Ø Add the scanned Sum array values to the elements of
corresponding sections
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Overall Flow of Complete Scan
40
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes