0% found this document useful (0 votes)
56 views40 pages

Lecture 10

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views40 pages

Lecture 10

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Scan

Parallel Prefix Sum – Scan

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes


Scan

Objective
Ø To master parallel Prefix Sum (Scan) algorithms
Ø Frequently used for parallel work assignment and resource
allocation
Ø A key primitive in many parallel algorithms to convert serial
computation into parallel computation
Ø Based on reduction tree and reverse reduction tree

Ø Reading – Mark Harris, Parallel Prefix Sum with CUDA


Ø [Link]
projects/scan/doc/[Link]

2
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

(Inclusive) Prefix-Sum (Scan) Definition


Definition: The all-prefix-sums operation takes a binary
associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! [3 1 7 0 4 1 6 3],!
would return! [3 4 11 11 15 16 22 25].!

3
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Inclusive Scan Application Example


Ø Assume we have a 100-inch sandwich to feed 10
Ø We know how many inches each person wants
Ø [3 5 2 7 28 4 3 0 8 1]
Ø How do we cut the sandwich quickly?
Ø How much will be left?

Ø Method 1: cut the sections sequentially: 3 inches first, 5


inches second, 2 inches third, etc.
Ø Method 2: calculate Prefix scan and cut in parallel
Ø [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)

4
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Typical Applications of Scan


Ø Scan is a simple and useful parallel building block
Ø Convert recurrences from sequential :
for(j=1;j<n;j++)
out[j] = out[j-1] + f(j);
Ø into parallel:
forall(j) { temp[j] = f(j) };
scan(out, temp);
Ø Useful for many parallel algorithms:
•Radix sort •Polynomial evaluation
•Quicksort •Solving recurrences
•String comparison •Tree operations
•Lexical analysis •Histograms
•Stream compaction •Etc.

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes


Scan

Other Applications
Ø Assigning space in farmers market
Ø Allocating memory to parallel threads
Ø Allocating memory buffer for
communication channels
Ø …

6
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Inclusive Sequential Prefix-Sum


Given a sequence [x0, x1, x2, ... ]
Calculate output [y0, y1, y2, ... ]

Such that y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2

Using a recursive definition
yi = yi − 1 + xi
7
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Work Efficient C Implementation

y[0] = x[0];
for (i=1; i < Max_i; i++)
y[i] = y[i-1] + x[i];

Computationally efficient:

N additions needed for N elements - O(N)

8
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Naïve Inclusive Parallel Scan


Ø Assign one thread to calculate each y
element
Ø Have every thread add up all x elements
needed for the y element
y0 = x0
y1 = x0 + x1
y 2 = x 0 + x 1+ x 2

Parallel programming is easy as long as you


don’t care about performance.
9
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

A Slightly Better Parallel Inclusive Scan Algorithm


1. Read input
T0 3 1 7 0 4 1 6 3 from device
memory to
shared
memory
Each thread reads one value from the input array
in device memory into shared memory array T0.
Thread 0 writes 0 into shared memory array.

10
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. Read input from


device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
stride to n: Add pairs
of elements stride
elements apart.
Double stride at
each iteration. (note:
must double buffer
shared mem arrays)

• Active threads: stride to n-1 (n-stride threads)


Iterate #1 • Thread j adds elements j and j-stride from T0 and
Stride = 1 writes result into shared memory buffer T1 (ping-pong) 11
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. (Read input from


device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
Stride 2 stride to n: Add pairs
of elements stride
T0 3 4 11 11 12 12 11 14 elements apart.
Double stride at
each iteration. (note:
must double buffer
shared mem arrays)

• Active threads: stride to n-1 (n-stride threads)


Iterate #2 • Thread j adds elements j and j-stride from T1 and
Stride = 2 writes result into shared memory buffer T0 (ping-pong) 12
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

T0 3 1 7 0 4 1 6 3 1. (Read input from


device memory to
Stride 1 shared memory
2. Iterate log(n)
T1 3 4 8 7 4 5 7 9 times: Threads
Stride 2 stride to n: Add pairs
of elements stride
T0 3 4 11 11 12 12 11 14 elements apart.
Double stride at
Stride 4 each iteration. (note:
must double buffer
T1 3 4 11 11 15 16 22 25 shared mem arrays)
3. Write output from
shared memory to
device memory
Iterate #3
Stride = 4 13
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Work Efficiency Considerations

Ø The first-attempt Scan executes log(n) parallel


iterations
Ø The steps do (n-1), (n-2), (n-4),..(n - n/2) adds each
Ø Total adds: n * log(n) - (n-1) à O(n*log(n)) work

Ø This scan algorithm is not very work efficient


Ø Sequential scan algorithm does n adds
Ø A factor of log(n) hurts: 20x for 10^6 elements!

Ø A parallel algorithm can be slow when execution


resources are saturated due to low work
efficiency
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Improving Efficiency
Ø A common parallel algorithm pattern:
Balanced Trees
Ø Build a balanced binary tree on the input data and sweep it to
and from the root
Ø Tree is not an actual data structure, but a concept to determine
what each thread does at each step
Ø For scan:
Ø Traverse down from leaves to root building partial sums at
internal nodes in the tree
Ø Root holds sum of all leaves
Ø Traverse back up the tree building the scan from the partial sums

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes


Scan

Let’s Look at the Reduction Tree Again


3 1 7 0 4 1 6 3

+ + + +

4 7 5 9

+ +

11 14

+
25 16
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Parallel Scan – Reduction Step


x0 x1 x2 x3 x4 x5 x6 x7

+ + + +

Time! ∑x0..x1 ∑x2..x3 ∑x4..x5 ∑x6..x7

+ +

∑x0..x3 ∑x4..x7

+
In place calculation ! ∑x x
0.. 7
Final value after reduce 17
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7

∑x0..x5

Move (add) a critical value to a


central location where it is
needed!

18
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Inclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 ∑x0..x7

∑x0..x5

+ + +

∑x0..x2 ∑x0..x4 ∑x0..x6

19
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Putting it Together

20
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step Kernel Code


// scan_array[2*BLOCK_SIZE] is in shared memory

int stride = 1;
while(stride <= BLOCK_SIZE)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2*BLOCK_SIZE)
scan_array[index] += scan_array[index-stride];
stride = stride*2;
threadIdx.x+1 = 1, 2, 3, 4….!
__syncthreads(); stride = 1, index = !
}

21
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Sum of left half

Copyright © 2013 by Yong


22 Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step


int stride = BLOCK_SIZE/2;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if((index+stride) < 2*BLOCK_SIZE)
{
scan_array[index+stride] += scan_array[index];
}
stride = stride/2;
__syncthreads();
}

23
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

(Exclusive) Prefix-Sum (Scan) Definition

Definition: The all-prefix-sums operation takes a binary


associative operator ⊕, and an array of n elements!
[x0, x1, …, xn-1],!
!
and returns the array!
!
! ![0, x0, (x0 ⊕ x1), …, (x0 ⊕ x1 ⊕ … ⊕ xn-2)].!
!
Example: If ⊕ is addition, then the all-prefix-sums operation
on the array ! ! [3 1 7 0 4 1 6 3],!
would return! [0 3 4 11 11 15 16 22].!

24
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Why Exclusive Scan


Ø To find the beginning address of allocated buffers

Ø Inclusive and Exclusive scans can be easily derived


from each other; it is a matter of convenience

! ![3 1 7 0 4 1 6 3]!
!
Exclusive ![0 3 4 11 11 15 16 22]!
!
Inclusive ![3 4 11 11 15 16 22 25]!
!

25
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
(Add-move Operation)
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0

0 +

∑x0..x3

26
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan
Exclusive Post Scan Step
x0 ∑x0..x1 x2 ∑x0..x3 x4 ∑x4..x5 x6 0

0 +

∑x0..x3

+ +

0 ∑x0..x1 ∑x0..x3 ∑x0..x5

+ + + +

0 x0 ∑x0..x1 ∑x0..x2 ∑x0..x3 ∑x0..x4 ∑x0..x5 27∑x0..x6


Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Exclusive Post Scan Step


if (threadIdx.x==0) scan_array[2*blockDim.x-1] = 0;
int stride = BLOCK_SIZE;
while(stride > 0)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2* BLOCK_SIZE)
{
float temp = scan_array[index];
scan_array[index] += scan_array[index-stride];
scan_array[index-stride] = temp;
}
stride = stride / 2;
__syncthreads();
}

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes


Scan

Exclusive Scan Example – Reduction Step


T 3 1 7 0 4 1 6 3
Assume array is already in shared memory

29
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step (cont.)


T 3 1 7 0 4 1 6 3
Stride 1 Iteration 1, n/2 threads

T 3 4 7 7 4 5 6 9

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value

30
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step (cont.)


T 3 1 7 0 4 1 6 3
Stride 1

T 3 4 7 7 4 5 6 9
Stride 2 Iteration 2, n/4 threads

T 3 4 7 11 4 5 6 14

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value

31
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Reduction Step (cont.)


T 3 1 7 0 4 1 6 3
Stride 1

T 3 4 7 7 4 5 6 9
Stride 2

T 3 4 7 11 4 5 6 14
Stride 4 Iteration log(n), 1 thread

T 3 4 7 11 4 5 6 25 Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value.

Note that this algorithm operates in-place: no need for double buffering

32
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Zero the Last Element

T 3 4 7 11 4 5 6 0

We now have an array of partial sums. Since this is an exclusive scan,


set the last element to zero. It will propagate back to the first element.

33
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step from Partial Sums


T 3 4 7 11 4 5 6 0

34
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step from Partial Sums


T 3 4 7 11 4 5 6 0
Stride 4 Iteration 1
1 thread
T 3 4 7 0 4 5 6 11

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.

35
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step from Partial Sums


T 3 4 7 11 4 5 6 0
Stride 4

T 3 4 7 0 4 5 6 11
Iteration 2
Stride 2
2 threads

T 3 0 7 4 4 11 6 16

Each corresponds
to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value,
and sets the value stride elements away to its own previous value.

36
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Post Scan Step from Partial Sums


T 3 4 7 11 4 5 6 0
Stride 4

T 3 4 7 0 4 5 6 11
Stride 2

T 3 0 7 4 4 11 6 16
Iteration log(n)
Stride 1
n/2 threads

T 0 3 4 11 11 15 16 22 Each corresponds
to a single thread.

Done! We now have a completed scan that we can write out to device memory.

Total steps: 2 * log(n).


Total work: 2 * (n-1) adds = O(n) Work Efficient!
37
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Work Analysis
Ø The parallel Inclusive Scan executes 2*log(n) parallel iterations
Ø log(n) in reduction and log(n) in post scan
Ø The iterations do n/2, n/4,..1, 1, …., n/4, n/2 adds
Ø Total adds: 2* (n-1) à O(n) work

Ø The total number of adds is no more than twice that


done in the efficient sequential algorithm
Ø The benefit of parallelism can easily overcome the 2X work
when there is sufficient hardware

38
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes
Scan

Working on Arbitrary Length Input


Ø Build on the scan kernel that handles up to
2*blockDim.x elements
Ø Assign each section of 2*blockDim elements to a
block
Ø Have each block write the sum of its section into a
Sum array indexed by blockIdx.x
Ø Run parallel scan on the Sum array
Ø May need to break down Sum into multiple sections if it is too big
for a block
Ø Add the scanned Sum array values to the elements of
corresponding sections

Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes


Scan

Overall Flow of Complete Scan

40
Copyright © 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes

You might also like