ARCS 2008
Data parallel algorithms,
algorithmic building blocks,
precision vs. accuracy
Robert Strzodka
ARCS 2008 Architecture of Computing Systems
GPGPU and CUDA Tutorials
Dresden, Germany, February 25 2008
Overview
ARCS 2008
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Kernel changes index
regions of output arrays
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Kernel changes index
regions of output arrays
Rasterizer
Creates data streams
from index regions
5
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Kernel changes index
regions of output arrays
Rasterizer
Creates data streams
from index regions
Stream of array elements,
order unknown
6
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Fragment Processor (FP)
Kernel changes index
regions of output arrays
Kernel changes each
datum independently,
reads more input arrays
Rasterizer
Creates data streams
from index regions
Stream of array elements,
order unknown
7
ARCS 2008
The GPU is a Fast, Parallel Array Processor
Input Arrays:
1D, 3D,
2D (typical)
Output Arrays:
1D, 3D (slice),
2D (typical)
Vertex Processor (VP)
Fragment Processor (FP)
Kernel changes index
regions of output arrays
Kernel changes each
datum independently,
reads more input arrays
Rasterizer
Creates data streams
from index regions
Stream of array elements,
order unknown
8
ARCS 2008
The GPU is a Fast, Highly Multi-Threaded Processor
Input Arrays:
nD
Output Arrays:
nD
Start thousands of
parallel threads in
groups of m, e.g. 32
ARCS 2008
The GPU is a Fast, Highly Multi-Threaded Processor
Input Arrays:
nD
Start thousands of
parallel threads in
groups of m, e.g. 32
Output Arrays:
nD
Each group operates in a
SIMD fashion, with
predication if necessary
10
ARCS 2008
The GPU is a Fast, Highly Multi-Threaded Processor
Input Arrays:
nD
Output Arrays:
nD
Start thousands of
parallel threads in
groups of m, e.g. 32
Each group operates in a
SIMD fashion, with
predication if necessary
In general all threads are independent
but certain collections of groups may
use local memory to exchange data
11
Native Memory Layout Data Locality
CPU
GPU
1D input
1D output
Other dimensions with offsets
Input
ARCS 2008
2D input
2D output
Other dimensions with offsets
Input
Output
Color coded locality
red (near), blue (far)
Output
12
ARCS 2008
Primitive Index Regions in Output Arrays
Quads and Triangles
Output region
Fastest option
Output region
Line segments
Slower, try to pair lines to 2xh,
wx2 quads
Output region
Point Clouds
Slowest, try to gather points into
larger forms
13
ARCS 2008
GPUs are Optimized for Local Data Access
Memory access types: Cache, Sequential, Random
CPU
Large cache
Few processing elements
Optimized for spatial and
temporal data reuse
GPU
Small cache
Many processing elements
Optimized for sequential
(streaming) data access
GeForce 7800 GTX
Pentium 4
chart courtesy
of Ian Buck
14
ARCS 2008
Input and Output Arrays
CPU
GPU
Input and output arrays may
overlap
Input and output arrays must
not overlap
Input
Input
Output
Output
15
Configuration Overhead
Configuration
limited
ARCS 2008
Computation
limited
chart courtesy
of Ian Buck
16
Overview
ARCS 2008
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy
17
ARCS 2008
Parallel Data-Flow: Map, Gather and Scatter
Gather: x= f( a(1,2), a(3,5), )
Map: x= f(a)
Input
Output
Scatter: x(2,3)= f(a), x(6,7)= g(a),
Input
Output
Input
Output
General: x(2,3)= f( a(1,2), a(3,5), ),
x(6,7)= f( a(6,2), a(7,5), ),
Input
Output
18
Performance of Gather and Scatter
ARCS 2008
chart courtesy of
Naga Govindaraju
19
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
input
20
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
N/2input
x N/2array
output
21
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
gather 2x2
regions for
each output
22
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
first output
23
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
maximum of
2x2 region
24
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
intermediates
25
Parallel Reduction, e.g. Maximum of an Array
input
intermediates
ARCS 2008
result
26
Parallel Reduction, e.g. Maximum of an Array
ARCS 2008
For commutative operators (e.g. +,*,max) this is encapsulated into
a single function call.
For a more detailed discussion see, Mark Harris' CUDA
optimization talk from SC 2007: [Link]
27
ARCS 2008
Overview
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy
slides courtesy of
Shubho Sengupta
28
ARCS 2008
Motivation
Stream Compaction
3
Split
T
Null
Motivation
Common scenarios in parallel computing
Variable output per thread
Threads want to perform a split radix sort, building trees
What came before/after me?
Where do I start writing my data?
Scan answers this question
ARCS 2008
ARCS 2008
Scan
Each element is a sum of all the elements to the left of it
(Exclusive)
Each element is a sum of all the elements to the left of it
and itself (Inclusive)
Input
11
11
15
16
22
Exclusive
11
11
15
16
22
25
Inclusive
Scan the past
ARCS 2008
First proposed in APL (1962)
Used as a data parallel primitive in the Connection
Machine (1990)
Guy Blelloch used scan as a primitive for various
parallel algorithms (1990)
Scan the present
ARCS 2008
First GPU implementation by Daniel Horn (2004), O(n
logn)
Subsequent GPU implementations by
Hensley (2005) O(n logn), Sengupta (2006) O(n), Gre (2006)
O(n) 2D
NVIDIA CUDA implementation by Mark Harris (2007),
O(n), space efficient
ARCS 2008
Scan - Reduce
3
11
14
log n steps
Work halves each
step
O(n) work
11
25
In place, space
efficient
ARCS 2008
Scan - Down Sweep
3
11
25
11
11
11
16
11
11
15
16
22
log n steps
Work doubles
each step
O(n) work
In place, space
efficient
ARCS 2008
Segmented Scan
Input
3
Scan within each segment in parallel
Output
0
Segmented Scan
Introduced by Schwartz (1980)
Forms the basis for a wide variety of algorithms
Radixsort, Quicksort
Sparse Matrix-Vector Multiply
Convex Hull
Solving recurrences
Tree operations
ARCS 2008
Segmented Scan Large Input
ARCS 2008
Segmented Scan Advantages
ARCS 2008
Operations in parallel over all the segments
Irregular workload since segments can be of any length
Can simulate divide-and-conquer recursion since
additional segments can be generated
ARCS 2008
Segmented Scan Variant: Tridiagonal Solver
Tridiagonal system of n rows solved in parallel
Then for each of the m columns in parallel
Read pattern is similar to but more complex than scan
Overview
ARCS 2008
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy
41
ARCS 2008
The Erratic Roundoff Error
Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|
-20
single precision
double precision
y = log2( f(a) ), 0 --> 2^-100
Smaller is better
-30
-40
-50
-60
-70
-80
-90
-100
10
20
30
40
50
x = log2( 1/a ), a = 1 / 2^x
42
Precision and Accuracy
ARCS 2008
There is no monotonic relation between the computational precision and
the accuracy of the final result.
Increasing precision can decrease accuracy !
The increase or decrease of precision in different parts of a computation
can have very different impact on the accuracy.
The above can be exploited to significantly reduce the precision in parts
of a computation without a loss in accuracy.
We obtain a mixed precision method.
43
Quantization - Preserving Accuracy
ARCS 2008
Watch out for cancellation
ab,
r = c*a-c*b
r = c(a-b)
Maximize operations on the same scale
a[0,1], b,c[10-3,10-4],
r = a+b+c
r = a+(b+c)
Make implicit relations between constants explicit
ai=0.01, i=0..99
a99=1-(i<99ai ),
r = i<100ai 1
r = i<100ai = 1
Use symmetric intervals for multiplication
a ~ [-1, 1],
r = 0.1134*(a+1)
r = 0.1134a+0.1134
Minimize the number of multiplications
r = 0.25a + 0.1b + 0.15c
r = 0.1(a+b)+0.15(a+c)
44
ARCS 2008
Resources for Signed Integer Operations
Operation
Area
Latency
min(r,0)
max(r,0)
b+1
add(r1,r2)
sub(r1,r2)
2b
add(r1,r2,r3)add(r4,r5)
2b
mult(r1,r2)
sqr(r)
b(b-2)
b ld(b)
2c(c-5)
c(c+3)
sqrt(r)
b: bitlength of argument, c: bitlength of result
45
FPGA Results: Conjugate Gradient (CG)
ARCS 2008
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
1600
Adder
Multiplier
CG kernel normalized (1/30)
1200
Number of slices
Smaller is better
1400
1000
800
600
400
200
0
20
25
30
35
40
45
50
Bits of mantissa
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]
46
ARCS 2008
High Precision Emulation
Given a m x m bit unsigned integer multiplier we want to build
a n x n multiplier with a n=k*m bit result
k
i =1
j =1
im
jm
2
a
2
i bj =
( i + j ) m
2
ai b j +
i , j =1
i + j k +1
( i + j ) m
2
ai b j
i , j =1
i + j >k +1
The evaluation of the first sum requires k(k+1)/2 multiplications,
the evaluation of the second depends on the rounding mode
For floating point numbers additional operations are necessary
because of the mantissa/exponent distinction
A double emulation with two aligned s23e8 single floats is less
complex than an exact s52e11 double emulation, achieves a
s46e8 precision and still requires 10-20 single float operations
47
ARCS 2008
Precision Performance Rough Estimates
Reconfigurable device, e.g. FPGA
2x float add 1x double add
4x float mul 1x double mul
Hardware emulation (compute area limited), e.g. GPU
2x float add 1x double add
5x float mul 1x double mul
Hardware emulation (data path limited), e.g. CPU
2x float add 1x double add
2x float mul 1x double mul
Software emulation
10x float add 1x double add
20x float mul 1x double mul
48
ARCS 2008
Mixed Precision Iterative Refinement Ax=b
Exploit the speed of low precision and obtain a result of
high accuracy
dk =b-Axk
Ack=dk
xk+1=xk+ck
k=k+1
Compute in high precision (cheap)
Solve in low precision (fast)
Correct in high precision (cheap)
Iterate until convergence in high precision
Low precision solution is used as a pre-conditioner in a
high precision iterative method
A is small and dense: Solve Ack=dk directly
A is large and sparse: Solve (approximately) Ack=dk with an iterative
method itself
49
Larger is better
CPU Results: LU Solver
ARCS 2008
chart courtesy
of Jack Dongarra
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy
(revisiting iterative refinement for linear systems), SC 2006]
50
ARCS 2008
GPU Results: Conjugate Gradient (CG) and Multigrid (MG)
Performance of double precision CPU and mixed precision CPU-GPU solvers
Seconds per grid node
Smaller is better
5e-4
5e-5
CG CPU
CG GPU
MG2+2 CPU
MG2+2 GPU
5e-6
5e-7
10
Data level
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]
51
Conclusions
ARCS 2008
Parallel Processing on GPUs is about identifying
independent work and preserving data locality
Map, gather, scatter are basic types of parallel data-flow.
Parallel prefix (scan) enables the parallelization of many
seemingly inherently sequential algorithms
Precision accuracy! Mixed precision methods can
reduce resource requirements quadratically.
52