1
So
I
found
this
batman
image
on
the
net.
I
had
to
put
it
somewhere
in
the
presenta7on.
Here
it
is.
Nvidia
has
a
page
on
its
cuda
zone
showing
applica7ons
and
speedups,
its
a
bit
silly
and
its
not
so
adver7zed
anymore
nowadays,
but
s7ll
useful
as
an
introduc7on
hCp://www.nvidia.com/object/cuda-apps-ash-new.html#
Intels
Knights
Corner
news
hCp://www.intel.com/pressroom/archive/releases/ 2010/20100531comp.htm
hCp://en.wikipedia.org/wiki/Intel_MIC#Knights_Corner
A
wild
and
not
even
a
par7cularly
crea7ve
example...
More
computa7on
could
enable
en7rely
novel
scenarios...
The
past
couple
of
Unreal
engine
demos
were
architected
to
show
some
of
what
it
could
be
possible
I
like
the
idea
of
mo7ongraphs
hCp://cg.cis.upenn.edu/hms/research/wcMG/
Also,
imperfect
shadowmaps
are
cool
hCp://cg.ibds.kit.edu/english/publica7ons.php
The
two
images
top
right
are
for
megameshes
(Lionhead)
and
megatextures
(ID)
Fight
night
round
4
and
Champion
are
the
rst
games
using
EAs
Real
AI,
which
is
trained
from
examples
Naughy
dog
famously
hand
op7mizes
SPU
kernels.
How
many
games
do
pack
so
much
computa7on,
even
today,
on
SPUs?
And
s7ll,
preCy
uniform
workloads,
on
a
processor
that
can
handle
non-uniform
loads
beCer
than
compute
(lower
SIMD
width).
Heart
bokeh
image
taken
from
hCp://www.augus7nefou.com/2009/08/bokeh-lter-turns-blurry-lights-into.html,
I
like
how
the
woman
is
pushed
out
of
the
frame
to
show
the
bokeh,
as
a
metaphor
of
how
we
did
some
eects
early
on
360
and
ps3
*
Data
parallel
is
harder
for
more
high
level
tasks,
but
its
far
from
impossible.
In
fact,
we
s7ll
rely
on
lots
of
data,
just
we
let
this
data
be
combed
through
with
mostly
handmade
serial
logic
Use
learning,
classica7on,
probabilis7c,
op7miza7on
methods
for
games...
I.E.
in
anima7on,
we
have
data,
go
from
blend
trees
to
anima7on
clips
all
down
to
work
with
individual
poses,
use
anima7on
data
itself
and
its
metrics
to
drive
anima7on,
not
logic!
hCp://graphics.stanford.edu/projects/ccclde/ccclde.pdf
hCp://www.stanford.edu/~svlevine/
Another
avenue:
simula7on,
modeling,
physics
The
last
image
is
from
hCp://www.crea7veapplica7ons.net/theory/meibach-and-posavec-data-visualiza7on- poetry-and-sculpture/
think
about
what
can
you
do
with
all
the
data
we
already
have
Usually
also
superscalar,
mul7ple
pipes,
SIMD
but
for
now
we
dont
care
Ok,
ok,
the
last
batman
...but
ooen
is
not,
its
surprising
(but
consider
that
ooen
the
compiler
does
not
know
if
a
given
loop
is
executed
ooen
or
not)
how
much
manual
unrolling
(and
prefetching
on
plaporms
without
strong
cache
predic7ors)
in
7ght
loops
s7ll
pays,
allowing
the
compiler
to
then
compute
a
good
schedule
without
dependencies
of
the
generated
assembly.
10
11
12
First
insight
we
learned
13
Again,
same
reasoning
14
15
Fibers
and
Corou7nes
are
i.e.
used
to
hide
I/O
and
network
access
in
web
servers
(a
popular
example
is
node.js)
16
For
a
sequen7al
CPU
loop,
all
this
work
done
*in
the
specied
order*.
The
compiler
may
reschedule
instruc7ons
without
breaking
the
logical
order
of
opera7ons,
and
the
OOO
CPU
(i.e.
PC,
but
not
Ps3
or
360)
will
re-order
instruc7ons
(well,
microinstruc7ons)
to
get
rid
of
dependencies,
but
the
processing
order
is
essen7ally
preserved.
By
contrast,
on
the
GPU,
all
you're
saying
is
you
want
that
whole
array
processed,
you
don't
care
about
the
order.
17
Of
course
there
is
other
stu
too,
and
never
mistake
a
logical
model
for
a
physical
one,
but
this
is
enough
for
everything
well
need
18
Vectors
are
usually
even
larger
than
the
number
of
ALUs.
Repeat
the
same
instruc7on
a
number
of
7mes
to
process
an
en7re
vector
(allows
ALUs
to
be
pipelined,
i.e.
a
vector
of
16
oat4
elements
might
be
processed
by
four
oat4
ALUs
issuing
4
at
a
7me,
each
ALU
pipeline
stage
can
compute
a
single
oat
per
cycle
and
with
four
stages
we
achieve
the
oat4
processing
with
a
throughput
of
one
per
cycle
and
a
latency
of
four).
We
say
threads
here
but
they
are
more
similar
to
lightweight
bers.
Also,
confusingly,
dierent
GPU
languages
call
the
same
concepts
in
dierent
ways
The
more
computa7onal
threads
we
can
pack
in
the
registers,
the
more
independent
work
well
have
to
hide
latencies!
This
lightweight
threading
mechanism
is
fundamental
for
performance
with
manycores,
see
i.e.
hCp://www.linleygroup.com/newsleCers/ newsleCer_detail.php?num=4747&year=2011&tag=3
which
seems
a
deriva7ve
of
the
Tilera
Tile64
hCp://www.7lera.com/products/processors/TILE64
19
Illustra7on
stole
from
Running
Code
at
a
Teraop:
Overview
of
GPU
Architecture,
Kayvon
Fatahalian,
Siggraph
2009
Some
GPUs
do
not
support
variable
length
loops,
everything
gets
unrolled
and
inlined...
20
I.E.
Pixars
Reyes
rendering
algorithm
had
interpreted,
very
parallel
shaders.
The
cost
of
the
interpreter
is
zero
when
you
repeat
the
same
instruc7on
over
thousands
of
elements.
Reyes
shaders
are
essen7ally
the
same
as
the
hardware
shaders
we
have
on
modern
GPUs
Note
that
GPU
shaders
have
some
explicit
SIMD
instruc7ons
in
HLSL
and
similar
languages,
suppor7ng
vectors
of
elements
up
to
4-wide.
In
prac7ce,
many
modern
GPUs
process
only
scalar
code,
turning
vector
instruc7on
into
scalars.
This
is
because
the
SIMD
parallelism
that
GPUs
implement
is
much
wider
than
the
4-wide
instruc7ons
of
the
language,
coupling
many
(i.e.
16
or
more)
execu7on
elements
together
in
a
big
execu7on
vector.
This
regardless
of
if
each
element
is
scalar
or
4-way,
some
GPUs
even
have
a
4+1
congura7on
(one
4-way
and
one
scalar
instruc7on
in
parallel
per
each
execu7on
element).
In
general
when
we
talk
about
GPU
vectors
we
dont
mean
the
actual
vector
instruc7ons
in
the
shader
code
but
the
aggrega7on
of
many
ALUs
following
the
same
code.
21
22
23
Skinning
images
from
hCp://blog.wolre.com/2009/11/volumetric-heat-diusion-skinning/
If
youve
done
rendering
on
the
GPU,
you
might
recognize
that
the
pseudo-code
looks
nothing
like
a
vertex
shader,
where
you
dont
fetch
data
but
its
passed
as
the
parameters
of
the
main
func7on
of
a
kernel.
The
data
is
laid
out
in
GPU
memory
and
bound
via
streams,
constants
and
other
means.
I
didnt
include
a
real
vertex
shader
code
example,
even
if
it
would
not
have
been
harder
to
read,
because
it
might
even
be
misleading.
The
way
DirectX
or
OpenGL
expose
programmable
func7onality
to
the
user
might
be
close
to
the
hardware,
but
most
ooen
nowadays
is
not.
I.E.
shaders
might
indeed
compile
to
include
explicit
fetches...
Pixel
shaders
can
kill
their
output,
so
to
be
pedan7c
they
can
output
one
or
zero
elements.
Also,
pixel
processing
might
be
culled
early,
this
is
very
important.
The
independence
that
I
assert
is
not
en7rely
true.
Pixel
shaders
need
deriva7ve
informa7on
for
all
the
variables,
these
are
computed
using
nite
dierences
from
the
neighboring
pixels
which
not
only
constrains
them
to
always
be
processed
in
2x2
pixel
units
(quads)
but
also
establishes
through
the
deriva7ves
(ddx,
ddy),
a
mean
of
communica7on
(i.e.
see
Eric
Penners
Shader
Amor7za7on
using
Pixel
Quad
Message
Passing
GPU
Pro
2)
24
Well see later some more accurate pseudo-code
for
now
lets
s7ll
pretend
to
work
with
a
generic
in-parallel
25
26
*
At
least
in
our
representa7on
of
the
GPU...
Really
compilers
can
decide
to
spill
some
of
the
local
state
onto
the
shared
memory
(well
see
later)
and
allow
for
a
stack
and
so
on,
but
that
is
quite
a
dierent
thing...
I.E.
Nvidia
Fermi
has
a
limit
of
64
temporary
registers
in
kernel
code
(number
of
registers
per
core
is
higher,
but
it
will
be
used
for
mul7ple
contexts)
27
28
ALL
PSEUDO-CODE
DOES
NOT
DO
BOUNDS
CHECKS
J
29
hCp://en.wikipedia.org/wiki/Sor7ng_network
hCp://www.cs.brandeis.edu/~hugues/sor7ng_networks.html
hCp://www.sciencedirect.com/science/ar7cle/pii/0020019089901968
hCp://www.springerlink.com/content/g002019156467227/
Tangen7ally
related,
but
interes7ng
hCp://www.pvk.ca/Blog/2012/08/27/tabasco-sort-super-op7mal-merge-sort/
30
31
Rendering
is
a
non-uniform
workload,
fully
automated
scheduling
has
its
benets
early-z
early-stencil
rejec7on,
which
performs
an
hardware,
automated
stream
compac7on!
Fundamental
for
rendering...
32
At
rst
its
easy
to
complain
about
bank
rules
and
cohalescing
issues
when
trying
to
access
shared
and
global
memory...
But
we
have
to
consider
that
the
GPU
oers
the
ability
of
accessing
dierent
memory
loca7ons
per
SIMD
element,
on
many
of
them
on
many
cores...
This
is
a
rather
impressive
feat,
on
standard
CPUs
to
load
data
into
SIMD
vectors
you
have
to
do
it
serially,
one
element
at
a
7me.
Memory
has
addressing
limita7ons,
its
a
limit
of
the
technology.
SIMD
communica7on
via
vo7ng:
all,
any,
ballot
Atomics:
Add,
Sub,
Inc,
Dec,
And,
Or,
Xor,
Exch,
Min,
Max,
CompareAndSwap.
Append
(or)
consume
structured
buer
views
(Dx11)
33
If
we
dont
have
control
over
scheduling,
we
could
not
be
able
to
access
core
local
memory,
obviously,
as
we
would
not
know
which
elements
are
scheduled
where.
34
For
convenience,
in
compute,
indices
can
be
generated
on
1d,
2d
or
3d
grids
ResetCoreMemory
memory
is
not
really-
reset,
but
we
cant
rely
on
its
contents,
the
ini7aliza7on
is
to
be
done
in
the
Kernel
and
any
output
has
to
go
into
GLOBAL_MEM,
so
I
created
this
pseudo-func7on
to
illustrate
this
in
a
C-like
way.
35
36
37
In
the
picture,
the
rst
Connec7on
Machine,
a
65536-processor
supercomputer
of
the
80ies
38
Shared
memory
is
not
exposed
to
pixel
and
vertex
shaders
as
it
requires
manual
scheduling
of
elements
on
cores.
Its
s7ll
used
internally
by
the
graphic
card
to
store
results
of
rasteriza7on
and
shared
informa7on
39
40
Note
how
widening
the
number
of
outputs
per
element
looks
like
unrolling
the
computa7on...
Such
scheduling
decisions
have
impacts
on
latency
hiding
(switching
contextes),
the
ability
of
using
registers
as
cache
(i.e.
in
this
computa7on,
as
we
average
columns,
going
wide
on
columns
will
enable
reusing
fetches
from
registers),
the
ability
of
syncronizing
dierent
tasks
in
a
single
kernel
(persistent
threads)
and
so
on...
41
Also,
many
libraries
worth
checking
out.
Thrust,
Nvidia
performance
primi7ves,
CuDPP,
libCL
etc
Even
more
for
tasks
that
are
inherently
parallel,
linear
algebra,
imaging
and
so
on
42
43
44
45
All
the
arrows
here
represent
sums
to
the
exis7ng
values,
the
self
part
of
the
addi7on
is
not
shown
to
not
confuse
with
too
many
arrows
hCp://en.wikipedia.org/wiki/Prex_sum
46
All
the
black
arrows
here
represent
sums
to
the
exis7ng
values,
the
self
part
of
the
addi7on
is
not
shown,
as
for
the
slide
before
The
red
arrows
mark
element
moves
hCp://hCp.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf
47
The
moderngpu
website
has
lots
of
code,
and
it
take
a
scan
centric
approach
to
parallel
computa7on
hCp://www.moderngpu.com/intro/scan.html
hCp://www.moderngpu.com/scan/segscan.html
48
Blelloch
paper
references
many
applica7on
domains:
www.cs.cmu.edu/~guyb/papers/ Ble93.pdf
hCp://back40compu7ng.googlecode.com/svn-history/r225/wiki/documents/ RadixSortTR.pdf
49
50
hCp://www.cse.chalmers.se/~olaolss/papers/Ecient%20Stream%20Compac7on %20on%20Wide%20SIMD%20Many-Core%20Architectures.pdf
51
52
Images
from
the
original
paper:
On-the-y
Point
Clouds
through
Histogram
Pyramids.
Gernot
Ziegler,
Art
Tevs,
Chris7an
Theobalt,
Hans-Peter
Seidel
hCp://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/paper17_gpu_pointclouds.pdf
hCp://www.mpi-inf.mpg.de/%7Egziegler/gpu_pointlist/slides_vmv2006.pdf
hCp://www.astro.lu.se/compugpu2010/resources/bonus_histopyramid.pdf
Marching
cubes
using
HP
hCp://diglib.eg.org/EG/DL/CGF/volume27/issue8/ v27i8pp2028-2039.pdf.abstract.pdf
53
Good
rendering
hCps://graphics.stanford.edu/wikis/cs448s-10/FrontPage? ac7on=ACachFile&do=get&target=08-GPUArchII.pdf
54
Some
of
the
compromises,
many
dierences
in
the
implementa7on
See
Debunking
the
100x
GPU
vs
CPU
Myth
for
more
details:
h@p://www.hwsw.hu/kepek/hirek/ 2010/06/p451-lee.pdf
55
There
are
many
languages
that
are
trying
to
be
both
GPU
and
CPU
targets,
OpenCL
has
CPU
backends,
without
considering
CPU
and
GPUs
on
the
same
chip
like
AMD
Fusion
(see
i.e.
synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-ecacy.pdf)
and
languages
that
target
the
CPU
for
wide
SIMD
compu7ng
like
Intels
opensource
SPMD
ISPC
hCp:// ispc.github.com/
hCp://tbex.twbbs.org/~tbex/pad/CA_papers/Twin%20Peaks.pdf
Throughput
compu7ng
is
everywhere,
not
only
for
high
performance
calcula7ons
but
also
servers,
low
power
and
so
on.
Amdhal
for
mul7cores:
research.cs.wisc.edu/mul7facet/papers/ ieeecomputer08_amdahl_mul7core.pdf
56
57
Some
bonus
reads
BASICS
Nvidia
CUDA
programming
guide
hCp://developer.download.nvidia.com/compute/ DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
N-Body
Simula7on
with
CUDA
from
GPU
Gems
3
hCp://hCp.developer.nvidia.com/ GPUGems3/gpugems3_ch31.html
ADVANCED
Ecient
Primi7ves
and
Algorithms
for
Many-core
architectures
hCp:// www.cs.ucdavis.edu/research/tech-reports/2011/CSE-2011-4.pdf
Persistent
threads
programming
hCp://developer.download.nvidia.com/GTC/PDF/ GTC2012/Presenta7onPDF/S0157-GTC2012-Persistent-Threads-Compu7ng.pdf
More
on
PT
and
Task
Parallelism
hCp://developer.download.nvidia.com/GTC/PDF/ GTC2012/Presenta7onPDF/S0138-GTC2012-Parallelism-Primi7ves-Applica7ons.pdf
Registers
or
shared
memory?
hCp://mc.stanford.edu/cgi-bin/images/6/65/ SC08_Volkov_GPU.pdf
Op7miza7on
7ps
hCp://www.cs.virginia.edu/~skadron/Papers/ cuda_tuning_bof_sc09_nal.pdf
DATA
STRUCTURES
TONS
of
spa7al
ones,
very
few
general
purpose
NOTE:
aCributes
of
parallel
data
structures:
58