Lecture 3:
Communication Architectures and Parallel
Programming models:
CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
Announcements
▪ Office hours
- Fatahalian: Tue/Thurs 1:30-2:30PM, GHC 7005
- Papamichael: Wed 1-2PM, HH A-312
- Mu: Mon 4-5PM, West Wing cluster
▪ Assignment 1 out today!
- Due: Jan 31st
- No late handin (1 week is more than enough time)
▪ Course discussions, message boards hosted on piazza.com
- I’ve heard good things
- Please go to piazza.com and enroll in 15-418
(CMU 15-418, Spring 2012)
Review: Kayvon’s fictitious multi-core chip
We talked about forms of parallel/concurrent execution.
(CMU 15-418, Spring 2012)
Finishing up from last time
(CMU 15-418, Spring 2012)
Review: thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements
A
1. Load input A[i] ×
2. Load input B[i] B
=
3. Compute A[i] × B[i] C
4. Store result into C[i]
Three memory operations (12 bytes) for every MUL
NVIDIA GTX 480 GPU can do 480 MULs per clock (1.4 GHz)
Need ~8.4 TB/sec of bandwidth to keep functional units busy
~ 2% efficiency… but 8x faster than CPU!
(3 GHz Core i7 quad-core CPU: similar efficiency on this computation)
(CMU 15-418, Spring 2012)
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.
Overcoming bandwidth limits are a common challenge for
application developers on throughput-optimized systems.
(CMU 15-418, Spring 2012)
Bandwidth is a critical resource
Performant parallel programs will:
▪ Fetch data from memory less often
- Reuse data in a thread (traditional locality optimizations)
- Share data across threads (cooperation)
▪ Request data less often (instead, do more math: it is “free”)
- Term: “arithmetic intensity” - ratio of math to data access
(CMU 15-418, Spring 2012)
Entire system view: CPU + discrete GPU
L1
21 GB/sec Memory
L2
DDR3 DRAM
.. L3 cache (Gigabytes)
. (8 MB)
L1
L2
2011 and future:
PCIe x16 bus Co-locating CPU+GPU on same chip
Multi-core CPU 8 GB/sec each direction avoids PCIe bus bottleneck
(also reduces communication latency)
Execution 177 GB/sec
contexts L1
Memory
.. L2 DDR5 DRAM
. (~1 MB)
(1-2 GB)
Execution
contexts L1
Multi-core GPU (CMU 15-418, Spring 2012)
Programming with ISPC
(quick tutorial for assignment 1)
(CMU 15-418, Spring 2012)
ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single *program* multiple data
▪ http://ispc.github.com/
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include
“sinx_ispc.h” export
void
sinx(
uniform
int
N,
int
N
=
1024;
uniform
int
terms,
int
terms
=
5;
uniform
float*
x,
float*
x
=
new
float[N];
uniform
float*
result)
float*
result
=
new
float[N]; {
//
assume
N
%
programCount
=
0
//
initialize
x
here
for
(uniform
int
i=0;
i<N;
i+=programCount)
{
//
execute
ISPC
code
int
idx
=
i
+
programIndex;
sinx(N,
terms,
x,
result);
float
value
=
x[idx];
float
numer
=
x[idx]
*
x[idx]
*
x[idx];
uniform
int
denom
=
6;
//
3!
SPMD abstraction:
uniform
int
sign
=
-‐1;
Spawn “gang” of ISPC program instances
for
(uniform
int
j=1;
j<=terms;
j++)
All instances run ISPC code in parallel
{
Upon return, all instances have completed
value
+=
sign
*
numer
/
denom
numer
*=
x[idx]
*
x[idx];
denom
*=
(j+3)
*
(j+4);
sign
*=
-‐1;
}
result[i]
=
value;
}
}
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...
C++ code: main.cpp
#include
“sinx_ispc.h”
Sequential execution (C code)
int
N
=
1024;
int
terms
=
5;
float*
x
=
new
float[N];
float*
result
=
new
float[N]; Call to sinx()
1 2 3 4 5 6 7 8 Begin executing programCount
//
initialize
x
here instances of sinx() (ISPC code)
//
execute
ISPC
code
sinx(N,
terms,
x,
result);
SPMD abstraction: sinx() returns.
Completion of ISPC program instances.
Spawn “gang” of ISPC program instances Resume sequential execution
All instances run ISPC code in parallel
Upon return, all instances have completed Sequential execution
(C code)
SIMD implementation:
Number of instances in a gang is the SIMD width (or a small multiple of)
ISPC compiler generates binary (.o) with SIMD instructions
C++ code links against object as usual (CMU 15-418, Spring 2012)
sin(x) in ISPC
Interleaved assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include
“sinx_ispc.h” export
void
sinx(
uniform
int
N,
int
N
=
1024;
uniform
int
terms,
int
terms
=
5;
uniform
float*
x,
float*
x
=
new
float[N];
uniform
float*
result)
float*
result
=
new
float[N]; {
//
assumes
N
%
programCount
=
0
//
initialize
x
here
for
(uniform
int
i=0;
i<N;
i+=programCount)
{
//
execute
ISPC
code
int
idx
=
i
+
programIndex;
sinx(N,
terms,
x,
result);
float
value
=
x[idx];
float
numer
=
x[idx]
*
x[idx]
*
x[idx];
uniform
int
denom
=
6;
//
3!
ISPC Keywords:
uniform
int
sign
=
-‐1;
programCount: number of simultaneously
for
(uniform
int
j=1;
j<=terms;
j++)
executing instances in the gang (uniform value)
{
value
+=
sign
*
numer
/
denom
programIndex: id of the current instance in the
numer
*=
x[idx]
*
x[idx];
gang. (a non-uniform value: “varying”)
denom
*=
(j+3)
*
(j+4);
sign
*=
-‐1;
uniform: A type modifier. All instances have the
}
same value for this variable. Its use is purely an
result[i]
=
value;
}
optimization. Not needed for correctness. }
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Blocked assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include
“sinx_ispc.h” export
void
sinx(
uniform
int
N,
int
N
=
1024;
uniform
int
terms,
int
terms
=
5;
uniform
float*
x,
float*
x
=
new
float[N];
uniform
float*
result)
{
float*
result
=
new
float[N];
//
assume
N
%
programCount
=
0
uniform
int
count
=
N
/
programCount;
//
initialize
x
here
int
start
=
programIndex
*
count;
for
(uniform
int
i=0;
i<count;
i++)
//
execute
ISPC
code
{
sinx(N,
terms,
x,
result);
int
idx
=
start
+
i;
float
value
=
x[idx];
float
numer
=
x[idx]
*
x[idx]
*
x[idx];
uniform
int
denom
=
6;
//
3!
uniform
int
sign
=
-‐1;
for
(uniform
int
j=1;
j<=terms;
j++)
{
value
+=
sign
*
numer
/
denom
numer
*=
x[idx]
*
x[idx];
denom
*=
(j+3)
*
(j+4);
sign
*=
-‐1;
}
result[i]
=
value;
}
}
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Compute sin(x) using tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include
“sinx_ispc.h” export
void
sinx(
uniform
int
N,
int
N
=
1024;
uniform
int
terms,
int
terms
=
5;
uniform
float*
x,
float*
x
=
new
float[N];
uniform
float*
result)
float*
result
=
new
float[N]; {
foreach
(i
=
0
...
N)
//
initialize
x
here
{
float
value
=
x[i];
//
execute
ISPC
code
float
numer
=
x[i]
*
x[i]
*
x[i];
sinx(N,
terms,
x,
result);
uniform
int
denom
=
6;
//
3!
uniform
int
sign
=
-‐1;
for
(uniform
int
j=1;
j<=terms;
j++)
foreach: key language construct
{
value
+=
sign
*
numer
/
denom
numer
*=
x[i]
*
x[i];
Declares parallel loop iterations
denom
*=
(j+3)
*
(j+4);
ISPC assigns iterations to program instances in gang
sign
*=
-‐1;
}
result[i]
=
value;
(ISPC implementation will perform static assignment,
}
but nothing in the abstraction prevents a dynamic }
assignment)
(CMU 15-418, Spring 2012)
ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- This is the programming abstraction
▪ Single instruction, multiple data (SIMD) implementation
- Compiler emits vector instructions
- Handles mapping of conditional control flow to vector instructions
export
uniform
float
sumall1( export
uniform
float
sumall2(
uniform
int
N,
uniform
int
N,
uniform
float*
x)
uniform
float*
x)
▪ Semantics of ISPC can be tricky {
uniform
float
sum
=
0.0f;
{
uniform
float
sum;
foreach
(i
=
0
...
N)
float
partial
=
0.0f;
- SPMD abstraction + uniform values
{
foreach
(i
=
0
...
N)
(allows implementation details to
sum
+=
x[i];
{
}
partial
+=
x[i];
peak through abstraction a bit)
}
- What does sumall1 do?
return
sum;
}
//
from
ISPC
math
library
- What does sumall2 do?
sum
=
reduceAdd(partial);
- Which one is correct?
return
sum;
}
(CMU 15-418, Spring 2012)
Today
▪ Three parallel programming models
- Abstractions presented to the programmer
- Influence how programmers think when writing programs
▪ Three machine architectures
- Abstraction presented by the HW to low-level software
- Typically reflect implementation
▪ Focus on communication and cooperation
▪ Reading: Textbook section 1.2
(CMU 15-418, Spring 2012)
System layers: interface, implementation, interface, ...
Parallel Applications
Abstractions for describing Abstractions for describing “Programming model”
parallel computation communication (way of thinking about things)
Language or API
primitives/mechanisms
Compiler and/or runtime
System call API
Operating system support
Hardware Architecture
(HW/SW boundary)
Micro-architecture (HW implementation)
Blue italic text: concept
Red italic text: system abstraction
Black text: system implementation
(CMU 15-418, Spring 2012)
Example: expressing parallelism (pthreads)
Parallel Applications
Abstraction for describing parallel computation: thread Programming
model
pthread_create()
pthread library
System call API
OS support: kernel thread management
x86-64
modern multi-core CPU
(CMU 15-418, Spring 2012)
Example: expressing parallelism (ISPC)
Parallel Applications
Abstractions for describing parallel computation:
Programming
1. Specifying simultaneous execution (imperative: true parallelism)
model
2. Specifying independent work (declarative: potentially parallel)
ISPC language (call ISPC function, foreach construct)
ISPC compiler
System call API
OS support
x86-64 (including AVX vector instructions)
single-core of CPU
Note: ISPC has additional language primitives for multi-core execution (not discussed here)
(CMU 15-418, Spring 2012)
Three models of communication
1. Shared address space
2. Message passing
3. Data parallel
(CMU 15-418, Spring 2012)
Shared address space model (abstraction)
▪ Threads communicate by:
- Reading/writing to shared variables
- Interprocessor communication is implicit in memory operations
- Thread 1 stores to X. Thread 2 reads X (observes update)
- Manipulating synchronization primitives
- e.g., mutual exclusion using locks
▪ Natural extension of sequential programming model
- In fact, all our discussions have assumed a shared address space so far
▪ Think: shared variables are like a big bulletin board
- Any thread can read or write
(CMU 15-418, Spring 2012)
Shared address space (implementation)
▪ Option 1: threads share an address space (all data is sharable)
▪ Option 2: each thread has its own virtual address space, shared portion of
address spaces maps to same physical location
(like processes, described this way in book)
Physical mapping
Virtual address spaces
(CMU 15-418, Spring 2012)
Shared address space HW implementation
Any processor can directly reference any memory location
“Dance-hall” organization Interconnect examples
Processor Processor Processor Processor
Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory
Interconnect
Crossbar
Processor Processor Processor Processor Processor
Processor
Memory I/O
Processor
Processor
Memory Memory Memory Memory Memory Memory
Multi-stage network
▪ Symmetric (shared-memory) multi-processor (SMP):
- Uniform memory access time: cost of accessing an uncached* memory address is the same
for all processors
(* caching introduces non-uniform access times, but we’ll talk about that later)
(CMU 15-418, Spring 2012)
Shared address space architectures
Commodity x86 examples
Memory
Memory Controller
Intel Core i7 (quad core)
(network is a ring) On chip network
Core 1 Core 2
Core 3 Core 4
AMD Phenom II (six core)
(CMU 15-418, Spring 2012)
SUN Niagara 2
Note size of crossbar: about die area of one core
Processor
Processor L2 cache Memory
Processor
L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor
Processor L2 cache Memory
Eight cores Processor
(CMU 15-418, Spring 2012)
Non-uniform memory access (NUMA)
All processors can access any memory location, but cost of memory access is
different for different processors
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache
Memory Memory Memory Memory
Interconnect
▪ Problem with preserving uniform access time: scalability
- Costs are uniform, but memory is uniformly far away
▪ NUMA designs are more scalable
- High bandwidth to local memory; BW scales with number of nodes if most accesses local
- Low latency access to local emory
▪ Increased programmer effort: performance tuning
- Finding, exploiting locality
(CMU 15-418, Spring 2012)
Non-uniform memory access (NUMA)
Example: latency to access location X is higher from cores 5-8 than cores 1-4
Example: modern dual-socket configuration
X
Memory Memory
Memory Controller Memory Controller
On chip network
Core 1 Core 2 Core 5 Core 6
Core 3 Core 4 Core 7 Core 8
AMD Hyper-transport /
Intel QuickPath
(CMU 15-418, Spring 2012)
SGI Altix UV 1000 (PSC Blacklight)
▪ 256 blades, 2 CPUs per blade, 8 cores per CPU = 4K cores
▪ Single shared address space
▪ Interconnect
- Fat tree of 256 CPUs (15 GB/sec links)
- 2D torus to scale up another factor of 2
Fat tree 2D torus
(CMU 15-418, Spring 2012)
Shared address space summary
▪ Communication abstraction
- Threads read/write shared variables
- Manipulate synchronization primitives: locks, semaphors, etc.
- Extension of uniprocessor programming
- But NUMA implementation requires reasoning about locality for perf
▪ Hardware support
- Any processor can load and store from any address
- NUMA designs more scalable than uniform memory access
- Even so, costly to scale (see cost of Blacklight)
(CMU 15-418, Spring 2012)
Message passing model (abstraction)
▪ Threads operate within independent address spaces
▪ Threads communicate by sending/receiving messages
- Explicit, point-to-point
- send: specifies buffer to be transmitted, recipient, optional message “tag”
- receive: specifies buffer to store data, sender, and (optional) message tag
- May be synchronous or asynchronous
match!
Address Y
send(X, 2, tag) recv(Y, 1, tag)
Address X
Thread 1 address space Thread 2 address space
(CMU 15-418, Spring 2012)
Message passing (implementation)
▪ Popular library: MPI (message passing interface)
▪ Challenges: buffering messages (until application initiates receive), minimizing cost of
memory copies
▪ HW need not implement system-wide loads and stores
- Connect complete (often commodity) systems together
- Clusters!
Cluster of workstations
(Infiniband network)
IBM Blue Gene/P Supercomputer
(CMU 15-418, Spring 2012)
Correspondence between models and
machines is fuzzy
▪ Common to implement message passing abstractions on
machines that support a shared address space in hardware
▪ Can implement shared address space abstraction on machines
that do not support it in HW (less efficient SW solution)
- Mark all pages with shared variables as invalid
- Page-fault handler issues appropriate network requests
▪ Reminder: keep in mind what is the programming model
(abstraction) and what is the HW implementation
(CMU 15-418, Spring 2012)
Data-parallel model
▪ Rigid computation structure
▪ Historically: same operation on each element of an array
- Operation ≈ instruction
- Matched capabilities of 80’s SIMD supercomputers
- Connection Machine (CM-1, CM-2): thousands of processors, one instruction
- And also Cray supercomputer vector processors
- Add(A,
B,
n)
⟵ this was an instruction on vectors A, B of length n
- Matlab another good example: A + B (A, B are vectors of same length)
▪ Today: often takes form of SPMD programming
-‐ map(function,
collection)
- Where function may be a complicated sequence of logic (e.g., a loop body)
- Application of function to each element of collection is independent
- In pure form: no communication during map
- Synchronization is implicit at the end of the map (CMU 15-418, Spring 2012)
Data parallelism example in ISPC
//
main
C++
code:
const
int
N
=
1024;
float*
x
=
new
float[N]; Think of loop body as function
float*
y
=
new
float[N];
foreach construct is a map
//
initialize
N
elements
of
x
here
Collection is implicitly defined by array indexing logic
absolute_value(N,
x,
y);
//
ISPC
code:
export
void
absolute_value(
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{
foreach
(i
=
0
...
N)
{
if
(x[i]
<
0)
y[i]
=
-‐x[i];
else
y[i]
=
x[i];
}
}
(CMU 15-418, Spring 2012)
Data parallelism example in ISPC
//
main
C++
code:
const
int
N
=
1024;
float*
x
=
new
float[N/2]; Think of loop body as function
float*
y
=
new
float[N];
foreach construct is a map
//
initialize
N/2
elements
of
x
here
Collection is implicitly defined by array indexing logic
absolute_repeat(N/2,
x,
y);
//
ISPC
code:
export
void
absolute_repeat(
Also a valid program!
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{ Takes absolute value of elements of x,
foreach
(i
=
0
...
N)
{
repeats them twice in output vector y
if
(x[i]
<
0)
y[2*i]
=
-‐x[i];
else
y[2*i]
=
x[i];
y[2*i+1]
=
y[2*i];
}
}
(CMU 15-418, Spring 2012)
Data parallelism example in ISPC
//
main
C++
code:
const
int
N
=
1024;
float*
x
=
new
float[N]; Think of loop body as function
float*
y
=
new
float[N];
foreach construct is a map
//
initialize
N
elements
of
x
Collection is implicitly defined by array indexing logic
shift_negative(N,
x,
y);
//
ISPC
code:
export
void
shift_negative(
This program is non-deterministic!
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{
Possibility for multiple iterations of the loop
foreach
(i
=
0
...
N) body to write to same memory location
{
if
(i
>
1
&&
x[i]
<
0)
y[i-‐1]
=
x[i]; (as described, data parallel model provides
else no primitives for fine-grained mutual
y[i]
=
x[i];
} exclusion/synchronization)
}
(CMU 15-418, Spring 2012)
Data parallelism the more formal way
Note: this is not ISPC syntax
//
main
program: Data-parallelism expressed this way is
const
int
N
=
1024;
sometimes referred to as stream programing
stream<float>
x(N);
//
define
collection
stream<float>
y(N);
//
define
collection
model
//
initialize
N
elements
of
x
here
Streams: collections of elements. Elements
//
map
absolute_value
onto
x,
y can be processed independently
absolute_value(x,
y);
Kernels: side-effect-free functions. Operate
//
“kernel”
definition
void
absolute_value(
element-wise on elements of collections
float
x,
float
y)
{ Think of kernel inputs, outputs, temporaries
if
(x
<
0) for each invocation as an address space
y
=
-‐x;
else
y
=
x; If you’ve ever written OpenGL shader code
}
(e.g., 15-462), you’ve coded in the stream
programming system
(CMU 15-418, Spring 2012)
Stream programming benefits
//
main
program: Functions really are side-effect free!
const
int
N
=
1024;
stream<float>
input(N);
(cannot write a non-deterministic program)
stream<float>
output(N);
stream<float>
tmp(N);
Program data flow is known:
foo(input,tmp);
bar(tmp,
output);
Allows prefetching. Inputs and outputs of each
invocation are known in advance. Prefetching
can be employed to hide latency.
input tmp output
foo bar Producer-consumer locality. Can structure code
so that outputs of first function feed
immediately into second function, never
written to memory.
Requires sophisticated compiler analysis.
(CMU 15-418, Spring 2012)
Stream programming drawbacks
//
main
program: Need library of ad-hoc operators to describe
const
int
N
=
1024;
stream<float>
input(N/2); complex data flows. (see use of repeat
stream<float>
tmp(N);
stream<float>
output(N);
operator at left to obtain same behavior as
indexing code below)
stream_repeat(2,
input,
tmp);
absolute_value(tmp,
output);
Cross fingers and hope compiler generates code
intelligently
//
ISPC
code:
export
void
absolute_value(
Kayvon’s experience:
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
This is the achilles heel of all “proper” data- {
foreach
(i
=
0
...
N)
parallel/stream programming systems.
{
if
(x[i]
<
0)
y[2*i]
=
-‐x[i];
“I just need one more operator”...
else
y[2*i]
=
x[i];
y[2*i+1]
=
y[2*i];
}
}
(CMU 15-418, Spring 2012)
Gather/scatter:
Two key data-parallel communication primitives
//
main
program: //
main
program:
const
int
N
=
1024; const
int
N
=
1024;
stream<float>
input(N); stream<float>
input(N);
stream<int>
indices; stream<int>
indices;
stream<float>
tmp_input(N);
stream<float>
tmp_output(N);
stream<float>
output(N); stream<float>
output(N);
stream_gather(input,
indices,
tmp_input); absolute_value(input,
tmp_output);
absolute_value(tmp_input,
output); stream_scatter(tmp_output,
indices,
output);
Gather: (ISPC equivalent) Scatter: (ISPC equivalent)
export
void
absolute_value( export
void
absolute_value(
uniform
float
N,
uniform
float
N,
uniform
float*
input,
uniform
float*
input,
uniform
float*
output,
uniform
float*
output,
uniform
int*
indices)
uniform
int*
indices)
{ {
foreach
(i
=
0
...
n)
foreach
(i
=
0
...
n)
{
{
float
tmp
=
x[indices[i]];
if
(x[i]
<
0)
if
(tmp
<
0)
y[indices[i]]
=
-‐x[i];
y[i]
=
-‐tmp;
else
else
y[indices[i]]
=
x[i];
y[i]
=
tmp;
}
} }
}
(CMU 15-418, Spring 2012)
Gather operation:
gather(R1,
R0,
mem_base);
Array in memory: base address = mem_base
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1
SSE, AVX instruction sets do not directly support SIMD gather/scatter
(must implement as scalar loop)
Hardware supported gather/scatter does exist on GPUs.
(still an expensive operation)
(CMU 15-418, Spring 2012)
Data-parallel model summary
▪ Data-parallelism is about program structure
▪ In spirit, map a single program onto a large collection of data
- Functional: side-effect free executions
- No communication among invocations
▪ In practice that’s how most programs work
▪ But... most popular languages do not enforce this
- OpenCL, CUDA, ISPC, etc.
- Choose flexibility/familiarity of imperative syntax over safety and complex
compiler optimizations required for functional syntax
- It’s been their key to success (and the recent adoption of parallel programming)
- Hear that PL folks! (lighten up!)
(CMU 15-418, Spring 2012)
Three parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming, but can shoot yourself in the foot easily
- Program might be correct, but not scale
▪ Message passing
- Structured communication as messages
- Often harder to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program
▪ Data parallel
- Structure computation as a big map
- Assumes a shared address space from which to load inputs/store results, but
severely limits communication within the map (preserve independent
processing)
- Modern embodiments encourage, don’t enforce, this structure
(CMU 15-418, Spring 2012)
Trend: hybrid programming models
▪ Shared address space within a multi-core node of a cluster,
message passing between nodes
- Very, very common
- Use convenience of shared address space where it can be implemented
efficiently (within a node)
▪ Data-parallel programming models support synchronization
primitives in kernels (CUDA, OpenCL)
- Permits limited forms of communication
▪ CUDA/OpenCL use data parallel model to scale to many cores, but
adopt shared-address space model for threads in a single core.
(CMU 15-418, Spring 2012)
Los Alamos National Laboratory: Roadrunner
Fastest computer in the world in 2008 (no longer true)
3,240 node cluster. Heterogeneous nodes.
Cluster Node
AMD CPU
IBM Cell CPU IBM Cell CPU IBM Cell CPU IBM Cell CPU
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
16 GB
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Memory
Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 AMD CPU (Address
Space)
Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 1 Core 2
4 GB Memory 4 GB Memory 4 GB Memory 4 GB Memory
(Address space) (Address space) (Address space) (Address space)
Network
(CMU 15-418, Spring 2012)
Summary
▪ Programming models provide a way to think about parallel
programs. They are abstractions that admit many possible
implementations.
▪ But restrictions imposed my models reflect realities of
hardware costs of communication
- Shared address space machines
- Messaging passing machines
- Usually wise to keep abstraction distance low (performance predictability). But
want it high enough for flexibility/portability
▪ In practice, you’ll need to be able to think in a variety of ways
- Modern machines provide different types of communication at different scales
- Different models fit the machine best at the various scales
(CMU 15-418, Spring 2012)