0% found this document useful (0 votes)

17 views47 pages

Parallel Programming Models Lecture

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views47 pages

Parallel Programming Models Lecture

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 3:

Communication Architectures and Parallel

Programming models:

CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Announcements
▪ Oﬃce hours
- Fatahalian: Tue/Thurs 1:30-2:30PM, GHC 7005
- Papamichael: Wed 1-2PM, HH A-312
- Mu: Mon 4-5PM, West Wing cluster

▪ Assignment 1 out today!

- Due: Jan 31st
- No late handin (1 week is more than enough time)

▪ Course discussions, message boards hosted on piazza.com

- I’ve heard good things
- Please go to piazza.com and enroll in 15-418
(CMU 15-418, Spring 2012)
Review: Kayvon’s fictitious multi-core chip

We talked about forms of parallel/concurrent execution.

(CMU 15-418, Spring 2012)
Finishing up from last time

(CMU 15-418, Spring 2012)

Review: thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements
A
1. Load input A[i] ×
2. Load input B[i] B
=
3. Compute A[i] × B[i] C
4. Store result into C[i]

Three memory operations (12 bytes) for every MUL

NVIDIA GTX 480 GPU can do 480 MULs per clock (1.4 GHz)
Need ~8.4 TB/sec of bandwidth to keep functional units busy

~ 2% eﬃciency… but 8x faster than CPU!

(3 GHz Core i7 quad-core CPU: similar eﬃciency on this computation)
(CMU 15-418, Spring 2012)
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.

Overcoming bandwidth limits are a common challenge for

application developers on throughput-optimized systems.

(CMU 15-418, Spring 2012)

Bandwidth is a critical resource
Performant parallel programs will:

▪ Fetch data from memory less often

- Reuse data in a thread (traditional locality optimizations)
- Share data across threads (cooperation)

▪ Request data less often (instead, do more math: it is “free”)

- Term: “arithmetic intensity” - ratio of math to data access

(CMU 15-418, Spring 2012)

Entire system view: CPU + discrete GPU
L1

21 GB/sec Memory
L2
DDR3 DRAM
.. L3 cache (Gigabytes)
. (8 MB)
L1

L2
2011 and future:
PCIe x16 bus Co-locating CPU+GPU on same chip
Multi-core CPU 8 GB/sec each direction avoids PCIe bus bottleneck
(also reduces communication latency)

Execution 177 GB/sec

contexts L1
Memory
.. L2 DDR5 DRAM
. (~1 MB)
(1-2 GB)

Execution
contexts L1

Multi-core GPU (CMU 15-418, Spring 2012)

Programming with ISPC
(quick tutorial for assignment 1)

(CMU 15-418, Spring 2012)

ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single *program* multiple data

▪ http://ispc.github.com/

(CMU 15-418, Spring 2012)

sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assume N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
SPMD abstraction: uniform int sign = -‐1;
Spawn “gang” of ISPC program instances for (uniform int j=1; j<=terms; j++)
All instances run ISPC code in parallel {
Upon return, all instances have completed value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -‐1;
}
result[i] = value;
}
}
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...

C++ code: main.cpp

#include “sinx_ispc.h”

Sequential execution (C code)

int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; Call to sinx()
1 2 3 4 5 6 7 8 Begin executing programCount
// initialize x here instances of sinx() (ISPC code)

// execute ISPC code

sinx(N, terms, x, result);

SPMD abstraction: sinx() returns.

Completion of ISPC program instances.
Spawn “gang” of ISPC program instances Resume sequential execution
All instances run ISPC code in parallel
Upon return, all instances have completed Sequential execution
(C code)
SIMD implementation:
Number of instances in a gang is the SIMD width (or a small multiple of)
ISPC compiler generates binary (.o) with SIMD instructions
C++ code links against object as usual (CMU 15-418, Spring 2012)
sin(x) in ISPC
Interleaved assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assumes N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
ISPC Keywords: uniform int sign = -‐1;

programCount: number of simultaneously for (uniform int j=1; j<=terms; j++)

executing instances in the gang (uniform value) {
value += sign * numer / denom
programIndex: id of the current instance in the numer *= x[idx] * x[idx];
gang. (a non-uniform value: “varying”) denom *= (j+3) * (j+4);
sign *= -‐1;
uniform: A type modifier. All instances have the }
same value for this variable. Its use is purely an result[i] = value;
}
optimization. Not needed for correctness. }
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Blocked assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
{
float* result = new float[N];
// assume N % programCount = 0
uniform int count = N / programCount;
// initialize x here int start = programIndex * count;
for (uniform int i=0; i<count; i++)
// execute ISPC code {
sinx(N, terms, x, result); int idx = start + i;
float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -‐1;

for (uniform int j=1; j<=terms; j++)

{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -‐1;
}
result[i] = value;
}
}

(CMU 15-418, Spring 2012)

sin(x) in ISPC
Compute sin(x) using tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...

C++ code: main.cpp ISPC code: sinx.ispc

#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -‐1;

for (uniform int j=1; j<=terms; j++)

foreach: key language construct {
value += sign * numer / denom
numer *= x[i] * x[i];
Declares parallel loop iterations denom *= (j+3) * (j+4);
ISPC assigns iterations to program instances in gang sign *= -‐1;
}
result[i] = value;
(ISPC implementation will perform static assignment, }
but nothing in the abstraction prevents a dynamic }
assignment)
(CMU 15-418, Spring 2012)
ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- This is the programming abstraction

▪ Single instruction, multiple data (SIMD) implementation

- Compiler emits vector instructions
- Handles mapping of conditional control flow to vector instructions
export uniform float sumall1( export uniform float sumall2(
uniform int N, uniform int N,
uniform float* x) uniform float* x)

▪ Semantics of ISPC can be tricky {

uniform float sum = 0.0f;
{
uniform float sum;
foreach (i = 0 ... N) float partial = 0.0f;
- SPMD abstraction + uniform values { foreach (i = 0 ... N)
(allows implementation details to sum += x[i]; {
} partial += x[i];
peak through abstraction a bit) }

- What does sumall1 do? return sum;

} // from ISPC math library

- What does sumall2 do? sum = reduceAdd(partial);

- Which one is correct? return sum;

}

(CMU 15-418, Spring 2012)

Today
▪ Three parallel programming models
- Abstractions presented to the programmer
- Influence how programmers think when writing programs

▪ Three machine architectures

- Abstraction presented by the HW to low-level software
- Typically reflect implementation

▪ Focus on communication and cooperation

▪ Reading: Textbook section 1.2

(CMU 15-418, Spring 2012)
System layers: interface, implementation, interface, ...
Parallel Applications

Abstractions for describing Abstractions for describing “Programming model”

parallel computation communication (way of thinking about things)

Language or API
primitives/mechanisms
Compiler and/or runtime

System call API

Operating system support

Hardware Architecture
(HW/SW boundary)
Micro-architecture (HW implementation)

Blue italic text: concept

Red italic text: system abstraction
Black text: system implementation
(CMU 15-418, Spring 2012)
Example: expressing parallelism (pthreads)
Parallel Applications

Abstraction for describing parallel computation: thread Programming

model

pthread_create()

pthread library

System call API

OS support: kernel thread management

x86-64
modern multi-core CPU

(CMU 15-418, Spring 2012)

Example: expressing parallelism (ISPC)
Parallel Applications
Abstractions for describing parallel computation:
Programming
1. Specifying simultaneous execution (imperative: true parallelism)
model
2. Specifying independent work (declarative: potentially parallel)

ISPC language (call ISPC function, foreach construct)

ISPC compiler

System call API

OS support

x86-64 (including AVX vector instructions)

single-core of CPU

Note: ISPC has additional language primitives for multi-core execution (not discussed here)
(CMU 15-418, Spring 2012)
Three models of communication
1. Shared address space

2. Message passing

3. Data parallel

(CMU 15-418, Spring 2012)

Shared address space model (abstraction)
▪ Threads communicate by:
- Reading/writing to shared variables
- Interprocessor communication is implicit in memory operations
- Thread 1 stores to X. Thread 2 reads X (observes update)
- Manipulating synchronization primitives
- e.g., mutual exclusion using locks

▪ Natural extension of sequential programming model

- In fact, all our discussions have assumed a shared address space so far

▪ Think: shared variables are like a big bulletin board

- Any thread can read or write
(CMU 15-418, Spring 2012)
Shared address space (implementation)
▪ Option 1: threads share an address space (all data is sharable)

▪ Option 2: each thread has its own virtual address space, shared portion of
address spaces maps to same physical location
(like processes, described this way in book)
Physical mapping

Virtual address spaces

(CMU 15-418, Spring 2012)

Shared address space HW implementation
Any processor can directly reference any memory location

“Dance-hall” organization Interconnect examples

Processor Processor Processor Processor

Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory

Interconnect
Crossbar
Processor Processor Processor Processor Processor

Processor
Memory I/O
Processor

Processor

Memory Memory Memory Memory Memory Memory

Multi-stage network

▪ Symmetric (shared-memory) multi-processor (SMP):

- Uniform memory access time: cost of accessing an uncached* memory address is the same
for all processors
(* caching introduces non-uniform access times, but we’ll talk about that later)
(CMU 15-418, Spring 2012)
Shared address space architectures
Commodity x86 examples

Memory

Memory Controller
Intel Core i7 (quad core)
(network is a ring) On chip network
Core 1 Core 2

Core 3 Core 4

AMD Phenom II (six core)

(CMU 15-418, Spring 2012)

SUN Niagara 2
Note size of crossbar: about die area of one core

Processor

Processor L2 cache Memory

Processor

L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor

Processor L2 cache Memory

Eight cores Processor

(CMU 15-418, Spring 2012)

Non-uniform memory access (NUMA)
All processors can access any memory location, but cost of memory access is
diﬀerent for diﬀerent processors

Processor Processor Processor Processor

Local Cache Local Cache Local Cache Local Cache
Memory Memory Memory Memory

Interconnect

▪ Problem with preserving uniform access time: scalability

- Costs are uniform, but memory is uniformly far away
▪ NUMA designs are more scalable
- High bandwidth to local memory; BW scales with number of nodes if most accesses local
- Low latency access to local emory
▪ Increased programmer eﬀort: performance tuning
- Finding, exploiting locality

(CMU 15-418, Spring 2012)

Non-uniform memory access (NUMA)
Example: latency to access location X is higher from cores 5-8 than cores 1-4

Example: modern dual-socket configuration

X
Memory Memory

Memory Controller Memory Controller

On chip network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

AMD Hyper-transport /
Intel QuickPath

(CMU 15-418, Spring 2012)

SGI Altix UV 1000 (PSC Blacklight)

▪ 256 blades, 2 CPUs per blade, 8 cores per CPU = 4K cores

▪ Single shared address space
▪ Interconnect
- Fat tree of 256 CPUs (15 GB/sec links)
- 2D torus to scale up another factor of 2

Fat tree 2D torus

(CMU 15-418, Spring 2012)
Shared address space summary
▪ Communication abstraction
- Threads read/write shared variables
- Manipulate synchronization primitives: locks, semaphors, etc.
- Extension of uniprocessor programming
- But NUMA implementation requires reasoning about locality for perf

▪ Hardware support
- Any processor can load and store from any address
- NUMA designs more scalable than uniform memory access
- Even so, costly to scale (see cost of Blacklight)

(CMU 15-418, Spring 2012)

Message passing model (abstraction)
▪ Threads operate within independent address spaces
▪ Threads communicate by sending/receiving messages
- Explicit, point-to-point
- send: specifies buﬀer to be transmitted, recipient, optional message “tag”
- receive: specifies buﬀer to store data, sender, and (optional) message tag
- May be synchronous or asynchronous

match!
Address Y
send(X, 2, tag) recv(Y, 1, tag)

Address X

Thread 1 address space Thread 2 address space

(CMU 15-418, Spring 2012)

Message passing (implementation)
▪ Popular library: MPI (message passing interface)
▪ Challenges: buﬀering messages (until application initiates receive), minimizing cost of
memory copies
▪ HW need not implement system-wide loads and stores
- Connect complete (often commodity) systems together
- Clusters!

Cluster of workstations
(Infiniband network)

IBM Blue Gene/P Supercomputer

(CMU 15-418, Spring 2012)

Correspondence between models and
machines is fuzzy
▪ Common to implement message passing abstractions on
machines that support a shared address space in hardware

▪ Can implement shared address space abstraction on machines

that do not support it in HW (less eﬃcient SW solution)
- Mark all pages with shared variables as invalid
- Page-fault handler issues appropriate network requests

▪ Reminder: keep in mind what is the programming model

(abstraction) and what is the HW implementation

(CMU 15-418, Spring 2012)

Data-parallel model
▪ Rigid computation structure
▪ Historically: same operation on each element of an array
- Operation ≈ instruction
- Matched capabilities of 80’s SIMD supercomputers
- Connection Machine (CM-1, CM-2): thousands of processors, one instruction
- And also Cray supercomputer vector processors
- Add(A, B, n) ⟵ this was an instruction on vectors A, B of length n
- Matlab another good example: A + B (A, B are vectors of same length)

▪ Today: often takes form of SPMD programming

-‐ map(function, collection)
- Where function may be a complicated sequence of logic (e.g., a loop body)
- Application of function to each element of collection is independent
- In pure form: no communication during map
- Synchronization is implicit at the end of the map (CMU 15-418, Spring 2012)
Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N elements of x here
Collection is implicitly defined by array indexing logic
absolute_value(N, x, y);

// ISPC code:
export void absolute_value(
uniform int N,
uniform float* x,
uniform float* y)
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[i] = -‐x[i];
else
y[i] = x[i];
}
}

(CMU 15-418, Spring 2012)

Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N/2]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N/2 elements of x here
Collection is implicitly defined by array indexing logic
absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
Also a valid program!
uniform int N,
uniform float* x,
uniform float* y)
{ Takes absolute value of elements of x,
foreach (i = 0 ... N)
{
repeats them twice in output vector y
if (x[i] < 0)
y[2*i] = -‐x[i];
else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
(CMU 15-418, Spring 2012)
Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N elements of x
Collection is implicitly defined by array indexing logic
shift_negative(N, x, y);

// ISPC code:
export void shift_negative(
This program is non-deterministic!
uniform int N,
uniform float* x,
uniform float* y)
{
Possibility for multiple iterations of the loop
foreach (i = 0 ... N) body to write to same memory location
{
if (i > 1 && x[i] < 0)
y[i-‐1] = x[i]; (as described, data parallel model provides
else no primitives for fine-grained mutual
y[i] = x[i];
} exclusion/synchronization)
}

(CMU 15-418, Spring 2012)

Data parallelism the more formal way
Note: this is not ISPC syntax
// main program: Data-parallelism expressed this way is
const int N = 1024;
sometimes referred to as stream programing
stream<float> x(N); // define collection
stream<float> y(N); // define collection
model

// initialize N elements of x here

Streams: collections of elements. Elements
// map absolute_value onto x, y can be processed independently
absolute_value(x, y);

Kernels: side-eﬀect-free functions. Operate

// “kernel” definition
void absolute_value(
element-wise on elements of collections
float x,
float y)
{ Think of kernel inputs, outputs, temporaries
if (x < 0) for each invocation as an address space
y = -‐x;
else
y = x; If you’ve ever written OpenGL shader code
}
(e.g., 15-462), you’ve coded in the stream
programming system

(CMU 15-418, Spring 2012)

Stream programming benefits
// main program: Functions really are side-eﬀect free!
const int N = 1024;
stream<float> input(N); (cannot write a non-deterministic program)
stream<float> output(N);
stream<float> tmp(N);
Program data flow is known:
foo(input,tmp);
bar(tmp, output);
Allows prefetching. Inputs and outputs of each
invocation are known in advance. Prefetching
can be employed to hide latency.
input tmp output
foo bar Producer-consumer locality. Can structure code
so that outputs of first function feed
immediately into second function, never
written to memory.

Requires sophisticated compiler analysis.

(CMU 15-418, Spring 2012)

Stream programming drawbacks
// main program: Need library of ad-hoc operators to describe
const int N = 1024;
stream<float> input(N/2); complex data flows. (see use of repeat
stream<float> tmp(N);
stream<float> output(N);
operator at left to obtain same behavior as
indexing code below)
stream_repeat(2, input, tmp);
absolute_value(tmp, output);
Cross fingers and hope compiler generates code
intelligently
// ISPC code:
export void absolute_value(
Kayvon’s experience: uniform int N,
uniform float* x,
uniform float* y)
This is the achilles heel of all “proper” data- {
foreach (i = 0 ... N)
parallel/stream programming systems. {
if (x[i] < 0)
y[2*i] = -‐x[i];
“I just need one more operator”... else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}

(CMU 15-418, Spring 2012)

Gather/scatter:
Two key data-parallel communication primitives
// main program: // main program:
const int N = 1024; const int N = 1024;
stream<float> input(N); stream<float> input(N);
stream<int> indices; stream<int> indices;
stream<float> tmp_input(N); stream<float> tmp_output(N);
stream<float> output(N); stream<float> output(N);

stream_gather(input, indices, tmp_input); absolute_value(input, tmp_output);

absolute_value(tmp_input, output); stream_scatter(tmp_output, indices, output);

Gather: (ISPC equivalent) Scatter: (ISPC equivalent)

export void absolute_value( export void absolute_value(
uniform float N, uniform float N,
uniform float* input, uniform float* input,
uniform float* output, uniform float* output,
uniform int* indices) uniform int* indices)
{ {
foreach (i = 0 ... n) foreach (i = 0 ... n)
{ {
float tmp = x[indices[i]]; if (x[i] < 0)
if (tmp < 0) y[indices[i]] = -‐x[i];
y[i] = -‐tmp; else
else y[indices[i]] = x[i];
y[i] = tmp; }
} }
}
(CMU 15-418, Spring 2012)
Gather operation:
gather(R1, R0, mem_base);

Array in memory: base address = mem_base

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1

SSE, AVX instruction sets do not directly support SIMD gather/scatter

(must implement as scalar loop)

Hardware supported gather/scatter does exist on GPUs.

(still an expensive operation)
(CMU 15-418, Spring 2012)
Data-parallel model summary
▪ Data-parallelism is about program structure
▪ In spirit, map a single program onto a large collection of data
- Functional: side-eﬀect free executions
- No communication among invocations
▪ In practice that’s how most programs work
▪ But... most popular languages do not enforce this
- OpenCL, CUDA, ISPC, etc.
- Choose flexibility/familiarity of imperative syntax over safety and complex
compiler optimizations required for functional syntax
- It’s been their key to success (and the recent adoption of parallel programming)
- Hear that PL folks! (lighten up!)

(CMU 15-418, Spring 2012)

Three parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming, but can shoot yourself in the foot easily
- Program might be correct, but not scale
▪ Message passing
- Structured communication as messages
- Often harder to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program
▪ Data parallel
- Structure computation as a big map
- Assumes a shared address space from which to load inputs/store results, but
severely limits communication within the map (preserve independent
processing)
- Modern embodiments encourage, don’t enforce, this structure

(CMU 15-418, Spring 2012)

Trend: hybrid programming models
▪ Shared address space within a multi-core node of a cluster,
message passing between nodes
- Very, very common
- Use convenience of shared address space where it can be implemented
eﬃciently (within a node)

▪ Data-parallel programming models support synchronization

primitives in kernels (CUDA, OpenCL)
- Permits limited forms of communication

▪ CUDA/OpenCL use data parallel model to scale to many cores, but

adopt shared-address space model for threads in a single core.

(CMU 15-418, Spring 2012)

Los Alamos National Laboratory: Roadrunner
Fastest computer in the world in 2008 (no longer true)
3,240 node cluster. Heterogeneous nodes.
Cluster Node

AMD CPU
IBM Cell CPU IBM Cell CPU IBM Cell CPU IBM Cell CPU

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
16 GB
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Memory

Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 AMD CPU (Address
Space)
Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 1 Core 2

4 GB Memory 4 GB Memory 4 GB Memory 4 GB Memory

(Address space) (Address space) (Address space) (Address space)

Network

(CMU 15-418, Spring 2012)

Summary
▪ Programming models provide a way to think about parallel
programs. They are abstractions that admit many possible
implementations.

▪ But restrictions imposed my models reflect realities of

hardware costs of communication
- Shared address space machines
- Messaging passing machines
- Usually wise to keep abstraction distance low (performance predictability). But
want it high enough for flexibility/portability

▪ In practice, you’ll need to be able to think in a variety of ways

- Modern machines provide different types of communication at different scales
- Different models fit the machine best at the various scales

(CMU 15-418, Spring 2012)

03 Progmodels Slides
No ratings yet
03 Progmodels Slides
61 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
Stanford CS149 Parallel Computing Assignment
No ratings yet
Stanford CS149 Parallel Computing Assignment
31 pages
Parallel Programming Models Overview
No ratings yet
Parallel Programming Models Overview
62 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
Vector Processing Language Assignment
No ratings yet
Vector Processing Language Assignment
4 pages
Sardar Patel Institute of Technology: Department of Computer Engineering
No ratings yet
Sardar Patel Institute of Technology: Department of Computer Engineering
3 pages
PC Course Notes May17
No ratings yet
PC Course Notes May17
123 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
Parallel Computing Notes
No ratings yet
Parallel Computing Notes
27 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Cisc 530 Midterm Fall 2022-1
No ratings yet
Cisc 530 Midterm Fall 2022-1
19 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
Gpus
No ratings yet
Gpus
37 pages
05 15slides
No ratings yet
05 15slides
117 pages
04 Progbasics
No ratings yet
04 Progbasics
83 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
28 pages
PDC MidTerm-I Spring2023
No ratings yet
PDC MidTerm-I Spring2023
3 pages
Parallel Computation Architectures Guide
No ratings yet
Parallel Computation Architectures Guide
40 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
Parallel and Distributed Computing Course
No ratings yet
Parallel and Distributed Computing Course
4 pages
Parallel & Distributed Computing Course
33% (3)
Parallel & Distributed Computing Course
10 pages
cs61c Notes
No ratings yet
cs61c Notes
29 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
CS3006 Parallel Computing Course Overview
100% (1)
CS3006 Parallel Computing Course Overview
46 pages
Programming On Parallel Machines
100% (2)
Programming On Parallel Machines
347 pages
Programming On Parallel Machines: Norm Matloff University of California, Davis
100% (1)
Programming On Parallel Machines: Norm Matloff University of California, Davis
347 pages
Par Proc Book
No ratings yet
Par Proc Book
400 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Aula Ch3
No ratings yet
Aula Ch3
21 pages
Exam1 s04
No ratings yet
Exam1 s04
15 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
EEE4084F Exam 2018
No ratings yet
EEE4084F Exam 2018
13 pages
Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
Simulink Model Code for x86-64
No ratings yet
Simulink Model Code for x86-64
30 pages
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
No ratings yet
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
33 pages
Midterm
No ratings yet
Midterm
5 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
2021 22
No ratings yet
2021 22
24 pages
Lecture 3 Computer Arithmetic
No ratings yet
Lecture 3 Computer Arithmetic
32 pages
Syllabus
No ratings yet
Syllabus
7 pages
Exam1 f01
No ratings yet
Exam1 f01
11 pages
Lecture 5
No ratings yet
Lecture 5
143 pages
DSP Architecture
100% (1)
DSP Architecture
31 pages
Computer Architecture Cycle Test
No ratings yet
Computer Architecture Cycle Test
10 pages
Pstack
No ratings yet
Pstack
1 page
Dts Test
No ratings yet
Dts Test
1 page
09 Indexes2
No ratings yet
09 Indexes2
5 pages
18 Timestampordering
No ratings yet
18 Timestampordering
82 pages
03 Storage1
No ratings yet
03 Storage1
55 pages
Excel VBA Programming (Statements & Functions)
No ratings yet
Excel VBA Programming (Statements & Functions)
10 pages
ASP.NET TreeView Control Guide
No ratings yet
ASP.NET TreeView Control Guide
5 pages
Grade 10 CAT Marking Guide
No ratings yet
Grade 10 CAT Marking Guide
19 pages
Python Voice Email Assistant Guide
No ratings yet
Python Voice Email Assistant Guide
19 pages
Configuración del Servidor Rakion
No ratings yet
Configuración del Servidor Rakion
5 pages
E-Commerce Trends & Tech Evolution
No ratings yet
E-Commerce Trends & Tech Evolution
2 pages
MacOS Big Sur v11.6 For VMware (CracksHash) Torrent - 1337x
No ratings yet
MacOS Big Sur v11.6 For VMware (CracksHash) Torrent - 1337x
3 pages
IK Gujral Punjab Technical University Jalandhar, Kapurthala
No ratings yet
IK Gujral Punjab Technical University Jalandhar, Kapurthala
3 pages
User Manual For NYUSIM - V31
No ratings yet
User Manual For NYUSIM - V31
93 pages
2nd Quarter Ict 7 Ms Excel 2021 Lesson 2
No ratings yet
2nd Quarter Ict 7 Ms Excel 2021 Lesson 2
39 pages
VPC (Virtual Private Cloud)
No ratings yet
VPC (Virtual Private Cloud)
59 pages
Chapter 03
No ratings yet
Chapter 03
110 pages
Suyog Toc Lab File
No ratings yet
Suyog Toc Lab File
49 pages
Nipuna DWH
No ratings yet
Nipuna DWH
15 pages
Credit Memo SD User Manual - SAP
No ratings yet
Credit Memo SD User Manual - SAP
33 pages
Software Developer Resume - Pradnya Tamore
No ratings yet
Software Developer Resume - Pradnya Tamore
1 page
BSC Computer Science First Yearsyllabus New Model Degreecollege Hingoliwef 202324
No ratings yet
BSC Computer Science First Yearsyllabus New Model Degreecollege Hingoliwef 202324
41 pages
2025 PTT Usecases Activity Guide - Recruiting
No ratings yet
2025 PTT Usecases Activity Guide - Recruiting
13 pages
Rate Adaptation and Reach Increase by Probabilistically Shaped 64-QAM An Experimental Demonstration
No ratings yet
Rate Adaptation and Reach Increase by Probabilistically Shaped 64-QAM An Experimental Demonstration
11 pages
Wisenet User Manual-A1 Camera-ENGLISH - Web-0412
No ratings yet
Wisenet User Manual-A1 Camera-ENGLISH - Web-0412
25 pages
Indra Yolanda Pristiawati - 201910325116 Resume Webinar
No ratings yet
Indra Yolanda Pristiawati - 201910325116 Resume Webinar
2 pages
Ankit Sharma: Education
No ratings yet
Ankit Sharma: Education
2 pages
An Efforts Estimate For Agile Projects
No ratings yet
An Efforts Estimate For Agile Projects
11 pages
Diagnostic Test
No ratings yet
Diagnostic Test
15 pages
Smooth Operator Manual - BABY Audio2
No ratings yet
Smooth Operator Manual - BABY Audio2
6 pages
A Developers Guide To The Semantic Web 1st Liyang Yu Full
No ratings yet
A Developers Guide To The Semantic Web 1st Liyang Yu Full
69 pages
Event Driven Programming Java PDF
No ratings yet
Event Driven Programming Java PDF
2 pages
6 System Calls
No ratings yet
6 System Calls
42 pages
Java Project Programs
No ratings yet
Java Project Programs
53 pages
CSE 453 Slide 1
No ratings yet
CSE 453 Slide 1
46 pages