Parallel Processing
Concurrency and Mapping
Samer Arandi
[email protected]Parallel Processing 66523
Computer Engineering Department
An-Najah National University
Outline
Characteristics of tasks and interactions
task generation, granularity, and context characteristics of task interactions
Mapping techniques for load balancing
static mappings dynamic mappings
Methods for minimizing interaction overheads Parallel algorithm design templates
11
Characteristics of Tasks
Key characteristics
generation strategy associated work associated data size
Impact choice and performance of parallel algorithms
12
Task Generation
Static task generation
identify concurrent tasks a-priori typically data or recursive decomposition leads to static tasks generation examples
- matrix operations - graph algorithms - image processing applications
- other regularly structured problems
Dynamic task generation
identify concurrent tasks as a computation unfolds (tasks & depend. graph) the rules governing the generation of tasks are known as part of the alg. typically a result of exploratory or speculative decompositions examples
- puzzle solving (15-puzzle) - game playing
recursive can also lead to dynamic tasks generation (quicksort) exploratory can also lead to static tasks generation (15-puzzle)
13
Task Sizes
Task Size: amount of time required for completion Uniform: all the same size (example?) Non-uniform
sometimes sizes are known or can be estimated a-priori sometimes not
- example: tasks in quicksort
size of each partition depends upon pivot selected
Implications on mapping?
14
Size of Data Associated with Tasks
Data may be small or large compared to the computation
size(input) < size(computation), e.g., 15 puzzle size(input) = size(computation) > size(output), e.g., min size(input) = size(output) <= size(computation), e.g., sort
Implications
small data: task can easily migrate to another process large data: ties the task to a process
- possibly can avoid communicating the task context
reconstruct/recompute the context elsewhere
15
Characteristics of Task Interactions
Orthogonal classification criteria
Static vs. dynamic Regular vs. irregular Read-only vs. read-write One-sided vs. two-sided
16
Characteristics of Task Interactions
Static interactions
tasks and interactions are known a-priori simpler to code
Dynamic interactions
timing or interacting tasks cannot be determined a-priori harder to code
- especially using two-sided message passing APIs
17
Characteristics of Task Interactions
Regular interactions
interactions have a pattern that can be described with a function
- e.g. mesh, ring
regular patterns can be exploited for efficient implementation
- e.g. schedule communication to avoid conflicts on network links
Irregular interactions
lack a well-defined topology modeled by a graph
18
Static Regular Task Interaction Pattern
Image operations, e.g. edge detection Nearest neighbor interactions on a 2D mesh
19
Static Irregular Task Interaction Pattern
Sparse matrix-vector multiply
A task must scan its associated row(s) of A to know which entry -of vector b- it requires (implies the tasks it needs to interact with)
20
Characteristics of Task Interactions
Read-only interactions
tasks only read data associated with other tasks example: Matrix Multiplication (shared: A and B)
Read-write interactions
read and modify data associated with other tasks
example: shared tasks priority queues
harder to code: requires synchronization
- need to avoid ordering races (read-write and write-write, etc)
21
Characteristics of Task Interactions
One-sided
initiated & completed independently by 1 of 2 interacting tasks
- GET - PUT
Two-sided
both tasks coordinate in an interaction
- SEND + RECV
22
Outline
Characteristics of tasks and interactions
task generation, granularity, and context characteristics of task interactions
Mapping techniques for load balancing
static mappings dynamic mappings
Methods for minimizing interaction overheads Parallel algorithm design templates
11
Mapping Techniques
Map concurrent tasks to processes for execution Goal: all tasks complete in the shortest possible time
Overheads of mappings
serialization (idling) - due to uneven load balancing/dependencies communication
A good mapping tries to minimize both sources of overheads Conflicting objectives: minimizing one increases the other
assigning all work to one processor (going to the extreme)
- minimizes communication - significant idling
minimizing serialization introduces communication
24
Mapping Techniques for Minimum Idling
Overall load balancing alone doesnt necessarily minimize idling
Task dependency graph determines when a task can run Must balance computation and interactions at each stage
Time
Time
25
Mapping Techniques for Minimum Idling
Static vs. dynamic mappings
Static mapping
a-priori mapping of tasks to processes requirements
- a good estimate of task size - even so, optimal mapping may be NP complete
e.g., multiple knapsack problem
Dynamic mapping
map tasks to processes at runtime why?
- tasks are generated at runtime, or
- their sizes are unknown
need to make sure cost of moving data doesnt outweigh the Factors that influence choice of mapping benefit of dynamic mapping size of data associated with a task nature of underlying domain
26
Schemes for Static Mapping
Data partitionings Task graph partitionings Hybrid strategies
27
Mappings Based on Data Partitioning
Partition computation using a combination of
data partitioning owner-computes rule
Example: 1-D block distribution for dense matrices
28
Block Array Distribution Schemes
Multi-dimensional block distributions
Multi-dimensional partitioning enables larger # of processes
29
Block Array Distribution Example
Multiplying two dense matrices C = A x B
Partition the output matrix C using a block decomposition Give each task the same number of elements of C
each element of C corresponds to a dot product even load balance
Obvious choices: 1D or 2D decomposition Select to minimize associated communication overhead
30
Imbalance and Block Array Distributions
Consider a block distribution for LU decomposition
Computing different blocks requires different amounts of work
If we map all tasks associated with a certain block onto a process in a 9-process ensemble=> imbalance => significant idle time
Another computation with similar distribution challenges
Gaussian Elimination
33
Block Cyclic Distribution
Variant of the block distribution scheme that can be used to alleviate the load-imbalance and idling
Steps
1. partition an array into many more blocks than the number of available processes 2. assign blocks to processes in a round-robin manner
- each process gets several non-adjacent blocks
34
Block-Cyclic Distribution
(a) 1D block-cyclic
(b) 2D block-cyclic
In certain cases even block-cyclic results in imbalance: - Randomized Block Distribution
35
Decomposition by Graph Partitioning
Data partitioning is very effective for problems that use dense matrices and have regular interaction patterns.
However, some problems utilize sparse matrices and have datadependent and irregular interaction patters Sparse-matrix vector multiply
Graph of the matrix is useful for decomposition
work ~ number of edges communication for a node ~ node degree
Goal: balance work & minimize communication Partition the graph
assign equal number of nodes to each process minimize edge count of the graph partition
36
Partitioning a Graph of Lake Superior
Random Partitioning ( 8 processes)
Partitioning for minimum edge-cut (8 processes)
37
Mappings Based on Task Partitioning
Partitioning a task-dependency graph
Optimal partitioning for general task-dependency graph
NP-complete problem
Excellent heuristics exist for structured graphs
38
Mapping a Binary Tree Dependency Graph
Dependency graph for quicksort Task assignment to processes in a hypercube*
*hypercube: node numbers that differ in 1 bit are adjacent
39
Task Partitioning: Mapping a Sparse Graph
17 item to communicate
13 item to communicate
36
Hierarchical Mappings
Sometimes a single mapping is inadequate
e.g., task mapping of a binary tree cannot readily use a large number of processors (e.g. parallel quicksort).
Hierarchical approach
use a task mapping at the top level data partitioning within each level
42
Schemes for Dynamic Mapping
Dynamic mapping AKA dynamic load balancing
load balancing is the primary motivation for dynamic mapping
Styles
centralized distributed
44
Centralized Dynamic Mapping
Processes = master(s) or slaves General strategy
when a slave runs out of work request more from master
Challenge
master may become bottleneck for large # of processes
Approach
chunk scheduling: process picks up several of tasks at once however
- large chunk sizes may cause significant load imbalances - gradually decrease chunk size as the computation progresses
45
Distributed Dynamic Mapping
All processes as peers Each process can send or receive work from other processes
avoids centralized bottleneck
Four critical design questions
how are sending and receiving processes paired together? who initiates work transfer? how much work is transferred? when is a transfer triggered?
Ideal answers can be application specific Cilk uses a distributed dynamic mapping: work stealing
Distributed v.s. Shared Memory Architectures Suitability
-For message-passing computers the computation size should be >> the data size
46
Outline
Characteristics of tasks and interactions
task generation, granularity, and context characteristics of task interactions
Mapping techniques for load balancing
static mappings dynamic mappings
Methods for minimizing interaction overheads Parallel algorithm design templates
11
Minimizing Interaction Overheads (1)
Rules of thumb
Maximize data locality
dont fetch data you already have restructure computation to reuse data promptly
Minimize volume of data exchange
partition interaction graph to minimize edge crossings
Minimize frequency of communication
try to aggregate messages where possible
Minimize contention and hot-spots
use decentralized techniques (avoidance)
48
Minimizing Interaction Overheads (2)
Techniques
Overlap communication with computation
use non-blocking communication primitives
- overlap communication with your own computation - one-sided: prefetch remote data to hide latency
multithread code on a processor
- overlap communication with another threads computation
Replicate data or computation to reduce communication Use group communication instead of point-to-point primitives Issue multiple communications and overlap their latency
(reduces exposed latency)
49
Outline
Characteristics of tasks and interactions
task generation, granularity, and context characteristics of task interactions
Mapping techniques for load balancing
static mappings dynamic mappings
Methods for minimizing interaction overheads Parallel algorithm design templates
11
Parallel Algorithm Model
Definition: ways of structuring a parallel algorithm Aspects of a model
decomposition mapping technique strategy to minimize interactions
51
Common Parallel Algorithm Models
Data parallel
each task performs similar operations on different data typically statically map tasks to processes
Task graph
use task dependency graph relationships to
- promote locality, or reduce interaction costs
Master-slave
one or more master processes generate work allocate it to worker processes allocation may be static or dynamic
Pipeline / producer-consumer
pass a stream of data through a sequence of processes each performs some operation on it
Hybrid
apply multiple models hierarchically, or apply multiple models in sequence to different phases
52
References
Slides originally from John Mellor-Crummey (Rice), COMP 422
Adapted from slides Principles of Parallel Algorithm Design by Ananth Grama Based on Chapter 3 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003
46