Algorithms with
distributed memory
Introduction
Distributed vs. PRAM
In the PRAM model we assumed all processors were
connected to the same memory.
A more common architecture is that each processor has its
own memory.
Connection network
P P P P P P
M M M M M M
Now concepts like reduction, broadcast, all-to-all broadcast
become relevant.
We will have communication costs of the shape
𝑡𝑠 𝑎 + 𝑡𝑤 𝑏, as we have seen in a previous lesson.
This will make the analysis for linear scaling more involved
Structured problems
• A well-structured problem, in which the
interactions between (blocks of) data are well-
structured, might allow for a simple mapping to
different processors.
• We will discuss this in the case of matrix-vector and
matrix-matrix multiplication.
Unstructured problems
• A more general problem is not per se structured
and can have any shape.
• We would then map different tasks to processers
with help of a task interaction graph.
• The problem then consists of partitioning the taks
interaction graph in a good way.
• Our goal is partitioning as to have maximal locality.
• In general this is an NP-hard problem.
Graph partitioning
Partitioning of a graph
• The problem of distributing tasks over processors
can be interpreted as the problem of partitioning a
graph.
• Each node represents some computational work,
and each edge represents some communication.
Graph partitioning
The optimization problem has two goals
• Minimize the number of edge cuts by the partitioning.
• Try to divide the nodes evenly.
Abstractly:
Input: Graph G = (V, E), number of partitions P
Output: Partition V=V0 ∪ V1 ∪ V2 ∪ ⋯ ∪ VP-1 such that
1) {Vi} are disjoint
2) Work is balanced, if each node represents the same
amount of work this: |Vi|≈|Vj|
3) Let Ecut={(u,v)|u∈Vi, v∈Vj, i≠j}, minimize |Ecut|
NP-hard, so solved heuristically!
Graph partitioning via BFS
Run BFS from any vertex
Stop when about ½ vertices are visited, assign
visited to one partition, unvisited to the other.
Repeat for each partition (divide-and-conquer) until
you have 𝑃 partitions.
BFS
Graph partitioning
• BFS assures connectedness of the resulting
partition.
• More sophisticated partitioning strategies:
• Coarsen, partition, refine.
• Maximal and maximum matchings
• Spectral partitioning
• Kernighan Lin algorithm: find local improvements
Overview
• When having distributed memory, we need to map
such that the number of interactions is minimized.
• Matrix calculations tend to be well-structured (the task
interaction graph is often just a mesh), at least in the
case of dense matrices!
This allows for a simple way to map.
• In general this is not the case and you would need to
take the task interaction graph into account and
partition it.
• This is a difficult problems, but there exist several heuristics.
• One heuristic would be graph partitioning via BFS.