CS 584
Load Balancing
Goal: All processors working all the
time
Efficiency of 1
Distribute the load (work) to meet the goal
Two types of load balancing
Static
Dynamic
Load Balancing
The load balancing problem can be
reduced to the bin-packing problem
NP-complete
For simple cases, we can do well, but
Heterogeneity
Different types of resources
Processor
Network, etc.
Evaluation of load
balancing
Efficiency
Are the processors always working?
How much processing overhead is associated
with the load balance algorithm?
Communication
Does load balance introduce or affect the
communication pattern?
How much communication overhead is
associated with the load balance algorithm?
How many edges are cut in communication
graph?
Partitioning Techniques
Regular grids (-: Easy :-)
striping
blocking
use processing power to divide load more
fairly
Generalized Graphs
Levelization
Scattered Decomposition
Recursive Bisection
Levelization
Begin with a boundary
Number these nodes 1
All nodes connected to a level 1
node are labeled 2, etc.
Partitioning is performed
determine the number of nodes per processor
count off the nodes of a level until exhausted
proceed to the next level
Levelization
Levelization
Want to insure nearest neighbor
comm.
If p is # processors and n is # nodes.
Let ri be the sum of the number of
nodes in contiguous levels i and i + 1
Let r = max{r1, r2, , rn}
Nearest neighbor communication is
assured if n/p > r
Scattered Decomposition
Used for highly irregular grids
Partition load into a large number r of
rectangular clusters such that r >> p
Each processor is given a disjoint set
of r/p clusters.
Communication overhead can be a
problem for highly irregular problems.
Recursive Bisection
Recursively divide the domain in
two pieces at each step.
3 Methods
Recursive Coordinate Bisection
Recursive Graph Bisection
Recursive Spectral Bisection
Recursive Coordinate
Bisection
Divide the domain
based on the
physical coordinates
of the nodes.
Pick a dimension and
divide in half.
RCB uses no connectivity information
lots of edges crossing boundaries
partitions may be disconnected
Some new research based on graph
separators overcomes some problems.
Ineritial Bisection
Often, coordinate bisection is susceptible
to the orientation of the mesh
Solution: Find the principle axis of the
communication graph
Graph Theory Based
Algorithms
Geometric algorithms are generally
low quality
they dont take into account
connectivity
Graph theory algorithms apply
what we know about generalized
graphs to the partitioning problem
Hopefully, they reduce the cut size
Greedy Bisection
Start with a vertex of
the smallest degree
least number of edges
Mark all its neighbors
Mark all its neighbors
neighbors, etc.
The first n/p marked
vertices form one
subdomain
Apply the algorithm
on the remaining
Recursive Graph Bisection
Based on graph
distance rather than
coordinate distance.
Determine the two
furthest separated
nodes
Organize and partition
nodes according to
their distance from
extremities.
Computationally
expensive
Can use approximation
methods.
Recursive Spectral
Bisection
Uses the discrete Laplacian
Let A be the adjacency matrix
Let D be the diagonal matrix where
D[i,i] is the degree of node I
LG = A - D
Recursive Spectral
Bisection
LG is negative semidefinite
Its largest eigenvalue is zero and the
corresponding eigenvector is all ones.
The magnitude of the second largest
eigenvalue gives a measure of the
connectivity of the graph.
Its corresponding eigenvector gives a
measure of distances between nodes.
Recursive Spectral
Bisection
The eigenvector corresponding to
the second largest eigenvalue is
the Fiedler vector.
Calculation of the Fiedler vector is
computationally intensive.
RSB yields connected partitions
that are very well balanced.
Example
RSB
299 edges cut
RCB 529 edges cut
RGB 618 edges cut
Global vs Local Partitioning
Global methods produce a good
partitioning
Local methods can then be used to
improve the partitioning
The Kernighan-Lin
algorithm
Swap pairs of nodes to decrease the cut
Will allow intermediate increases in the cut size to
avoid certain local minima
Loop
choose the pair of nodes with largest benefit of swapping
logically exchange them (not for real)
lock those nodes
until all nodes are locked
Find the sequence of swaps that yields the largest
accumulated benefit
Perform the swaps for real
The Kernihan-Lin
Algorithm
Helpful-Sets
Two Steps
Find a set of nodes in one partition and
move it to the other partition to decrease
the cut size
Rebalance the load
The set of nodes moved must be helpful
Helpfulness of node is equal to the
change in cut size if the node is moved
Helpful-Sets
All these sets are
2 - helpful
Helpful-Sets Algorithm
The Helpful-Sets Algorithm
Theory
If there is a bisection and if its cut size is not
too small then there exists a small 4helpful set in one side or the other
This 4-helpful set can be moved and will
reduce the cut by 4
If imbalance is not too large and cut of
unbalanced partition is not too small then
it is possible to rebalance without increasing
the cut size by more than 2
Apply the theory iteratively until too
small condition is met.
Multi-level Hybrid Methods
For very large graphs, time to
partition can be extremely costly
Reduce time by coarsening the graph
shrink a large graph to a smaller one
that has similar characteristics
Coarsen by
heavy edge matching
simple partitioning heuristics
Multi-level Hybrid Methods
Comparisons
(x.xx) run time in seconds
ML Multilevel (spectral on coarse KL on intermedia
IN Inertial
Party 5 or 6 different methods
Graph
airfoil
|v|
4253
|e|
12289
ML
85
(0.08)
Chaco
IN
94
(0.00)
crack
10240
30380
211
(0.16)
377
(0.01)
218
(0.05)
196
(0.14)
243
(0.10)
208
(0.44)
wave
156317 10593319542
(3.64)
9834
(0.19)
9660
(1.61)
9801
(3.50)
10361
(2.84)
9614
(11.93)
22579
13643
(0.06)
9897
(0.06)
8869
(3.45)
8869
(11.52)
lh
1443
20148
total edge weight
36376
487380 (0.33)
mat
73752
17617189359
(1.80)
DEBR
10485762097149100286
(48.99)
IN+KL
83
(0.02)
Metis
PMetis
85
(0.04)
all
94
(0.04)
all+HS
83
(0.15)
9555
(2.04)
Party
101674 172204 94272
(988.39) (16.63) (577.97)
(0.
Dynamic Load Balancing
Load is statically partitioned initially
Adjust load when an imbalance is
detected.
Objectives
rebalance the load
keep edge cut minimized (communication)
avoid having too much overhead
Dynamic Load Balancing
Consider adaptive algorithms
After an interval of computation
mesh is adjusted according to an
estimate of the discretization error
coarsened in areas
refined in others
Mesh adjustment causes load
imbalance
Dynamic Load Balancing
After refinement, node 1 ends up with more work
Centralized DLB
Control of the load is centralized
Two approaches
Master-worker (Task scheduling)
Tasks are kept in central location
Workers ask for tasks
Requires that you have lots of tasks with weak locality
requirements. No major communication between
workers
Load Monitor
Periodically, monitor load on the processors
Adjust load to keep optimal balance
Repartitioning
Consider: dynamic situation is simply a
sequence of static situations
Solution: repartition the load after each
some partitioning algorithms are very quick
Issues
scalability problems
how different are current load distribution
and new load distribution
data dependencies
Decentralizing DLB
Generally focused on work pool
Two approaches
Hierarchy
Fully distributed
Fully Distributed DLB
Lower overhead than centralized
schemes.
No global information
Load is locally optimized
Propagation is slow
Load balance may not be as good as
centralized load balance scheme
Three steps
Flow calculation (How much to move)
Mesh node selection (Which work to move)
Actual mesh node migration
Flow calculation
View as a network flow problem
Add source and sink nodes
Connect source to all nodes
edge value is current load
Connect sink to all nodes
edge value is mean load
processor communication graph
Flow calculation
Many network flow algorithms
more intense than necessary
not parallel
Use simpler, more scalable algorithms
Random Matchings
pick random neighboring processes
exchange some load
eventually you may get there
Diffusion
Each processor balances its load with
all its neighbors
How much work should I have?
wtp1 wtp
t
t
(
w
w
pq p q )
q ,{ p , q}F
How much to send on an edge?
l tpq1 pq ( wtp wqt )
Repeat until all load is balanced
log(1 / )
O
2
1
steps
Diffusion
Convergence to load balance can be
slow
Can be improved with over-relaxation
Monitor what is sent in each step
Determine how much to send based on
current imbalance and how much was sent
in previous steps
Diffuses load in
log(
1
/
O
1 2
steps
Dimension Exchange
Rather than communicate with all neighbors
each round, only communicate with one
Comes from dimensions of hypercube
Use edge coloring for general graphs
Exchange load with neighbor along a
dimension
l = (li + lj)/2
Will converge in d steps if hypercube
Some graphs may need different factor to
converge faster
l = li * a + lj * (1 a)
Diffusion & Dimension
Exchange
Can view
diffusion as a Jacobi method
dimension exchange as Gauss-Seidel
Can use multi-level variants
Divide the processor communication
graph in half
Determine the load to shift across the
cut
Recursively rebalance each half
Mesh node selection
Must identify which mesh nodes to
migrate
minimize edge cut and overhead
Very dependent on problem
Shape & size of partition may play a role
in accuracy
Aspect ratio maintenance
Move items that are further away from
center of gravity.
Load Balancing Schemes
(Who do I request work from?)
Asynchronous Round Robin
each processor maintains target
Ask from target then increment target
Global Round Robin
target is maintained by master node
Random Polling
randomly select a donor
each processor has equal probability