Problem statement
1. You have an array of N = 16 elements, indexed 1..16. There are P = 4 processors: P0,
P1, P2, P3.
Each element i requires cost(i) = i units of work (so later indices are heavier).
Compare these three partitioning schemes: Block, Cyclic, and Block-Cyclic (block size B =
2). For each scheme:
1. List which elements each processor receives.
2. Compute the total work (sum of element indices) assigned to each processor (show
arithmetic).
3. Compute average work per processor and an imbalance metric max_load /
average_load.
4. Conclude which scheme is best for this workload.
Solution:
Solution — step by step
Total work (for verification)
Sum of integers 1 through 16:
1 + 2 + 3 + … + 16 = (16 × 17) / 2 = 136.
Average per processor = 136 / 4 = 34.
A. Block partitioning
Rule: divide into contiguous chunks of size N/P = 16/4 = 4.
Assignments:
P0: indices 1, 2, 3, 4
P1: indices 5, 6, 7, 8
P2: indices 9, 10, 11, 12
P3: indices 13, 14, 15, 16
Compute totals (showing arithmetic):
P0 total = 1 + 2 + 3 + 4
= (1 + 2) + (3 + 4) = 3 + 7 = 10
P1 total = 5 + 6 + 7 + 8
= (5 + 6) + (7 + 8) = 11 + 15 = 26
P2 total = 9 + 10 + 11 + 12
= (9 + 10) + (11 + 12) = 19 + 23 = 42
P3 total = 13 + 14 + 15 + 16
= (13 + 14) + (15 + 16) = 27 + 31 = 58
Check: 10 + 26 + 42 + 58 = (10 + 26) + (42 + 58) = 36 + 100 = 136 (matches total).
Imbalance metric:
max_load = 58, average = 34
ratio = 58 / 34 = divide step-by-step:
o 34 × 1 = 34 → remainder 24
o 240 / 34 ≈ 7 → 34×7=238 → remainder 2 → so decimal ≈ 1.705882...
So ratio ≈ 1.7059 (≈ 170.59% of average).
Interpretation: one processor (P3) does ~70.6% more work than average → poor for
this skewed-cost case.
B. Cyclic partitioning
Rule: assign index i to processor (i − 1) mod 4 (round-robin).
Assignments:
P0: 1, 5, 9, 13
P1: 2, 6, 10, 14
P2: 3, 7, 11, 15
P3: 4, 8, 12, 16
Compute totals:
P0 total = 1 + 5 + 9 + 13
= (1 + 5) + (9 + 13) = 6 + 22 = 28
P1 total = 2 + 6 + 10 + 14
= (2 + 6) + (10 + 14) = 8 + 24 = 32
P2 total = 3 + 7 + 11 + 15
= (3 + 7) + (11 + 15) = 10 + 26 = 36
P3 total = 4 + 8 + 12 + 16
= (4 + 8) + (12 + 16) = 12 + 28 = 40
Check: 28 + 32 + 36 + 40 = (28 + 32) + (36 + 40) = 60 + 76 = 136 (matches total).
Imbalance metric:
max_load = 40, average = 34
ratio = 40 / 34 = divide step-by-step:
o 34 × 1 = 34 → remainder 6 → decimal 6/34 = 0.176470588...
ratio ≈ 1.17647 (≈ 117.65% of average).
Interpretation: only ~17.65% more than average → good balance, but data locality is
poor (elements for a processor are scattered).
C. Block-Cyclic partitioning (block size B = 2)
Rule: break array into blocks of 2 contiguous elements, then assign blocks cyclically among
processors.
Owner of index i is ((i − 1) // B) mod P.
Blocks (B=2):
Block0 = indices (1,2), Block1 = (3,4), Block2 = (5,6), Block3 = (7,8), Block4 = (9,10),
Block5 = (11,12), Block6 = (13,14), Block7 = (15,16).
Assign blocks cyclically to P0..P3:
Block0 → P0 (1,2)
Block1 → P1 (3,4)
Block2 → P2 (5,6)
Block3 → P3 (7,8)
Block4 → P0 (9,10)
Block5 → P1 (11,12)
Block6 → P2 (13,14)
Block7 → P3 (15,16)
So processor elements:
P0: 1, 2, 9, 10
P1: 3, 4, 11, 12
P2: 5, 6, 13, 14
P3: 7, 8, 15, 16
Compute totals:
P0 total = 1 + 2 + 9 + 10
= (1 + 2) + (9 + 10) = 3 + 19 = 22
P1 total = 3 + 4 + 11 + 12
= (3 + 4) + (11 + 12) = 7 + 23 = 30
P2 total = 5 + 6 + 13 + 14
= (5 + 6) + (13 + 14) = 11 + 27 = 38
P3 total = 7 + 8 + 15 + 16
= (7 + 8) + (15 + 16) = 15 + 31 = 46
Check: 22 + 30 + 38 + 46 = (22 + 30) + (38 + 46) = 52 + 84 = 136.
Imbalance metric:
max_load = 46, average = 34
ratio = 46 / 34 = divide:
o 34 × 1 = 34 → remainder 12 → decimal 12/34 ≈ 0.352941176...
ratio ≈ 1.35294 (≈ 135.29% of average).
Interpretation: better than pure block but worse than cyclic in this particular skewed-
cost pattern — but has better locality than cyclic.
Summary table (quick view)
Scheme P0 P1 P2 P3 max_load avg max/avg
Block 10 26 42 58 58 34 1.7059
Cyclic 28 32 36 40 40 34 1.1765
Block-Cyclic B=2 22 30 38 46 46 34 1.3529
Conclusion & recommendation
For this specific workload where cost increases with index (later elements are
heavier), Cyclic gives the best load balance (lowest max/avg) because it spreads
heavy and light elements evenly across processors.
Block performs worst here because it groups heavy elements together (P3 gets the
heaviest chunk).
Block-Cyclic (B=2) is a compromise: better locality than Cyclic and better balance
than Block — useful when you want both locality and reasonable balance. Its
effectiveness depends on chosen block size B.
In real HPC applications you choose B based on a tradeoff between
communication/locality and load balance; libraries like ScaLAPACK use block-cyclic
tiling for that reason.