This project benchmarks a parallel prefix sum (exclusive scan) built on a reusable OpenMP loop helper, and compares it against a sequential baseline. It also includes scripts to run benchmarks, generate plots, and produce a report style PDF.
The focus of the performance discussion is the speedup between the sequential and parallel implementations of prefix sum. The same patterns and conclusions apply to other data parallel algorithms such as parallel mergesort.
You will need:
- A C and C++ compiler with OpenMP support (for example,
gccandg++) make- Bash
- For plotting:
- Either
gnuplot, or - Python 3 with
matplotlibandpandas
- Either
From the project root:
cd prefixsum-parallelization-study
make libgen.aThis builds the common static library used by both sequential and parallel implementations.
Most sub Makefiles will auto build ../libgen.a if it is missing, but running it once at the root makes things explicit.
From the project root:
cd sequential
make
make bench # or: bash ./queue.shWhat this does:
- Builds
prefixsum_seqfromprefixsum_seq.cpp. - Runs
bench_sequential.sh, which:-
Reads
PREFIXSUM_NSfrom../params.sh. -
For each
Nin that list, runs./prefixsum_seq N >/dev/null 2> result/prefixsum_seq_N
-
Each
result/prefixsum_seq_Ncontains the total runtime in seconds written tostderr.
-
Files written:
sequential/result/prefixsum_seq_10000sequential/result/prefixsum_seq_1000000sequential/result/prefixsum_seq_100000000sequential/result/prefixsum_seq_1000000000
Each file contains a single floating point number (time in seconds).
From the project root:
cd prefixsum
make # builds ./prefixsum using OpenMP
make test # optional: correctness and basic timing sanity
make bench # or: bash ./queue.shKey parts of prefixsum.cpp:
-
Data generation:
int *arr = new int[n]; generatePrefixSumData(arr, n);
-
Output array:
int *pr = new int[n + 1]; pr[0] = 0;
-
Parallel work is done in three passes using
OmpLoop:- Per block scan to compute local prefix sums and block totals.
- Sequential scan over block totals to compute offsets.
- Parallel pass that adds each block’s offset to its local prefix sums.
The
OmpLoophelper wraps an OpenMPforwith dynamic scheduling and a user settable chunk size:OmpLoop loop; loop.setNbThread(nbthreads); loop.setGranularity(1); // Example usage: loop.parfor(0, n, B, [&](int start) { // work on range [start, start + B) });
-
The program prints timing information only to
stderrand the check status (checkedorincorrect) tostdout.
Benchmarking is driven by bench_prefixsum.sh, which:
-
Reads
PREFIXSUM_NSandTHREADSfrom../params.sh. -
For each combination of
Nandt:./prefixsum N t >/dev/null 2> result/prefixsum_N_t
The parallel timings live in prefixsum/result/ with filenames like:
prefixsum_10000_1,prefixsum_10000_4,prefixsum_10000_16,prefixsum_10000_64prefixsum_1000000_1,prefixsum_1000000_4,prefixsum_1000000_16,prefixsum_1000000_64- And so on, following the pattern
prefixsum_<N>_<threads>.
After running both sequential and parallel benchmarks, you can generate plots.
From prefixsum/:
make plotThe plot target checks if gnuplot is available:
-
If
gnuplotis found, it runsplot.shand produces PDF plots. -
If
gnuplotis not found, it falls back to Python:python3 ./plot_py.py
To generate the PNG plots (which this README embeds), run:
cd prefixsum
python3 plots.pyplots.py will:
-
Read existing timing results.
-
Build the plots if needed.
-
Write the following PNGs to
prefixsum/plots/: -
prefixsum_speedup_n.png -
prefixsum_speedup_thread.png -
prefixsum_time_vs_threads.png -
prefixsum_efficiency_vs_threads.png
It also writes PDFs with the same stems and a combined plots.pdf.
Once the PNGs exist in prefixsum/plots/, they can be embedded directly in this README.
This plot shows speedup as a function of the number of threads for each problem size N.
This plot shows speedup as a function of N (on a log scale) for each fixed thread count.
This plot shows the absolute parallel runtime (par_time) as a function of thread count for each N.
This plot shows parallel efficiency (speedup / threads) as a function of thread count for each N.
The CSV file prefixsum/plots/prefixsum_speedup_table.csv contains rows of the form:
N,threads,seq_time,par_time,speedup,efficiency
where:
seq_timeis the runtime ofsequential/prefixsum_seqfor thatN.par_timeis the runtime ofprefixsum/prefixsumfor thatNand thread count.speedup = seq_time / par_time.efficiency = speedup / threads.
A few representative rows from that file:
| N | threads | speedup (seq/par) | efficiency (speedup per thread) |
|---|---|---|---|
| 10 000 | 1 | 0.83 | 0.83 |
| 10 000 | 4 | 0.10 | 0.03 |
| 10 000 | 16 | 0.03 | 0.00 |
| 1 000 000 | 1 | 1.17 | 1.17 |
| 1 000 000 | 4 | 2.02 | 0.51 |
| 1 000 000 | 16 | 1.33 | 0.08 |
| 100 000 000 | 1 | 1.04 | 1.04 |
| 100 000 000 | 4 | 2.84 | 0.71 |
| 100 000 000 | 12 | 4.51 | 0.38 |
| 100 000 000 | 64 | 7.84 | 0.12 |
| 1 000 000 000 | 1 | 0.97 | 0.97 |
| 1 000 000 000 | 4 | 3.13 | 0.78 |
| 1 000 000 000 | 8 | 4.44 | 0.56 |
| 1 000 000 000 | 64 | 12.56 | 0.20 |
(Values rounded for readability.)
-
Small inputs do not benefit from parallelism
- For
N = 10 000, even 1 thread in the parallel version is slower than the straightforward sequential loop, with a speedup of about0.83. - As thread count increases, overhead from thread creation, scheduling, and extra memory traffic dominates the tiny amount of work, so speedup drops further.
- This is typical for small arrays: the critical path is too short compared to parallel overhead.
- For
-
Medium inputs achieve modest speedups
- For
N = 1 000 000, the best speedup is about2.02with 4 threads. - Beyond 4 threads, speedup does not improve much; overhead and memory contention start to dominate.
- The parallel algorithm is useful here, but you do not get linear scaling with thread count.
- For
-
Large inputs benefit from more threads, but efficiency falls
- For
N = 100 000 000, speedup continues to grow with thread count:- Roughly
2.84at 4 threads. - Around
4.51at 12 threads. - Around
7.84at 64 threads.
- Roughly
- However, efficiency falls from about
0.71at 4 threads to about0.12at 64 threads. - At this scale, the algorithm becomes memory bound: more threads can hide latency and use more cores, but each thread spends a lot of time waiting on memory, so adding threads produces diminishing returns.
- For
-
Very large inputs show the highest speedups
- For
N = 1 000 000 000:- 4 threads give speedup ≈
3.13(efficiency ≈0.78). - 8 threads give speedup ≈
4.44(efficiency ≈0.56). - 64 threads give speedup ≈
12.56(efficiency ≈0.20).
- 4 threads give speedup ≈
- The work is large enough that thread and scheduling overhead are amortized, so the parallel implementation clearly beats the sequential baseline.
- But even with 64 threads, the speedup is far from 64, again because of memory bandwidth limitations and synchronization.
- For



