Parallel Prefix Sum With OpenMP

This project benchmarks a parallel prefix sum (exclusive scan) built on a reusable OpenMP loop helper, and compares it against a sequential baseline. It also includes scripts to run benchmarks, generate plots, and produce a report style PDF.

The focus of the performance discussion is the speedup between the sequential and parallel implementations of prefix sum. The same patterns and conclusions apply to other data parallel algorithms such as parallel mergesort.

Build prerequisites

You will need:

A C and C++ compiler with OpenMP support (for example, gcc and g++)
make
Bash
For plotting:
- Either gnuplot, or
- Python 3 with matplotlib and pandas

One time setup

From the project root:

cd prefixsum-parallelization-study
make libgen.a

This builds the common static library used by both sequential and parallel implementations.

Most sub Makefiles will auto build ../libgen.a if it is missing, but running it once at the root makes things explicit.

Sequential baseline

From the project root:

cd sequential
make
make bench        # or: bash ./queue.sh

What this does:

Builds prefixsum_seq from prefixsum_seq.cpp.
Runs bench_sequential.sh, which:
- Reads PREFIXSUM_NS from ../params.sh.
- For each N in that list, runs
```
./prefixsum_seq N >/dev/null 2> result/prefixsum_seq_N
```
- Each result/prefixsum_seq_N contains the total runtime in seconds written to stderr.

Files written:

sequential/result/prefixsum_seq_10000
sequential/result/prefixsum_seq_1000000
sequential/result/prefixsum_seq_100000000
sequential/result/prefixsum_seq_1000000000

Each file contains a single floating point number (time in seconds).

Parallel prefix sum

From the project root:

cd prefixsum
make             # builds ./prefixsum using OpenMP
make test        # optional: correctness and basic timing sanity
make bench       # or: bash ./queue.sh

Key parts of prefixsum.cpp:

Data generation:

int *arr = new int[n];
generatePrefixSumData(arr, n);

Output array:
```
int *pr = new int[n + 1];
pr[0] = 0;
```
Parallel work is done in three passes using OmpLoop:
1. Per block scan to compute local prefix sums and block totals.
2. Sequential scan over block totals to compute offsets.
3. Parallel pass that adds each block’s offset to its local prefix sums.
The OmpLoop helper wraps an OpenMP for with dynamic scheduling and a user settable chunk size:
```
OmpLoop loop;
loop.setNbThread(nbthreads);
loop.setGranularity(1);

// Example usage:
loop.parfor(0, n, B, [&](int start) {
    // work on range [start, start + B)
});
```
The program prints timing information only to stderr and the check status (checked or incorrect) to stdout.

Benchmarking is driven by bench_prefixsum.sh, which:

Reads PREFIXSUM_NS and THREADS from ../params.sh.

For each combination of N and t:

./prefixsum N t >/dev/null 2> result/prefixsum_N_t

The parallel timings live in prefixsum/result/ with filenames like:

prefixsum_10000_1, prefixsum_10000_4, prefixsum_10000_16, prefixsum_10000_64
prefixsum_1000000_1, prefixsum_1000000_4, prefixsum_1000000_16, prefixsum_1000000_64
And so on, following the pattern prefixsum_<N>_<threads>.

Plots and derived data

After running both sequential and parallel benchmarks, you can generate plots.

Generating plots

From prefixsum/:

make plot

The plot target checks if gnuplot is available:

If gnuplot is found, it runs plot.sh and produces PDF plots.
If gnuplot is not found, it falls back to Python:
```
python3 ./plot_py.py
```

To generate the PNG plots (which this README embeds), run:

cd prefixsum
python3 plots.py

plots.py will:

Read existing timing results.
Build the plots if needed.
Write the following PNGs to prefixsum/plots/:
prefixsum_speedup_n.png
prefixsum_speedup_thread.png
prefixsum_time_vs_threads.png
prefixsum_efficiency_vs_threads.png

It also writes PDFs with the same stems and a combined plots.pdf.

Embedded plots (PNG)

Once the PNGs exist in prefixsum/plots/, they can be embedded directly in this README.

Speedup vs threads

This plot shows speedup as a function of the number of threads for each problem size N.

Speedup vs problem size

This plot shows speedup as a function of N (on a log scale) for each fixed thread count.

Parallel time vs threads

This plot shows the absolute parallel runtime (par_time) as a function of thread count for each N.

Parallel efficiency vs threads

This plot shows parallel efficiency (speedup / threads) as a function of thread count for each N.

Measured speedup: sequential vs parallel prefix sum

The CSV file prefixsum/plots/prefixsum_speedup_table.csv contains rows of the form:

N,threads,seq_time,par_time,speedup,efficiency

where:

seq_time is the runtime of sequential/prefixsum_seq for that N.
par_time is the runtime of prefixsum/prefixsum for that N and thread count.
speedup = seq_time / par_time.
efficiency = speedup / threads.

A few representative rows from that file:

N	threads	speedup (seq/par)	efficiency (speedup per thread)
10 000	1	0.83	0.83
10 000	4	0.10	0.03
10 000	16	0.03	0.00
1 000 000	1	1.17	1.17
1 000 000	4	2.02	0.51
1 000 000	16	1.33	0.08
100 000 000	1	1.04	1.04
100 000 000	4	2.84	0.71
100 000 000	12	4.51	0.38
100 000 000	64	7.84	0.12
1 000 000 000	1	0.97	0.97
1 000 000 000	4	3.13	0.78
1 000 000 000	8	4.44	0.56
1 000 000 000	64	12.56	0.20

(Values rounded for readability.)

Key observations

Small inputs do not benefit from parallelism
- For N = 10 000, even 1 thread in the parallel version is slower than the straightforward sequential loop, with a speedup of about 0.83.
- As thread count increases, overhead from thread creation, scheduling, and extra memory traffic dominates the tiny amount of work, so speedup drops further.
- This is typical for small arrays: the critical path is too short compared to parallel overhead.
Medium inputs achieve modest speedups
- For N = 1 000 000, the best speedup is about 2.02 with 4 threads.
- Beyond 4 threads, speedup does not improve much; overhead and memory contention start to dominate.
- The parallel algorithm is useful here, but you do not get linear scaling with thread count.
Large inputs benefit from more threads, but efficiency falls
- For N = 100 000 000, speedup continues to grow with thread count:
  - Roughly 2.84 at 4 threads.
  - Around 4.51 at 12 threads.
  - Around 7.84 at 64 threads.
- However, efficiency falls from about 0.71 at 4 threads to about 0.12 at 64 threads.
- At this scale, the algorithm becomes memory bound: more threads can hide latency and use more cores, but each thread spends a lot of time waiting on memory, so adding threads produces diminishing returns.
Very large inputs show the highest speedups
- For N = 1 000 000 000:
  - 4 threads give speedup ≈ 3.13 (efficiency ≈ 0.78).
  - 8 threads give speedup ≈ 4.44 (efficiency ≈ 0.56).
  - 64 threads give speedup ≈ 12.56 (efficiency ≈ 0.20).
- The work is large enough that thread and scheduling overhead are amortized, so the parallel implementation clearly beats the sequential baseline.
- But even with 64 threads, the speedup is far from 64, again because of memory bandwidth limitations and synchronization.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prefixsum		prefixsum
sequential		sequential
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
approx.cpp		approx.cpp
functions.c		functions.c
gen_lib.c		gen_lib.c
libgen.a		libgen.a
params.sh		params.sh
sequential_lib.c		sequential_lib.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Prefix Sum With OpenMP

Build prerequisites

One time setup

Sequential baseline

Parallel prefix sum

Plots and derived data

Generating plots

Embedded plots (PNG)

Speedup vs threads

Speedup vs problem size

Parallel time vs threads

Parallel efficiency vs threads

Measured speedup: sequential vs parallel prefix sum

Key observations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallel Prefix Sum With OpenMP

Build prerequisites

One time setup

Sequential baseline

Parallel prefix sum

Plots and derived data

Generating plots

Embedded plots (PNG)

Speedup vs threads

Speedup vs problem size

Parallel time vs threads

Parallel efficiency vs threads

Measured speedup: sequential vs parallel prefix sum

Key observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages