Skip to content

ak811/prefixsum-openmp-benchmark

Repository files navigation

Parallel Prefix Sum With OpenMP

This project benchmarks a parallel prefix sum (exclusive scan) built on a reusable OpenMP loop helper, and compares it against a sequential baseline. It also includes scripts to run benchmarks, generate plots, and produce a report style PDF.

The focus of the performance discussion is the speedup between the sequential and parallel implementations of prefix sum. The same patterns and conclusions apply to other data parallel algorithms such as parallel mergesort.


Build prerequisites

You will need:

  • A C and C++ compiler with OpenMP support (for example, gcc and g++)
  • make
  • Bash
  • For plotting:
    • Either gnuplot, or
    • Python 3 with matplotlib and pandas

One time setup

From the project root:

cd prefixsum-parallelization-study
make libgen.a

This builds the common static library used by both sequential and parallel implementations.

Most sub Makefiles will auto build ../libgen.a if it is missing, but running it once at the root makes things explicit.


Sequential baseline

From the project root:

cd sequential
make
make bench        # or: bash ./queue.sh

What this does:

  • Builds prefixsum_seq from prefixsum_seq.cpp.
  • Runs bench_sequential.sh, which:
    • Reads PREFIXSUM_NS from ../params.sh.

    • For each N in that list, runs

      ./prefixsum_seq N >/dev/null 2> result/prefixsum_seq_N
    • Each result/prefixsum_seq_N contains the total runtime in seconds written to stderr.

Files written:

  • sequential/result/prefixsum_seq_10000
  • sequential/result/prefixsum_seq_1000000
  • sequential/result/prefixsum_seq_100000000
  • sequential/result/prefixsum_seq_1000000000

Each file contains a single floating point number (time in seconds).


Parallel prefix sum

From the project root:

cd prefixsum
make             # builds ./prefixsum using OpenMP
make test        # optional: correctness and basic timing sanity
make bench       # or: bash ./queue.sh

Key parts of prefixsum.cpp:

  • Data generation:

    int *arr = new int[n];
    generatePrefixSumData(arr, n);
  • Output array:

    int *pr = new int[n + 1];
    pr[0] = 0;
  • Parallel work is done in three passes using OmpLoop:

    1. Per block scan to compute local prefix sums and block totals.
    2. Sequential scan over block totals to compute offsets.
    3. Parallel pass that adds each block’s offset to its local prefix sums.

    The OmpLoop helper wraps an OpenMP for with dynamic scheduling and a user settable chunk size:

    OmpLoop loop;
    loop.setNbThread(nbthreads);
    loop.setGranularity(1);
    
    // Example usage:
    loop.parfor(0, n, B, [&](int start) {
        // work on range [start, start + B)
    });
  • The program prints timing information only to stderr and the check status (checked or incorrect) to stdout.

Benchmarking is driven by bench_prefixsum.sh, which:

  • Reads PREFIXSUM_NS and THREADS from ../params.sh.

  • For each combination of N and t:

    ./prefixsum N t >/dev/null 2> result/prefixsum_N_t

The parallel timings live in prefixsum/result/ with filenames like:

  • prefixsum_10000_1, prefixsum_10000_4, prefixsum_10000_16, prefixsum_10000_64
  • prefixsum_1000000_1, prefixsum_1000000_4, prefixsum_1000000_16, prefixsum_1000000_64
  • And so on, following the pattern prefixsum_<N>_<threads>.

Plots and derived data

After running both sequential and parallel benchmarks, you can generate plots.

Generating plots

From prefixsum/:

make plot

The plot target checks if gnuplot is available:

  • If gnuplot is found, it runs plot.sh and produces PDF plots.

  • If gnuplot is not found, it falls back to Python:

    python3 ./plot_py.py

To generate the PNG plots (which this README embeds), run:

cd prefixsum
python3 plots.py

plots.py will:

  • Read existing timing results.

  • Build the plots if needed.

  • Write the following PNGs to prefixsum/plots/:

  • prefixsum_speedup_n.png

  • prefixsum_speedup_thread.png

  • prefixsum_time_vs_threads.png

  • prefixsum_efficiency_vs_threads.png

It also writes PDFs with the same stems and a combined plots.pdf.


Embedded plots (PNG)

Once the PNGs exist in prefixsum/plots/, they can be embedded directly in this README.

Speedup vs threads

Prefix sum speedup vs threads

This plot shows speedup as a function of the number of threads for each problem size N.


Speedup vs problem size

Prefix sum speedup vs problem size

This plot shows speedup as a function of N (on a log scale) for each fixed thread count.


Parallel time vs threads

Prefix sum parallel time vs threads

This plot shows the absolute parallel runtime (par_time) as a function of thread count for each N.


Parallel efficiency vs threads

Prefix sum parallel efficiency vs threads

This plot shows parallel efficiency (speedup / threads) as a function of thread count for each N.


Measured speedup: sequential vs parallel prefix sum

The CSV file prefixsum/plots/prefixsum_speedup_table.csv contains rows of the form:

N,threads,seq_time,par_time,speedup,efficiency

where:

  • seq_time is the runtime of sequential/prefixsum_seq for that N.
  • par_time is the runtime of prefixsum/prefixsum for that N and thread count.
  • speedup = seq_time / par_time.
  • efficiency = speedup / threads.

A few representative rows from that file:

N threads speedup (seq/par) efficiency (speedup per thread)
10 000 1 0.83 0.83
10 000 4 0.10 0.03
10 000 16 0.03 0.00
1 000 000 1 1.17 1.17
1 000 000 4 2.02 0.51
1 000 000 16 1.33 0.08
100 000 000 1 1.04 1.04
100 000 000 4 2.84 0.71
100 000 000 12 4.51 0.38
100 000 000 64 7.84 0.12
1 000 000 000 1 0.97 0.97
1 000 000 000 4 3.13 0.78
1 000 000 000 8 4.44 0.56
1 000 000 000 64 12.56 0.20

(Values rounded for readability.)

Key observations

  1. Small inputs do not benefit from parallelism

    • For N = 10 000, even 1 thread in the parallel version is slower than the straightforward sequential loop, with a speedup of about 0.83.
    • As thread count increases, overhead from thread creation, scheduling, and extra memory traffic dominates the tiny amount of work, so speedup drops further.
    • This is typical for small arrays: the critical path is too short compared to parallel overhead.
  2. Medium inputs achieve modest speedups

    • For N = 1 000 000, the best speedup is about 2.02 with 4 threads.
    • Beyond 4 threads, speedup does not improve much; overhead and memory contention start to dominate.
    • The parallel algorithm is useful here, but you do not get linear scaling with thread count.
  3. Large inputs benefit from more threads, but efficiency falls

    • For N = 100 000 000, speedup continues to grow with thread count:
      • Roughly 2.84 at 4 threads.
      • Around 4.51 at 12 threads.
      • Around 7.84 at 64 threads.
    • However, efficiency falls from about 0.71 at 4 threads to about 0.12 at 64 threads.
    • At this scale, the algorithm becomes memory bound: more threads can hide latency and use more cores, but each thread spends a lot of time waiting on memory, so adding threads produces diminishing returns.
  4. Very large inputs show the highest speedups

    • For N = 1 000 000 000:
      • 4 threads give speedup ≈ 3.13 (efficiency ≈ 0.78).
      • 8 threads give speedup ≈ 4.44 (efficiency ≈ 0.56).
      • 64 threads give speedup ≈ 12.56 (efficiency ≈ 0.20).
    • The work is large enough that thread and scheduling overhead are amortized, so the parallel implementation clearly beats the sequential baseline.
    • But even with 64 threads, the speedup is far from 64, again because of memory bandwidth limitations and synchronization.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors