Skip to content

Add benchmarking package with earth_ocean case#5317

Merged
simone-silvestri merged 37 commits intomainfrom
glw/benchmarks-again
Mar 18, 2026
Merged

Add benchmarking package with earth_ocean case#5317
simone-silvestri merged 37 commits intomainfrom
glw/benchmarks-again

Conversation

@glwagner
Copy link
Copy Markdown
Member

Summary

  • New benchmarking/ directory with a benchmarking framework following the Breeze pattern
  • First canonical case: earth_ocean — HydrostaticFreeSurfaceModel with TripolarGrid, realistic ETOPO2022 bathymetry (via NumericalEarth), CATKE, TEOS10, SplitExplicitFreeSurface (30 substeps)
  • Single CLI entry point (run_benchmarks.jl) with ArgParse supporting parameter sweeps over grid size, float type, advection schemes, closures
  • JSON result accumulation and markdown report generation

Test plan

  • Tested --closure=nothing on CPU with 72x36x10 grid
  • Tested --closure=CATKE on CPU with 72x36x10 grid
  • Test on GPU
  • Test parameter sweep mode (multiple sizes/closures)

🤖 Generated with Claude Code

New `benchmarking/` directory implementing a benchmarking framework
following the Breeze pattern. Includes:

- `run_benchmarks.jl`: CLI entry point with ArgParse supporting
  parameter sweeps (grid size, float type, advection, closure)
- `earth_ocean` case: HydrostaticFreeSurfaceModel with TripolarGrid,
  realistic ETOPO2022 bathymetry via NumericalEarth, CATKE vertical
  mixing, seawater buoyancy (TEOS10), SplitExplicitFreeSurface (30
  substeps), WENO momentum/tracer advection
- Benchmark and simulate modes with JSON result accumulation
- System metadata capture (Julia version, GPU info, etc.)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…using statements

Replace custom synchronize_device with Oceananigans' sync_device!, remove
unnecessary qualified imports for symbols already exported from Oceananigans,
and simplify GPU type checking in metadata.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown
Collaborator

@simone-silvestri simone-silvestri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice probably good to merge this before all the optimizations.
How would we run this to automatically upload the results? Would it be possible to also use nsight to be able to compute the bottlenecks?

@glwagner
Copy link
Copy Markdown
Member Author

Nice probably good to merge this before all the optimizations. How would we run this to automatically upload the results? Would it be possible to also use nsight to be able to compute the bottlenecks?

From what I saw we may want to separate benchmarks and profiling as two different workflows... the benchmarks use a gha tool. But I think we could adapt the benchmarks to be also useful for profiling and then add an additional workflow for the profiles. what do you think?

momentum_advection = WENOVectorInvariant(order=9),
tracer_advection = WENO(order=7),
closure = CATKEVerticalDiffusivity(),
timestepper = :QuasiAdamsBashforth2)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should use rk3 by default right @simone-silvestri ?

major_basins = 2
)

grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom_height);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l;ets do PartialCellBottom? What do you think @simone-silvestri

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's use Partial cells, we ought to start using them.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Ok, so benchmarks will provide an overall timing of the total time step and profiling will go more in depth?

@glwagner
Copy link
Copy Markdown
Member Author

glwagner commented Feb 20, 2026

Ok, so benchmarks will provide an overall timing of the total time step and profiling will go more in depth?

Yes, although actually I am not sure this is required.

Help me understand what it looks like to use nsight. For example, here is the workflow in Breeze:

  run-benchmarks:
    name: "Run benchmarks - dynamics ${{ matrix.dynamics }} - microphysics ${{ matrix.microphysics }} - grid ${{ matrix.grid }}"
    permissions:
      actions: write
      contents: write
      pull-requests: read
      statuses: write
    runs-on: aws-linux-nvidia-gpu-l4
    strategy:
      fail-fast: false
      matrix:
        include:
          - dynamics: 'anelastic'
            microphysics: 'nothing'
            grid: '128^3, 512x512x256, 768x768x256'
          - dynamics: 'compressible_splitexplicit'
            microphysics: 'nothing'
            grid: '512x512x256'
          - dynamics: 'anelastic'
            microphysics: 'MixedPhaseEquilibrium, 1M_MixedEquilibrium, 1M_MixedNonEquilibrium'
            grid: '512x512x256'
    defaults:
      run:
        shell: bash
        working-directory: ./benchmarking
    container:
      image: 'ghcr.io/numericalearth/breeze-docker-images:benchmarking-julia_1.12.4'
      options: --gpus=all
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v6
      - name: Instantiate benchmarking environment
        shell: julia --color=yes --project {0}
        run: |
          using Pkg
          Pkg.instantiate()
      - name: Run benchmarks
        timeout-minutes: 45
        run: |
          earlyoom -m 3 -s 100 -r 300 --prefer 'julia' &
          ARGS=()
          if [[ -n "${{ matrix.dynamics }}" ]]; then
              ARGS+=(--dynamics "${{ matrix.dynamics }}")
          fi
          if [[ -n "${{ matrix.microphysics }}" ]]; then
              ARGS+=(--microphysics "${{ matrix.microphysics }}")
          fi
          julia --color=yes --project run_benchmarks.jl --device GPU "${ARGS[@]}" --size "${{ matrix.grid }}"
      - uses: actions/upload-artifact@v6
        with:
          name: 'benchmarks-dynamics_${{ matrix.dynamics }}-microphysics_${{ matrix.microphysics }}-grid_${{ matrix.grid }}'
          path: benchmarking/benchmark_results.*
          retention-days: 90
          overwrite: false
      - name: Create job summary
        timeout-minutes: 1
        run:
          cat "benchmark_results.md" >> "${GITHUB_STEP_SUMMARY}"

@glwagner
Copy link
Copy Markdown
Member Author

I renamed the benchmark directory to legacy_benchmark. I think we should delete it only once we have distributed benchmarking.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

to use nsight we just need to wrap julia --project with a nsys profile --trace=cuda --output=filename this will create a report that we can inspect that will give us accurate timings on all kernels (on the GPU only)
this is how the benchmark pipeline is set up at the moment

      # Profile each benchmark group, save output in txt and remove profiles
      for BENCHMARK_GROUP in "\${BENCHMARK_GROUPS[@]}"; do
        # Run benchmarks
        export BENCHMARK_GROUP
        OUTPUT_PREFIX="\${BENCHMARK_GROUP}_output"

        $NSYS profile --output=\${OUTPUT_PREFIX} --force-overwrite true --trace=cuda $TARTARUS_HOME/julia-$JULIA_VERSION/bin/julia --color=yes --project --check-bounds=no test/benchmark_tests.jl
        $NSYS stats \${OUTPUT_PREFIX}.nsys-rep > \${OUTPUT_PREFIX}.txt

        # Remove generated output files
        rm \${OUTPUT_PREFIX}.nsys-rep
        rm \${OUTPUT_PREFIX}.sqlite
      done

the nsys stats converts the report file (with extension .nsys-rep) to a file that can be parsed and read through.

@glwagner
Copy link
Copy Markdown
Member Author

do we also get an accurate benchmark when we do that?

@glwagner
Copy link
Copy Markdown
Member Author

I would argue that nsight is for profiling not benchmarking. But if we can do both profiling and benchmarking simultaneously, that's great. However, I would suspect that we actually want to separate the two. For example, we want to benchmark many more cases than what we want to profile (it would produce too much information).

@glwagner
Copy link
Copy Markdown
Member Author

@giordano let us know if you have opinions

@glwagner
Copy link
Copy Markdown
Member Author

I think separating benchmarking and profiling is the right approach. They have different goals and constraints:

  1. Overhead: nsys profile instruments CUDA calls, which adds overhead. Even if small, it contaminates the wall-clock timings we're trying to measure. Benchmark numbers should be clean and comparable across runs/commits.

  2. Scope: We want to benchmark many configurations (resolution sweeps, closures, timesteppers) to track regressions. Profiling all of those would produce overwhelming amounts of data. Profiling is most useful on 1-2 canonical cases to identify kernel-level bottlenecks.

  3. Output format: Benchmarks produce simple JSON/markdown with aggregate timings (time per step, grid points/s). Profiles produce .nsys-rep files converted via nsys stats into ranked kernel-time tables — fundamentally different artifacts.

  4. Workflow: Benchmarks should run on every PR (or on-demand) to catch regressions. Profiling is more investigative — run it when actively optimizing, not on every commit.

Practical suggestion: The benchmark cases (earth_ocean, etc.) can be shared between both workflows. run_benchmarks.jl stays focused on timing. A separate run_profiles.jl (or a --profile flag) would set up the same model but run fewer time steps and skip timing/reporting, since nsight handles that externally via nsys profile --trace=cuda julia --project run_profiles.jl. The GHA workflows would be two separate jobs.

That way nsight profiling wraps the same canonical cases but doesn't pollute benchmark numbers.

Copy link
Copy Markdown
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to set up the CI infrastructure. How would this work? Run on some of the buildkite runners? If so, we then need to add a GitHub action to fetch the artifact produced by the buildkite run (example from CUDA.jl, for reference) and publish it to a website (like we do in Breeze.jl). I'm happy to help with this stuff.

For the record, this would fix #4851.

Comment on lines +19 to +24
function benchmark_time_stepping(model;
time_steps = 100,
Δt = 60,
warmup_steps = 10,
name = "benchmark",
verbose = true)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's indeed lots of duplication with the Breeze benchmark suite, but also some small but crucial differences. For example right today I added extra arguments to this function, but some of them, like microphysics, I don't think would make sense for Oceananigans, so we'd need some abstraction (perhaps just allowing the user to supply a dictionary for "extra info"?).

I'd need to have a closer look to this to detect the differences, but I'd suggest for the time being we keep them separate to see how they evolve, and in a few weeks (when hopefully they'd have matured a little bit) we try to combine them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here is probably the main that can be shared, right?

Also have to think about coupled modeling; in that case there would be two or more Simulations coupled together, so the total grid points is the sum of the component Simulations.

Copy link
Copy Markdown
Collaborator

@giordano giordano Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code here is probably the main that can be shared, right?

Well, as I said above this function was changed in Breeze.jl just today in a way that's not useful as-is for Oceananigans. But it points to the need for an abstraction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't see those changes

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those extra arguments could be wrapped up in a generic "model_config" object that is specific to the model being benchmarked perhaps

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I said above it could just be a dictionary, easy peasy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(perhaps just allowing the user to supply a dictionary for "extra info"?).

ok I see this now. I didn't see it on the first read.

@glwagner
Copy link
Copy Markdown
Member Author

We also need to set up the CI infrastructure. How would this work? Run on some of the buildkite runners? If so, we then need to add a GitHub action to fetch the artifact produced by the buildkite run (example from CUDA.jl, for reference) and publish it to a website (like we do in Breeze.jl). I'm happy to help with this stuff.

For the record, this would fix #4851.

I believe we basically want to refactor this pipeline: https://github.com/CliMA/Oceananigans.jl/blob/main/.buildkite/pipeline-benchmarks.yml

which uses an agent on a dedicated GPU on tartarus.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

do we also get an accurate benchmark when we do that?

yes, with the added benefit of having more information that can be used (without any overhead!)

I would argue that nsight is for profiling not benchmarking. But if we can do both profiling and benchmarking simultaneously, that's great. However, I would suspect that we actually want to separate the two. For example, we want to benchmark many more cases than what we want to profile (it would produce too much information).

Why is that? The information is more (why would that be a negative?) you decide what to information to keep, especially because it comes with no extra cost.

The best and most reliable way to benchmark on a cuda gpu is to add an NVTX range around the time step and profile it with nsys.

@glwagner
Copy link
Copy Markdown
Member Author

glwagner commented Feb 21, 2026

do we also get an accurate benchmark when we do that?

yes, with the added benefit of having more information that can be used (without any overhead!)

I would argue that nsight is for profiling not benchmarking. But if we can do both profiling and benchmarking simultaneously, that's great. However, I would suspect that we actually want to separate the two. For example, we want to benchmark many more cases than what we want to profile (it would produce too much information).

Why is that? The information is more (why would that be a negative?) you decide what to information to keep, especially because it comes with no extra cost.

The best and most reliable way to benchmark on a cuda gpu is to add an NVTX range around the time step and profile it with nsys.

Ok, so you think that it's ok to generate eg O(100) nsight reports (as an example -- consider benchmarks all across NumericalEarth)? and then we have some logic to figure out which ones we want to upload?

It just sounds more complicated than having a simple benchmark pipeline to do that, and then profile a couple of cases.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

simone-silvestri commented Feb 21, 2026

Ok, so you think that it's ok to generate eg O(100) nsight reports (as an example -- consider benchmarks all across NumericalEarth)?

How would you be able to do that on one GPU at the same time without polluting the benchmarking results?
Sorry, but I really do not understand what is the difference between doing nsys profile julia --project and just julia --project.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

have some logic to figure out which ones we want to upload?

Where is the difficulty in that? The result is one number

@glwagner
Copy link
Copy Markdown
Member Author

Ok, so you think that it's ok to generate eg O(100) nsight reports (as an example -- consider benchmarks all across NumericalEarth)?

How would you be able to do that on one GPU at the same time without polluting the benchmarking results? Sorry, but I really do not understand what is the difference between doing nsys profile julia --project and just julia --project.

We have a setup where we run benchmarks on command line, like

julia --project run_benchmarks --options...

this saves data into a JSON file, it automatically appends each subsequent case.

After all the jobs run, the results are sent over to another package where it is served on the website.

We can encode in the workflow some logic about which case we want to upload a profile for sure. If it really is no extra cost then it is not wasteful. But since the benchmarks are cheap it might be easier to just have a separate workflow line that does the profile, and then we upload all of the profile.

It doesn't really matter to me honestly. Just trying to make life easy

@glwagner
Copy link
Copy Markdown
Member Author

have some logic to figure out which ones we want to upload?

Where is the difficulty in that? The result is one number

Not sure what you mean. We need to run many benchmarks; hydrostatic and nonhydrostatic, immersed and not, differnet time-steppers, closures, advection schemes, etc.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Not sure what you mean. We need to run many benchmarks; hydrostatic and nonhydrostatic, immersed and not, differnet time-steppers, closures, advection schemes, etc.

nsys produces a report with the timing for each kernel. It also has the option to wrap parts of the code with a custom NVTX range that measures the desired timings. Every run will generate a report from which the timings can be easily extracted (a grep with the kernel name / custom range name is enough). If the main time loop is the only result we are interested in, that is the only one that needs to be uploaded. However, as a bonus (if it is desired) we can upload also the timings of the largest kernels. I prefer to have a complete picture of what is going on, which means uploading the largest kernels at least. as a bonus, it is possible to wrap also other parts of the code with NVTX (such as I/O for example) to see if there is any regression.

@glwagner
Copy link
Copy Markdown
Member Author

Not sure what you mean. We need to run many benchmarks; hydrostatic and nonhydrostatic, immersed and not, differnet time-steppers, closures, advection schemes, etc.

nsys produces a report with the timing for each kernel. It also has the option to wrap parts of the code with a custom NVTX range that measures the desired timings. Every run will generate a report from which the timings can be easily extracted (a grep with the kernel name / custom range name is enough). If the main time loop is the only result we are interested in, that is the only one that needs to be uploaded. However, as a bonus (if it is desired) we can upload also the timings of the largest kernels. I prefer to have a complete picture of what is going on, which means uploading the largest kernels at least. as a bonus, it is possible to wrap also other parts of the code with NVTX (such as I/O for example) to see if there is any regression.

But just to be clear, you only want profiles for a few cases right? Not every single case we will run?

@simone-silvestri
Copy link
Copy Markdown
Collaborator

I can give it a try to set it up such that we have more info per each benchmark run, it would be possible only on GPUs though.

@glwagner
Copy link
Copy Markdown
Member Author

I can give it a try to set it up such that we have more info per each benchmark run, it would be possible only on GPUs though.

It's important that we can benchmark on other systems as well, like AMD, or using Reactant.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

But just to be clear, you only want profiles for a few cases right? Not every single case we will run?

Well if we upload data on the benchmarks it would be nice to have info on the kernel timings in addition the the total time per time step

@glwagner
Copy link
Copy Markdown
Member Author

But just to be clear, you only want profiles for a few cases right? Not every single case we will run?

Well if we upload data on the benchmarks it would be nice to have info on the kernel timings in addition the the total time per time step

If there is no problem storing ~1 GB for each CI run or commit to main that will work and scale to the infrastructure we are envisioning. Assuming that each profile is ~20 MB.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

I think it might be fixed (we need to wait ~5 hrs that the benchmarks run to figure it out)

@giordano
Copy link
Copy Markdown
Collaborator

giordano commented Mar 9, 2026

This seem to be working now, but I still believe we can save quite a lot of time by running benchmarks in the same session, saving a lot of latency.

echo " Group 6: Tracer count sweep (1/2 deg)"
echo "========================================"
$JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1" --output=$OUTPUT_FILE
$JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation of this line was off

Suggested change
$JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE
$JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE

@simone-silvestri
Copy link
Copy Markdown
Collaborator

It crashed when I try, I don't think it is set up correctly. I need to fix something to make it work. But I wanted to complete pushing to the repo to see the online benchmarks first

@glwagner
Copy link
Copy Markdown
Member Author

what work remains to merge this? @simone-silvestri @giordano

@giordano
Copy link
Copy Markdown
Collaborator

giordano commented Mar 17, 2026

We just had a chat to discuss the way forward 😃 I think Simone is going to push some fixes soon

@glwagner
Copy link
Copy Markdown
Member Author

ok. also, we can merge this without the nvtx profiles --- I think its fine to continue work in a new PR. Just having basic benchmarking will be useful.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

yeah the problem is actually the pushing to the repository and the setup of the pipeline, some small details, I am on it after some help from @giordano

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Ok the summary is produced: https://github.com/CliMA/Oceananigans.jl/actions/runs/23208086564?pr=5317
Let's merge this and see what happens to the website!

@simone-silvestri simone-silvestri merged commit 3458de0 into main Mar 18, 2026
3 of 8 checks passed
@simone-silvestri simone-silvestri deleted the glw/benchmarks-again branch March 18, 2026 07:01
@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 18, 2026

Is now the Performance benchmarks Docs section outdated or obsolete? The link to https://github.com/CliMA/Oceananigans.jl/tree/main/benchmark is definitely broken (and results in all docs built failing).

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 19, 2026

Why did we keep the legacy_benchmarks? Perhaps just delete them? They will live in Github history if we want to come back to them.

@glwagner
Copy link
Copy Markdown
Member Author

I think we can delete them. It was just a temporary change in the course of developing.

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 19, 2026

I think we can delete them. It was just a temporary change in the course of developing.

2f35d80

briochemc added a commit to briochemc/Oceananigans.jl that referenced this pull request Mar 21, 2026
* glw/prescribed-free-surface:
  Clean up update_prescribed_∂t_σ! method signatures
  Fix ∂t_σ = 0 bug for PrescribedFreeSurface + ZStarCoordinate
  Revise citation for De Abreu and Timmermans (CliMA#5411)
  Add benchmarking package with earth_ocean case (CliMA#5317)
  (0.105.5) Add unit test for zero momentum flux at immersed peripheral nodes (CliMA#5402)
  (0.105.5) Fix `initialize_closure_fields!()` use with `DynamicSmagorinsky` (CliMA#5366)
  Fix the reference list in the Langmuir example (CliMA#5400)
@giordano giordano linked an issue Mar 27, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collect benchmark results and host them on a website

5 participants