Add benchmarking package with earth_ocean case#5317
Conversation
New `benchmarking/` directory implementing a benchmarking framework following the Breeze pattern. Includes: - `run_benchmarks.jl`: CLI entry point with ArgParse supporting parameter sweeps (grid size, float type, advection, closure) - `earth_ocean` case: HydrostaticFreeSurfaceModel with TripolarGrid, realistic ETOPO2022 bathymetry via NumericalEarth, CATKE vertical mixing, seawater buoyancy (TEOS10), SplitExplicitFreeSurface (30 substeps), WENO momentum/tracer advection - Benchmark and simulate modes with JSON result accumulation - System metadata capture (Julia version, GPU info, etc.) Co-Authored-By: Claude Opus 4.6 <[email protected]>
…using statements Replace custom synchronize_device with Oceananigans' sync_device!, remove unnecessary qualified imports for symbols already exported from Oceananigans, and simplify GPU type checking in metadata. Co-Authored-By: Claude Opus 4.6 <[email protected]>
simone-silvestri
left a comment
There was a problem hiding this comment.
Nice probably good to merge this before all the optimizations.
How would we run this to automatically upload the results? Would it be possible to also use nsight to be able to compute the bottlenecks?
From what I saw we may want to separate benchmarks and profiling as two different workflows... the benchmarks use a gha tool. But I think we could adapt the benchmarks to be also useful for profiling and then add an additional workflow for the profiles. what do you think? |
benchmarking/src/earth_ocean.jl
Outdated
| momentum_advection = WENOVectorInvariant(order=9), | ||
| tracer_advection = WENO(order=7), | ||
| closure = CATKEVerticalDiffusivity(), | ||
| timestepper = :QuasiAdamsBashforth2) |
There was a problem hiding this comment.
probably should use rk3 by default right @simone-silvestri ?
benchmarking/src/earth_ocean.jl
Outdated
| major_basins = 2 | ||
| ) | ||
|
|
||
| grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom_height); |
There was a problem hiding this comment.
l;ets do PartialCellBottom? What do you think @simone-silvestri
There was a problem hiding this comment.
Yeah let's use Partial cells, we ought to start using them.
|
Ok, so benchmarks will provide an overall timing of the total time step and profiling will go more in depth? |
Yes, although actually I am not sure this is required. Help me understand what it looks like to use nsight. For example, here is the workflow in Breeze: run-benchmarks:
name: "Run benchmarks - dynamics ${{ matrix.dynamics }} - microphysics ${{ matrix.microphysics }} - grid ${{ matrix.grid }}"
permissions:
actions: write
contents: write
pull-requests: read
statuses: write
runs-on: aws-linux-nvidia-gpu-l4
strategy:
fail-fast: false
matrix:
include:
- dynamics: 'anelastic'
microphysics: 'nothing'
grid: '128^3, 512x512x256, 768x768x256'
- dynamics: 'compressible_splitexplicit'
microphysics: 'nothing'
grid: '512x512x256'
- dynamics: 'anelastic'
microphysics: 'MixedPhaseEquilibrium, 1M_MixedEquilibrium, 1M_MixedNonEquilibrium'
grid: '512x512x256'
defaults:
run:
shell: bash
working-directory: ./benchmarking
container:
image: 'ghcr.io/numericalearth/breeze-docker-images:benchmarking-julia_1.12.4'
options: --gpus=all
timeout-minutes: 30
steps:
- uses: actions/checkout@v6
- name: Instantiate benchmarking environment
shell: julia --color=yes --project {0}
run: |
using Pkg
Pkg.instantiate()
- name: Run benchmarks
timeout-minutes: 45
run: |
earlyoom -m 3 -s 100 -r 300 --prefer 'julia' &
ARGS=()
if [[ -n "${{ matrix.dynamics }}" ]]; then
ARGS+=(--dynamics "${{ matrix.dynamics }}")
fi
if [[ -n "${{ matrix.microphysics }}" ]]; then
ARGS+=(--microphysics "${{ matrix.microphysics }}")
fi
julia --color=yes --project run_benchmarks.jl --device GPU "${ARGS[@]}" --size "${{ matrix.grid }}"
- uses: actions/upload-artifact@v6
with:
name: 'benchmarks-dynamics_${{ matrix.dynamics }}-microphysics_${{ matrix.microphysics }}-grid_${{ matrix.grid }}'
path: benchmarking/benchmark_results.*
retention-days: 90
overwrite: false
- name: Create job summary
timeout-minutes: 1
run:
cat "benchmark_results.md" >> "${GITHUB_STEP_SUMMARY}" |
|
I renamed the |
|
to use nsight we just need to wrap the |
|
do we also get an accurate benchmark when we do that? |
|
I would argue that nsight is for profiling not benchmarking. But if we can do both profiling and benchmarking simultaneously, that's great. However, I would suspect that we actually want to separate the two. For example, we want to benchmark many more cases than what we want to profile (it would produce too much information). |
|
@giordano let us know if you have opinions |
|
I think separating benchmarking and profiling is the right approach. They have different goals and constraints:
Practical suggestion: The benchmark cases ( That way nsight profiling wraps the same canonical cases but doesn't pollute benchmark numbers. |
giordano
left a comment
There was a problem hiding this comment.
We also need to set up the CI infrastructure. How would this work? Run on some of the buildkite runners? If so, we then need to add a GitHub action to fetch the artifact produced by the buildkite run (example from CUDA.jl, for reference) and publish it to a website (like we do in Breeze.jl). I'm happy to help with this stuff.
For the record, this would fix #4851.
| function benchmark_time_stepping(model; | ||
| time_steps = 100, | ||
| Δt = 60, | ||
| warmup_steps = 10, | ||
| name = "benchmark", | ||
| verbose = true) |
There was a problem hiding this comment.
There's indeed lots of duplication with the Breeze benchmark suite, but also some small but crucial differences. For example right today I added extra arguments to this function, but some of them, like microphysics, I don't think would make sense for Oceananigans, so we'd need some abstraction (perhaps just allowing the user to supply a dictionary for "extra info"?).
I'd need to have a closer look to this to detect the differences, but I'd suggest for the time being we keep them separate to see how they evolve, and in a few weeks (when hopefully they'd have matured a little bit) we try to combine them.
There was a problem hiding this comment.
This code here is probably the main that can be shared, right?
Also have to think about coupled modeling; in that case there would be two or more Simulations coupled together, so the total grid points is the sum of the component Simulations.
There was a problem hiding this comment.
This code here is probably the main that can be shared, right?
Well, as I said above this function was changed in Breeze.jl just today in a way that's not useful as-is for Oceananigans. But it points to the need for an abstraction.
There was a problem hiding this comment.
ah, I didn't see those changes
There was a problem hiding this comment.
I think those extra arguments could be wrapped up in a generic "model_config" object that is specific to the model being benchmarked perhaps
There was a problem hiding this comment.
Yes, as I said above it could just be a dictionary, easy peasy.
There was a problem hiding this comment.
(perhaps just allowing the user to supply a dictionary for "extra info"?).
ok I see this now. I didn't see it on the first read.
I believe we basically want to refactor this pipeline: https://github.com/CliMA/Oceananigans.jl/blob/main/.buildkite/pipeline-benchmarks.yml which uses an agent on a dedicated GPU on tartarus. |
yes, with the added benefit of having more information that can be used (without any overhead!)
Why is that? The information is more (why would that be a negative?) you decide what to information to keep, especially because it comes with no extra cost. The best and most reliable way to benchmark on a cuda gpu is to add an NVTX range around the time step and profile it with nsys. |
Ok, so you think that it's ok to generate eg O(100) nsight reports (as an example -- consider benchmarks all across NumericalEarth)? and then we have some logic to figure out which ones we want to upload? It just sounds more complicated than having a simple benchmark pipeline to do that, and then profile a couple of cases. |
How would you be able to do that on one GPU at the same time without polluting the benchmarking results? |
Where is the difficulty in that? The result is one number |
We have a setup where we run benchmarks on command line, like julia --project run_benchmarks --options...this saves data into a JSON file, it automatically appends each subsequent case. After all the jobs run, the results are sent over to another package where it is served on the website. We can encode in the workflow some logic about which case we want to upload a profile for sure. If it really is no extra cost then it is not wasteful. But since the benchmarks are cheap it might be easier to just have a separate workflow line that does the profile, and then we upload all of the profile. It doesn't really matter to me honestly. Just trying to make life easy |
Not sure what you mean. We need to run many benchmarks; hydrostatic and nonhydrostatic, immersed and not, differnet time-steppers, closures, advection schemes, etc. |
nsys produces a report with the timing for each kernel. It also has the option to wrap parts of the code with a custom NVTX range that measures the desired timings. Every run will generate a report from which the timings can be easily extracted (a grep with the kernel name / custom range name is enough). If the main time loop is the only result we are interested in, that is the only one that needs to be uploaded. However, as a bonus (if it is desired) we can upload also the timings of the largest kernels. I prefer to have a complete picture of what is going on, which means uploading the largest kernels at least. as a bonus, it is possible to wrap also other parts of the code with NVTX (such as I/O for example) to see if there is any regression. |
But just to be clear, you only want profiles for a few cases right? Not every single case we will run? |
|
I can give it a try to set it up such that we have more info per each benchmark run, it would be possible only on GPUs though. |
It's important that we can benchmark on other systems as well, like AMD, or using Reactant. |
Well if we upload data on the benchmarks it would be nice to have info on the kernel timings in addition the the total time per time step |
If there is no problem storing ~1 GB for each CI run or commit to main that will work and scale to the infrastructure we are envisioning. Assuming that each profile is ~20 MB. |
…jl into glw/benchmarks-again
|
I think it might be fixed (we need to wait ~5 hrs that the benchmarks run to figure it out) |
|
This seem to be working now, but I still believe we can save quite a lot of time by running benchmarks in the same session, saving a lot of latency. |
.buildkite/pipeline-benchmarks.yml
Outdated
| echo " Group 6: Tracer count sweep (1/2 deg)" | ||
| echo "========================================" | ||
| $JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1" --output=$OUTPUT_FILE | ||
| $JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE |
There was a problem hiding this comment.
Indentation of this line was off
| $JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE | |
| $JULIA $JULIA_ARGS $BENCHMARK_SCRIPT --size=720x360x50 --tracers="T,S,C1,C2,C3" --output=$OUTPUT_FILE |
|
It crashed when I try, I don't think it is set up correctly. I need to fix something to make it work. But I wanted to complete pushing to the repo to see the online benchmarks first |
|
what work remains to merge this? @simone-silvestri @giordano |
|
We just had a chat to discuss the way forward 😃 I think Simone is going to push some fixes soon |
|
ok. also, we can merge this without the nvtx profiles --- I think its fine to continue work in a new PR. Just having basic benchmarking will be useful. |
|
yeah the problem is actually the pushing to the repository and the setup of the pipeline, some small details, I am on it after some help from @giordano |
|
Ok the summary is produced: https://github.com/CliMA/Oceananigans.jl/actions/runs/23208086564?pr=5317 |
|
Is now the Performance benchmarks Docs section outdated or obsolete? The link to https://github.com/CliMA/Oceananigans.jl/tree/main/benchmark is definitely broken (and results in all docs built failing). |
|
Why did we keep the |
|
I think we can delete them. It was just a temporary change in the course of developing. |
|
* glw/prescribed-free-surface: Clean up update_prescribed_∂t_σ! method signatures Fix ∂t_σ = 0 bug for PrescribedFreeSurface + ZStarCoordinate Revise citation for De Abreu and Timmermans (CliMA#5411) Add benchmarking package with earth_ocean case (CliMA#5317) (0.105.5) Add unit test for zero momentum flux at immersed peripheral nodes (CliMA#5402) (0.105.5) Fix `initialize_closure_fields!()` use with `DynamicSmagorinsky` (CliMA#5366) Fix the reference list in the Langmuir example (CliMA#5400)
Summary
benchmarking/directory with a benchmarking framework following the Breeze patternearth_ocean— HydrostaticFreeSurfaceModel with TripolarGrid, realistic ETOPO2022 bathymetry (via NumericalEarth), CATKE, TEOS10, SplitExplicitFreeSurface (30 substeps)run_benchmarks.jl) with ArgParse supporting parameter sweeps over grid size, float type, advection schemes, closuresTest plan
--closure=nothingon CPU with 72x36x10 grid--closure=CATKEon CPU with 72x36x10 grid🤖 Generated with Claude Code