Skip to content

Add restart verification script#5379

Merged
glwagner merged 18 commits intomainfrom
eq/manual_restart_verification
Mar 20, 2026
Merged

Add restart verification script#5379
glwagner merged 18 commits intomainfrom
eq/manual_restart_verification

Conversation

@ewquon
Copy link
Copy Markdown
Collaborator

@ewquon ewquon commented Mar 6, 2026

This is an updated version of #5372, with the driver script now written in Julia.

Usage:

julia verify_restart.jl /path/to/my_simulation.jl

To test all examples:

cd Oceananigans.jl/examples/utils
julia restart_verification.jl ../*.jl

This will create a subdirectory per simulation script provided, which will generally look like:

baroclinic_adjustment_0.jl  # auto-generated no-restart script
baroclinic_adjustment_1.jl  # auto-generated restart script
compare_restart.log         # empty if norestart_iteration200.jld2 and restarted_iteration200.jld2 are identical
log.run0
log.run1
norestart_iteration0.jld2
norestart_iteration100.jld2
norestart_iteration200.jld2
restarted_iteration200.jld2

The checkpoint comparison is automatically performed by the accompanying utility script, which runs quietly by default. It can also be run on its own:

Usage: julia compare_checkpoints.jl <filepath1> <filepath2> [-v|--verbose]

Current results:

Example Problem Restart is Bitwise Identical Notes
baroclinic_adjustment.jl
convecting_plankton.jl
horizontal_convection.jl
hydrostatic_lock_exchange.jl
internal_tide.jl
internal_wave.jl
kelvin_helmholtz_instability.jl manually tested restart
langmuir_turbulence.jl ran on CPU, reduced problem to 64x64x32; warning that halo size was increased to (5,5,5) for ImmersedBoundaryGrid
ocean_wind_mixing_and_convection.jl ☑️ abs diff ≲ O(1e-13) ran on CPU, reduced problem to 64x64x32; warning that halo size was increased to (4,4,4) for ImmersedBoundaryGrid
one_dimensional_diffusion.jl
shallow_water_Bickley_jet.jl
spherical_baroclinic_instability.jl (lat-lon) ran on CPU
spherical_baroclinic_instability.jl (tripolar) ✅* ran on CPU; velocities, tracers, and free_surface fields identical but NaN found in timestepper.Gⁿ.u
spherical_baroclinic_instability.jl (rotated lat-lon) ran on CPU
two_dimensional_turbulence.jl

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 6, 2026

I think I've sufficiently beat this to death. The only example that I didn't do the restart test on is "tilted_bottom_boundary_layer", which gives NaN when I attempted to run it.

@glwagner glwagner requested a review from giordano March 6, 2026 20:47
@glwagner
Copy link
Copy Markdown
Member

glwagner commented Mar 6, 2026

Not sure where to put this in the repo as we haven't had something quite like this before. Perhaps a directory at the top-level called /utilities? Curious what @giordano has to say.

Copy link
Copy Markdown
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a directory at the top-level called /utilities?

In Julia's repository there's a top-level directory called contrib, which follows an old convention of collecting in there scripts and utilities "contributed" by the community. This could fit in the same scheme, however this one in particular I suspect may benefit from living inside examples/ because you could run it with

julia --project restart_verification.jl ../*.jl

and it'd correctly use the right environment. If this was in contrib/ you'd have to use

julia --project=../examples restart_verification.jl ../examples/*.jl

which is a bit more verbose. That'd be fine for me, but perhaps slightly less user-friendly. I don't have strong opinions either way.

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Mar 8, 2026

Perhaps a directory at the top-level called /utilities?

In Julia's repository there's a top-level directory called contrib, which follows an old convention of collecting in there scripts and utilities "contributed" by the community. This could fit in the same scheme, however this one in particular I suspect may benefit from living inside examples/ because you could run it with

julia --project restart_verification.jl ../*.jl

and it'd correctly use the right environment. If this was in contrib/ you'd have to use

julia --project=../examples restart_verification.jl ../examples/*.jl

which is a bit more verbose. That'd be fine for me, but perhaps slightly less user-friendly. I don't have strong opinions either way.

I was assuming that the long-term use case for this script is not really to test the examples (though that is useful, if we change the examples or add new ones) but rather to test on new scripts / user scripts. Users might copy/paste it into the repo they are using for that work in that case.

@glwagner glwagner requested a review from giordano March 18, 2026 02:17
Copy link
Copy Markdown
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked super-duper in details, but looks overall good if it does the job 🚀

return nothing
end

"""(If only Comonicon worked)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to be a comment? Reads weird in a docstring 😅

ewquon and others added 7 commits March 19, 2026 11:58
…ananigans.jl into eq/manual_restart_verification
- Add `Base.isapprox` for `Clock` structs (compares time fields approximately,
  iteration and stage exactly) with tests
- Refactor `compare_checkpoints.jl` into `CheckpointComparison.jl` module;
  `compare_all` now returns success/failure boolean
- Fix `verify_restart.jl`: propagate `--project` to child Julia processes,
  rename output directory to `<name>_restart_verification`, add usage docstring,
  and report success/failure at the end

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
isapprox(0.0, 1e-15) is false with default (relative) tolerances because
the reference value is zero. Use time=1.0 as the base so that small
perturbations are correctly detected as approximately equal.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.49%. Comparing base (96f04c9) to head (00a694c).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5379      +/-   ##
==========================================
+ Coverage   73.48%   73.49%   +0.01%     
==========================================
  Files         398      398              
  Lines       22671    22673       +2     
==========================================
+ Hits        16660    16664       +4     
+ Misses       6011     6009       -2     
Flag Coverage Δ
buildkite 68.84% <100.00%> (+0.01%) ⬆️
julia 68.84% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@glwagner glwagner merged commit 370ca5f into main Mar 20, 2026
66 of 67 checks passed
@glwagner glwagner deleted the eq/manual_restart_verification branch March 20, 2026 03:12
@glwagner
Copy link
Copy Markdown
Member

nice work @ewquon !

briochemc added a commit to briochemc/Oceananigans.jl that referenced this pull request Mar 23, 2026
…ine-ACCESS-OM2

* bp-claude/distributed-FPivot-TripolarGrid:
  Replace reverse() with reversed-range views in fold halo fills
  Reinstate Docs/Benchmarks (CliMA#5419)
  Update fill_halo_regions.jl (CliMA#5415)
  Temporarily drop Benchmark section from Docs + delete `legacy_benchmarks` (CliMA#5412)
  Add restart verification script (CliMA#5379)
  Rework support for reduction operations on Metal GPU to avoid materialization of the interior (CliMA#5329)
  Fix typo with density perturbation in docs (CliMA#5398)
  Implement ReactantCore.materialize_traced_array for Field (CliMA#5409)
  Remove Oceananigans dependency from Project.toml (CliMA#5414)
  (0.106) Log checkpoint file and mtime when restoring simulations (CliMA#5355)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants