(v0.104.0) Checkpointing simulations by ali-ramadhan · Pull Request #4892 · CliMA/Oceananigans.jl

ali-ramadhan · 2025-10-30T18:27:11Z

This PR refactors how the Checkpointer works by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.

Basic design idea:

We now have two new functions: prognostic_state(obj) which returns a named tuple corresponding to the prognostic state of obj and restore_prognostic_state!(obj, state) which restores obj based on information contained in state (which is a named tuple and is read from a checkpoint file).
Objects are checkpointed recursively by serializing prognostic information to the JLD2 checkpoint file.
The goal is for checkpointing to be flexible enough that we can very easily use it for different types of simulations, e.g. coupled simulations in ClimaOcean.jl by just defining prognostic_state and restore_prognostic_state!.

Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.

Will continue working on this PR, but any feedback is very welcome!

Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857

Rhetorical aside

In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data, TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.

Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.

For example, take the SpecifiedTimes schedule. It has two properties times and previous_actuation. Since previous_actuation changes as the simulation runs, only previous_actuation needs to be checkpointed.

This leads to the possibility of the user changing times then picking up previous_actuation which can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.

Checkpointing both times and previous_actuation allows us to check that times is the same when restoring. But I don't think this is the checkpointer's responsibility.

…anigans.jl into ali/checkpointing-that-works

src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl

…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>

src/Models/NonhydrostaticModels/nonhydrostatic_model.jl

ali-ramadhan · 2025-11-13T18:05:53Z

Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues.

…anigans.jl into ali/checkpointing-that-works

glwagner · 2026-01-09T19:41:44Z

Ok so I think checkpointing now works for all turbulence closures!

In general, we shouldn't be calling initialize!(model) when restoring from a checkpoint because this could overwrite/corrupt closure fields. And we should use update_state!(model; compute_tendencies=false) when restoring from a checkpoint.

I also had to add a very small absolute tolerance to model equality tests when testing checkpointing turbulence closures, usually for fields like e and ε that may accumulate floating point differences due to differing order of operations (reductions?). But it's tiny, like atol = 1e-20.

Smagorinsky closures

Smagorinsky and SmagorinskyLilly don't need checkpointing.

In DirectionallyAveragedDynamicSmagorinsky we technically may not need to checkpoint 𝒥ᴸᴹ and 𝒥ᴹᴹ since they're recomputed from LM and MM but I did it to ensure exact reproducibility.

LagrangianAveragedDynamicSmagorinsky needed some changes to be checkpointed because it maintains time history for Lagrangian trajectory averaging. I had to also checkpoint previous_compute_time along with 𝒥ᴸᴹ⁻ and 𝒥ᴹᴹ⁻.

When using a multi-stage timestepper like RK3, I think it should only compute coefficients at the final stage. Lagrangian averaging was being applied 3 times with small fractional Δt values instead of once with the full Δt. Not doing so leaves the closure in an inconsistent state so when you pickup you can't recreate the same sequence of intermediate states. I think the main issue is that t⁻ is updated at every substep. The checkpoint captures t⁻ from stage=1, but the next computation in the continuous run happens at stage=2.

But please let me know if this makes sense for LagrangianAveragedDynamicSmagorinsky + RK3. If not, I can revert the changes and we can fix this properly in another PR. This is not an issue with QAB2. cc @tomchor @glwagner @simone-silvestri

RiBasedVerticalDiffusivity

This was relatively easy to checkpoint. Gotta checkpoint some closure fields but otherwise just needed to keep track of the previous_compute_time.

CATKE

I had to add some checks like making sure we only call time_step_catke_equation! and compute_average_surface_buoyancy_flux! on new iterations (Δt > 0) and making sure they work with split RK3. I also added a _skip_next_compute property to CATKEDiffusivityFields because we need to

TKEDissipationVerticalDiffusivity

Also similar to CATKE, I had to introduce a previous_compute_time and _skip_next_compute. Should also work once TKE-dissipation supports RK3 (#5127).

I think this is great!

We could consider a refactor in the future --- for example adding a new function that's something like time_step_closure_fields!(closure, model). With such an interface, we can distinguish between true "auxiliary state updates" (via update_state!) and prognostic state updates.

ali-ramadhan · 2026-01-10T06:00:50Z

@simone-silvestri ZStarCoordinate checkpointing works now (and is tested)! I didn't have to checkpoint grids, just model.vertical_coordinate and also pass the grid to prognostic_state and restore_prognostic_state!. But if we want to checkpoint grids in the future it'll be easy to do.

I think this PR is ready to be merged now so just re-requesting review. I'd also like to wait to hear from @tomchor about whether the changes I made to DynamicSmagorinsky are okay.

docs/src/simulations/checkpointing.md

src/Models/LagrangianParticleTracking/LagrangianParticleTracking.jl

codecov · 2026-01-12T12:25:37Z

Codecov Report

❌ Patch coverage is 83.80282% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.22%. Comparing base (5a0553f) to head (9f0e493).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/OutputWriters/checkpointer.jl	84.76%	16 Missing ⚠️
src/OutputWriters/windowed_time_average.jl	50.00%	15 Missing ⚠️
src/Utils/schedules.jl	62.50%	9 Missing ⚠️
...rostaticFreeSurfaceModels/explicit_free_surface.jl	0.00%	6 Missing ⚠️
src/Simulations/run.jl	84.84%	5 Missing ⚠️
...gianParticleTracking/LagrangianParticleTracking.jl	50.00%	3 Missing ⚠️
src/Fields/field.jl	60.00%	2 Missing ⚠️
...rostaticFreeSurfaceModels/implicit_free_surface.jl	60.00%	2 Missing ⚠️
src/Simulations/callback.jl	66.66%	2 Missing ⚠️
...e_implementations/ri_based_vertical_diffusivity.jl	87.50%	2 Missing ⚠️
... and 6 more

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4892       +/-   ##
===========================================
+ Coverage   32.87%   73.22%   +40.35%     
===========================================
  Files         383      391        +8     
  Lines       21105    21897      +792     
===========================================
+ Hits         6938    16034     +9096     
+ Misses      14167     5863     -8304

Flag	Coverage Δ
buildkite	`68.57% <88.58%> (?)`
julia	`68.57% <88.58%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tomchor · 2026-01-12T12:39:33Z

Smagorinsky and SmagorinskyLilly don't need checkpointing.

In DirectionallyAveragedDynamicSmagorinsky we technically may not need to checkpoint 𝒥ᴸᴹ and 𝒥ᴹᴹ since they're recomputed from LM and MM but I did it to ensure exact reproducibility.

LagrangianAveragedDynamicSmagorinsky needed some changes to be checkpointed because it maintains time history for Lagrangian trajectory averaging. I had to also checkpoint previous_compute_time along with 𝒥ᴸᴹ⁻ and 𝒥ᴹᴹ⁻.

When using a multi-stage timestepper like RK3, I think it should only compute coefficients at the final stage. Lagrangian averaging was being applied 3 times with small fractional Δt values instead of once with the full Δt. Not doing so leaves the closure in an inconsistent state so when you pickup you can't recreate the same sequence of intermediate states. I think the main issue is that t⁻ is updated at every substep. The checkpoint captures t⁻ from stage=1, but the next computation in the continuous run happens at stage=2.

But please let me know if this makes sense for LagrangianAveragedDynamicSmagorinsky + RK3. If not, I can revert the changes and we can fix this properly in another PR. This is not an issue with QAB2. cc @tomchor @glwagner @simone-silvestri

I tried to find some info on this online but couldn't. What I can say is that the models (I'm aware of) will usually update the closure every n time steps (usually 5), and when comes time to update the closure, they compute the averaging at every substep. Given that, and given the fact that substeps are very short (and perhaps not a "real" solution of the model), I think your changes are okay. Perhaps even an improvement! Great work!

tomchor

Nice work! I see some outstanding comments (mostly the formatting ones, which if you want I can help with), but I'll leave it up to you if you think they're necessary. Looking forward to seeing this merged!

Co-authored-by: Tomás Chor <[email protected]>

…ng.jl Co-authored-by: Tomás Chor <[email protected]>

…anigans.jl into ali/checkpointing-that-works

@giordano

* fix it * bugfix * remove trailing white spaces * Add missing comma * Apply suggestion from @giordano * add empty line * fix grammar * reorder imports * fix multi_dimensional_reconstruction * Apply suggestions from code review * combine grid tests --------- Co-authored-by: Mosè Giordano <[email protected]> Co-authored-by: Navid C. Constantinou <[email protected]>

ali-ramadhan added 6 commits October 30, 2025 07:11

First stab at starting to support checkpointing simulations

390f24e

Start working on some new tests

751072c

Parameterize a couple of tests

bc39dd5

Replace old tests

d131070

Fix archs for checkpointer tests

ee79883

Merge branch 'main' into ali/checkpointing-that-works

30d4ccf

navidcy added the output 💾 label Nov 1, 2025

ali-ramadhan added 5 commits November 12, 2025 16:57

Merge branch 'main' into ali/checkpointing-that-works

0f79241

Checkpointing output writers

c3838da

Checkpointing and restoring Lagrangian particles

d721d9b

Checkpoint the hydrostatic model

50cd623

Merge branch 'ali/checkpointing-that-works' of github.com:CliMA/Ocean…

629381f

…anigans.jl into ali/checkpointing-that-works

glwagner reviewed Nov 13, 2025

View reviewed changes

src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl Outdated Show resolved Hide resolved

ali-ramadhan and others added 3 commits November 12, 2025 22:34

Update src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surfa…

f6d8bfc

…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>

Nonhydrostatic diffusivity fields are now called closure fields

e155376

Fix model prognostic_state

71cffaa

glwagner reviewed Nov 13, 2025

View reviewed changes

src/Models/NonhydrostaticModels/nonhydrostatic_model.jl Outdated Show resolved Hide resolved

glwagner reviewed Nov 13, 2025

View reviewed changes

src/Models/NonhydrostaticModels/nonhydrostatic_model.jl Outdated Show resolved Hide resolved

ali-ramadhan added 3 commits November 13, 2025 09:19

Checkpointing MultiRegionObject

5a1e461

Checkpointing for free surfaces

d4c25bd

Properly checkpoint simulation to not override new stop criteria

1f6814c

ali-ramadhan added 3 commits November 13, 2025 17:25

Merge branch 'main' into ali/checkpointing-that-works

0802088

Checkpoint SplitRungeKutta3TimeStepper

3b3eb39

Test checkpointing hydrostatic models

c512323

ali-ramadhan mentioned this pull request Nov 14, 2025

Checkpointing doesn't seem to be bit-for-bit with NonhydrostaticModel and variable z spacing #4904

Open

ali-ramadhan added 4 commits November 14, 2025 09:42

Merge branch 'ali/checkpointing-that-works' of github.com:CliMA/Ocean…

4af2871

…anigans.jl into ali/checkpointing-that-works

Get rid of checkpointer properties

bf03663

Checkpoint shallow water models

6a7f654

Test checkpointing shallow water models

d2ef109

ali-ramadhan added 2 commits January 9, 2026 22:57

Add checkpointing support for ZStarCoordinate

1888b15

Test ZStarCoordinate checkpointing and re-organize checkpointing tests

ead3eaf

ali-ramadhan requested review from glwagner, simone-silvestri and tomchor January 10, 2026 05:57

Merge branch 'main' into ali/checkpointing-that-works

90f299f

Clean up all checkpoint files

5872a00

tomchor mentioned this pull request Jan 12, 2026

(0.104.0) Reformulate hydrostatic model timestepping #4811

Merged

9 tasks

tomchor reviewed Jan 12, 2026

View reviewed changes

docs/src/simulations/checkpointing.md Outdated Show resolved Hide resolved

tomchor reviewed Jan 12, 2026

View reviewed changes

docs/src/simulations/checkpointing.md Outdated Show resolved Hide resolved

tomchor reviewed Jan 12, 2026

View reviewed changes

src/Models/LagrangianParticleTracking/LagrangianParticleTracking.jl Outdated Show resolved Hide resolved

tomchor approved these changes Jan 12, 2026

View reviewed changes

ali-ramadhan and others added 4 commits January 12, 2026 10:39

Update docs/src/simulations/checkpointing.md

5c012e2

Co-authored-by: Tomás Chor <[email protected]>

Update src/Models/LagrangianParticleTracking/LagrangianParticleTracki…

ae44228

…ng.jl Co-authored-by: Tomás Chor <[email protected]>

Merge branch 'main' into ali/checkpointing-that-works

7b99807

Update checkpointing docs

4e5617b

simone-silvestri approved these changes Jan 12, 2026

View reviewed changes

giordano and others added 5 commits January 12, 2026 18:56

Merge branch 'main' into ali/checkpointing-that-works

f8e5559

No need to checkpoint the split-explicit free surface filtered state

00afd7a

Minor fix

690a86b

Merge branch 'ali/checkpointing-that-works' of github.com:CliMA/Ocean…

9f0e493

…anigans.jl into ali/checkpointing-that-works

Respect existing coding style

5782a44

ali-ramadhan merged commit c8ae9d4 into main Jan 13, 2026
68 of 72 checks passed

ali-ramadhan deleted the ali/checkpointing-that-works branch January 13, 2026 00:22

tomchor mentioned this pull request Feb 6, 2026

Dynamic Smagorinsky always produces zero values for eddy viscosity #5257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(v0.104.0) Checkpointing simulations#4892

(v0.104.0) Checkpointing simulations#4892
ali-ramadhan merged 109 commits intomainfrom
ali/checkpointing-that-works

ali-ramadhan commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ali-ramadhan commented Nov 13, 2025 •

edited

Loading

Uh oh!

glwagner commented Jan 9, 2026

Smagorinsky closures

`RiBasedVerticalDiffusivity`

CATKE

`TKEDissipationVerticalDiffusivity`

Uh oh!

ali-ramadhan commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

tomchor commented Jan 12, 2026

Uh oh!

tomchor left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ali-ramadhan commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rhetorical aside

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ali-ramadhan commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented Jan 9, 2026

Smagorinsky closures

RiBasedVerticalDiffusivity

CATKE

TKEDissipationVerticalDiffusivity

Uh oh!

ali-ramadhan commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tomchor commented Jan 12, 2026

Uh oh!

tomchor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ali-ramadhan commented Oct 30, 2025 •

edited

Loading

ali-ramadhan commented Nov 13, 2025 •

edited

Loading

`RiBasedVerticalDiffusivity`

`TKEDissipationVerticalDiffusivity`

codecov bot commented Jan 12, 2026 •

edited

Loading