Skip to content

(v0.104.0) Checkpointing simulations#4892

Merged
ali-ramadhan merged 109 commits intomainfrom
ali/checkpointing-that-works
Jan 13, 2026
Merged

(v0.104.0) Checkpointing simulations#4892
ali-ramadhan merged 109 commits intomainfrom
ali/checkpointing-that-works

Conversation

@ali-ramadhan
Copy link
Copy Markdown
Member

@ali-ramadhan ali-ramadhan commented Oct 30, 2025

This PR refactors how the Checkpointer works by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.

Basic design idea:

  • We now have two new functions: prognostic_state(obj) which returns a named tuple corresponding to the prognostic state of obj and restore_prognostic_state!(obj, state) which restores obj based on information contained in state (which is a named tuple and is read from a checkpoint file).
  • Objects are checkpointed recursively by serializing prognostic information to the JLD2 checkpoint file.
  • The goal is for checkpointing to be flexible enough that we can very easily use it for different types of simulations, e.g. coupled simulations in ClimaOcean.jl by just defining prognostic_state and restore_prognostic_state!.

Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.

Will continue working on this PR, but any feedback is very welcome!

Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857


Rhetorical aside

In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data, TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.

Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.

For example, take the SpecifiedTimes schedule. It has two properties times and previous_actuation. Since previous_actuation changes as the simulation runs, only previous_actuation needs to be checkpointed.

This leads to the possibility of the user changing times then picking up previous_actuation which can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.

Checkpointing both times and previous_actuation allows us to check that times is the same when restoring. But I don't think this is the checkpointer's responsibility.

@ali-ramadhan
Copy link
Copy Markdown
Member Author

ali-ramadhan commented Nov 13, 2025

Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues.

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Jan 9, 2026

Ok so I think checkpointing now works for all turbulence closures!

In general, we shouldn't be calling initialize!(model) when restoring from a checkpoint because this could overwrite/corrupt closure fields. And we should use update_state!(model; compute_tendencies=false) when restoring from a checkpoint.

I also had to add a very small absolute tolerance to model equality tests when testing checkpointing turbulence closures, usually for fields like e and ε that may accumulate floating point differences due to differing order of operations (reductions?). But it's tiny, like atol = 1e-20.

Smagorinsky closures

Smagorinsky and SmagorinskyLilly don't need checkpointing.

In DirectionallyAveragedDynamicSmagorinsky we technically may not need to checkpoint 𝒥ᴸᴹ and 𝒥ᴹᴹ since they're recomputed from LM and MM but I did it to ensure exact reproducibility.

LagrangianAveragedDynamicSmagorinsky needed some changes to be checkpointed because it maintains time history for Lagrangian trajectory averaging. I had to also checkpoint previous_compute_time along with 𝒥ᴸᴹ⁻ and 𝒥ᴹᴹ⁻.

When using a multi-stage timestepper like RK3, I think it should only compute coefficients at the final stage. Lagrangian averaging was being applied 3 times with small fractional Δt values instead of once with the full Δt. Not doing so leaves the closure in an inconsistent state so when you pickup you can't recreate the same sequence of intermediate states. I think the main issue is that t⁻ is updated at every substep. The checkpoint captures t⁻ from stage=1, but the next computation in the continuous run happens at stage=2.

But please let me know if this makes sense for LagrangianAveragedDynamicSmagorinsky + RK3. If not, I can revert the changes and we can fix this properly in another PR. This is not an issue with QAB2. cc @tomchor @glwagner @simone-silvestri

RiBasedVerticalDiffusivity

This was relatively easy to checkpoint. Gotta checkpoint some closure fields but otherwise just needed to keep track of the previous_compute_time.

CATKE

I had to add some checks like making sure we only call time_step_catke_equation! and compute_average_surface_buoyancy_flux! on new iterations (Δt > 0) and making sure they work with split RK3. I also added a _skip_next_compute property to CATKEDiffusivityFields because we need to

TKEDissipationVerticalDiffusivity

Also similar to CATKE, I had to introduce a previous_compute_time and _skip_next_compute. Should also work once TKE-dissipation supports RK3 (#5127).

I think this is great!

We could consider a refactor in the future --- for example adding a new function that's something like time_step_closure_fields!(closure, model). With such an interface, we can distinguish between true "auxiliary state updates" (via update_state!) and prognostic state updates.

@ali-ramadhan
Copy link
Copy Markdown
Member Author

@simone-silvestri ZStarCoordinate checkpointing works now (and is tested)! I didn't have to checkpoint grids, just model.vertical_coordinate and also pass the grid to prognostic_state and restore_prognostic_state!. But if we want to checkpoint grids in the future it'll be easy to do.

I think this PR is ready to be merged now so just re-requesting review. I'd also like to wait to hear from @tomchor about whether the changes I made to DynamicSmagorinsky are okay.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 83.80282% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.22%. Comparing base (5a0553f) to head (9f0e493).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/OutputWriters/checkpointer.jl 84.76% 16 Missing ⚠️
src/OutputWriters/windowed_time_average.jl 50.00% 15 Missing ⚠️
src/Utils/schedules.jl 62.50% 9 Missing ⚠️
...rostaticFreeSurfaceModels/explicit_free_surface.jl 0.00% 6 Missing ⚠️
src/Simulations/run.jl 84.84% 5 Missing ⚠️
...gianParticleTracking/LagrangianParticleTracking.jl 50.00% 3 Missing ⚠️
src/Fields/field.jl 60.00% 2 Missing ⚠️
...rostaticFreeSurfaceModels/implicit_free_surface.jl 60.00% 2 Missing ⚠️
src/Simulations/callback.jl 66.66% 2 Missing ⚠️
...e_implementations/ri_based_vertical_diffusivity.jl 87.50% 2 Missing ⚠️
... and 6 more
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4892       +/-   ##
===========================================
+ Coverage   32.87%   73.22%   +40.35%     
===========================================
  Files         383      391        +8     
  Lines       21105    21897      +792     
===========================================
+ Hits         6938    16034     +9096     
+ Misses      14167     5863     -8304     
Flag Coverage Δ
buildkite 68.57% <88.58%> (?)
julia 68.57% <88.58%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tomchor
Copy link
Copy Markdown
Member

tomchor commented Jan 12, 2026

Smagorinsky and SmagorinskyLilly don't need checkpointing.

In DirectionallyAveragedDynamicSmagorinsky we technically may not need to checkpoint 𝒥ᴸᴹ and 𝒥ᴹᴹ since they're recomputed from LM and MM but I did it to ensure exact reproducibility.

LagrangianAveragedDynamicSmagorinsky needed some changes to be checkpointed because it maintains time history for Lagrangian trajectory averaging. I had to also checkpoint previous_compute_time along with 𝒥ᴸᴹ⁻ and 𝒥ᴹᴹ⁻.

When using a multi-stage timestepper like RK3, I think it should only compute coefficients at the final stage. Lagrangian averaging was being applied 3 times with small fractional Δt values instead of once with the full Δt. Not doing so leaves the closure in an inconsistent state so when you pickup you can't recreate the same sequence of intermediate states. I think the main issue is that t⁻ is updated at every substep. The checkpoint captures t⁻ from stage=1, but the next computation in the continuous run happens at stage=2.

But please let me know if this makes sense for LagrangianAveragedDynamicSmagorinsky + RK3. If not, I can revert the changes and we can fix this properly in another PR. This is not an issue with QAB2. cc @tomchor @glwagner @simone-silvestri

I tried to find some info on this online but couldn't. What I can say is that the models (I'm aware of) will usually update the closure every n time steps (usually 5), and when comes time to update the closure, they compute the averaging at every substep. Given that, and given the fact that substeps are very short (and perhaps not a "real" solution of the model), I think your changes are okay. Perhaps even an improvement! Great work!

Copy link
Copy Markdown
Member

@tomchor tomchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I see some outstanding comments (mostly the formatting ones, which if you want I can help with), but I'll leave it up to you if you think they're necessary. Looking forward to seeing this merged!

@ali-ramadhan ali-ramadhan merged commit c8ae9d4 into main Jan 13, 2026
68 of 72 checks passed
@ali-ramadhan ali-ramadhan deleted the ali/checkpointing-that-works branch January 13, 2026 00:22
navidcy referenced this pull request Jan 18, 2026
* fix it

* bugfix

* remove trailing white spaces

* Add missing comma

* Apply suggestion from @giordano

* add empty line

* fix grammar

* reorder imports

* fix multi_dimensional_reconstruction

* Apply suggestions from code review

* combine grid tests

---------

Co-authored-by: Mosè Giordano <[email protected]>
Co-authored-by: Navid C. Constantinou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change 💔 Concerning a change which breaks the API output 💾

Projects

None yet

6 participants