(v0.104.0) Checkpointing simulations#4892
Conversation
…anigans.jl into ali/checkpointing-that-works
src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl
Outdated
Show resolved
Hide resolved
…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>
|
Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues. |
…anigans.jl into ali/checkpointing-that-works
I think this is great! We could consider a refactor in the future --- for example adding a new function that's something like |
|
@simone-silvestri I think this PR is ready to be merged now so just re-requesting review. I'd also like to wait to hear from @tomchor about whether the changes I made to |
src/Models/LagrangianParticleTracking/LagrangianParticleTracking.jl
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4892 +/- ##
===========================================
+ Coverage 32.87% 73.22% +40.35%
===========================================
Files 383 391 +8
Lines 21105 21897 +792
===========================================
+ Hits 6938 16034 +9096
+ Misses 14167 5863 -8304
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I tried to find some info on this online but couldn't. What I can say is that the models (I'm aware of) will usually update the closure every |
tomchor
left a comment
There was a problem hiding this comment.
Nice work! I see some outstanding comments (mostly the formatting ones, which if you want I can help with), but I'll leave it up to you if you think they're necessary. Looking forward to seeing this merged!
Co-authored-by: Tomás Chor <[email protected]>
…ng.jl Co-authored-by: Tomás Chor <[email protected]>
…anigans.jl into ali/checkpointing-that-works
* fix it * bugfix * remove trailing white spaces * Add missing comma * Apply suggestion from @giordano * add empty line * fix grammar * reorder imports * fix multi_dimensional_reconstruction * Apply suggestions from code review * combine grid tests --------- Co-authored-by: Mosè Giordano <[email protected]> Co-authored-by: Navid C. Constantinou <[email protected]>
This PR refactors how the
Checkpointerworks by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.Basic design idea:
prognostic_state(obj)which returns a named tuple corresponding to the prognostic state ofobjandrestore_prognostic_state!(obj, state)which restoresobjbased on information contained instate(which is a named tuple and is read from a checkpoint file).prognostic_stateandrestore_prognostic_state!.Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.
Will continue working on this PR, but any feedback is very welcome!
Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857
Rhetorical aside
In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data,
TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.
For example, take the
SpecifiedTimesschedule. It has two propertiestimesandprevious_actuation. Sinceprevious_actuationchanges as the simulation runs, onlyprevious_actuationneeds to be checkpointed.This leads to the possibility of the user changing
timesthen picking upprevious_actuationwhich can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.Checkpointing both
timesandprevious_actuationallows us to check thattimesis the same when restoring. But I don't think this is the checkpointer's responsibility.