Skip to content

Implement checkpointing for AtmosphereModel#529

Open
ewquon wants to merge 13 commits intomainfrom
add_checkpt
Open

Implement checkpointing for AtmosphereModel#529
ewquon wants to merge 13 commits intomainfrom
add_checkpt

Conversation

@ewquon
Copy link
Copy Markdown
Collaborator

@ewquon ewquon commented Feb 27, 2026

Addresses #443

Following https://github.com/CliMA/Oceananigans.jl/blob/51c67b5dc9445fe3b9015b3ccad4ebdfb89258f1/docs/src/simulations/checkpointing.md, I've independently verified that Oceananigans restarts work flawlessly -- bitwise agreement.

The same workflow applied to the free convection demo (https://numericalearth.github.io/BreezeDocumentation/stable/#Quick-Start)

simulation = Simulation(model, Δt=10, stop_time=2hours)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation, checkpoint_at_end=true)

--vs--

simulation = Simulation(model, Δt=10, stop_iteration=2000)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation)
# ...
simulation = Simulation(model, Δt=10, stop_time=2hours)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation, pickup="checkpoint_iteration2000.jld2", checkpoint_at_end=true)

gives qualitatively identical results. The fields differ by:

ρu.data DIFFER!? max abs/rel diff : 0.0002235832290651274 0.027438062136977973
ρw.data DIFFER!? max abs/rel diff : 0.0001403733258555917 0.03648357117818405
ρθ.data are approximately equal
ρqᵗ.data DIFFER!? max abs/rel diff : 6.206636658928621e-7 4.801171466458838e-5

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Feb 27, 2026

Attaching my checkpoint comparison script
compare_checkpoints.jl.txt

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Feb 27, 2026

I've added $U^0$ and $G^n$ timestepper outputs for completeness (following Oceananigans) but AFAIK it's actually not needed for a successfull restart.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/AtmosphereModels/atmosphere_model.jl 0.00% 10 Missing ⚠️
src/TimeSteppers/TimeSteppers.jl 0.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@giordano
Copy link
Copy Markdown
Member

Thanks! It'd be good to also add a test (idea: setting up a small simulation, running it til the end while dumping a checkpoint, then also resume the checkpoint and compare the result of the first run and the resumed one?)

@giordano giordano added the enhancement ✨ ideas and requests for new features label Feb 27, 2026
@giordano giordano linked an issue Feb 27, 2026 that may be closed by this pull request
to generalize prognostic_state and restore_prognostic_state! to all
timesteppers
@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Feb 27, 2026

@giordano I've added a test per your suggestion. Verified that I get the same diffs (no bitwise or approx agreement at this point) as my previous workflow

Comment on lines +94 to +100
all_match_approx &= all(d -> d ≈ 0, momentum_diffs)
all_match_bitwise &= all(d -> d == 0, momentum_diffs)

println(all_match_bitwise ? "\nPASS: restart is bitwise identical to no-restart." :
"\nFAIL: restart differs from no-restart.")
println(all_match_approx ? "\nPASS: restart is approximately identical to no-restart." :
"\nFAIL: restart significantly differs from no-restart.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than printing PASS or FAIL, use the @test macro from the Test standard library, which would loudly error out in case of a check failure. See the other test files in this directory for inspiration about the test organisation (just a random example: test/advection_schemes.jl)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip @giordano, this is still a WIP -- trying to get Breeze restart to be bitwise equal like Oceananigans at the moment, then I'll cleanup the testing...


# ── Compare final states ─────────────────────────────────────────────────────

println("\nComparing no-restart vs restarted final states:")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing Breeze we use ParallelTestRunner.jl which allows us run the tests in parallel. As a side effect, though, this capture all the output printed to screen, and shows it only at the end rather than live, making prints typically a bit useless as they aren't shown while the tests are running anyway. But specifically to this case, I think most of the prints below should be replaced by @tests, as suggested above.

- `implicit_solver`: Optional implicit solver for diffusion
"""
struct SSPRungeKutta3{FT, U0, TG, TI} <: AbstractTimeStepper
struct SSPRungeKutta3{FT, U0, TG, TI} <: AbstractBreezeTimeStepper
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will want to move this upstream to Oceananigans at some point. What makes this time-stepper specific to Breeze?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Oceananigans timesteppers store $G^n$ and $G^-$ tendencies; Breeze timesteppers only store $G^n$.

@ewquon ewquon self-assigned this Mar 3, 2026
# By default, don't compute tendencies, cf. Oceananigans:
# "After restoring from a checkpoint, skip tendency computation since the restored
# tendencies are already correct."
function TimeSteppers.update_state!(model::AtmosphereModel, callbacks=[]; compute_tendencies=false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what broke the geostrophic_subsidence_forcings and forcing_and_boundary_conditions tests, right?

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 10, 2026

Further restart testing using CliMA/Oceananigans.jl#5379

Example Problem Bitwise Abs Diff Rel Diff Notes
acoustic_wave.jl
bomex.jl ρu ~ O(1e-1) ρv,ρw ~ O(1)
boussinesq_bomex.jl v ~ O(1e-1) v,w ~ O(1)
cloudy_kelvin_helmholtz.jl ☑️ ρθ ~ O(1e-7) ρw ~ O(1e-2)
cloudy_thermal_bubble.jl ☑️ ρθ ~ O(1e-7) ρu ~ O(1)
dry_thermal_bubble.jl ρe ~ O(1e-4) ρu ~ O(1)
inertia_gravity_wave.jl WIP
kinematic_driver.jl ρθ ~ O(1) ρqᵗ ~ O(1e-2)
mountain_wave.jl ?? NaN
prescribed_sea_surface_temperature.jl ρu ~ O(1e-4) ρqᵗ ~ O(1)
rico.jl ρθ ~ O(1e-2) ρw ~ O(1)
rising_parcels.jl WIP
splitting_supercell.jl ρu ~ O(1e-1) ρv,ρw,ρqᶜˡ,ρqʳ ~ O(1)
stationary_parcel_model.jl WIP
tropical_cyclone_world.jl ρθ ~ O(1e-2) ρu,ρv,ρw ~ O(1)

@giordano
Copy link
Copy Markdown
Member

I think relative difference is more informative than the absolute one, which depends on the scale of the numbers involved.

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 10, 2026

Agree that relative diff often adds context @giordano but I don't think it changes the story here. I think the diffs are significant -- see the updated table.

@giordano
Copy link
Copy Markdown
Member

That seems to be even worse 😄

function Oceananigans.prognostic_state(model::AtmosphereModel)
state = (clock = prognostic_state(model.clock),
timestepper = prognostic_state(model.timestepper))
return merge(state, prognostic_fields(model))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that model.forcing can also have a prognostic state, which is missing here (in particular the SubsidenceForcing, which depends on horizontal averages?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SubsidenceForcing, the averaged_field is diagnosed every update_state!

@glwagner
Copy link
Copy Markdown
Member

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...

The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

@glwagner
Copy link
Copy Markdown
Member

Also, it might be sufficient for this PR to enable checkpointing for one of the dynamics + simple models. We can work on ensuring that all of the dynamics, forcings, boundary conditions, etc are supported in future PRs.

@giordano
Copy link
Copy Markdown
Member

Question: are you comparing using the CPU or the GPU architecture? I'm mildly sure our GPU examples aren't fully reproducible even on the same machine because we don't fix the GPU RNG seed correctly (I believe CUDA.jl documentation about this isn't accurate, or maybe just out of date).

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 10, 2026

Question: are you comparing using the CPU or the GPU architecture? I'm mildly sure our GPU examples aren't fully reproducible even on the same machine because we don't fix the GPU RNG seed correctly (I believe CUDA.jl documentation about this isn't accurate, or maybe just out of date).

CPU

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 11, 2026

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...

The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

@giordano
Copy link
Copy Markdown
Member

Ok, that's interesting: so there seems to be something not right with one dynamics but not the other? At least that's partially reassuring 😁

@glwagner
Copy link
Copy Markdown
Member

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...
The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

may need restore_prognostic_state! for the dynamics

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 19, 2026

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...
The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

may need restore_prognostic_state! for the dynamics

@glwagner I don't think there are any prognostic variables in dynamics.

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 19, 2026

Adding another data point. This is the RICO example run for 24 hours. "run1" and "run2" is the same case, but run on different machines -- effectively different realizations of the same conditions. "run2" was run from 0-24h (continuous) and also 0-8h, 8-24h (restarted). Presented profiles are averages over the last four hours. I think this actually looks pretty good, any thoughts @glwagner @giordano?

rico_profiles_24h

@glwagner
Copy link
Copy Markdown
Member

@glwagner I don't think there are any prognostic variables in dynamics.

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

but there are for CompressibleDynamics right?

@ewquon
Copy link
Copy Markdown
Collaborator Author

ewquon commented Mar 19, 2026

@glwagner I don't think there are any prognostic variables in dynamics.

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

but there are for CompressibleDynamics right?

CompressibleDynamics is capable of a perfect restart, Anelastic (what I'm testing) is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ✨ ideas and requests for new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checkpointing for AtmosphereModel

3 participants