Implement checkpointing for `AtmosphereModel` by ewquon · Pull Request #529 · NumericalEarth/Breeze.jl

ewquon · 2026-02-27T17:47:38Z

Addresses #443

Following https://github.com/CliMA/Oceananigans.jl/blob/51c67b5dc9445fe3b9015b3ccad4ebdfb89258f1/docs/src/simulations/checkpointing.md, I've independently verified that Oceananigans restarts work flawlessly -- bitwise agreement.

The same workflow applied to the free convection demo (https://numericalearth.github.io/BreezeDocumentation/stable/#Quick-Start)

simulation = Simulation(model, Δt=10, stop_time=2hours)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation, checkpoint_at_end=true)

--vs--

simulation = Simulation(model, Δt=10, stop_iteration=2000)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation)
# ...
simulation = Simulation(model, Δt=10, stop_time=2hours)
simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(100))
conjure_time_step_wizard!(simulation, cfl=0.7)
run!(simulation, pickup="checkpoint_iteration2000.jld2", checkpoint_at_end=true)

gives qualitatively identical results. The fields differ by:

ρu.data DIFFER!? max abs/rel diff : 0.0002235832290651274 0.027438062136977973
ρw.data DIFFER!? max abs/rel diff : 0.0001403733258555917 0.03648357117818405
ρθ.data are approximately equal
ρqᵗ.data DIFFER!? max abs/rel diff : 6.206636658928621e-7 4.801171466458838e-5

ewquon · 2026-02-27T17:48:36Z

Attaching my checkpoint comparison script
compare_checkpoints.jl.txt

ewquon · 2026-02-27T17:50:04Z

I've added $U^0$ and $G^n$ timestepper outputs for completeness (following Oceananigans) but AFAIK it's actually not needed for a successfull restart.

codecov · 2026-02-27T18:02:25Z

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/AtmosphereModels/atmosphere_model.jl	0.00%	10 Missing ⚠️
src/TimeSteppers/TimeSteppers.jl	0.00%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

giordano · 2026-02-27T18:13:36Z

Thanks! It'd be good to also add a test (idea: setting up a small simulation, running it til the end while dumping a checkpoint, then also resume the checkpoint and compare the result of the first run and the resumed one?)

to generalize prognostic_state and restore_prognostic_state! to all timesteppers

ewquon · 2026-02-27T20:55:50Z

@giordano I've added a test per your suggestion. Verified that I get the same diffs (no bitwise or approx agreement at this point) as my previous workflow

test/checkpoint_restart.jl

src/AtmosphereModels/atmosphere_model.jl

src/TimeSteppers/TimeSteppers.jl

test/checkpoint_restart.jl

giordano · 2026-02-27T23:37:08Z

test/checkpoint_restart.jl

+all_match_approx  &= all(d -> d ≈ 0, momentum_diffs)
+all_match_bitwise &= all(d -> d == 0, momentum_diffs)
+
+println(all_match_bitwise ? "\nPASS: restart is bitwise identical to no-restart." :
+                            "\nFAIL: restart differs from no-restart.")
+println(all_match_approx  ? "\nPASS: restart is approximately identical to no-restart." :
+                            "\nFAIL: restart significantly differs from no-restart.")


Rather than printing PASS or FAIL, use the @test macro from the Test standard library, which would loudly error out in case of a check failure. See the other test files in this directory for inspiration about the test organisation (just a random example: test/advection_schemes.jl)

Thanks for the tip @giordano, this is still a WIP -- trying to get Breeze restart to be bitwise equal like Oceananigans at the moment, then I'll cleanup the testing...

giordano · 2026-02-27T23:42:23Z

test/checkpoint_restart.jl

+
+# ── Compare final states ─────────────────────────────────────────────────────
+
+println("\nComparing no-restart vs restarted final states:")


For testing Breeze we use ParallelTestRunner.jl which allows us run the tests in parallel. As a side effect, though, this capture all the output printed to screen, and shows it only at the end rather than live, making prints typically a bit useless as they aren't shown while the tests are running anyway. But specifically to this case, I think most of the prints below should be replaced by @tests, as suggested above.

src/TimeSteppers/TimeSteppers.jl

src/AtmosphereModels/atmosphere_model.jl

glwagner · 2026-02-27T23:59:31Z

src/TimeSteppers/ssp_runge_kutta_3.jl

 - `implicit_solver`: Optional implicit solver for diffusion
 """
-struct SSPRungeKutta3{FT, U0, TG, TI} <: AbstractTimeStepper
+struct SSPRungeKutta3{FT, U0, TG, TI} <: AbstractBreezeTimeStepper


I believe we will want to move this upstream to Oceananigans at some point. What makes this time-stepper specific to Breeze?

The Oceananigans timesteppers store $G^n$ and $G^-$ tendencies; Breeze timesteppers only store $G^n$.

src/AtmosphereModels/update_atmosphere_model_state.jl

This reverts commit 433d31b.

giordano · 2026-03-03T07:19:55Z

src/AtmosphereModels/update_atmosphere_model_state.jl

+# By default, don't compute tendencies, cf. Oceananigans:
+# "After restoring from a checkpoint, skip tendency computation since the restored
+#  tendencies are already correct."
+function TimeSteppers.update_state!(model::AtmosphereModel, callbacks=[]; compute_tendencies=false)


I believe this is what broke the geostrophic_subsidence_forcings and forcing_and_boundary_conditions tests, right?

ewquon · 2026-03-10T16:23:47Z

Further restart testing using CliMA/Oceananigans.jl#5379

Example Problem	Bitwise	Abs Diff	Rel Diff	Notes
acoustic_wave.jl	✅
bomex.jl	❌	ρu ~ O(1e-1)	ρv,ρw ~ O(1)
boussinesq_bomex.jl	❌	v ~ O(1e-1)	v,w ~ O(1)
cloudy_kelvin_helmholtz.jl	☑️	ρθ ~ O(1e-7)	ρw ~ O(1e-2)
cloudy_thermal_bubble.jl	☑️	ρθ ~ O(1e-7)	ρu ~ O(1)
dry_thermal_bubble.jl	❌	ρe ~ O(1e-4)	ρu ~ O(1)
inertia_gravity_wave.jl				WIP
kinematic_driver.jl	❌	ρθ ~ O(1)	ρqᵗ ~ O(1e-2)
mountain_wave.jl	??			NaN
prescribed_sea_surface_temperature.jl	❌	ρu ~ O(1e-4)	ρqᵗ ~ O(1)
rico.jl	❌	ρθ ~ O(1e-2)	ρw ~ O(1)
rising_parcels.jl				WIP
splitting_supercell.jl	❌	ρu ~ O(1e-1)	ρv,ρw,ρqᶜˡ,ρqʳ ~ O(1)
stationary_parcel_model.jl				WIP
tropical_cyclone_world.jl	❌	ρθ ~ O(1e-2)	ρu,ρv,ρw ~ O(1)

giordano · 2026-03-10T16:29:02Z

I think relative difference is more informative than the absolute one, which depends on the scale of the numbers involved.

ewquon · 2026-03-10T17:28:27Z

Agree that relative diff often adds context @giordano but I don't think it changes the story here. I think the diffs are significant -- see the updated table.

giordano · 2026-03-10T17:37:43Z

That seems to be even worse 😄

glwagner · 2026-03-10T19:32:19Z

src/AtmosphereModels/atmosphere_model.jl

+function Oceananigans.prognostic_state(model::AtmosphereModel)
+    state = (clock = prognostic_state(model.clock),
+             timestepper = prognostic_state(model.timestepper))
+    return merge(state, prognostic_fields(model))


I believe that model.forcing can also have a prognostic state, which is missing here (in particular the SubsidenceForcing, which depends on horizontal averages?)

For SubsidenceForcing, the averaged_field is diagnosed every update_state!

glwagner · 2026-03-10T19:34:50Z

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...

The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

glwagner · 2026-03-10T19:35:37Z

Also, it might be sufficient for this PR to enable checkpointing for one of the dynamics + simple models. We can work on ensuring that all of the dynamics, forcings, boundary conditions, etc are supported in future PRs.

giordano · 2026-03-10T19:42:14Z

Question: are you comparing using the CPU or the GPU architecture? I'm mildly sure our GPU examples aren't fully reproducible even on the same machine because we don't fix the GPU RNG seed correctly (I believe CUDA.jl documentation about this isn't accurate, or maybe just out of date).

ewquon · 2026-03-10T19:43:58Z

Question: are you comparing using the CPU or the GPU architecture? I'm mildly sure our GPU examples aren't fully reproducible even on the same machine because we don't fix the GPU RNG seed correctly (I believe CUDA.jl documentation about this isn't accurate, or maybe just out of date).

CPU

ewquon · 2026-03-11T21:42:25Z

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...

The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

giordano · 2026-03-11T23:22:07Z

Ok, that's interesting: so there seems to be something not right with one dynamics but not the other? At least that's partially reassuring 😁

glwagner · 2026-03-11T23:36:35Z

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...
The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

may need restore_prognostic_state! for the dynamics

…sary field U⁰

…o add_checkpt

ewquon · 2026-03-19T17:02:05Z

The dry thermal bubble is probably a good one to focus on at first since it's relatively simple and has no forcing or microphysics...
The fact that the acoustic wave works but thermal bubble does not is interesting. The difference is that the acoustic wave uses CompressibleDynamics and thermal bubble uses AnelasticDynamics.

Changing the thermal bubble from anelastic to compressible results in a perfect restart.

may need restore_prognostic_state! for the dynamics

@glwagner I don't think there are any prognostic variables in dynamics.

Breeze.jl/src/AnelasticEquations/anelastic_dynamics.jl

Line 112 in 0251766

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

ewquon · 2026-03-19T17:08:34Z

Adding another data point. This is the RICO example run for 24 hours. "run1" and "run2" is the same case, but run on different machines -- effectively different realizations of the same conditions. "run2" was run from 0-24h (continuous) and also 0-8h, 8-24h (restarted). Presented profiles are averages over the last four hours. I think this actually looks pretty good, any thoughts @glwagner @giordano?

glwagner · 2026-03-19T17:21:31Z

@glwagner I don't think there are any prognostic variables in dynamics.

Breeze.jl/src/AnelasticEquations/anelastic_dynamics.jl

Line 112 in 0251766

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

but there are for CompressibleDynamics right?

ewquon · 2026-03-19T19:44:12Z

@glwagner I don't think there are any prognostic variables in dynamics.

Breeze.jl/src/AnelasticEquations/anelastic_dynamics.jl

Line 112 in 0251766

AtmosphereModels.prognostic_dynamics_field_names(::AnelasticDynamics) = ()

but there are for CompressibleDynamics right?

CompressibleDynamics is capable of a perfect restart, Anelastic (what I'm testing) is not.

ewquon added 2 commits February 27, 2026 10:35

Checkpoint output success

5c2eb74

Pickup success

6a2333b

giordano added the enhancement ✨ ideas and requests for new features label Feb 27, 2026

giordano linked an issue Feb 27, 2026 that may be closed by this pull request

Checkpointing for AtmosphereModel #443

Open

ewquon added 3 commits February 27, 2026 11:52

Add AbstractBreezeTimeStepper

60d2a2c

to generalize prognostic_state and restore_prognostic_state! to all timesteppers

Merge branch 'main' into add_checkpt

2b09a1b

Add checkpoint restart test

c8fa1bf

Add run!(; write_pickup_state=filename) kwarg for testing

433d31b