Skip to content

(0.106) Log checkpoint file and mtime when restoring simulations#5355

Merged
navidcy merged 21 commits intomainfrom
ts-codex/add-checkpointer-info
Mar 19, 2026
Merged

(0.106) Log checkpoint file and mtime when restoring simulations#5355
navidcy merged 21 commits intomainfrom
ts-codex/add-checkpointer-info

Conversation

@taimoorsohail
Copy link
Copy Markdown
Collaborator

I only stumbled upon this problem recently - if I am not diligent about cleaning up my checkpointer files, the model may pickup older checkpoint files (which have a higher iteration number but were created a long time ago) without my realising.

Obviously we can't foolproof everything, but this particular issue is insidious because it is completely silent, and thus may easily go unnoticed. To make it more obvious to the end-user what checkpoint file is being picked up, I have added an info statement with the file name and time created that is being picked up. This will assist in understanding what is happening a bit more. Keen to add if people think it is useful!

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Mar 2, 2026

interesting! should we load the most recently created checkpoint, instead of the latest iteration?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

taimoorsohail commented Mar 2, 2026

I think a default behaviour of highest iteration number makes some sense. Though I guess then set!(simulation; checkpoint=:latest) is a bit misleading, cos it might not be the latest. We could remove the run!(simulation, pickup=true) functionality and explicitly make it run!(simulation, pickup=:latest), meaning the most recently created file, or run!(simulation, pickup=:highest), meaning the highest iteration number? This would also align with set! functionality...

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Mar 2, 2026

how about something like

run!(simulation, pickup=:iteration)
run!(simulation, pickup=:time_stamp)

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 3, 2026

or the more verbose

run!(simulation, pickup=:latest_iteration)
run!(simulation, pickup=:recent_time_stamp)

?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

OK this is renamed now. Note the default behaviour of pickup=true is now to use :recent_time_stamp. This is different behaviour from before so may trip people up?

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Mar 3, 2026

OK this is renamed now. Note the default behaviour of pickup=true is now to use :recent_time_stamp. This is different behaviour from before so may trip people up?

If we move forward then we should bump to 0.106.0 since its a breaking change

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 4, 2026

OK this is renamed now. Note the default behaviour of pickup=true is now to use :recent_time_stamp. This is different behaviour from before so may trip people up?

If we move forward then we should bump to 0.106.0 since its a breaking change

Or is this a patch release?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

Tests are failing here?

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 10, 2026

Yes, seems like test_checkpointer.jl fails somewhere here:

@testset "Edge cases [$(typeof(arch))]" begin
@info " Testing edge cases [$(typeof(arch))]..."
test_checkpoint_empty_tracers(arch)
test_checkpoint_missing_file_warning(arch)
test_pickup_mode_selection_and_default(arch)
end

Try running it locally?

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 10, 2026

if it's helpful, here's the errors I get when running locally:

[2026/03/10 08:56:02.631] INFO  Model iteration 3 equals or exceeds stop iteration 3.
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1535
  Expression: set!(iter_sim; iteration = 2)
  Log Pattern: (:info, r"iteration2\\.jld2")
  Captured Logs: 
    LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration2.jld2; last modified (UTC): 2026-03-10T06:56:03", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())

Stacktrace:
 [1] backtrace()
   @ Base ./error.jl:124
 [2] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
   @ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
 [3] test_pickup_mode_selection_and_default(arch::CPU)
   @ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1535
 [4] macro expansion
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
 [6] top-level scope
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1544
  Expression: set!(recent_sim; checkpoint = :recent_time_stamp)
  Log Pattern: (:info, r"iteration1\\.jld2")
  Captured Logs: 
    LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration1.jld2; last modified (UTC): 2026-03-10T06:56:04", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())

Stacktrace:
 [1] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
   @ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
 [2] test_pickup_mode_selection_and_default(arch::CPU)
   @ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1544
 [3] macro expansion
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
 [5] top-level scope
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1553
  Expression: set!(highest_sim; checkpoint = :highest_iteration)
  Log Pattern: (:info, r"iteration3\\.jld2")
  Captured Logs: 
    LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration3.jld2; last modified (UTC): 2026-03-10T06:56:03", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())

Stacktrace:
 [1] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
   @ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
 [2] test_pickup_mode_selection_and_default(arch::CPU)
   @ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1553
 [3] macro expansion
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
 [5] top-level scope
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
[2026/03/10 08:56:05.392] INFO  Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration1.jld2; last modified (UTC): 2026-03-10T06:56:04
[2026/03/10 08:56:05.393] INFO  Initializing simulation...
[2026/03/10 08:56:05.393] INFO      ... simulation initialization complete (287.875 μs)
[2026/03/10 08:56:05.393] INFO  Executing initial time step...
[2026/03/10 08:56:05.397] INFO  Simulation is stopping after running for 0 seconds.
[2026/03/10 08:56:05.397] INFO  Model iteration 2 equals or exceeds stop iteration 0.
[2026/03/10 08:56:05.398] INFO      ... initial time step complete (4.420 ms).
Edge cases [CPU]: Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1563
  Expression: iteration(default_sim) == 1
   Evaluated: 2 == 1

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:680 [inlined]
 [2] test_pickup_mode_selection_and_default(arch::CPU)
   @ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1563
 [3] macro expansion
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
 [5] top-level scope
   @ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Test Summary:    | Pass  Fail  Total   Time
Edge cases [CPU] |   21     4     25  47.5s
RNG of the outermost testset: Xoshiro(0x082ec9b98c045886, 0xea36ed04c5171b57, 0xbdb26f978ceddc10, 0x639acb8732c67b9f, 0x348b4f6fff5c5834)
ERROR: LoadError: Some tests did not pass: 21 passed, 4 failed, 0 errored, 0 broken.
in expression starting at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1822

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

Thanks, fixed :)

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Looks like Reactant also exports Periodic

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 12, 2026

Looks like Reactant also exports Periodic

hm... it does not.
But also I don't see the connection with this PR?

@simone-silvestri
Copy link
Copy Markdown
Collaborator

There is none, the problem of the tests failing is just that reactant started to export Periodic as well so we are getting name conflicts. However, I think they are changing it in reactant, so we can merge even if the tests do not pass in my opinion

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 13, 2026

Gotcha

I guess that #5391 deals with the issue with Periodic?

@glwagner
Copy link
Copy Markdown
Member

Gotcha

I guess that #5391 deals with the issue with Periodic?

correct

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

So shall we merge given the errors are unrelated to this PR?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

Also is it a patch release?

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 14, 2026

It's not a breaking API change but I guess since the behaviour change @glwagner was suggesting we bump the minor release so that people notice it?

@glwagner
Copy link
Copy Markdown
Member

The same setup will execute differently after this PR, right? Therefore it is breaking; you cannot expect existing code to run identically. (Which is a good thing --- the prior behavior as described by @taimoorsohail was undesirable.)

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 17, 2026

Fair! Done.

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

taimoorsohail commented Mar 18, 2026

Can someone approve so I can merge? @simone-silvestri @glwagner @navidcy

@navidcy navidcy changed the title Log checkpoint file and mtime when restoring simulations (0.106) Log checkpoint file and mtime when restoring simulations Mar 18, 2026
@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

I think I might need someone with elevated permissions to merge it as tests are failing for unrelated reasons...

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 18, 2026

Let me ensure that they are unrelated reasons indeed.
Then, count on me: I got you covered ;)

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Mar 19, 2026

The distributed CI is failing only. I'm pretty sure that's unrelated.

@navidcy navidcy merged commit cd51861 into main Mar 19, 2026
68 of 80 checks passed
@navidcy navidcy deleted the ts-codex/add-checkpointer-info branch March 19, 2026 09:03
navidcy referenced this pull request Mar 20, 2026
…lization of the interior (#5329)

* Enhance MetalGPU support: add device handling for AbstractArray and remove maybe_copy_interior

* Fix initialization flag in reduction operations for fields

* Remove unused device method for AbstractArray

* Refactor device function to accept AbstractArray for broader compatibility

* Refactor device function to use Metal.device for Base.ReshapedArray

* Add comment about extension for Metal.device to support Base.ReshapedArray

* Metal 1.9.3 fix mapreduce device check
@giordano giordano added the breaking change 💔 Concerning a change which breaks the API label Mar 20, 2026
briochemc added a commit to briochemc/Oceananigans.jl that referenced this pull request Mar 23, 2026
…ine-ACCESS-OM2

* bp-claude/distributed-FPivot-TripolarGrid:
  Replace reverse() with reversed-range views in fold halo fills
  Reinstate Docs/Benchmarks (CliMA#5419)
  Update fill_halo_regions.jl (CliMA#5415)
  Temporarily drop Benchmark section from Docs + delete `legacy_benchmarks` (CliMA#5412)
  Add restart verification script (CliMA#5379)
  Rework support for reduction operations on Metal GPU to avoid materialization of the interior (CliMA#5329)
  Fix typo with density perturbation in docs (CliMA#5398)
  Implement ReactantCore.materialize_traced_array for Field (CliMA#5409)
  Remove Oceananigans dependency from Project.toml (CliMA#5414)
  (0.106) Log checkpoint file and mtime when restoring simulations (CliMA#5355)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change 💔 Concerning a change which breaks the API output 💾

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants