(0.106) Log checkpoint file and mtime when restoring simulations#5355
(0.106) Log checkpoint file and mtime when restoring simulations#5355
Conversation
|
interesting! should we load the most recently created checkpoint, instead of the latest iteration? |
|
I think a default behaviour of highest iteration number makes some sense. Though I guess then |
|
how about something like |
|
or the more verbose run!(simulation, pickup=:latest_iteration)
run!(simulation, pickup=:recent_time_stamp)? |
|
OK this is renamed now. Note the default behaviour of |
If we move forward then we should bump to 0.106.0 since its a breaking change |
Or is this a patch release? |
|
Tests are failing here? |
|
Yes, seems like Oceananigans.jl/test/test_checkpointer.jl Lines 1998 to 2003 in 6466f00 Try running it locally? |
|
if it's helpful, here's the errors I get when running locally: [2026/03/10 08:56:02.631] INFO Model iteration 3 equals or exceeds stop iteration 3.
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1535
Expression: set!(iter_sim; iteration = 2)
Log Pattern: (:info, r"iteration2\\.jld2")
Captured Logs:
LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration2.jld2; last modified (UTC): 2026-03-10T06:56:03", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())
Stacktrace:
[1] backtrace()
@ Base ./error.jl:124
[2] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
@ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
[3] test_pickup_mode_selection_and_default(arch::CPU)
@ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1535
[4] macro expansion
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
[5] macro expansion
@ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
[6] top-level scope
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1544
Expression: set!(recent_sim; checkpoint = :recent_time_stamp)
Log Pattern: (:info, r"iteration1\\.jld2")
Captured Logs:
LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration1.jld2; last modified (UTC): 2026-03-10T06:56:04", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())
Stacktrace:
[1] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
@ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
[2] test_pickup_mode_selection_and_default(arch::CPU)
@ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1544
[3] macro expansion
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
[4] macro expansion
@ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
[5] top-level scope
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Edge cases [CPU]: Log Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1553
Expression: set!(highest_sim; checkpoint = :highest_iteration)
Log Pattern: (:info, r"iteration3\\.jld2")
Captured Logs:
LogRecord(Info, "Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration3.jld2; last modified (UTC): 2026-03-10T06:56:03", Oceananigans.Simulations, :run, :Oceananigans_Simulations_935ebb30, "/Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/src/Simulations/run.jl", 106, Base.Pairs{Symbol, Union{}, Nothing, @NamedTuple{}}())
Stacktrace:
[1] record(ts::Test.DefaultTestSet, t::Test.LogTestFailure)
@ Test ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/logging.jl:166
[2] test_pickup_mode_selection_and_default(arch::CPU)
@ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1553
[3] macro expansion
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
[4] macro expansion
@ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
[5] top-level scope
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
[2026/03/10 08:56:05.392] INFO Picking up simulation from checkpoint file /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/pickup_mode_selection_CPU_13367600781762233934_iteration1.jld2; last modified (UTC): 2026-03-10T06:56:04
[2026/03/10 08:56:05.393] INFO Initializing simulation...
[2026/03/10 08:56:05.393] INFO ... simulation initialization complete (287.875 μs)
[2026/03/10 08:56:05.393] INFO Executing initial time step...
[2026/03/10 08:56:05.397] INFO Simulation is stopping after running for 0 seconds.
[2026/03/10 08:56:05.397] INFO Model iteration 2 equals or exceeds stop iteration 0.
[2026/03/10 08:56:05.398] INFO ... initial time step complete (4.420 ms).
Edge cases [CPU]: Test Failed at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1563
Expression: iteration(default_sim) == 1
Evaluated: 2 == 1
Stacktrace:
[1] macro expansion
@ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:680 [inlined]
[2] test_pickup_mode_selection_and_default(arch::CPU)
@ Main ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1563
[3] macro expansion
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2004 [inlined]
[4] macro expansion
@ ~/.julia/juliaup/julia-1.12.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.12/Test/src/Test.jl:1776 [inlined]
[5] top-level scope
@ ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:2001
Test Summary: | Pass Fail Total Time
Edge cases [CPU] | 21 4 25 47.5s
RNG of the outermost testset: Xoshiro(0x082ec9b98c045886, 0xea36ed04c5171b57, 0xbdb26f978ceddc10, 0x639acb8732c67b9f, 0x348b4f6fff5c5834)
ERROR: LoadError: Some tests did not pass: 21 passed, 4 failed, 0 errored, 0 broken.
in expression starting at /Users/navid/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl-v3/test/test_checkpointer.jl:1822 |
|
Thanks, fixed :) |
|
Looks like Reactant also exports |
hm... it does not. |
|
There is none, the problem of the tests failing is just that reactant started to export |
|
Gotcha I guess that #5391 deals with the issue with |
correct |
|
So shall we merge given the errors are unrelated to this PR? |
|
Also is it a patch release? |
|
It's not a breaking API change but I guess since the behaviour change @glwagner was suggesting we bump the minor release so that people notice it? |
|
The same setup will execute differently after this PR, right? Therefore it is breaking; you cannot expect existing code to run identically. (Which is a good thing --- the prior behavior as described by @taimoorsohail was undesirable.) |
|
Fair! Done. |
|
Can someone approve so I can merge? @simone-silvestri @glwagner @navidcy |
|
I think I might need someone with elevated permissions to merge it as tests are failing for unrelated reasons... |
|
Let me ensure that they are unrelated reasons indeed. |
…ananigans.jl into ts-codex/add-checkpointer-info
|
The distributed CI is failing only. I'm pretty sure that's unrelated. |
…lization of the interior (#5329) * Enhance MetalGPU support: add device handling for AbstractArray and remove maybe_copy_interior * Fix initialization flag in reduction operations for fields * Remove unused device method for AbstractArray * Refactor device function to accept AbstractArray for broader compatibility * Refactor device function to use Metal.device for Base.ReshapedArray * Add comment about extension for Metal.device to support Base.ReshapedArray * Metal 1.9.3 fix mapreduce device check
…ine-ACCESS-OM2 * bp-claude/distributed-FPivot-TripolarGrid: Replace reverse() with reversed-range views in fold halo fills Reinstate Docs/Benchmarks (CliMA#5419) Update fill_halo_regions.jl (CliMA#5415) Temporarily drop Benchmark section from Docs + delete `legacy_benchmarks` (CliMA#5412) Add restart verification script (CliMA#5379) Rework support for reduction operations on Metal GPU to avoid materialization of the interior (CliMA#5329) Fix typo with density perturbation in docs (CliMA#5398) Implement ReactantCore.materialize_traced_array for Field (CliMA#5409) Remove Oceananigans dependency from Project.toml (CliMA#5414) (0.106) Log checkpoint file and mtime when restoring simulations (CliMA#5355)
I only stumbled upon this problem recently - if I am not diligent about cleaning up my checkpointer files, the model may pickup older checkpoint files (which have a higher iteration number but were created a long time ago) without my realising.
Obviously we can't foolproof everything, but this particular issue is insidious because it is completely silent, and thus may easily go unnoticed. To make it more obvious to the end-user what checkpoint file is being picked up, I have added an info statement with the file name and time created that is being picked up. This will assist in understanding what is happening a bit more. Keen to add if people think it is useful!