Reinvoke sharding tests and refactor Clock constructor for sharded grids by glwagner · Pull Request #5389 · CliMA/Oceananigans.jl

glwagner · 2026-03-11T20:32:18Z

No description provided.

- Update CI sharding job to Julia 1.11.9, add XLA_FLAGS for 4 virtual devices - Add reconstruct_global_field, synchronize_communication\!, set_to_field\! for sharded fields - Add complete_communication_and_compute_buffer\! and interior_tendency_kernel_parameters no-ops for ShardedDistributed to disable pipelining - Rename DistributedTripolarGridOfSomeKind to MPITripolarGridOfSomeKind to distinguish MPI-based distributed from Reactant sharded distributed - Make synchronize_communication\! conditional on AsynchronousDistributed - Create run_sharding_tests.jl as centralized test runner for all partition configs - Rewrite test_sharded_lat_lon.jl and test_sharded_tripolar.jl to use run_sharding_tests.jl - Use tolerance-based comparisons (sqrt(eps)) instead of strict equality - Uncomment sharding tests in runtests.jl Co-Authored-By: Claude Opus 4.6 <[email protected]>

…nings Remove --project flag from mpiexec subprocess commands to match existing MPI test patterns. The subprocess inherits JULIA_LOAD_PATH from Pkg.test() which already includes MPI. Add include guard to distributed_tests_utils.jl to prevent method redefinition warnings when included from multiple test files. Co-Authored-By: Claude Opus 4.6 <[email protected]>

@compile

Move Reactant-specific run_distributed_simulation (with @compile) into run_sharding_tests.jl. The shared distributed_tests_utils.jl is also used by MPI tests which don't have Reactant available. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Define ReactantTestGrid type alias and dispatch run_distributed_simulation on it, rather than overriding the generic method. Co-Authored-By: Claude Opus 4.6 <[email protected]>

@isdefined

The `if !@isdefined(...)` guard wrapped all code in a single expression, preventing Julia from resolving `@test` macros since `using Test` (via dependencies_for_runtests.jl) hadn't executed yet at lowering time. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…@test

…@test Remove ShardedDistributedField set_to_function! override that launched a KA kernel with closures, causing InvalidIRError. Falls through to ReactantField method which does CPU fallback instead. Also remove stale @test isnothing(flat_grid.z) — grid.z is now a StaticVerticalDiscretization even for Flat topologies. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Introduce PeriodicFillHalo struct that captures Val{N}, Val{H} as type parameters, avoiding grid access inside periodic halo kernels. This fixes Reactant MLIR compilation failures from runtime grid field access. - Rename fill_halo_kernel! → fill_halo_kernel (non-mutating builder) - Split periodic_size_and_offset into periodic_size and periodic_offset with improved comments explaining windowed field offset correction - Remove stale _fill_north_halo! overrides from test/distributed_tests_utils.jl - Update OceananigansEnzymeExt for the renamed functions Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add explicit Tuple{Any, Any} and Tuple{Any} signatures to resolve method ambiguity with the generic fill_halo_event! methods. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

glwagner · 2026-03-21T13:52:21Z

@simone-silvestri this cleans up halo filling a bit --- there was some code in there that was badly golfed

wsmoses · 2026-03-21T13:58:03Z

ext/OceananigansReactantExt/OceananigansReactantExt.jl

+    # Use optimization_barrier to avoid aliasing last_Δt and last_stage_Δt,
+    # which causes XLA buffer donation errors in loops.
+    # (Δt + zero(Δt) gets folded by XLA, so we need a barrier.)
+    (last_stage_Δt,) = Reactant.Ops.optimization_barrier(Δt)


this is a bit weird, is there a reason last_Δt and last_stage_Δt shouldn't alias at the start [for ease]

alternatively, we should do something less strong than an optimization barrier [and add a new op]. what happens is that in mlir if we see multiple returns of the same value, we just return one value for both [as they're the same]. Either we shouldn't do that, or alternatively we can add a new op that means "force_buffer_return" or something. The reason being that we actually do still want this to get optimized inside of loops, etc.

cc @Pangoraw

they don't alias at the start but I think the issue is something like

function tick!(clock, dt) last_dt.mlir_data = dt.mlir_data last_stage_dt.mlir_data = dt.mlir_data return nothing end

so they alias after this update in tick!, we need to copy the mlir_data?

Can we make them Alia's at the start?

I think the issue here is that whether or not we actually need last_stage_dt depends on the time-stepping method. For RK3 these are different but for an earlier time-stepping method (which we are using for ocean in these tests and on gb25) they are the same

asking claude to figure it out

I have to run but Claude is working on this (maybe will get it wrong), feel free to commit here in meantime

can this part be factored out in a follow up and the rest merged [since I know blocking some other tests]

in particualr I think you'll want to revert the ext/OceananigansReactantExt/OceananigansReactantExt.jl part of
ext/OceananigansReactantExt/OceananigansReactantExt.jl

back to the way it was before [aka]

function Oceananigans.TimeSteppers.tick!(clock::Oceananigans.TimeSteppers.Clock{<:Any, <:Any, <:Reactant.TracedRNumber}, Δt; stage=false) Oceananigans.TimeSteppers.tick_time!(clock, Δt) if stage # tick a stage update clock.stage += 1 clock.last_stage_Δt = Δt else # tick an iteration and reset stage clock.iteration.mlir_data = (clock.iteration + 1).mlir_data clock.stage = 1 end return nothing end

and we can fix the time stepper generality in a follow up [as this PR otherwise contains many necessary unrelated fixes]

ill see if this gets tests to pass on gb25. But I think the reason we changed it is because the gb25 tests were not passing? One issue is that we have changed this function so there is nolonger a stage kwarg. I can figure out how to rewrite it though.

materialize_clock! in the constructor doesn't prevent MLIR-level aliasing that occurs at runtime inside tick! when both last_Δt and last_stage_Δt are assigned the same Δt value. The optimization_barrier is still needed until Reactant provides a proper buffer copy op. See #5389 (comment) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

codecov · 2026-03-21T15:53:52Z

Codecov Report

❌ Patch coverage is 45.64103% with 106 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.91%. Comparing base (04fd2ed) to head (e1dbdb6).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Operators/spacings_and_areas_and_volumes.jl	1.78%	55 Missing ⚠️
ext/OceananigansReactantExt/TimeSteppers.jl	40.42%	28 Missing ⚠️
src/BoundaryConditions/fill_halo_kernels.jl	82.92%	7 Missing ⚠️
ext/OceananigansReactantExt/Fields.jl	0.00%	3 Missing ⚠️
...c/BoundaryConditions/fill_halo_regions_periodic.jl	0.00%	3 Missing ⚠️
ext/OceananigansEnzymeExt.jl	0.00%	2 Missing ⚠️
ext/OceananigansReactantExt/Models.jl	0.00%	2 Missing ⚠️
...OceananigansReactantExt/OceananigansReactantExt.jl	83.33%	2 Missing ⚠️
ext/OceananigansReactantExt/OutputReaders.jl	0.00%	2 Missing ⚠️
...rthogonalSphericalShellGrids/distributed_zipper.jl	90.90%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5389      +/-   ##
==========================================
+ Coverage   68.84%   73.91%   +5.07%     
==========================================
  Files         398      398              
  Lines       21963    22732     +769     
==========================================
+ Hits        15121    16803    +1682     
+ Misses       6842     5929     -913

Flag	Coverage Δ
buildkite	`68.76% <27.60%> (-0.09%)`	⬇️
julia	`68.76% <27.60%> (-0.09%)`	⬇️
reactant_unit	`3.79% <15.46%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ext/OceananigansReactantExt/OceananigansReactantExt.jl

glwagner · 2026-03-23T15:27:03Z

ext/OceananigansReactantExt/OceananigansReactantExt.jl

    clock.stage = 1
    clock.last_Δt.mlir_data = Δt.mlir_data
-    clock.last_stage_Δt.mlir_data = Δt.mlir_data
+    clock.last_stage_Δt.mlir_data = promote_to_traced(Δt).mlir_data


@wsmoses how about this (create a copy of dt and use that)

no this will not work, it'll still be seen as the same value at the end

looks like it doesnt work over on https://github.com/PRONTOLab/GB-25/actions/runs/23446407885/job/68211278229?pr=258

Yeah I would expect it to fail

Can we restore this to the old functionality, merge all the sharding stuff, then come back to this when @Pangoraw added the attr stuff?

wsmoses · 2026-03-23T16:50:59Z

Avoiding MLIR data won’t work either and is necessary, I think you just need to revert, merge this, and wait for the upstream fix

Add materialize_clock! function to clock.jl

Add materialize_clock! function to update last_Δt

glwagner · 2026-03-23T17:26:28Z

we cleared the reactant tests, and also the issue on gb25 is resolved so I am crossing my fingers this will get merged soon

glwagner · 2026-03-23T17:27:40Z

note on the above discussion: the solution was to add a materialize_clock!(clock, timestepper) switch --- for AB2, which has no stages, we force last_dt and last_stage_dt to alias initially and this solves the problem. The fix is in src not the extension, since the fix is innocuous for non-reactant situations. materialize_clock! is called in the constructor for HydrostaticFreeSurfaceModel.

simone-silvestri · 2026-03-24T07:10:41Z

src/BoundaryConditions/fill_halo_regions_periodic.jl

 end

-@kernel function _fill_periodic_bottom_and_top_halo!(c, bottom_bc, top_bc, loc, grid, args)
+@kernel function _fill_periodic_bottom_and_top_halo!(c, ::Val{N}, ::Val{H}) where {N, H}


why passing Vals here?

It seems that knowing loop size at compile time yields to successful compilation vs not. This is actually needed for julia 1.12 with reactant.

ah ok, cool

@trace

* Fix XLA buffer donation error from aliased clock fields in tick! In tick!, both clock.last_Δt and clock.last_stage_Δt were set to the same Δt.mlir_data, causing them to alias. Inside @trace for loops, this triggers "Attempt to donate a buffer which is also used by the same call to Execute()" because XLA tries to donate one while reading the other in the while loop carry. Break the alias by creating a copy via Δt + zero(Δt) for last_stage_Δt. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove initialization_update_state! and clean up initialization Replace the confusing `initialization_update_state!` with a clear separation of concerns between `reconcile_state!`, `initialize!`, and `update_state!`: - `reconcile_state!`: ensures auxiliary state (barotropic velocities, vertical coordinate scaling) is consistent with prognostic fields. Called in `set!` and `initialize!`. Idempotent. - `initialize!`: one-time setup before first time step. Calls `reconcile_state!`, synchronous halo fills, and `initialize_closure_fields!`. Allowed to be non-idempotent. - `update_state!`: idempotent recomputation of diagnostics/derived quantities. Called every time step and after `set!`. Renames: - `initialize_free_surface!` → `reconcile_free_surface!` - `initialize_vertical_coordinate!` → `reconcile_vertical_coordinate!` - `maybe_initialize_state!` → `maybe_initialize!` (now also calls `initialize!`) The `set!` function gains a `reconcile_state` kwarg (default `true`) for advanced users who set barotropic velocities independently. Bump version to 0.106.1. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove initialize! call from maybe_initialize! maybe_initialize! should only call update_state!, not initialize!. Calling initialize! inside time_step! would cause double-initialization when used with Simulation (which already calls initialize! in Simulation.initialize!). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Rename maybe_initialize! to maybe_prepare_first_time_step! Also move reconcile_state! definition to TimeSteppers (where update_state! lives) and call it from maybe_prepare_first_time_step! so that bare time_step!(model, Δt) reconciles state on the first step. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Use optimization_barrier instead of +zero to break clock field alias XLA folds Δt + zero(Δt) → Δt, so the aliasing between last_Δt and last_stage_Δt persists. Use Reactant.Ops.optimization_barrier which XLA cannot fold, ensuring distinct buffers for the while loop carry. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Add materialize_clock! and remove optimization_barrier from tick! Add `materialize_clock!(clock, timestepper)` called in HFSM and NH model constructors. For QuasiAdamsBashforth2TimeStepper, this sets `clock.last_Δt = clock.last_stage_Δt` to ensure they are distinct objects. This is needed for Reactant, where aliased ConcreteRNumber fields cause XLA buffer donation errors in compiled loops. With materialize_clock! breaking the alias at construction time, the optimization_barrier workaround in the Reactant tick! override is no longer needed and is removed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Restore optimization_barrier in Reactant tick! materialize_clock! in the constructor doesn't prevent MLIR-level aliasing that occurs at runtime inside tick! when both last_Δt and last_stage_Δt are assigned the same Δt value. The optimization_barrier is still needed until Reactant provides a proper buffer copy op. See #5389 (comment) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Remove optimization_barrier from Reactant tick! materialize_clock! in the model constructor ensures last_Δt and last_stage_Δt are distinct ConcreteRNumber objects. Assigning the same value to both in tick! via .mlir_data does not re-alias the objects. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * Move materialize_clock! QAB2 method to Reactant extension For QAB2, last_Δt and last_stage_Δt always hold the same value after tick!. In the Reactant extension, materialize_clock! aliases them via setfield! (bypassing the ReactantClock setproperty! override) so that Reactant's tracer sees one buffer, avoiding XLA buffer donation errors. The src/ QAB2 method is removed since aliasing is only needed for Reactant's ConcreteRNumber fields (normal Float64 fields are immutable values with no aliasing concern). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * rm stale import * fill free surface halos before timestepping * remove stale imports * add fill halos in reconcile_state! * now just need to fix multi-region * cubed sphere implementation * fix the dispatch --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Simone Silvestri <[email protected]>

glwagner and others added 10 commits March 11, 2026 20:32

Refactor Clock function for sharding support

7d1fdc4

need explicit grid arg

e41bcc0

implement parent_type for FieldTimeSeries

9afbf84

return model from distributed tests to permit further checks

76aaea0

import on_architecture

329ab89

quality on_architecture

8c7b79f

Use dispatch for Reactant run_distributed_simulation

4a241c9

Define ReactantTestGrid type alias and dispatch run_distributed_simulation on it, rather than overriding the generic method. Co-Authored-By: Claude Opus 4.6 <[email protected]>

navidcy added extensions 🧬 reactant ∇ all day I dream about MLIR labels Mar 12, 2026

glwagner and others added 12 commits March 12, 2026 14:02

hotfix for setting Clock.last_dt

2502d29

fix typos

24e9e14

bugfix

6f3fa24

sorry

913dc1d

extend setproperty only for ReactantClock

888f6fa

convert val for sharded and not

d9a9581

better (according to Paul)

ea3130e

Fix cpu_interpolating_time_indices

e1644be

Merge branch 'main' into glw/sharding-fixes-1

a697d87

Merge branch 'main' into glw/sharding-fixes-1

ba5f5a9

Merge branch 'main' into glw/sharding-fixes-1

b53ad8d

glwagner changed the title ~~Refactor Clock function for sharding support~~ Reinvoke sharding tests and add Clock constructor for sharded grids Mar 17, 2026

glwagner changed the title ~~Reinvoke sharding tests and add Clock constructor for sharded grids~~ Reinvoke sharding tests and refactor Clock constructor for sharded grids Mar 17, 2026

glwagner and others added 4 commits March 17, 2026 20:39

Fix ambiguous fill_halo_event! dispatch for PeriodicFillHalo

a8d7c59

Add explicit Tuple{Any, Any} and Tuple{Any} signatures to resolve method ambiguity with the generic fill_halo_event! methods. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Remove stray end causing ParseError on Julia 1.11

3174a79

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

glwagner requested review from giordano, simone-silvestri and wsmoses March 21, 2026 13:50

wsmoses reviewed Mar 21, 2026

View reviewed changes

glwagner commented Mar 23, 2026

View reviewed changes

ext/OceananigansReactantExt/OceananigansReactantExt.jl Outdated Show resolved Hide resolved

glwagner added 2 commits March 23, 2026 08:59

Update ext/OceananigansReactantExt/OceananigansReactantExt.jl

076c0f2

Merge branch 'main' into glw/sharding-fixes-1

81fc96b

glwagner commented Mar 23, 2026

View reviewed changes

glwagner added 4 commits March 23, 2026 09:31

clean up

5fa6fd4

specify Traced vs Concrete clock

b8ed6f1

cleanup

49535ee

try avoiding the mlir_data

3065798

glwagner added 4 commits March 23, 2026 11:01

Update TimeSteppers.jl

94de981

Implement materialize_clock! function

e3e7299

Add materialize_clock! function to clock.jl

Implement materialize_clock! function

55be704

Add materialize_clock! function to update last_Δt

Add materialize_clock! to time stepper initialization

e1dbdb6

glwagner merged commit 4dd6137 into main Mar 23, 2026
67 of 68 checks passed

glwagner deleted the glw/sharding-fixes-1 branch March 23, 2026 19:20

glwagner mentioned this pull request Mar 23, 2026

Reorganize Reactant test groups and add CI coverage #5423

Merged

2 tasks

simone-silvestri reviewed Mar 24, 2026

View reviewed changes

simone-silvestri mentioned this pull request Mar 24, 2026

Rework sharding tests #4293

Closed

giordano mentioned this pull request Mar 26, 2026

Switch biases to be enums rather than distinct types. #5318

Merged

glwagner mentioned this pull request Mar 29, 2026

Revert "Switch biases to be enums rather than distinct types. (#5318)" #5449

Merged

Conversation

glwagner commented Mar 11, 2026

Uh oh!

glwagner commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsmoses commented Mar 23, 2026

Uh oh!

glwagner commented Mar 23, 2026

Uh oh!

glwagner commented Mar 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

glwagner commented Mar 21, 2026 •

edited

Loading

codecov bot commented Mar 21, 2026 •

edited

Loading