Skip to content

[ai] add a debug_assert on shard_number#8846

Merged
tellet-q merged 1 commit into
resharding-shard-holder-idempotentfrom
resharding-shard-holder-idempotent-assert
Apr 30, 2026
Merged

[ai] add a debug_assert on shard_number#8846
tellet-q merged 1 commit into
resharding-shard-holder-idempotentfrom
resharding-shard-holder-idempotent-assert

Conversation

@tellet-q

Copy link
Copy Markdown
Contributor

Check that shard_number is either shard_id or shard_id + 1. Fires if in future for some reason we'll allow removing a shard "in the middle".

@tellet-q tellet-q requested a review from timvisee April 30, 2026 09:26
@tellet-q tellet-q merged commit a52e6e5 into resharding-shard-holder-idempotent Apr 30, 2026
14 checks passed
@tellet-q tellet-q deleted the resharding-shard-holder-idempotent-assert branch April 30, 2026 09:54
timvisee added a commit that referenced this pull request Apr 30, 2026
* Make resharding state transitions idempotent on replay

Why: consensus entries may be re-applied after a crash (partial state
on disk) or during raft recovery. The unchecked state-transition
helpers used `debug_assert!` to require a specific starting state, so
a replay would panic in debug or silently overwrite in release.

How to apply: use write_optional so the state file is only touched
when the in-memory state needs to change. This also avoids unnecessary
fsyncs when a replay is a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Skip replica set creation on resharding start replay

Why: on replay, `create_replica_set` would call `create_shard_dir`,
which removes and recreates the existing shard directory -- wiping
any shard contents that had already been migrated or written since
the first apply.

How to apply: check `contains_shard(shard_id)` before creating the
replica set and pass `None` to `start_resharding_unchecked` when a
replica set with the target shard id is already present. Also
relaxes `check_start_resharding` to return `Ok` (instead of a
swallowed `bad_request`) when a matching resharding state is
persisted, so the caller can fall through each idempotent step.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Derive resharding shard count from shard id for idempotency

Why: the start/finish/abort paths used ++/-- on the persisted shard
number with a `debug_assert` pinning the expected starting value. On
replay (e.g. after a crash between the shard holder mutation and the
config save) this either panics or produces a wrong count.

Since resharding always targets the last shard (auto sharding assigns
contiguous ids from zero), the target count is a pure function of the
shard id: `shard_id + 1` for start up, `shard_id` for finish down and
abort up. Set it directly and skip the save when it already matches,
so replay converges to the same value without touching the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Fall through resharding finish/abort steps on replay

Why: the early return on `resharding_state.is_none()` assumed that a
missing state means the operation was fully applied. But a crash can
leave the state cleared while the shard drop, key mapping removal, or
shard count update are still pending. Short-circuiting then skips the
reconciling work those replays need to do.

How to apply: drop the early return so every step runs; each step is
already individually idempotent (check_*, drop_and_remove_shard,
remove_shard_from_key_mapping, the set-based shard count update).
The abort_resharding down-invalidation path now reads nodes from the
router regardless of its variant, so a replay over an already-rolled-
back ring doesn't trip the removed debug_assert!s.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Skip shard key mapping write when nothing to remove

Why: `write_optional` returns `Some` whenever the shard key is
present, even if the shard id we're removing is already gone. That
still triggers a JSON rewrite and a cache notification on every
replay of a finished finish/abort.

How to apply: check that the shard id is actually in the set before
returning `Some`. If the set is missing or already doesn't contain
the id, return `None` so the on-disk file and in-memory data are
left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Add tests for resharding replay idempotency

Covers the paths that must not error or panic on replay:
- `check_start_resharding` when matching state is already present
- `start_resharding_unchecked` preserves matching state verbatim
- `finish_resharding_unchecked` when state is already cleared

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Simplify finish_resharding_unchecked closure

Replace a manual `match` on `Option` with `as_ref().map(...)` to
satisfy clippy::manual_map.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* test: add a test for resharding replay (#8839)

* [ai] add a debug_assert on shard_number (#8846)

* Update lib/collection/src/shards/shard_holder/resharding.rs

Co-authored-by: Roman Titov <[email protected]>

* Reformat

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: tellet-q <[email protected]>
Co-authored-by: Roman Titov <[email protected]>
timvisee added a commit that referenced this pull request May 8, 2026
* Make resharding state transitions idempotent on replay

Why: consensus entries may be re-applied after a crash (partial state
on disk) or during raft recovery. The unchecked state-transition
helpers used `debug_assert!` to require a specific starting state, so
a replay would panic in debug or silently overwrite in release.

How to apply: use write_optional so the state file is only touched
when the in-memory state needs to change. This also avoids unnecessary
fsyncs when a replay is a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Skip replica set creation on resharding start replay

Why: on replay, `create_replica_set` would call `create_shard_dir`,
which removes and recreates the existing shard directory -- wiping
any shard contents that had already been migrated or written since
the first apply.

How to apply: check `contains_shard(shard_id)` before creating the
replica set and pass `None` to `start_resharding_unchecked` when a
replica set with the target shard id is already present. Also
relaxes `check_start_resharding` to return `Ok` (instead of a
swallowed `bad_request`) when a matching resharding state is
persisted, so the caller can fall through each idempotent step.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Derive resharding shard count from shard id for idempotency

Why: the start/finish/abort paths used ++/-- on the persisted shard
number with a `debug_assert` pinning the expected starting value. On
replay (e.g. after a crash between the shard holder mutation and the
config save) this either panics or produces a wrong count.

Since resharding always targets the last shard (auto sharding assigns
contiguous ids from zero), the target count is a pure function of the
shard id: `shard_id + 1` for start up, `shard_id` for finish down and
abort up. Set it directly and skip the save when it already matches,
so replay converges to the same value without touching the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Fall through resharding finish/abort steps on replay

Why: the early return on `resharding_state.is_none()` assumed that a
missing state means the operation was fully applied. But a crash can
leave the state cleared while the shard drop, key mapping removal, or
shard count update are still pending. Short-circuiting then skips the
reconciling work those replays need to do.

How to apply: drop the early return so every step runs; each step is
already individually idempotent (check_*, drop_and_remove_shard,
remove_shard_from_key_mapping, the set-based shard count update).
The abort_resharding down-invalidation path now reads nodes from the
router regardless of its variant, so a replay over an already-rolled-
back ring doesn't trip the removed debug_assert!s.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Skip shard key mapping write when nothing to remove

Why: `write_optional` returns `Some` whenever the shard key is
present, even if the shard id we're removing is already gone. That
still triggers a JSON rewrite and a cache notification on every
replay of a finished finish/abort.

How to apply: check that the shard id is actually in the set before
returning `Some`. If the set is missing or already doesn't contain
the id, return `None` so the on-disk file and in-memory data are
left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Add tests for resharding replay idempotency

Covers the paths that must not error or panic on replay:
- `check_start_resharding` when matching state is already present
- `start_resharding_unchecked` preserves matching state verbatim
- `finish_resharding_unchecked` when state is already cleared

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* Simplify finish_resharding_unchecked closure

Replace a manual `match` on `Option` with `as_ref().map(...)` to
satisfy clippy::manual_map.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* test: add a test for resharding replay (#8839)

* [ai] add a debug_assert on shard_number (#8846)

* Update lib/collection/src/shards/shard_holder/resharding.rs

Co-authored-by: Roman Titov <[email protected]>

* Reformat

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: tellet-q <[email protected]>
Co-authored-by: Roman Titov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants