test: add a test for resharding replay#8839
Merged
tellet-q merged 1 commit intoApr 29, 2026
Merged
Conversation
timvisee
approved these changes
Apr 29, 2026
5bfca31
into
resharding-shard-holder-idempotent
19 of 20 checks passed
10 tasks
timvisee
added a commit
that referenced
this pull request
Apr 30, 2026
* Make resharding state transitions idempotent on replay Why: consensus entries may be re-applied after a crash (partial state on disk) or during raft recovery. The unchecked state-transition helpers used `debug_assert!` to require a specific starting state, so a replay would panic in debug or silently overwrite in release. How to apply: use write_optional so the state file is only touched when the in-memory state needs to change. This also avoids unnecessary fsyncs when a replay is a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip replica set creation on resharding start replay Why: on replay, `create_replica_set` would call `create_shard_dir`, which removes and recreates the existing shard directory -- wiping any shard contents that had already been migrated or written since the first apply. How to apply: check `contains_shard(shard_id)` before creating the replica set and pass `None` to `start_resharding_unchecked` when a replica set with the target shard id is already present. Also relaxes `check_start_resharding` to return `Ok` (instead of a swallowed `bad_request`) when a matching resharding state is persisted, so the caller can fall through each idempotent step. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Derive resharding shard count from shard id for idempotency Why: the start/finish/abort paths used ++/-- on the persisted shard number with a `debug_assert` pinning the expected starting value. On replay (e.g. after a crash between the shard holder mutation and the config save) this either panics or produces a wrong count. Since resharding always targets the last shard (auto sharding assigns contiguous ids from zero), the target count is a pure function of the shard id: `shard_id + 1` for start up, `shard_id` for finish down and abort up. Set it directly and skip the save when it already matches, so replay converges to the same value without touching the file. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Fall through resharding finish/abort steps on replay Why: the early return on `resharding_state.is_none()` assumed that a missing state means the operation was fully applied. But a crash can leave the state cleared while the shard drop, key mapping removal, or shard count update are still pending. Short-circuiting then skips the reconciling work those replays need to do. How to apply: drop the early return so every step runs; each step is already individually idempotent (check_*, drop_and_remove_shard, remove_shard_from_key_mapping, the set-based shard count update). The abort_resharding down-invalidation path now reads nodes from the router regardless of its variant, so a replay over an already-rolled- back ring doesn't trip the removed debug_assert!s. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip shard key mapping write when nothing to remove Why: `write_optional` returns `Some` whenever the shard key is present, even if the shard id we're removing is already gone. That still triggers a JSON rewrite and a cache notification on every replay of a finished finish/abort. How to apply: check that the shard id is actually in the set before returning `Some`. If the set is missing or already doesn't contain the id, return `None` so the on-disk file and in-memory data are left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Add tests for resharding replay idempotency Covers the paths that must not error or panic on replay: - `check_start_resharding` when matching state is already present - `start_resharding_unchecked` preserves matching state verbatim - `finish_resharding_unchecked` when state is already cleared Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Simplify finish_resharding_unchecked closure Replace a manual `match` on `Option` with `as_ref().map(...)` to satisfy clippy::manual_map. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * test: add a test for resharding replay (#8839) * [ai] add a debug_assert on shard_number (#8846) * Update lib/collection/src/shards/shard_holder/resharding.rs Co-authored-by: Roman Titov <[email protected]> * Reformat --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> Co-authored-by: tellet-q <[email protected]> Co-authored-by: Roman Titov <[email protected]>
timvisee
added a commit
that referenced
this pull request
May 8, 2026
* Make resharding state transitions idempotent on replay Why: consensus entries may be re-applied after a crash (partial state on disk) or during raft recovery. The unchecked state-transition helpers used `debug_assert!` to require a specific starting state, so a replay would panic in debug or silently overwrite in release. How to apply: use write_optional so the state file is only touched when the in-memory state needs to change. This also avoids unnecessary fsyncs when a replay is a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip replica set creation on resharding start replay Why: on replay, `create_replica_set` would call `create_shard_dir`, which removes and recreates the existing shard directory -- wiping any shard contents that had already been migrated or written since the first apply. How to apply: check `contains_shard(shard_id)` before creating the replica set and pass `None` to `start_resharding_unchecked` when a replica set with the target shard id is already present. Also relaxes `check_start_resharding` to return `Ok` (instead of a swallowed `bad_request`) when a matching resharding state is persisted, so the caller can fall through each idempotent step. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Derive resharding shard count from shard id for idempotency Why: the start/finish/abort paths used ++/-- on the persisted shard number with a `debug_assert` pinning the expected starting value. On replay (e.g. after a crash between the shard holder mutation and the config save) this either panics or produces a wrong count. Since resharding always targets the last shard (auto sharding assigns contiguous ids from zero), the target count is a pure function of the shard id: `shard_id + 1` for start up, `shard_id` for finish down and abort up. Set it directly and skip the save when it already matches, so replay converges to the same value without touching the file. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Fall through resharding finish/abort steps on replay Why: the early return on `resharding_state.is_none()` assumed that a missing state means the operation was fully applied. But a crash can leave the state cleared while the shard drop, key mapping removal, or shard count update are still pending. Short-circuiting then skips the reconciling work those replays need to do. How to apply: drop the early return so every step runs; each step is already individually idempotent (check_*, drop_and_remove_shard, remove_shard_from_key_mapping, the set-based shard count update). The abort_resharding down-invalidation path now reads nodes from the router regardless of its variant, so a replay over an already-rolled- back ring doesn't trip the removed debug_assert!s. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip shard key mapping write when nothing to remove Why: `write_optional` returns `Some` whenever the shard key is present, even if the shard id we're removing is already gone. That still triggers a JSON rewrite and a cache notification on every replay of a finished finish/abort. How to apply: check that the shard id is actually in the set before returning `Some`. If the set is missing or already doesn't contain the id, return `None` so the on-disk file and in-memory data are left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Add tests for resharding replay idempotency Covers the paths that must not error or panic on replay: - `check_start_resharding` when matching state is already present - `start_resharding_unchecked` preserves matching state verbatim - `finish_resharding_unchecked` when state is already cleared Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Simplify finish_resharding_unchecked closure Replace a manual `match` on `Option` with `as_ref().map(...)` to satisfy clippy::manual_map. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * test: add a test for resharding replay (#8839) * [ai] add a debug_assert on shard_number (#8846) * Update lib/collection/src/shards/shard_holder/resharding.rs Co-authored-by: Roman Titov <[email protected]> * Reformat --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> Co-authored-by: tellet-q <[email protected]> Co-authored-by: Roman Titov <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A new consensus_test to assert an idempotency of a resharding op.