[ai] add a debug_assert on shard_number by tellet-q · Pull Request #8846 · qdrant/qdrant

tellet-q · 2026-04-30T09:26:52Z

Check that shard_number is either shard_id or shard_id + 1. Fires if in future for some reason we'll allow removing a shard "in the middle".

* Make resharding state transitions idempotent on replay Why: consensus entries may be re-applied after a crash (partial state on disk) or during raft recovery. The unchecked state-transition helpers used `debug_assert!` to require a specific starting state, so a replay would panic in debug or silently overwrite in release. How to apply: use write_optional so the state file is only touched when the in-memory state needs to change. This also avoids unnecessary fsyncs when a replay is a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip replica set creation on resharding start replay Why: on replay, `create_replica_set` would call `create_shard_dir`, which removes and recreates the existing shard directory -- wiping any shard contents that had already been migrated or written since the first apply. How to apply: check `contains_shard(shard_id)` before creating the replica set and pass `None` to `start_resharding_unchecked` when a replica set with the target shard id is already present. Also relaxes `check_start_resharding` to return `Ok` (instead of a swallowed `bad_request`) when a matching resharding state is persisted, so the caller can fall through each idempotent step. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Derive resharding shard count from shard id for idempotency Why: the start/finish/abort paths used ++/-- on the persisted shard number with a `debug_assert` pinning the expected starting value. On replay (e.g. after a crash between the shard holder mutation and the config save) this either panics or produces a wrong count. Since resharding always targets the last shard (auto sharding assigns contiguous ids from zero), the target count is a pure function of the shard id: `shard_id + 1` for start up, `shard_id` for finish down and abort up. Set it directly and skip the save when it already matches, so replay converges to the same value without touching the file. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Fall through resharding finish/abort steps on replay Why: the early return on `resharding_state.is_none()` assumed that a missing state means the operation was fully applied. But a crash can leave the state cleared while the shard drop, key mapping removal, or shard count update are still pending. Short-circuiting then skips the reconciling work those replays need to do. How to apply: drop the early return so every step runs; each step is already individually idempotent (check_*, drop_and_remove_shard, remove_shard_from_key_mapping, the set-based shard count update). The abort_resharding down-invalidation path now reads nodes from the router regardless of its variant, so a replay over an already-rolled- back ring doesn't trip the removed debug_assert!s. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Skip shard key mapping write when nothing to remove Why: `write_optional` returns `Some` whenever the shard key is present, even if the shard id we're removing is already gone. That still triggers a JSON rewrite and a cache notification on every replay of a finished finish/abort. How to apply: check that the shard id is actually in the set before returning `Some`. If the set is missing or already doesn't contain the id, return `None` so the on-disk file and in-memory data are left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Add tests for resharding replay idempotency Covers the paths that must not error or panic on replay: - `check_start_resharding` when matching state is already present - `start_resharding_unchecked` preserves matching state verbatim - `finish_resharding_unchecked` when state is already cleared Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * Simplify finish_resharding_unchecked closure Replace a manual `match` on `Option` with `as_ref().map(...)` to satisfy clippy::manual_map. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * test: add a test for resharding replay (#8839) * [ai] add a debug_assert on shard_number (#8846) * Update lib/collection/src/shards/shard_holder/resharding.rs Co-authored-by: Roman Titov <[email protected]> * Reformat --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> Co-authored-by: tellet-q <[email protected]> Co-authored-by: Roman Titov <[email protected]>

[ai] add a debug_assert on shard_number

69bc2d2

tellet-q requested a review from timvisee April 30, 2026 09:26

timvisee approved these changes Apr 30, 2026

View reviewed changes

tellet-q merged commit a52e6e5 into resharding-shard-holder-idempotent Apr 30, 2026
14 checks passed

tellet-q deleted the resharding-shard-holder-idempotent-assert branch April 30, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ai] add a debug_assert on shard_number#8846

[ai] add a debug_assert on shard_number#8846
tellet-q merged 1 commit into
resharding-shard-holder-idempotentfrom
resharding-shard-holder-idempotent-assert

tellet-q commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tellet-q commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants