Skip to content

Clear shard data before snapshot recovery transfer#8782

Merged
timvisee merged 6 commits into
devfrom
clear-shard-data-before-recovery-2
Apr 30, 2026
Merged

Clear shard data before snapshot recovery transfer#8782
timvisee merged 6 commits into
devfrom
clear-shard-data-before-recovery-2

Conversation

@timvisee
Copy link
Copy Markdown
Member

Alternative for #8697

When sending a shard snapshot transfer to an existing shard, extra headroom is needed on the receiver. Specifically, the snapshot is first downloaded and then replaces the existing data. It means that we need double the size of a shard in terms of disk space.

This is a blocker on some customer deployments that have very large shards. And we'd like to start enabling snapshot transfer by default everywhere because it has better performance characteristics.

This PR adjusts the shard snapshot recovery process. It now clears the target shard and then downloads the snapshot. This is safe because the shard transfer itself marks the shard as dead, and so it will not be used in reads. Clearing first prevents needing the headroom.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@timvisee timvisee mentioned this pull request Apr 23, 2026
11 tasks
@timvisee timvisee requested a review from agourlay April 23, 2026 16:04
@timvisee timvisee marked this pull request as ready for review April 23, 2026 16:04
coderabbitai[bot]

This comment was marked as resolved.

@timvisee
Copy link
Copy Markdown
Member Author

timvisee commented Apr 24, 2026

I've seen this failure multiple times now:

=========================== short test summary info ============================
FAILED tests/consensus_tests/test_failed_snapshot_recovery.py::test_dirty_shard_handling_with_active_replicas[snapshot] - ValueError: not enough values to unpack (expected 1, got 0)

Currently investigating.

E.g.: https://github.com/qdrant/qdrant/actions/runs/24879554764/job/72844257977#step:11:844

@timvisee timvisee mentioned this pull request Apr 24, 2026
9 tasks
@qdrant qdrant deleted a comment from coderabbitai Bot Apr 24, 2026
Comment thread lib/collection/src/shards/replica_set/snapshots.rs Outdated
@timvisee
Copy link
Copy Markdown
Member Author

timvisee commented Apr 24, 2026

I've seen this failure multiple times now:

=========================== short test summary info ============================
FAILED tests/consensus_tests/test_failed_snapshot_recovery.py::test_dirty_shard_handling_with_active_replicas[snapshot] - ValueError: not enough values to unpack (expected 1, got 0)

Currently investigating.

E.g.: https://github.com/qdrant/qdrant/actions/runs/24879554764/job/72844257977#step:11:844

Fixed in cd99636

A user request might temporarily see the replica disappear while it is cleared. In my opinion this is expected behavior. This resulted in the test falling through the 'wait all to be active' check.

I'm not 100% confident this fixed all CI issues yet, because it's hard to track what actually happens here. However, local runs don't show the problem anymore.

AI also couldn't find any more issues:

I could not reproduce locally over 30+ runs — fast CPU, CPU-loaded, single core (taskset -c 0), and 20% CPU quota via systemd-run. The failure window is timing-sensitive to CI's scheduler.

Let's consider it fixed and take another look if CI trips on it again.

@qdrant qdrant deleted a comment from coderabbitai Bot Apr 24, 2026
@timvisee timvisee force-pushed the clear-shard-data-before-recovery-2 branch from 028d85a to e9280e7 Compare April 29, 2026 13:27
coderabbitai[bot]

This comment was marked as resolved.

@timvisee timvisee merged commit 4087b37 into dev Apr 30, 2026
15 checks passed
@timvisee timvisee deleted the clear-shard-data-before-recovery-2 branch April 30, 2026 09:26
timvisee added a commit that referenced this pull request May 8, 2026
* [ai] On shard snapshot transfer recovery, drop existing shard before recovery

* [ai] Add integration test to assert clearing behavior

* [ai] Debug assert our replica is not active when we clear it

* [ai] Tweak assertion

* Fix flaky test, replica may temporarily not be visible

* Replace debug assertion with runtime error
@timvisee timvisee mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants