Clear shard data before snapshot recovery transfer by timvisee · Pull Request #8782 · qdrant/qdrant

timvisee · 2026-04-23T15:41:48Z

Alternative for #8697

When sending a shard snapshot transfer to an existing shard, extra headroom is needed on the receiver. Specifically, the snapshot is first downloaded and then replaces the existing data. It means that we need double the size of a shard in terms of disk space.

This is a blocker on some customer deployments that have very large shards. And we'd like to start enabling snapshot transfer by default everywhere because it has better performance characteristics.

This PR adjusts the shard snapshot recovery process. It now clears the target shard and then downloads the snapshot. This is safe because the shard transfer itself marks the shard as dead, and so it will not be used in reads. Clearing first prevents needing the headroom.

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

timvisee · 2026-04-24T08:36:30Z

I've seen this failure multiple times now:

=========================== short test summary info ============================
FAILED tests/consensus_tests/test_failed_snapshot_recovery.py::test_dirty_shard_handling_with_active_replicas[snapshot] - ValueError: not enough values to unpack (expected 1, got 0)

Currently investigating.

E.g.: https://github.com/qdrant/qdrant/actions/runs/24879554764/job/72844257977#step:11:844

timvisee · 2026-04-24T14:28:01Z

I've seen this failure multiple times now:
=========================== short test summary info ============================
FAILED tests/consensus_tests/test_failed_snapshot_recovery.py::test_dirty_shard_handling_with_active_replicas[snapshot] - ValueError: not enough values to unpack (expected 1, got 0)
Currently investigating.

E.g.: https://github.com/qdrant/qdrant/actions/runs/24879554764/job/72844257977#step:11:844

Fixed in cd99636

A user request might temporarily see the replica disappear while it is cleared. In my opinion this is expected behavior. This resulted in the test falling through the 'wait all to be active' check.

I'm not 100% confident this fixed all CI issues yet, because it's hard to track what actually happens here. However, local runs don't show the problem anymore.

AI also couldn't find any more issues:

I could not reproduce locally over 30+ runs — fast CPU, CPU-loaded, single core (taskset -c 0), and 20% CPU quota via systemd-run. The failure window is timing-sensitive to CI's scheduler.

Let's consider it fixed and take another look if CI trips on it again.

…recovery

* [ai] On shard snapshot transfer recovery, drop existing shard before recovery * [ai] Add integration test to assert clearing behavior * [ai] Debug assert our replica is not active when we clear it * [ai] Tweak assertion * Fix flaky test, replica may temporarily not be visible * Replace debug assertion with runtime error

timvisee mentioned this pull request Apr 23, 2026

Clear shard data before recovery #8697

Closed

11 tasks

timvisee added the release:1.18.0 label Apr 23, 2026

timvisee requested a review from agourlay April 23, 2026 16:04

timvisee marked this pull request as ready for review April 23, 2026 16:04

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai Bot Apr 23, 2026

github-actions Bot mentioned this pull request Apr 23, 2026

Flaky test hnsw_quantized_search_test::hnsw_quantized_search_euclid_test #8735

Closed

qdrant deleted a comment from coderabbitai Bot Apr 24, 2026

timvisee requested a review from ffuugoo April 24, 2026 08:26

timvisee mentioned this pull request Apr 24, 2026

Use snapshot transfers by default #8784

Merged

9 tasks

qdrant deleted a comment from coderabbitai Bot Apr 24, 2026

agourlay reviewed Apr 24, 2026

View reviewed changes

Comment thread lib/collection/src/shards/replica_set/snapshots.rs Outdated

qdrant deleted a comment from coderabbitai Bot Apr 24, 2026

timvisee assigned ffuugoo and timvisee Apr 24, 2026

timvisee added 6 commits April 29, 2026 15:27

[ai] On shard snapshot transfer recovery, drop existing shard before …

23ad7a3

…recovery

[ai] Add integration test to assert clearing behavior

35c7935

[ai] Debug assert our replica is not active when we clear it

6e5cc58

[ai] Tweak assertion

8318837

Fix flaky test, replica may temporarily not be visible

4fd3071

Replace debug assertion with runtime error

e9280e7

timvisee force-pushed the clear-shard-data-before-recovery-2 branch from 028d85a to e9280e7 Compare April 29, 2026 13:27

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai Bot Apr 29, 2026

github-actions Bot mentioned this pull request Apr 29, 2026

Flaky test hnsw_quantized_search_test::hnsw_turbo_quantization_manhattan_bits2_test #8838

Closed

timvisee requested review from agourlay and generall April 29, 2026 14:07

agourlay reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/consensus_tests/test_shard_snapshot_transfer_clear_and_restart.py

agourlay approved these changes Apr 30, 2026

View reviewed changes

timvisee merged commit 4087b37 into dev Apr 30, 2026
15 checks passed

timvisee deleted the clear-shard-data-before-recovery-2 branch April 30, 2026 09:26

timvisee mentioned this pull request May 8, 2026

Bump version to 1.18.0 #8959

Merged

This was referenced May 21, 2026

Hotfix: don't clear shard before snapshot recovery transfer #9119

Closed

When clearing shard for snapshot transfer, use temporary dummy shard #9122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear shard data before snapshot recovery transfer#8782

Clear shard data before snapshot recovery transfer#8782
timvisee merged 6 commits into
devfrom
clear-shard-data-before-recovery-2

timvisee commented Apr 23, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

timvisee commented Apr 24, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timvisee commented Apr 23, 2026

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

timvisee commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timvisee commented Apr 24, 2026 •

edited

Loading

timvisee commented Apr 24, 2026 •

edited

Loading