test(consensus): snapshot transfer with set-payload for missing points by generall · Pull Request #9125 · qdrant/qdrant

generall · 2026-05-21T20:53:05Z

What

Adds a consensus test that reproduces a shard transfer abort triggered by set-payload (and other partial-update) operations targeting non-existing points while a snapshot transfer is in progress.

This is a red-first test: it is expected to fail on current dev and to pass once the receiver tolerates missing-point operations during shard-transfer recovery. The fix is proposed separately.

Why this happens

Production symptom:

ERROR collection::shards::transfer::driver: Failed to transfer shard ...: status: NotFound, message: "Not found: No point with id ... found"

Mechanism:

A partial-update op (set_payload/update_vectors/delete_payload/…) for a point that does not exist is written to the WAL before the point-existence check rejects it (LocalShard::submit_update → wal.lock_and_write).
During a snapshot transfer the queue proxy replays buffered WAL operations to the receiver.
The receiver applies them with WriteOrdering::Weak + clock_tag.force = true, which routes through update_local and bypasses the missing-point tolerance in handle_failed_replicas (the tolerance only guards the source's replica-set coordination, not the forwarded queue-proxy path).
The op hard-fails with PointNotFound. The only safety net is bounded retries (queue-level BATCH_RETRIES, then driver-level MAX_RETRY_COUNT). Under sustained load these are exhausted → the receiver replica is marked Dead and the transfer is aborted.

Test design notes

Heavy, unthrottled, wait=false load with id ranges spanning all shards keeps a missing-point op in every transfer attempt, so the failure is reliable rather than a lost race.
The transferred replica's state is checked while the load is still running. This is essential: consensus auto-recovers Dead replicas, and if the load is stopped first, the next recovery transfer has no failing ops to replay and succeeds — masking the bug.
Pass criterion: the transferred replica reaches Active; fail criterion: it goes Dead / never reaches Active within the timeout.

Verification

Current dev (buggy): fails reliably (2/2 runs), saw Dead state: True, driver logs the No point with id ... found abort. ~105s.
With a minimal receiver-side tolerance fix: passes (~19s), confirming the test is discriminating and not always-red.

Run locally:

tests/venv/bin/python -m pytest tests/consensus_tests/test_shard_snapshot_transfer_missing_point.py -p no:xdist -o addopts=""

🤖 Generated with Claude Code

Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

coderabbitai · 2026-05-21T22:04:34Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 57df76a7-45c7-457a-9f66-740d81bc9442

📥 Commits

Reviewing files that changed from the base of the PR and between cd9fbc6 and b8b7084.

📒 Files selected for processing (2)

lib/collection/src/shards/queue_proxy_shard.rs
tests/consensus_tests/test_shard_snapshot_transfer_missing_point.py

📝 Walkthrough

Walkthrough

This PR refines error handling during WAL operation forwarding in shard snapshot transfers and introduces a corresponding integration test. The Rust change in queue_proxy_shard.rs distinguishes transient from non-transient errors when forwarding batch and one-by-one updates: transient errors bubble up for caller retry, while non-transient errors either trigger fallback to one-by-one transfer (batch path) or cause the operation to be skipped with transfer continuing (one-by-one path). The new Python integration test concurrently generates missing-point payloads and upserts during snapshot-based shard replication, verifying the recovery mechanism works under stress and that cluster convergence is achieved.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Failed test_shard_wal_delta_transfer_manual_recovery_chain #8346: The selective error handling for transient vs. non-transient failures during operation forwarding directly addresses WAL-delta transfer divergence scenarios in shard replication.

Suggested reviewers

agourlay
timvisee

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/snapshot-transfer-missing-point

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

#9125) * test(consensus): snapshot transfer with set-payload for missing points Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

generall mentioned this pull request May 21, 2026

fix(transfer): skip non-transient errors during queue proxy WAL replay #9126

Merged

generall marked this pull request as ready for review May 21, 2026 22:02

generall merged commit 64558ac into dev May 21, 2026
12 checks passed

generall deleted the test/snapshot-transfer-missing-point branch May 21, 2026 22:03

This was referenced Jun 2, 2026

test(consensus): batch initial upsert in snapshot-transfer missing-point test #9261

Merged

test(consensus): de-flake replicate_points_stream_transfer_updates #9263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(consensus): snapshot transfer with set-payload for missing points#9125

test(consensus): snapshot transfer with set-payload for missing points#9125
generall merged 2 commits into
devfrom
test/snapshot-transfer-missing-point

generall commented May 21, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented May 21, 2026

Review failed

Walkthrough

Estimated code review effort

Possibly related issues

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

generall commented May 21, 2026

What

Why this happens

Test design notes

Verification

Uh oh!

Uh oh!

coderabbitai Bot commented May 21, 2026

Review failed

Walkthrough

Estimated code review effort

Possibly related issues

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant