fix(transfer): skip non-transient errors during queue proxy WAL replay by generall · Pull Request #9126 · qdrant/qdrant

generall · 2026-05-21T20:58:37Z

What

Fix for the shard-transfer abort reproduced by the test in #9125.

Base branch is test/snapshot-transfer-missing-point on purpose, so this PR's CI runs test_shard_snapshot_transfer_with_missing_point_updates with the fix and shows it green (it fails on the test branch / dev alone). Rebase onto dev before merging.

Problem

During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver (transfer_operations_batch). Some are partial updates (set_payload, update_vectors, …) that the receiver rejects with a non-transient error — most commonly NotFound: No point with id ... for a point that does not exist on the receiver, but equally any other client-caused bad request.

These ops are forwarded with WriteOrdering::Weak + force = true, which routes through update_local and bypasses the missing-point tolerance in handle_failed_replicas (added in #5991 for the live forwarded-update path). The error propagates and aborts the transfer:

ERROR collection::shards::transfer::driver: Failed to transfer shard ...: NotFound "No point with id ... found"

The only safety net is bounded retries (queue-level BATCH_RETRIES, driver-level MAX_RETRY_COUNT). Under sustained load they are exhausted, the receiver replica is marked Dead, and the transfer aborts (consensus auto-restarts it, failing the same way while load continues).

Fix

Handle the error where the semantic context lives — the transmitter (transfer_operations_batch):

Skip non-transient errors and continue. A replayed WAL op that the remote rejects non-transiently was already applied (and rejected the same way) on the sender, so the sender's state reflects it as a no-op → skipping keeps both sides consistent.
Propagate transient errors so the caller retries delivery (network/timeout/unavailable).
The batch update API aborts at the first failing op and can't tell us which one, so a non-transient batch error falls back to one-by-one sending to isolate and skip the offending op(s).

Design notes

Transmitter, not receiver: the queue proxy is the component that knows it's replaying historical WAL ops; the receiver's update_local shouldn't assume its caller's intent.
Non-transient, not just missing-point: unlike Do not deactivate partial/recovery replica on missing point #5991 (which deliberately narrowed to missing-point on the live path, where ops aren't pre-adjudicated), replayed ops have already been adjudicated on the sender, so any non-transient rejection is safe to skip. This also covers bad requests beyond missing points.
Covers WAL-delta too, since it forwards through the same queue-proxy path.

Verification

Test branch alone (no fix): test fails reliably (replica goes Dead, never Active).
With this fix: test passes (~25s). Logs show ~1.2k Skipping operation permanently rejected ... No point with id ... and 0 driver-level transfer aborts.

🤖 Generated with Claude Code

During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

generall · 2026-05-21T22:02:22Z

FYI @timvisee: bug was observed in prod, I reproduced it in cloud and in test. Would be nice of you take a look, but I merge it without review for faster release.

Additionally validated, but didn't inlude into PR, that fro streaming transfer those non-transient errors are properly handled, as we first always try to update local shard and only then proxy to remote.

#9125) * test(consensus): snapshot transfer with set-payload for missing points Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

generall force-pushed the fix/snapshot-transfer-missing-point branch from b0605f8 to 687d0d2 Compare May 21, 2026 21:37

generall changed the title ~~fix(transfer): tolerate missing-point updates on recovering replica~~ fix(transfer): skip non-transient errors during queue proxy WAL replay May 21, 2026

generall requested a review from timvisee May 21, 2026 21:58

generall marked this pull request as ready for review May 21, 2026 21:59

generall merged commit b8b7084 into test/snapshot-transfer-missing-point May 21, 2026
14 checks passed

generall deleted the fix/snapshot-transfer-missing-point branch May 21, 2026 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(transfer): skip non-transient errors during queue proxy WAL replay#9126

fix(transfer): skip non-transient errors during queue proxy WAL replay#9126
generall merged 1 commit into
test/snapshot-transfer-missing-pointfrom
fix/snapshot-transfer-missing-point

generall commented May 21, 2026 •

edited

Loading

Uh oh!

generall commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

generall commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Problem

Fix

Design notes

Verification

Uh oh!

generall commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

generall commented May 21, 2026 •

edited

Loading