Skip to content

fix(transfer): skip non-transient errors during queue proxy WAL replay#9126

Merged
generall merged 1 commit into
test/snapshot-transfer-missing-pointfrom
fix/snapshot-transfer-missing-point
May 21, 2026
Merged

fix(transfer): skip non-transient errors during queue proxy WAL replay#9126
generall merged 1 commit into
test/snapshot-transfer-missing-pointfrom
fix/snapshot-transfer-missing-point

Conversation

@generall

@generall generall commented May 21, 2026

Copy link
Copy Markdown
Member

What

Fix for the shard-transfer abort reproduced by the test in #9125.

Base branch is test/snapshot-transfer-missing-point on purpose, so this PR's CI runs test_shard_snapshot_transfer_with_missing_point_updates with the fix and shows it green (it fails on the test branch / dev alone). Rebase onto dev before merging.

Problem

During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver (transfer_operations_batch). Some are partial updates (set_payload, update_vectors, …) that the receiver rejects with a non-transient error — most commonly NotFound: No point with id ... for a point that does not exist on the receiver, but equally any other client-caused bad request.

These ops are forwarded with WriteOrdering::Weak + force = true, which routes through update_local and bypasses the missing-point tolerance in handle_failed_replicas (added in #5991 for the live forwarded-update path). The error propagates and aborts the transfer:

ERROR collection::shards::transfer::driver: Failed to transfer shard ...: NotFound "No point with id ... found"

The only safety net is bounded retries (queue-level BATCH_RETRIES, driver-level MAX_RETRY_COUNT). Under sustained load they are exhausted, the receiver replica is marked Dead, and the transfer aborts (consensus auto-restarts it, failing the same way while load continues).

Fix

Handle the error where the semantic context lives — the transmitter (transfer_operations_batch):

  • Skip non-transient errors and continue. A replayed WAL op that the remote rejects non-transiently was already applied (and rejected the same way) on the sender, so the sender's state reflects it as a no-op → skipping keeps both sides consistent.
  • Propagate transient errors so the caller retries delivery (network/timeout/unavailable).
  • The batch update API aborts at the first failing op and can't tell us which one, so a non-transient batch error falls back to one-by-one sending to isolate and skip the offending op(s).

Design notes

  • Transmitter, not receiver: the queue proxy is the component that knows it's replaying historical WAL ops; the receiver's update_local shouldn't assume its caller's intent.
  • Non-transient, not just missing-point: unlike Do not deactivate partial/recovery replica on missing point #5991 (which deliberately narrowed to missing-point on the live path, where ops aren't pre-adjudicated), replayed ops have already been adjudicated on the sender, so any non-transient rejection is safe to skip. This also covers bad requests beyond missing points.
  • Covers WAL-delta too, since it forwards through the same queue-proxy path.

Verification

  • Test branch alone (no fix): test fails reliably (replica goes Dead, never Active).
  • With this fix: test passes (~25s). Logs show ~1.2k Skipping operation permanently rejected ... No point with id ... and 0 driver-level transfer aborts.

🤖 Generated with Claude Code

During a shard transfer the queue proxy replays operations from the
sender's WAL to the receiver. Some of these are partial updates
(set_payload, update_vectors, ...) that the receiver rejects with a
non-transient error - most commonly `NotFound: No point with id ...`
for a point that does not exist on the receiver, but also any other
client-caused bad request.

These operations were replayed from the WAL, meaning they were already
applied (and rejected the same way) on the sender, so the sender's state
reflects them as no-ops. Propagating the error aborted the whole transfer;
under sustained load the bounded queue/driver retries were exhausted and
the receiver replica was marked Dead.

Handle the error where the semantic context lives - the transmitter
(`transfer_operations_batch`): skip operations the remote rejects with a
non-transient error and keep going, while still propagating transient
errors so the caller retries delivery. Because the batch update API aborts
at the first failing operation, a non-transient batch error falls back to
one-by-one sending to isolate and skip the offending operation(s).

This complements PR #5991, which handles missing points on the live
forwarded-update path (handle_failed_replicas) but not the WAL replay path.

Fixes the abort reproduced by
test_shard_snapshot_transfer_with_missing_point_updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@generall generall force-pushed the fix/snapshot-transfer-missing-point branch from b0605f8 to 687d0d2 Compare May 21, 2026 21:37
@generall generall changed the title fix(transfer): tolerate missing-point updates on recovering replica fix(transfer): skip non-transient errors during queue proxy WAL replay May 21, 2026
@generall generall requested a review from timvisee May 21, 2026 21:58
@generall generall marked this pull request as ready for review May 21, 2026 21:59
@generall

Copy link
Copy Markdown
Member Author

FYI @timvisee: bug was observed in prod, I reproduced it in cloud and in test. Would be nice of you take a look, but I merge it without review for faster release.

Additionally validated, but didn't inlude into PR, that fro streaming transfer those non-transient errors are properly handled, as we first always try to update local shard and only then proxy to remote.

@generall generall merged commit b8b7084 into test/snapshot-transfer-missing-point May 21, 2026
14 checks passed
@generall generall deleted the fix/snapshot-transfer-missing-point branch May 21, 2026 22:02
generall added a commit that referenced this pull request May 21, 2026
#9125)

* test(consensus): snapshot transfer with set-payload for missing points

Add a consensus test reproducing a shard transfer abort caused by
set-payload (and other partial-update) operations targeting non-existing
points.

Such operations are written to the WAL before the point-existence check
rejects them, so the queue proxy replays them to the receiver during a
snapshot transfer. The receiver applies them with force=true, bypassing
the missing-point tolerance in handle_failed_replicas, and the operation
hard-fails with `NotFound: No point with id ... found`. Under sustained
load the bounded queue/driver retries are exhausted, the receiver replica
is marked Dead and the transfer is aborted.

The test keeps the missing-point load running while checking the result,
because consensus auto-recovers Dead replicas: stopping the load first
would let the next recovery transfer succeed and mask the bug. It must
FAIL on current code and PASS once the receiver tolerates missing-point
operations during recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126)

During a shard transfer the queue proxy replays operations from the
sender's WAL to the receiver. Some of these are partial updates
(set_payload, update_vectors, ...) that the receiver rejects with a
non-transient error - most commonly `NotFound: No point with id ...`
for a point that does not exist on the receiver, but also any other
client-caused bad request.

These operations were replayed from the WAL, meaning they were already
applied (and rejected the same way) on the sender, so the sender's state
reflects them as no-ops. Propagating the error aborted the whole transfer;
under sustained load the bounded queue/driver retries were exhausted and
the receiver replica was marked Dead.

Handle the error where the semantic context lives - the transmitter
(`transfer_operations_batch`): skip operations the remote rejects with a
non-transient error and keep going, while still propagating transient
errors so the caller retries delivery. Because the batch update API aborts
at the first failing operation, a non-transient batch error falls back to
one-by-one sending to isolate and skip the offending operation(s).

This complements PR #5991, which handles missing points on the live
forwarded-update path (handle_failed_replicas) but not the WAL replay path.

Fixes the abort reproduced by
test_shard_snapshot_transfer_with_missing_point_updates.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
generall added a commit that referenced this pull request May 22, 2026
#9125)

* test(consensus): snapshot transfer with set-payload for missing points

Add a consensus test reproducing a shard transfer abort caused by
set-payload (and other partial-update) operations targeting non-existing
points.

Such operations are written to the WAL before the point-existence check
rejects them, so the queue proxy replays them to the receiver during a
snapshot transfer. The receiver applies them with force=true, bypassing
the missing-point tolerance in handle_failed_replicas, and the operation
hard-fails with `NotFound: No point with id ... found`. Under sustained
load the bounded queue/driver retries are exhausted, the receiver replica
is marked Dead and the transfer is aborted.

The test keeps the missing-point load running while checking the result,
because consensus auto-recovers Dead replicas: stopping the load first
would let the next recovery transfer succeed and mask the bug. It must
FAIL on current code and PASS once the receiver tolerates missing-point
operations during recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126)

During a shard transfer the queue proxy replays operations from the
sender's WAL to the receiver. Some of these are partial updates
(set_payload, update_vectors, ...) that the receiver rejects with a
non-transient error - most commonly `NotFound: No point with id ...`
for a point that does not exist on the receiver, but also any other
client-caused bad request.

These operations were replayed from the WAL, meaning they were already
applied (and rejected the same way) on the sender, so the sender's state
reflects them as no-ops. Propagating the error aborted the whole transfer;
under sustained load the bounded queue/driver retries were exhausted and
the receiver replica was marked Dead.

Handle the error where the semantic context lives - the transmitter
(`transfer_operations_batch`): skip operations the remote rejects with a
non-transient error and keep going, while still propagating transient
errors so the caller retries delivery. Because the batch update API aborts
at the first failing operation, a non-transient batch error falls back to
one-by-one sending to isolate and skip the offending operation(s).

This complements PR #5991, which handles missing points on the live
forwarded-update path (handle_failed_replicas) but not the WAL replay path.

Fixes the abort reproduced by
test_shard_snapshot_transfer_with_missing_point_updates.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant