fix(transfer): skip non-transient errors during queue proxy WAL replay#9126
Merged
generall merged 1 commit intoMay 21, 2026
Conversation
During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
b0605f8 to
687d0d2
Compare
Member
Author
|
FYI @timvisee: bug was observed in prod, I reproduced it in cloud and in test. Would be nice of you take a look, but I merge it without review for faster release. Additionally validated, but didn't inlude into PR, that fro streaming transfer those non-transient errors are properly handled, as we first always try to update local shard and only then proxy to remote. |
b8b7084
into
test/snapshot-transfer-missing-point
14 checks passed
generall
added a commit
that referenced
this pull request
May 21, 2026
#9125) * test(consensus): snapshot transfer with set-payload for missing points Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
generall
added a commit
that referenced
this pull request
May 22, 2026
#9125) * test(consensus): snapshot transfer with set-payload for missing points Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fix for the shard-transfer abort reproduced by the test in #9125.
Base branch is
test/snapshot-transfer-missing-pointon purpose, so this PR's CI runstest_shard_snapshot_transfer_with_missing_point_updateswith the fix and shows it green (it fails on the test branch /devalone). Rebase ontodevbefore merging.Problem
During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver (
transfer_operations_batch). Some are partial updates (set_payload,update_vectors, …) that the receiver rejects with a non-transient error — most commonlyNotFound: No point with id ...for a point that does not exist on the receiver, but equally any other client-caused bad request.These ops are forwarded with
WriteOrdering::Weak+force = true, which routes throughupdate_localand bypasses the missing-point tolerance inhandle_failed_replicas(added in #5991 for the live forwarded-update path). The error propagates and aborts the transfer:The only safety net is bounded retries (queue-level
BATCH_RETRIES, driver-levelMAX_RETRY_COUNT). Under sustained load they are exhausted, the receiver replica is markedDead, and the transfer aborts (consensus auto-restarts it, failing the same way while load continues).Fix
Handle the error where the semantic context lives — the transmitter (
transfer_operations_batch):Design notes
update_localshouldn't assume its caller's intent.Verification
Dead, neverActive).Skipping operation permanently rejected ... No point with id ...and 0 driver-level transfer aborts.🤖 Generated with Claude Code