test(consensus): snapshot transfer with set-payload for missing points#9125
Conversation
Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR refines error handling during WAL operation forwarding in shard snapshot transfers and introduces a corresponding integration test. The Rust change in Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
#9125) * test(consensus): snapshot transfer with set-payload for missing points Add a consensus test reproducing a shard transfer abort caused by set-payload (and other partial-update) operations targeting non-existing points. Such operations are written to the WAL before the point-existence check rejects them, so the queue proxy replays them to the receiver during a snapshot transfer. The receiver applies them with force=true, bypassing the missing-point tolerance in handle_failed_replicas, and the operation hard-fails with `NotFound: No point with id ... found`. Under sustained load the bounded queue/driver retries are exhausted, the receiver replica is marked Dead and the transfer is aborted. The test keeps the missing-point load running while checking the result, because consensus auto-recovers Dead replicas: stopping the load first would let the next recovery transfer succeed and mask the bug. It must FAIL on current code and PASS once the receiver tolerates missing-point operations during recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126) During a shard transfer the queue proxy replays operations from the sender's WAL to the receiver. Some of these are partial updates (set_payload, update_vectors, ...) that the receiver rejects with a non-transient error - most commonly `NotFound: No point with id ...` for a point that does not exist on the receiver, but also any other client-caused bad request. These operations were replayed from the WAL, meaning they were already applied (and rejected the same way) on the sender, so the sender's state reflects them as no-ops. Propagating the error aborted the whole transfer; under sustained load the bounded queue/driver retries were exhausted and the receiver replica was marked Dead. Handle the error where the semantic context lives - the transmitter (`transfer_operations_batch`): skip operations the remote rejects with a non-transient error and keep going, while still propagating transient errors so the caller retries delivery. Because the batch update API aborts at the first failing operation, a non-transient batch error falls back to one-by-one sending to isolate and skip the offending operation(s). This complements PR #5991, which handles missing points on the live forwarded-update path (handle_failed_replicas) but not the WAL replay path. Fixes the abort reproduced by test_shard_snapshot_transfer_with_missing_point_updates. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
What
Adds a consensus test that reproduces a shard transfer abort triggered by
set-payload(and other partial-update) operations targeting non-existing points while a snapshot transfer is in progress.This is a red-first test: it is expected to fail on current
devand to pass once the receiver tolerates missing-point operations during shard-transfer recovery. The fix is proposed separately.Why this happens
Production symptom:
Mechanism:
set_payload/update_vectors/delete_payload/…) for a point that does not exist is written to the WAL before the point-existence check rejects it (LocalShard::submit_update→wal.lock_and_write).WriteOrdering::Weak+clock_tag.force = true, which routes throughupdate_localand bypasses the missing-point tolerance inhandle_failed_replicas(the tolerance only guards the source's replica-set coordination, not the forwarded queue-proxy path).PointNotFound. The only safety net is bounded retries (queue-levelBATCH_RETRIES, then driver-levelMAX_RETRY_COUNT). Under sustained load these are exhausted → the receiver replica is markedDeadand the transfer is aborted.Test design notes
wait=falseload with id ranges spanning all shards keeps a missing-point op in every transfer attempt, so the failure is reliable rather than a lost race.Deadreplicas, and if the load is stopped first, the next recovery transfer has no failing ops to replay and succeeds — masking the bug.Active; fail criterion: it goesDead/ never reachesActivewithin the timeout.Verification
dev(buggy): fails reliably (2/2 runs),saw Dead state: True, driver logs theNo point with id ... foundabort. ~105s.Run locally:
tests/venv/bin/python -m pytest tests/consensus_tests/test_shard_snapshot_transfer_missing_point.py -p no:xdist -o addopts=""🤖 Generated with Claude Code