Skip to content

test(consensus): snapshot transfer with set-payload for missing points#9125

Merged
generall merged 2 commits into
devfrom
test/snapshot-transfer-missing-point
May 21, 2026
Merged

test(consensus): snapshot transfer with set-payload for missing points#9125
generall merged 2 commits into
devfrom
test/snapshot-transfer-missing-point

Conversation

@generall
Copy link
Copy Markdown
Member

What

Adds a consensus test that reproduces a shard transfer abort triggered by set-payload (and other partial-update) operations targeting non-existing points while a snapshot transfer is in progress.

This is a red-first test: it is expected to fail on current dev and to pass once the receiver tolerates missing-point operations during shard-transfer recovery. The fix is proposed separately.

Why this happens

Production symptom:

ERROR collection::shards::transfer::driver: Failed to transfer shard ...: status: NotFound, message: "Not found: No point with id ... found"

Mechanism:

  1. A partial-update op (set_payload/update_vectors/delete_payload/…) for a point that does not exist is written to the WAL before the point-existence check rejects it (LocalShard::submit_updatewal.lock_and_write).
  2. During a snapshot transfer the queue proxy replays buffered WAL operations to the receiver.
  3. The receiver applies them with WriteOrdering::Weak + clock_tag.force = true, which routes through update_local and bypasses the missing-point tolerance in handle_failed_replicas (the tolerance only guards the source's replica-set coordination, not the forwarded queue-proxy path).
  4. The op hard-fails with PointNotFound. The only safety net is bounded retries (queue-level BATCH_RETRIES, then driver-level MAX_RETRY_COUNT). Under sustained load these are exhausted → the receiver replica is marked Dead and the transfer is aborted.

Test design notes

  • Heavy, unthrottled, wait=false load with id ranges spanning all shards keeps a missing-point op in every transfer attempt, so the failure is reliable rather than a lost race.
  • The transferred replica's state is checked while the load is still running. This is essential: consensus auto-recovers Dead replicas, and if the load is stopped first, the next recovery transfer has no failing ops to replay and succeeds — masking the bug.
  • Pass criterion: the transferred replica reaches Active; fail criterion: it goes Dead / never reaches Active within the timeout.

Verification

  • Current dev (buggy): fails reliably (2/2 runs), saw Dead state: True, driver logs the No point with id ... found abort. ~105s.
  • With a minimal receiver-side tolerance fix: passes (~19s), confirming the test is discriminating and not always-red.

Run locally:

tests/venv/bin/python -m pytest tests/consensus_tests/test_shard_snapshot_transfer_missing_point.py -p no:xdist -o addopts=""

🤖 Generated with Claude Code

Add a consensus test reproducing a shard transfer abort caused by
set-payload (and other partial-update) operations targeting non-existing
points.

Such operations are written to the WAL before the point-existence check
rejects them, so the queue proxy replays them to the receiver during a
snapshot transfer. The receiver applies them with force=true, bypassing
the missing-point tolerance in handle_failed_replicas, and the operation
hard-fails with `NotFound: No point with id ... found`. Under sustained
load the bounded queue/driver retries are exhausted, the receiver replica
is marked Dead and the transfer is aborted.

The test keeps the missing-point load running while checking the result,
because consensus auto-recovers Dead replicas: stopping the load first
would let the next recovery transfer succeed and mask the bug. It must
FAIL on current code and PASS once the receiver tolerates missing-point
operations during recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
#9126)

During a shard transfer the queue proxy replays operations from the
sender's WAL to the receiver. Some of these are partial updates
(set_payload, update_vectors, ...) that the receiver rejects with a
non-transient error - most commonly `NotFound: No point with id ...`
for a point that does not exist on the receiver, but also any other
client-caused bad request.

These operations were replayed from the WAL, meaning they were already
applied (and rejected the same way) on the sender, so the sender's state
reflects them as no-ops. Propagating the error aborted the whole transfer;
under sustained load the bounded queue/driver retries were exhausted and
the receiver replica was marked Dead.

Handle the error where the semantic context lives - the transmitter
(`transfer_operations_batch`): skip operations the remote rejects with a
non-transient error and keep going, while still propagating transient
errors so the caller retries delivery. Because the batch update API aborts
at the first failing operation, a non-transient batch error falls back to
one-by-one sending to isolate and skip the offending operation(s).

This complements PR #5991, which handles missing points on the live
forwarded-update path (handle_failed_replicas) but not the WAL replay path.

Fixes the abort reproduced by
test_shard_snapshot_transfer_with_missing_point_updates.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
@generall generall marked this pull request as ready for review May 21, 2026 22:02
@generall generall merged commit 64558ac into dev May 21, 2026
12 checks passed
@generall generall deleted the test/snapshot-transfer-missing-point branch May 21, 2026 22:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 57df76a7-45c7-457a-9f66-740d81bc9442

📥 Commits

Reviewing files that changed from the base of the PR and between cd9fbc6 and b8b7084.

📒 Files selected for processing (2)
  • lib/collection/src/shards/queue_proxy_shard.rs
  • tests/consensus_tests/test_shard_snapshot_transfer_missing_point.py

📝 Walkthrough

Walkthrough

This PR refines error handling during WAL operation forwarding in shard snapshot transfers and introduces a corresponding integration test. The Rust change in queue_proxy_shard.rs distinguishes transient from non-transient errors when forwarding batch and one-by-one updates: transient errors bubble up for caller retry, while non-transient errors either trigger fallback to one-by-one transfer (batch path) or cause the operation to be skipped with transfer continuing (one-by-one path). The new Python integration test concurrently generates missing-point payloads and upserts during snapshot-based shard replication, verifying the recovery mechanism works under stress and that cluster convergence is achieved.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested reviewers

  • agourlay
  • timvisee
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/snapshot-transfer-missing-point

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

generall added a commit that referenced this pull request May 22, 2026
#9125)

* test(consensus): snapshot transfer with set-payload for missing points

Add a consensus test reproducing a shard transfer abort caused by
set-payload (and other partial-update) operations targeting non-existing
points.

Such operations are written to the WAL before the point-existence check
rejects them, so the queue proxy replays them to the receiver during a
snapshot transfer. The receiver applies them with force=true, bypassing
the missing-point tolerance in handle_failed_replicas, and the operation
hard-fails with `NotFound: No point with id ... found`. Under sustained
load the bounded queue/driver retries are exhausted, the receiver replica
is marked Dead and the transfer is aborted.

The test keeps the missing-point load running while checking the result,
because consensus auto-recovers Dead replicas: stopping the load first
would let the next recovery transfer succeed and mask the bug. It must
FAIL on current code and PASS once the receiver tolerates missing-point
operations during recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(transfer): skip non-transient errors during queue proxy WAL replay (#9126)

During a shard transfer the queue proxy replays operations from the
sender's WAL to the receiver. Some of these are partial updates
(set_payload, update_vectors, ...) that the receiver rejects with a
non-transient error - most commonly `NotFound: No point with id ...`
for a point that does not exist on the receiver, but also any other
client-caused bad request.

These operations were replayed from the WAL, meaning they were already
applied (and rejected the same way) on the sender, so the sender's state
reflects them as no-ops. Propagating the error aborted the whole transfer;
under sustained load the bounded queue/driver retries were exhausted and
the receiver replica was marked Dead.

Handle the error where the semantic context lives - the transmitter
(`transfer_operations_batch`): skip operations the remote rejects with a
non-transient error and keep going, while still propagating transient
errors so the caller retries delivery. Because the batch update API aborts
at the first failing operation, a non-transient batch error falls back to
one-by-one sending to isolate and skip the offending operation(s).

This complements PR #5991, which handles missing points on the live
forwarded-update path (handle_failed_replicas) but not the WAL replay path.

Fixes the abort reproduced by
test_shard_snapshot_transfer_with_missing_point_updates.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant