fix: notify pending consensus ops on snapshot apply by generall · Pull Request #8990 · qdrant/qdrant

generall · 2026-05-11T12:22:44Z

Summary

Fixes flaky test_peer_snapshot_bootstrap (tests/consensus_tests/test_peer_snapshot_bootstrap.py), which intermittently fails with:

Timeout waiting for condition peer_is_online to be satisfied in 30 seconds

caused by the joining peer panicking with:

Failed to add peer: Service internal error:
Waiting for consensus operation commit failed. Timeout set at: 10 seconds

Root cause

When a peer proposes an AddPeer / RemovePeer / UpdatePeerMetadata / UpdateClusterMetadata operation via propose_consensus_op_with_await, it registers an awaiter in on_consensus_op_apply and blocks on a broadcast channel. The channel is normally signalled when the corresponding log entry is processed by apply_conf_change_entry / apply_normal_entry.

If the operation is committed remotely and the leader truncates past it before we apply our local copy of the log entry, the entry is delivered back as part of a raft snapshot instead. Snapshot application went through update_from_snapshot only — it never consulted on_consensus_op_apply, so the awaiter blocked until the 10s timeout fired and the joining peer's add_peer_to_known gRPC call failed.

In the failing CI run (/tmp/flacky/test_peer_snapshot_bootstrap/):

The restarted bootstrap peer (Follower) proposes AddLearnerNode for the new peer.
The leader commits it at index 15, but before the Follower applies index 15 from its WAL it receives a MsgSnapshot (at index 14 first, then at index 15) — both clear the WAL.
The snapshot at index 15 already reflects the new peer in address_by_id and conf_state.learners, but the awaiter is never woken.

Fix

In ConsensusManager::apply_snapshot, after update_from_snapshot, walk on_consensus_op_apply and resolve any awaiter whose effect is now directly observable in persistent state:

AddPeer { peer_id, uri } — address_by_id[peer_id] == uri
RemovePeer(peer_id) — peer_id no longer in address_by_id
UpdatePeerMetadata { peer_id, metadata } — metadata_by_id[peer_id] == metadata
UpdateClusterMetadata { key, value } — cluster_metadata[key] == value

Other variants (CollectionMeta, RequestSnapshot, ReportSnapshot) are listed explicitly and left pending — the match is exhaustive, so adding a new ConsensusOperations variant will force a compile error here.

Test plan

New unit apply_snapshot_notifies_pending_add_peer: registers a pending AddPeer waiter, applies a snapshot containing the peer, asserts the waiter receives Ok(true) and is removed from the pending map. Confirmed this test fails on master (panic: awaiter must be notified by snapshot: Empty) and passes with the fix.
New unit apply_snapshot_does_not_notify_unrelated_add_peer: applies a snapshot that does not satisfy the awaited operation, asserts the awaiter stays pending — guards against spurious notifications.
All 17 tests in content_manager::consensus_manager::tests pass.
CI test_peer_snapshot_bootstrap runs cleanly.

🤖 Generated with Claude Code

When an `AddPeer` / `RemovePeer` / `UpdatePeerMetadata` / `UpdateClusterMetadata` operation is proposed locally and then committed remotely, the resulting log entry can be delivered back to us as part of a raft snapshot rather than as a regular log entry. In that case the operation never flows through `apply_conf_change_entry` / `apply_normal_entry`, so the awaiter registered by `propose_consensus_op_with_await` was never woken and timed out after 10s — manifesting as flaky failures of `test_peer_snapshot_bootstrap` ("Failed to add peer: ... Waiting for consensus operation commit failed"). Resolve awaiters in `apply_snapshot` whenever their effect is directly observable in the new persistent state. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

timvisee · 2026-05-11T13:50:42Z

Thanks for the clear write up. That made this PR very easily digestable. 🙏

When an `AddPeer` / `RemovePeer` / `UpdatePeerMetadata` / `UpdateClusterMetadata` operation is proposed locally and then committed remotely, the resulting log entry can be delivered back to us as part of a raft snapshot rather than as a regular log entry. In that case the operation never flows through `apply_conf_change_entry` / `apply_normal_entry`, so the awaiter registered by `propose_consensus_op_with_await` was never woken and timed out after 10s — manifesting as flaky failures of `test_peer_snapshot_bootstrap` ("Failed to add peer: ... Waiting for consensus operation commit failed"). Resolve awaiters in `apply_snapshot` whenever their effect is directly observable in the new persistent state. Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

generall requested a review from ffuugoo May 11, 2026 13:07

qdrant deleted a comment from coderabbitai Bot May 11, 2026

timvisee approved these changes May 11, 2026

View reviewed changes

timvisee merged commit ec66b87 into dev May 11, 2026
15 checks passed

timvisee deleted the fix/consensus-snapshot-notify-pending branch May 11, 2026 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: notify pending consensus ops on snapshot apply#8990

fix: notify pending consensus ops on snapshot apply#8990
timvisee merged 1 commit into
devfrom
fix/consensus-snapshot-notify-pending

generall commented May 11, 2026

Uh oh!

timvisee commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

generall commented May 11, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

timvisee commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timvisee commented May 11, 2026 •

edited

Loading