Skip to content

fix: notify pending consensus ops on snapshot apply#8990

Merged
timvisee merged 1 commit into
devfrom
fix/consensus-snapshot-notify-pending
May 11, 2026
Merged

fix: notify pending consensus ops on snapshot apply#8990
timvisee merged 1 commit into
devfrom
fix/consensus-snapshot-notify-pending

Conversation

@generall
Copy link
Copy Markdown
Member

Summary

Fixes flaky test_peer_snapshot_bootstrap (tests/consensus_tests/test_peer_snapshot_bootstrap.py), which intermittently fails with:

Timeout waiting for condition peer_is_online to be satisfied in 30 seconds

caused by the joining peer panicking with:

Failed to add peer: Service internal error:
Waiting for consensus operation commit failed. Timeout set at: 10 seconds

Root cause

When a peer proposes an AddPeer / RemovePeer / UpdatePeerMetadata / UpdateClusterMetadata operation via propose_consensus_op_with_await, it registers an awaiter in on_consensus_op_apply and blocks on a broadcast channel. The channel is normally signalled when the corresponding log entry is processed by apply_conf_change_entry / apply_normal_entry.

If the operation is committed remotely and the leader truncates past it before we apply our local copy of the log entry, the entry is delivered back as part of a raft snapshot instead. Snapshot application went through update_from_snapshot only — it never consulted on_consensus_op_apply, so the awaiter blocked until the 10s timeout fired and the joining peer's add_peer_to_known gRPC call failed.

In the failing CI run (/tmp/flacky/test_peer_snapshot_bootstrap/):

  1. The restarted bootstrap peer (Follower) proposes AddLearnerNode for the new peer.
  2. The leader commits it at index 15, but before the Follower applies index 15 from its WAL it receives a MsgSnapshot (at index 14 first, then at index 15) — both clear the WAL.
  3. The snapshot at index 15 already reflects the new peer in address_by_id and conf_state.learners, but the awaiter is never woken.

Fix

In ConsensusManager::apply_snapshot, after update_from_snapshot, walk on_consensus_op_apply and resolve any awaiter whose effect is now directly observable in persistent state:

  • AddPeer { peer_id, uri }address_by_id[peer_id] == uri
  • RemovePeer(peer_id)peer_id no longer in address_by_id
  • UpdatePeerMetadata { peer_id, metadata }metadata_by_id[peer_id] == metadata
  • UpdateClusterMetadata { key, value }cluster_metadata[key] == value

Other variants (CollectionMeta, RequestSnapshot, ReportSnapshot) are listed explicitly and left pending — the match is exhaustive, so adding a new ConsensusOperations variant will force a compile error here.

Test plan

  • New unit apply_snapshot_notifies_pending_add_peer: registers a pending AddPeer waiter, applies a snapshot containing the peer, asserts the waiter receives Ok(true) and is removed from the pending map. Confirmed this test fails on master (panic: awaiter must be notified by snapshot: Empty) and passes with the fix.
  • New unit apply_snapshot_does_not_notify_unrelated_add_peer: applies a snapshot that does not satisfy the awaited operation, asserts the awaiter stays pending — guards against spurious notifications.
  • All 17 tests in content_manager::consensus_manager::tests pass.
  • CI test_peer_snapshot_bootstrap runs cleanly.

🤖 Generated with Claude Code

When an `AddPeer` / `RemovePeer` / `UpdatePeerMetadata` /
`UpdateClusterMetadata` operation is proposed locally and then committed
remotely, the resulting log entry can be delivered back to us as part of
a raft snapshot rather than as a regular log entry. In that case the
operation never flows through `apply_conf_change_entry` /
`apply_normal_entry`, so the awaiter registered by
`propose_consensus_op_with_await` was never woken and timed out after
10s — manifesting as flaky failures of `test_peer_snapshot_bootstrap`
("Failed to add peer: ... Waiting for consensus operation commit failed").

Resolve awaiters in `apply_snapshot` whenever their effect is directly
observable in the new persistent state.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@generall generall requested a review from ffuugoo May 11, 2026 13:07
@qdrant qdrant deleted a comment from coderabbitai Bot May 11, 2026
@timvisee
Copy link
Copy Markdown
Member

timvisee commented May 11, 2026

Thanks for the clear write up. That made this PR very easily digestable. 🙏

@timvisee timvisee merged commit ec66b87 into dev May 11, 2026
15 checks passed
@timvisee timvisee deleted the fix/consensus-snapshot-notify-pending branch May 11, 2026 13:50
generall added a commit that referenced this pull request May 22, 2026
When an `AddPeer` / `RemovePeer` / `UpdatePeerMetadata` /
`UpdateClusterMetadata` operation is proposed locally and then committed
remotely, the resulting log entry can be delivered back to us as part of
a raft snapshot rather than as a regular log entry. In that case the
operation never flows through `apply_conf_change_entry` /
`apply_normal_entry`, so the awaiter registered by
`propose_consensus_op_with_await` was never woken and timed out after
10s — manifesting as flaky failures of `test_peer_snapshot_bootstrap`
("Failed to add peer: ... Waiting for consensus operation commit failed").

Resolve awaiters in `apply_snapshot` whenever their effect is directly
observable in the new persistent state.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants