Fix missed-wakeup race in SaveOnDisk::wait_for#8847
Merged
Merged
Conversation
Adds test_wait_for_no_missed_wakeup plus a #[cfg(test)]-only test_set_pre_park_sleep_ms hook that injects a sleep inside wait_for between releasing the read guard on data and parking on the condvar. This widens the race window from nanoseconds to ~100ms so the writer thread's notify_all reliably fires *before* the waiter parks, exposing the bug as a hard failure (notification lost, wait_for hits timeout). This commit only adds the test and the instrumentation; the bug is still present, so the new test fails. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
write and write_optional were calling change_notification.notify_all without holding notification_lock. A waiter that had released its read guard on data but had not yet parked on the condvar would miss the notification and wait the full timeout — even though the condition was already true. Acquire notification_lock around notify_all (in a new notify_change helper). The waiter holds notification_lock across the release-and-park gap, so notify_change blocks until the waiter has actually parked, guaranteeing delivery. This matches the standard parking_lot Condvar pattern where the mutex that protects the predicate is held while signalling. The new test_wait_for_no_missed_wakeup test (added in the previous commit) now passes; manifests in production as the flaky test_fix_reshard_down_without_shard_key consensus test, where wait_for_shard_key_activation timed out waiting on a replica state that had already become Active. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Removes the #[cfg(test)] test_pre_park_sleep_ms hook and the test_wait_for_no_missed_wakeup test that depended on it. The instrumentation existed only to deterministically reproduce the missed-wakeup race in the unfixed wait_for; with the fix in place (notify_change holds notification_lock around notify_all) the race is closed and the hook has no remaining purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Member
|
Doesn't this have potential to
When waiting we hold the guard here: qdrant/lib/common/common/src/save_on_disk.rs Lines 96 to 102 in 9d4e96e |
9 tasks
Member
|
Here's my change proposal: #8849 It uses a subscribe-check-then-wait mechanism, and removes the notification lock. |
Member
I completely forgot about Condvar wait semantics. It releases the given mutex guard during the wait so that others can obtain the lock. That means this implementation is perfectly fine. |
timvisee
approved these changes
Apr 30, 2026
timvisee
pushed a commit
that referenced
this pull request
May 8, 2026
* test: reproduce missed-wakeup race in SaveOnDisk::wait_for Adds test_wait_for_no_missed_wakeup plus a #[cfg(test)]-only test_set_pre_park_sleep_ms hook that injects a sleep inside wait_for between releasing the read guard on data and parking on the condvar. This widens the race window from nanoseconds to ~100ms so the writer thread's notify_all reliably fires *before* the waiter parks, exposing the bug as a hard failure (notification lost, wait_for hits timeout). This commit only adds the test and the instrumentation; the bug is still present, so the new test fails. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix: prevent missed wakeup in SaveOnDisk::wait_for write and write_optional were calling change_notification.notify_all without holding notification_lock. A waiter that had released its read guard on data but had not yet parked on the condvar would miss the notification and wait the full timeout — even though the condition was already true. Acquire notification_lock around notify_all (in a new notify_change helper). The waiter holds notification_lock across the release-and-park gap, so notify_change blocks until the waiter has actually parked, guaranteeing delivery. This matches the standard parking_lot Condvar pattern where the mutex that protects the predicate is held while signalling. The new test_wait_for_no_missed_wakeup test (added in the previous commit) now passes; manifests in production as the flaky test_fix_reshard_down_without_shard_key consensus test, where wait_for_shard_key_activation timed out waiting on a replica state that had already become Active. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * test: remove missed-wakeup race instrumentation Removes the #[cfg(test)] test_pre_park_sleep_ms hook and the test_wait_for_no_missed_wakeup test that depended on it. The instrumentation existed only to deterministically reproduce the missed-wakeup race in the unfixed wait_for; with the fix in place (notify_change holds notification_lock around notify_all) the race is closed and the hook has no remaining purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a missed-wakeup race in
SaveOnDisk::wait_for(lib/common/common/src/save_on_disk.rs) that surfaces in CI as the flakytests/consensus_tests/test_resharding_extras.py::test_fix_reshard_down_without_shard_key(and any other consensus path that callswait_for_shard_key_activation).write/write_optionalwere callingchange_notification.notify_all()without holdingnotification_lock. A waiter that had released its read guard ondatabut had not yet parked on the condvar would miss the notification and wait the full timeout — even though the condition was already true.In the failing run, the leader applies
SetShardReplicaState (Initializing → Active)for all 4 replicas in ~10ms; every duplicate proposal that arrives afterwards fails the precondition check, so no furtherreplica_state.write()happens. A single missed wakeup in this window then forceswait_for_shard_key_activationto time out 9.81s later, producing the 408 response observed in CI.The fix introduces a small
notify_change()helper that takesnotification_lockaroundnotify_all. Since the waiter holdsnotification_lockacross the release-and-park gap, the writer's notify now blocks until the waiter has actually parked, guaranteeing delivery. This matches the standardparking_lot::Condvarpattern.Commits
The change is split into three commits to make the bug visible in CI:
test: reproduce missed-wakeup race in SaveOnDisk::wait_for— addstest_wait_for_no_missed_wakeupplus a#[cfg(test)]-onlytest_set_pre_park_sleep_mshook that injects a 100ms sleep insidewait_forbetween releasing the read guard ondataand parking on the condvar. Widens the race window from nanoseconds to a clearly observable size; the new test fails on this commit alone, demonstrating the bug.fix: prevent missed wakeup in SaveOnDisk::wait_for— adds thenotify_change()helper, drops the data write guard before notifying. Test from commit 1 now passes; instrumentation untouched.test: remove missed-wakeup race instrumentation— deletes the#[cfg(test)]hook and the test that depended on it.Test plan
cargo test -p common --lib save_on_disk— passes after commit 2 and commit 3.cargo check --workspace— clean.tests/consensus_tests/test_resharding_extras.py::test_fix_reshard_down_without_shard_keyin CI; expected to no longer flake on this race.🤖 Generated with Claude Code