Skip to content

Fix missed-wakeup race in SaveOnDisk::wait_for#8847

Merged
timvisee merged 3 commits into
devfrom
fix-flacky
Apr 30, 2026
Merged

Fix missed-wakeup race in SaveOnDisk::wait_for#8847
timvisee merged 3 commits into
devfrom
fix-flacky

Conversation

@generall
Copy link
Copy Markdown
Member

Summary

Fixes a missed-wakeup race in SaveOnDisk::wait_for (lib/common/common/src/save_on_disk.rs) that surfaces in CI as the flaky tests/consensus_tests/test_resharding_extras.py::test_fix_reshard_down_without_shard_key (and any other consensus path that calls wait_for_shard_key_activation).

write / write_optional were calling change_notification.notify_all() without holding notification_lock. A waiter that had released its read guard on data but had not yet parked on the condvar would miss the notification and wait the full timeout — even though the condition was already true.

In the failing run, the leader applies SetShardReplicaState (Initializing → Active) for all 4 replicas in ~10ms; every duplicate proposal that arrives afterwards fails the precondition check, so no further replica_state.write() happens. A single missed wakeup in this window then forces wait_for_shard_key_activation to time out 9.81s later, producing the 408 response observed in CI.

The fix introduces a small notify_change() helper that takes notification_lock around notify_all. Since the waiter holds notification_lock across the release-and-park gap, the writer's notify now blocks until the waiter has actually parked, guaranteeing delivery. This matches the standard parking_lot::Condvar pattern.

Commits

The change is split into three commits to make the bug visible in CI:

  1. test: reproduce missed-wakeup race in SaveOnDisk::wait_for — adds test_wait_for_no_missed_wakeup plus a #[cfg(test)]-only test_set_pre_park_sleep_ms hook that injects a 100ms sleep inside wait_for between releasing the read guard on data and parking on the condvar. Widens the race window from nanoseconds to a clearly observable size; the new test fails on this commit alone, demonstrating the bug.
  2. fix: prevent missed wakeup in SaveOnDisk::wait_for — adds the notify_change() helper, drops the data write guard before notifying. Test from commit 1 now passes; instrumentation untouched.
  3. test: remove missed-wakeup race instrumentation — deletes the #[cfg(test)] hook and the test that depended on it.

Test plan

  • cargo test -p common --lib save_on_disk — passes after commit 2 and commit 3.
  • cargo check --workspace — clean.
  • Re-run tests/consensus_tests/test_resharding_extras.py::test_fix_reshard_down_without_shard_key in CI; expected to no longer flake on this race.

🤖 Generated with Claude Code

generall and others added 3 commits April 30, 2026 11:18
Adds test_wait_for_no_missed_wakeup plus a #[cfg(test)]-only
test_set_pre_park_sleep_ms hook that injects a sleep inside wait_for
between releasing the read guard on data and parking on the condvar.
This widens the race window from nanoseconds to ~100ms so the writer
thread's notify_all reliably fires *before* the waiter parks, exposing
the bug as a hard failure (notification lost, wait_for hits timeout).

This commit only adds the test and the instrumentation; the bug is
still present, so the new test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
write and write_optional were calling change_notification.notify_all
without holding notification_lock. A waiter that had released its read
guard on data but had not yet parked on the condvar would miss the
notification and wait the full timeout — even though the condition was
already true.

Acquire notification_lock around notify_all (in a new notify_change
helper). The waiter holds notification_lock across the release-and-park
gap, so notify_change blocks until the waiter has actually parked,
guaranteeing delivery.

This matches the standard parking_lot Condvar pattern where the mutex
that protects the predicate is held while signalling. The new
test_wait_for_no_missed_wakeup test (added in the previous commit) now
passes; manifests in production as the flaky
test_fix_reshard_down_without_shard_key consensus test, where
wait_for_shard_key_activation timed out waiting on a replica state
that had already become Active.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Removes the #[cfg(test)] test_pre_park_sleep_ms hook and the
test_wait_for_no_missed_wakeup test that depended on it. The
instrumentation existed only to deterministically reproduce the
missed-wakeup race in the unfixed wait_for; with the fix in place
(notify_change holds notification_lock around notify_all) the race is
closed and the hook has no remaining purpose.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@qdrant qdrant deleted a comment from coderabbitai Bot Apr 30, 2026
@generall generall requested a review from timvisee April 30, 2026 09:36
@timvisee
Copy link
Copy Markdown
Member

timvisee commented Apr 30, 2026

Doesn't this have potential to dead livelock?

  • T1: call wait_for
  • T1: hold notification_lock guard
  • T1: wait for change notification
  • T2: call write
  • T2: use self.notify_change()
  • T2: blocked on notification_lock, can't immediately notify and needs to wait on wait_for cycle

When waiting we hold the guard here:

let notification_guard = self.notification_lock.lock();
// Based on https://github.com/Amanieu/parking_lot/issues/165
RwLockReadGuard::unlocked(&mut data_read_guard, || {
// Move the guard in so it gets unlocked before we re-lock the RwLock read guard
let mut guard = notification_guard;
self.change_notification.wait_for(&mut guard, remaining);
});

@timvisee
Copy link
Copy Markdown
Member

Here's my change proposal: #8849

It uses a subscribe-check-then-wait mechanism, and removes the notification lock.

@timvisee
Copy link
Copy Markdown
Member

timvisee commented Apr 30, 2026

Doesn't this have potential to dead livelock?

* T1: call `wait_for`

* T1: hold notification_lock guard

* T1: wait for change notification

* T2: call `write`

* T2: use `self.notify_change()`

* T2: blocked on notification_lock, can't immediately notify and needs to wait on wait_for cycle

When waiting we hold the guard here:

let notification_guard = self.notification_lock.lock();
// Based on https://github.com/Amanieu/parking_lot/issues/165
RwLockReadGuard::unlocked(&mut data_read_guard, || {
// Move the guard in so it gets unlocked before we re-lock the RwLock read guard
let mut guard = notification_guard;
self.change_notification.wait_for(&mut guard, remaining);
});

I completely forgot about Condvar wait semantics. It releases the given mutex guard during the wait so that others can obtain the lock.

That means this implementation is perfectly fine.

@timvisee timvisee merged commit 7486a57 into dev Apr 30, 2026
15 checks passed
@timvisee timvisee deleted the fix-flacky branch April 30, 2026 13:29
timvisee pushed a commit that referenced this pull request May 8, 2026
* test: reproduce missed-wakeup race in SaveOnDisk::wait_for

Adds test_wait_for_no_missed_wakeup plus a #[cfg(test)]-only
test_set_pre_park_sleep_ms hook that injects a sleep inside wait_for
between releasing the read guard on data and parking on the condvar.
This widens the race window from nanoseconds to ~100ms so the writer
thread's notify_all reliably fires *before* the waiter parks, exposing
the bug as a hard failure (notification lost, wait_for hits timeout).

This commit only adds the test and the instrumentation; the bug is
still present, so the new test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix: prevent missed wakeup in SaveOnDisk::wait_for

write and write_optional were calling change_notification.notify_all
without holding notification_lock. A waiter that had released its read
guard on data but had not yet parked on the condvar would miss the
notification and wait the full timeout — even though the condition was
already true.

Acquire notification_lock around notify_all (in a new notify_change
helper). The waiter holds notification_lock across the release-and-park
gap, so notify_change blocks until the waiter has actually parked,
guaranteeing delivery.

This matches the standard parking_lot Condvar pattern where the mutex
that protects the predicate is held while signalling. The new
test_wait_for_no_missed_wakeup test (added in the previous commit) now
passes; manifests in production as the flaky
test_fix_reshard_down_without_shard_key consensus test, where
wait_for_shard_key_activation timed out waiting on a replica state
that had already become Active.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* test: remove missed-wakeup race instrumentation

Removes the #[cfg(test)] test_pre_park_sleep_ms hook and the
test_wait_for_no_missed_wakeup test that depended on it. The
instrumentation existed only to deterministically reproduce the
missed-wakeup race in the unfixed wait_for; with the fix in place
(notify_change holds notification_lock around notify_all) the race is
closed and the hook has no remaining purpose.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
@timvisee timvisee mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants