vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397

phip1611 · 2025-10-06T12:29:11Z

List of Changes

vmm: remove noop
vmm: refactor poor naming
vmm: fix CpuManager::resume(): gracefully wait for run vCPU loop ACK

Hints for Reviewers

Please review the commits one-by-one, as they include a comprehensive description.

rbradford · 2025-10-09T07:29:53Z

Thanks @phip1611 - i've seen this. I'm going to need a bit of time with a clear head to give this a proper review. Thanks for your patience.

phip1611 · 2025-10-09T08:20:44Z

Thanks @phip1611 - i've seen this. I'm going to need a bit of time with a clear head to give this a proper review. Thanks for your patience.

In the commit message there is a reproducer! We discovered and fixed this in a four person -two-day workshop 😁 So no worries, it is hard to debug these issues.

I can tell you that since we have this patch, we've successfully ran tens of thousands of live-migrations in our test setup including vCPU auto-converging (vCPU throttling - not upstream).

vmm/src/cpu.rs

likebreath

Thanks for reporting and fixing the bug with detailed information. Overall look good to me. Some minor comments below for clarify.

vmm/src/cpu.rs

phip1611 · 2025-10-21T06:20:15Z

I think we should be good to go now.

The vCPU lifecycle is already complicated. Let's remove dead code (the impl is a no-op). Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

These bindings actually refer to atomic bool shared across all vCPUs to instruct single vCPUs with their next action. As there are already enough Arc<AtomicBool>, this helps while debugging things to see that different bindings refer to the same atomic bool. In other words: This naming really confused us while debugging. Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

Fix a race condition that happens in resume()-pause() cycles. It is odd that for pause(), the CpuManager waited via `state.paused` for the vCPU thread to ACK the state change but not for `resume()`. In the `resume()` case, oddly CpuManager "owned" the state change in `state.paused`. This commit changes this so that the vCPU ACKs its state change itself in `state.paused` when it transitions from pause->run. Further, `CpuManager::resume()` now gracefully waits for the vCPU to be resumed. More technical: This change ensures proper synchronization and prevents situations in that park() follows right after unpark(), causing deadlocks and other weird behavior due to race conditions. Calling resume() now takes slightly longer, very similar to pause(). This is, however, even for 254 vCPUs in the range of less than 10ms, and ultimately we now have correct behaviour. ## Reproducer Since [0] is merged, the underlying problem can be tested without this commit by modifying the pause() API call to run `CpuManager::pause()` and `CpuManager::resume()` in a loop a thousand times. `ch-remote --api-socket ... pause` ```patch diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs index d7bba25..35557d58f 100644 --- a/vmm/src/vm.rs +++ b/vmm/src/vm.rs @@ -2687,6 +2687,10 @@ impl Pausable for Vm { MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e)) })?; + for _ in 0..1000 { + self.cpu_manager.lock().unwrap().pause()?; + self.cpu_manager.lock().unwrap().resume()?; + } self.cpu_manager.lock().unwrap().pause()?; self.device_manager.lock().unwrap().pause()?; ``` ## Outlook Decades of experience in VMM development showed us that using many AtomicBools is a footgun. They are not synchronized with each other at all. On the long term, we might want to refactor things to have a single shared AtomicU64 with different bits having different meanings. [0] cloud-hypervisor#7290 Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

phip1611 · 2025-10-22T04:38:58Z

Rebased - hopefully the flakyness in the net_hotplug test is gone now.

phip1611 · 2025-10-22T04:43:26Z

vmm/src/cpu.rs

+                                while vcpus_pause_signalled.load(Ordering::SeqCst) {
                                    thread::park();
                                }
+                                vcpu_paused.store(false, Ordering::SeqCst);


btw: this is the line were the individual vCPU tells the cPU manager is has been resumed

phip1611 · 2025-10-22T09:15:35Z

more CI flakiness.. now the test_fw_cfg test fails in multiple PRs 😞

phip1611 requested a review from a team as a code owner October 6, 2025 12:29

phip1611 force-pushed the upstream-run-vcpu-loop branch 3 times, most recently from 1be7611 to dec7e69 Compare October 6, 2025 14:18

rbradford approved these changes Oct 9, 2025

View reviewed changes

vmm/src/cpu.rs Show resolved Hide resolved

likebreath reviewed Oct 9, 2025

View reviewed changes

vmm/src/cpu.rs Show resolved Hide resolved

vmm/src/cpu.rs Show resolved Hide resolved

vmm/src/cpu.rs Outdated Show resolved Hide resolved

phip1611 force-pushed the upstream-run-vcpu-loop branch from dec7e69 to bf75a81 Compare October 21, 2025 06:20

phip1611 force-pushed the upstream-run-vcpu-loop branch from bf75a81 to f9965a8 Compare October 21, 2025 08:02

rbradford approved these changes Oct 21, 2025

View reviewed changes

rbradford added this pull request to the merge queue Oct 21, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 21, 2025

likebreath approved these changes Oct 21, 2025

View reviewed changes

likebreath added this pull request to the merge queue Oct 21, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 21, 2025

phip1611 added 3 commits October 22, 2025 06:38

vmm: remove noop

2e92191

The vCPU lifecycle is already complicated. Let's remove dead code (the impl is a no-op). Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

phip1611 force-pushed the upstream-run-vcpu-loop branch from f9965a8 to bb3c012 Compare October 22, 2025 04:38

phip1611 commented Oct 22, 2025

View reviewed changes

rbradford added this pull request to the merge queue Oct 22, 2025

Merged via the queue into cloud-hypervisor:main with commit d39b565 Oct 22, 2025
42 of 46 checks passed

phip1611 deleted the upstream-run-vcpu-loop branch October 22, 2025 11:06

likebreath added the bug-fix Bug fix to include in release notes label Oct 27, 2025

likebreath added this to Cloud Hypervisor Roadmap Nov 6, 2025

likebreath moved this to ✅ Done in Cloud Hypervisor Roadmap Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397

vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397

Uh oh!

phip1611 commented Oct 6, 2025 •

edited

Loading

Uh oh!

rbradford commented Oct 9, 2025

Uh oh!

phip1611 commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

likebreath left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phip1611 commented Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

phip1611 commented Oct 22, 2025

Uh oh!

phip1611 Oct 22, 2025

Uh oh!

phip1611 commented Oct 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397

vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397

Uh oh!

Conversation

phip1611 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List of Changes

Hints for Reviewers

Uh oh!

rbradford commented Oct 9, 2025

Uh oh!

phip1611 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phip1611 commented Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

phip1611 commented Oct 22, 2025

Uh oh!

phip1611 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

phip1611 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phip1611 commented Oct 6, 2025 •

edited

Loading

phip1611 commented Oct 9, 2025 •

edited

Loading

phip1611 commented Oct 22, 2025 •

edited

Loading