Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Oct 6, 2025

List of Changes

  • vmm: remove noop
  • vmm: refactor poor naming
  • vmm: fix CpuManager::resume(): gracefully wait for run vCPU loop ACK

Hints for Reviewers

Please review the commits one-by-one, as they include a comprehensive description.

@phip1611 phip1611 requested a review from a team as a code owner October 6, 2025 12:29
@phip1611 phip1611 force-pushed the upstream-run-vcpu-loop branch 3 times, most recently from 1be7611 to dec7e69 Compare October 6, 2025 14:18
@rbradford
Copy link
Member

Thanks @phip1611 - i've seen this. I'm going to need a bit of time with a clear head to give this a proper review. Thanks for your patience.

@phip1611
Copy link
Member Author

phip1611 commented Oct 9, 2025

Thanks @phip1611 - i've seen this. I'm going to need a bit of time with a clear head to give this a proper review. Thanks for your patience.

In the commit message there is a reproducer! We discovered and fixed this in a four person -two-day workshop 😁 So no worries, it is hard to debug these issues.

I can tell you that since we have this patch, we've successfully ran tens of thousands of live-migrations in our test setup including vCPU auto-converging (vCPU throttling - not upstream).

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting and fixing the bug with detailed information. Overall look good to me. Some minor comments below for clarify.

@phip1611 phip1611 force-pushed the upstream-run-vcpu-loop branch from dec7e69 to bf75a81 Compare October 21, 2025 06:20
@phip1611
Copy link
Member Author

I think we should be good to go now.

@phip1611 phip1611 force-pushed the upstream-run-vcpu-loop branch from bf75a81 to f9965a8 Compare October 21, 2025 08:02
@rbradford rbradford added this pull request to the merge queue Oct 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 21, 2025
@likebreath likebreath added this pull request to the merge queue Oct 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 21, 2025
The vCPU lifecycle is already complicated. Let's remove dead
code (the impl is a no-op).

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
These bindings actually refer to atomic bool shared across all
vCPUs to instruct single vCPUs with their next action. As there
are already enough Arc<AtomicBool>, this helps while debugging
things to see that different bindings refer to the same atomic
bool.

In other words: This naming really confused us while debugging.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
@phip1611 phip1611 force-pushed the upstream-run-vcpu-loop branch from f9965a8 to bb3c012 Compare October 22, 2025 04:38
@phip1611
Copy link
Member Author

Rebased - hopefully the flakyness in the net_hotplug test is gone now.

while vcpus_pause_signalled.load(Ordering::SeqCst) {
thread::park();
}
vcpu_paused.store(false, Ordering::SeqCst);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: this is the line were the individual vCPU tells the cPU manager is has been resumed

@rbradford rbradford added this pull request to the merge queue Oct 22, 2025
@phip1611
Copy link
Member Author

phip1611 commented Oct 22, 2025

more CI flakiness.. now the test_fw_cfg test fails in multiple PRs 😞

Merged via the queue into cloud-hypervisor:main with commit d39b565 Oct 22, 2025
42 of 46 checks passed
@phip1611 phip1611 deleted the upstream-run-vcpu-loop branch October 22, 2025 11:06
@likebreath likebreath added the bug-fix Bug fix to include in release notes label Oct 27, 2025
@likebreath likebreath moved this to ✅ Done in Cloud Hypervisor Roadmap Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix Bug fix to include in release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants