Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Aug 20, 2025

vmm: cpu_manager: massively accelerate .pause()

With 254 vCPUs, pausing now takes ~4ms instead of >254ms. This
improvement is visible when running ch-remote pause and is
particularly important for live migration, where every millisecond
of downtime matters.

For the wait logic, it is fine to stick to the approach of
sleeping 1ms on the first missed ACK as:

  1. we have to wait anyway
  2. we give time to the OS, enabling it to schedule a vCPU thread next

@phip1611 phip1611 requested a review from a team as a code owner August 20, 2025 11:25
@phip1611 phip1611 force-pushed the accelerate-vcpus-pause branch 5 times, most recently from 454da2c to 7c6dffb Compare August 20, 2025 11:37
Copy link
Member

@alyssais alyssais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

With 254 vCPUs, pausing now takes ~4ms instead of >254ms. This
improvement is visible when running `ch-remote pause` and is
particularly important for live migration, where every millisecond
of downtime matters.

For the wait logic, it is fine to stick to the approach of
sleeping 1ms on the first missed ACK as:

1) we have to wait anyway
2) we give time to the OS, enabling it to schedule a vCPU thread next

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
@phip1611 phip1611 force-pushed the accelerate-vcpus-pause branch from 7c6dffb to 59eabe4 Compare August 20, 2025 12:01
@phip1611 phip1611 requested a review from alyssais August 20, 2025 12:01
@alyssais alyssais enabled auto-merge August 20, 2025 12:48
@alyssais alyssais added this pull request to the merge queue Aug 20, 2025
Merged via the queue into cloud-hypervisor:main with commit c1f4df6 Aug 20, 2025
40 checks passed
@phip1611 phip1611 deleted the accelerate-vcpus-pause branch August 20, 2025 13:44
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 1, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 1, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 1, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 2, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
olivereanderson pushed a commit to cyberus-technology/cloud-hypervisor that referenced this pull request Sep 2, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 15, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 15, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Sep 16, 2025
It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change from pause->run is also gracefully recognized by
`CpuManager::resume()`.

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms.

## Reproducer

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

Since [0] is merged, this fix can be tested for example by modifying
the pause() API call to run pause() and resume() in a loop a thousand
times. With this change, things do not get stuck anymore.

## Outlook
Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 6, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()` and
`CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 6, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()` and
`CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 6, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 6, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 21, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 21, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Oct 22, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] cloud-hypervisor#7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
github-merge-queue bot pushed a commit that referenced this pull request Oct 22, 2025
Fix a race condition that happens in resume()-pause() cycles.

It is odd that for pause(), the CpuManager waited via `state.paused`
for the vCPU thread to ACK the state change but not for `resume()`.
In the `resume()` case, oddly CpuManager "owned" the state change
in `state.paused`. This commit changes this so that the vCPU ACKs
its state change itself in `state.paused` when it transitions from
pause->run. Further, `CpuManager::resume()` now gracefully waits for
the vCPU to be resumed.

More technical:

This change ensures proper synchronization and prevents situations
in that park() follows right after unpark(), causing deadlocks and
other weird behavior due to race conditions.

Calling resume() now takes slightly longer, very similar to pause().
This is, however, even for 254 vCPUs in the range of less than 10ms,
and ultimately we now have correct behaviour.

## Reproducer

Since [0] is merged, the underlying problem can be tested without this
commit by modifying the pause() API call to run `CpuManager::pause()`
and `CpuManager::resume()` in a loop a thousand times.

`ch-remote --api-socket ... pause`

```patch
diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs
index d7bba25..35557d58f 100644
--- a/vmm/src/vm.rs
+++ b/vmm/src/vm.rs
@@ -2687,6 +2687,10 @@ impl Pausable for Vm {
             MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e))
         })?;

+        for _ in 0..1000 {
+            self.cpu_manager.lock().unwrap().pause()?;
+            self.cpu_manager.lock().unwrap().resume()?;
+        }
         self.cpu_manager.lock().unwrap().pause()?;
         self.device_manager.lock().unwrap().pause()?;
```

## Outlook

Decades of experience in VMM development showed us that using
many AtomicBools is a footgun. They are not synchronized with each
other at all. On the long term, we might want to refactor things to
have a single shared AtomicU64 with different bits having different
meanings.

[0] #7290

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants