vmm: cpu: Alternative pause race avoidance using immediate_exit #7455

rbradford · 2025-11-03T15:53:08Z

vmm: properly unset immediate_exit on -EINTR
vmm: cpu: Remove (now) unnecessary clearing of KVM set_immediate_exit()
vmm: Use set_immediate_exit to avoid a race on pausing the vCPU
vmm: cpu: Clear immediate_exit flag on resume

Also see [0] for more info. [0] https://docs.kernel.org/virt/kvm/api.html#the-kvm-run-structure Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

This is now done as a part of the handling exits that result in an EINTR error. Signed-off-by: Rob Bradford <[email protected]>

A race can occur if the signal to the vCPU thread is sent between the shared state atomic being checked (e.g. for paused) and entering KVM_RUN. As such we can use KVM's immediate_exit boolean to make the KVM_RUN exit immediately. We only conditionally try and take the lock - since because we can end up a deadlock from a priority inversion situation where the ordering of signals doesn't match the ordering in which we take the locks. Fixes: cloud-hypervisor#7427 Signed-off-by: Rob Bradford <[email protected]>

This avoids an extra KVM_RUN exit upon resume where the immediate_exit flags is still set. Signed-off-by: Rob Bradford <[email protected]>

likebreath · 2025-11-04T18:37:54Z

vmm/src/cpu.rs

+                Ok(mut vcpu) => {
+                    vcpu.vcpu.set_immediate_exit(true);
+                }
+                Err(std::sync::TryLockError::WouldBlock) => {}


Would it be better (e.g. reducing the race window further) if we call into signal_thread() here, rather than waiting for looping all vcpus? Essentially, we will be using vcpu.try_lock() to detect from vmm thread if the vcpu thread is executing KVM_RUN, and pick either set_immediate_exit(true) or signal_thread to kick out the vcpu thread.

I think picking between two paths would actually increase the chance of a race (and setting immediate_exit or signalling are both cheap) - one of the reasons we have to do it like this is that these loops are split. If we recombine them like this - I think we can avoid the race entirely (the reduces the chance further but I think it could still theoretically happen):

fn signal_vcpus(&mut self) { // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the // signal will kick it out. for vcpu_id in 0..self.vcpus.len() { match self.vcpus[vcpu_id].try_lock() { Ok(mut vcpu) => { vcpu.vcpu.set_immediate_exit(true); } Err(std::sync::TryLockError::WouldBlock) => {} Err(e) => { unreachable!("{e:?}") } } self.vcpu_states[vcpu_id].signal_thread(); // In this case it's fine to wait for the vCPU run to make the lock available for the taking self.vcpus[vcpu_id] .lock() .unwrap() .vcpu .set_immediate_exit(true); self.vcpu_states[vcpu_id].wait_until_signal_acknowledged(); } }

We can't do this two attempts - one non-blocking, one blocking with the split loop variant as you get a deadlock due to the signals not being handled in vCPU ordering.

These were split into two loops in this commit - c1f4df6

I think picking between two paths would actually increase the chance of a race

Hmm... my view is the opposite. I thought with such change, we will be able to further reduce the time window between vmm thread detected vCPU thread is in KVM_RUN and vmm thread kick vCPU thread with signal, which is the racing window. Reducing the race window would reduce the changes of race, isn't it?

The change I am thinking is:

fn signal_vcpus(&self) { // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the // signal will kick it out. for vcpu_id in 0..self.vcpus.len() { match self.vcpus[vcpu_id].try_lock() { Ok(mut vcpu) => { // vCPU thread is not in KVM_RUN, kick it out with set_immediate_exit vcpu.vcpu.set_immediate_exit(true); } Err(std::sync::TryLockError::WouldBlock) => { // vCPU thread is in KVM_RUN, kick it out with signal self.vcpu_states[vcpu_id].signal_thread(); } Err(e) => { unreachable!("{e:?}") } } } for state in self.vcpu_states.iter() { state.wait_until_signal_acknowledged(); } }

Right - but the thing is that KVM_RUN isn't the only place that the lock is taken - it's the most likely. I think conditionally sending the signal could result in situations where the wait_until_signal_acknowledged spins forever.

I've gone ahead and added even more "defence in depth" - by making the interrupted signalling check and retry.

Right - but the thing is that KVM_RUN isn't the only place that the lock is taken

I see. How about the following change? It should help with reducing the race further, while not assuming KVM_RUN is the only place where the vcpu lock is held.

fn signal_vcpus(&self) { // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the // signal will kick it out. for vcpu_id in 0..self.vcpus.len() { match self.vcpus[vcpu_id].try_lock() { Ok(mut vcpu) => { // Try to avoid the race mentioned above, when the vcpu lock is not taken. vcpu.vcpu.set_immediate_exit(true); } Err(std::sync::TryLockError::WouldBlock) => {} Err(e) => { unreachable!("{e:?}") } } // Always send signal to kick out vCPU thread, as KVM_RUN is not the only place // that the vCPU lock could be taken. self.vcpu_states[vcpu_id].signal_thread(); } for state in self.vcpu_states.iter() { state.wait_until_signal_acknowledged(); } }

Sure - let me incorporate that change - will tackle that tomorrow.

likebreath · 2025-11-04T20:19:08Z

vmm/src/cpu.rs

+    ///
+    /// Returns whether the vCPU was interrupted within the 1 second timeout
+    fn wait_until_signal_acknowledged(&self) -> bool {
+        let start = Instant::now();


Let's split such commit into a different PR, as such change addresses a separate topic.

Also, we should report at least a warning if a timeout occurs here, as this likely suggests abnormal behavior that warrants our attention.

I've actually improved it since - but yes, let me split it out (and tbh, this change on it's own would probably fix the lost signal issue in #7427 itself.)

By retrying the signalling on the vCPU thread if it doesn't respond within 1 second the race between pause being requested and the KVM_RUN ioctl returning can be mitigated. Signed-off-by: Rob Bradford <[email protected]>

phip1611 · 2025-11-05T05:59:25Z

Just to make sure that I get it right. Your high-level idea is that instead of doing (unsafe) stuff in the signal handler, we stick to the NOOP signal handler and instead you tackle the challenge at places than cause the signal to be send?

I think this could be a possible way forward, but we need some time to think about it and evaluate it. Thanks for working on this!

phip1611 · 2025-11-11T10:17:32Z

I've added our (=@cyberus-technology) analysis, statement, and recommendation in the issue #7427 (comment).

rbradford · 2025-11-11T15:33:53Z

Closing as @phip1611 is going to raise a new PR with a proposed solution.

phip1611 and others added 4 commits November 3, 2025 15:20

vmm: cpu: Remove (now) unnecessary clearing of KVM set_immediate_exit()

4cb4e58

This is now done as a part of the handling exits that result in an EINTR error. Signed-off-by: Rob Bradford <[email protected]>

vmm: cpu: Clear immediate_exit flag on resume

8b1ac03

This avoids an extra KVM_RUN exit upon resume where the immediate_exit flags is still set. Signed-off-by: Rob Bradford <[email protected]>

rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from a5786b5 to 8b1ac03 Compare November 4, 2025 16:07

rbradford marked this pull request as ready for review November 4, 2025 16:46

rbradford requested a review from a team as a code owner November 4, 2025 16:46

rbradford mentioned this pull request Nov 4, 2025

vmm: cpu: Use set_immediate_exit to avoid a pause race #7428

Closed

likebreath reviewed Nov 4, 2025

View reviewed changes

rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from 6061b65 to 273b3ec Compare November 4, 2025 19:46

likebreath reviewed Nov 4, 2025

View reviewed changes

rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from 273b3ec to b8a5afc Compare November 4, 2025 20:25

likebreath mentioned this pull request Nov 6, 2025

vmm: cpu: Retry signalling the vCPU thread if it doesn't acknowledge #7459

Merged

phip1611 mentioned this pull request Nov 11, 2025

KVM: Invalid immediate_exit handling #7427

Open

rbradford closed this Nov 11, 2025

vmm: cpu: Alternative pause race avoidance using immediate_exit #7455

vmm: cpu: Alternative pause race avoidance using immediate_exit #7455

Uh oh!

Conversation

rbradford commented Nov 3, 2025

Uh oh!

likebreath Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

likebreath Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

likebreath Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

likebreath Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

rbradford Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phip1611 commented Nov 11, 2025

Uh oh!

rbradford commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rbradford Nov 4, 2025 •

edited

Loading

likebreath Nov 4, 2025 •

edited

Loading

rbradford Nov 4, 2025 •

edited

Loading

phip1611 commented Nov 5, 2025 •

edited

Loading