vmm: cpu: Use set_immediate_exit to avoid a pause race #7428

rbradford · 2025-10-22T15:52:40Z

Problem is described in excellent detail in #7427. In essence it was
possible for the pause to be triggered between when we check for the
pause atomic being signalled and when we enter KVM_RUN. This would then
result in that pause being lost until the next KVM_EXIT. The code
already made use of KVM set_immediate_exit() for handling the
completion of I/O exits and the RT signal for breaking the execution of
KVM_RUN.

Using the code from Firecracker:
firecracker-microvm/firecracker@b359722

Implement Thread Local Storage (TLS) for the Vcpu object and then use
the RT signal to trigger KVM set_immediate_exit(). The use of TLS is
necessary as signals are not thread safe.

Fixes: #7427

Signed-off-by: Rob Bradford [email protected]

rbradford · 2025-10-22T16:15:09Z

This seems to fail when I do a shutdown - it's interacting badly with the SIGWINCH handler:

It's stuck in the poll():

#4  0x00007fa9d4965c1b in vmm::sigwinch_listener::sigwinch_listener_main::{closure#1} (tx=0x7fa9d43ff570) at vmm/src/sigwinch_listener.rs:196
196	        while unsafe { poll(&mut pollfd, 1, -1) } == -1 {

@alyssais - Any thoughts?

rbradford · 2025-10-22T16:50:42Z

This seems to fail when I do a shutdown - it's interacting badly with the SIGWINCH handler:

It's stuck in the poll():
#4  0x00007fa9d4965c1b in vmm::sigwinch_listener::sigwinch_listener_main::{closure#1} (tx=0x7fa9d43ff570) at vmm/src/sigwinch_listener.rs:196
196	        while unsafe { poll(&mut pollfd, 1, -1) } == -1 {
@alyssais - Any thoughts?

This is not to do with that - I was getting confused because of the second process used for the SIGWINCH handler - gdb was dropping me into that. Rather it's the signalling as part of the shutdown which is going wrong (hence why some of the tests - those that reboot in the middle - are failing.)

rbradford · 2025-10-22T18:50:52Z

This is not to do with that - I was getting confused because of the second process used for the SIGWINCH handler - gdb was dropping me into that. Rather it's the signalling as part of the shutdown which is going wrong (hence why some of the tests - those that reboot in the middle - are failing.)

All fixed now.

vmm/src/cpu.rs

phip1611

Thanks for coming up with a quick first draft for this! Your implementation using thread-local storage looks much cleaner and less hacky than I initially expected - nicely done!

I have some remarks, tho.

I’d like to run it in our test suite to verify everything works as intended.

hypervisor/src/kvm/mod.rs

vmm/src/cpu.rs

hypervisor/src/kvm/mod.rs

vmm/src/cpu.rs

phip1611 · 2025-10-23T09:47:02Z

vmm/src/cpu.rs

+            unsafe {
+                let _ = Vcpu::run_on_thread_local(|vcpu| {
+                    vcpu.vcpu.set_immediate_exit(true);
+                    fence(Ordering::Release);


? please explain

Explain what - the fence? I copied it from firecracker.

Yes, the fence. Well, do we genuinely understand the need for the fence there? We discussed this internally and we are not sure. That said: If you see something that we do not see, please write a short comment. Otherwise, I think we should remove it

QEMU has a barrier after this too - it makes sense that this would be necessary on systems that don't have TSO - i.e. a weaker memory model. As the KVM_RUN ioctl could be performed on a different physical core to the one where this value was updated (and thus resulting in the change being in the cache in that core but not seen by the running core). The memory barrier ensures all cores see all writes prior to the barrier before any reads that might happen after the barrier.

@rbradford Can you elaborate on the case where multiple CPUs would be involved? Are you worried about this thread migrating (doesn't need atomics, just signal safety) or are there multiple threads writing to KVM_RUN (needs atomics)?

The former - the memory fence is to ensure the write to immediate_exit (from the signal handler) is visible to the upcoming KVM_RUN ioctl (which could be running on different physical core from the signal handler).

rbradford · 2025-10-23T12:28:40Z

@phip1611 Don't worry too much about this version of the PR - i've got some ideas on how to improve it and will update it.

Problem is described in excellent detail in cloud-hypervisor#7427. In essence it was possible for the pause to be triggered between when we check for the pause atomic being signalled and when we enter KVM_RUN. This would then result in that pause being lost until the next KVM_EXIT. The code already made use of KVM set_immediate_exit() for handling the completion of I/O exits and the RT signal for breaking the execution of KVM_RUN. Using the code from Firecracker: firecracker-microvm/firecracker@b359722 Implement Thread Local Storage (TLS) for the Vcpu object and then use the RT signal to trigger KVM set_immediate_exit(). The use of TLS is necessary as signals are not thread safe. Fixes: cloud-hypervisor#7427 Signed-off-by: Rob Bradford <[email protected]>

Also see [0] for more info. [0] https://docs.kernel.org/virt/kvm/api.html#the-kvm-run-structure Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]

This is now done as a part of the handling exits that result in an EINTR error. Signed-off-by: Rob Bradford <[email protected]>

Signed-off-by: Rob Bradford <[email protected]>

phip1611 · 2025-10-27T10:26:04Z

vmm/src/seccomp_filters.rs

        (libc::SYS_recvfrom, vec![]),
        (libc::SYS_recvmsg, vec![]),
        (libc::SYS_sched_yield, vec![]),
+        (libc::SYS_sendto, vec![]),


this commit is missing the why part

phip1611 · 2025-10-27T10:28:02Z

vmm/src/cpu.rs

+    /// Should be called if the current `self` had called `init_thread_local_data()` and
+    /// now needs to move to a different thread.
+    ///
+    /// Fails if `self` was not previously associated with the current thread.


Fails if self was not previously associated with the current thread.

In what realistic scenarios can this happen?

To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().

phip1611 · 2025-10-27T10:29:12Z

vmm/src/cpu.rs

+    /// Cannot clean init vcpu TLS.
+    #[error("Failed to initialise vCPU TLS")]
+    VcpuTlsInit,
+    #[error("vCPU TLS not present")]


I think we should add a little more context, something like

Suggested change

#[error("vCPU TLS not present")]

#[error("vCPU was moved to another thread early and TLS is no longer present")]

Please note that this discussion still remains. Why can the vCPU be moved to another thread here? Not sure how useful this error message is.

If we have a hard error here, I'd rather panic?

I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).

phip1611 · 2025-10-27T10:29:39Z

Following up on #7428 (comment): Either I'm blind or you somehow missed to unregister the signal handler? 😀

olivereanderson · 2025-10-27T13:32:09Z

vmm/src/cpu.rs

+            if let Some(vcpu_ptr) = cell.get() {
+                // SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty,
+                // and it is being cleared on `Vcpu::drop` so there is no dangling pointer.
+                let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr };


I fear this will break Rust's aliasing rules.

What happens if you are in vcpu.run() here and then the signal handler gets triggered which will then run this function. Then you have two mutable references to the same VCPU resource don't you?

Good catch! However, it might be that there is a "loop hole" for us in the undefined behavior spectrum - as we have differently execution contexts (normal thread, thread handling signal) and different stack variables, I think it could be okay.

I don't know how to verify either, tho.

From our internal discussion:

It might be okay to have an imperfect fix for now with potential UB (thanks Rob for working on this, really!)

Assuming we have UB here indeed: A possible path forward for a correct solution could be

add functionality to VcpuFd in kvm-ioctls to create new distinct mappings of KVM_RUN

add an atomic way to read/write immediate_exit from rust

use this abstraction in CHV and create a reasonably nice interface within CHV for that

CHV would then use a distinct KVM_RUN mapping (which holds immediate_exit) in its vCPU TLS

Further:

If we have indeed UB here, Firecracker runs into the same UB problem

While Firecracker may not serve as the ultimate reference implementation, it successfully runs millions of VMs in production, which suggests that this is probably not "critical undefined behavior".

It might be that the Rust language/compiler team one days says what we are doing here is not UB and a valid corner-case - but at this point in time this is not clear

@olivereanderson asks the Rust team upstream (link will be added ASAP) what's their take on our UB discussion. Perhaps we could wait on a reply there before we merge this?

I asked about this on Zulip: https://rust-lang.zulipchat.com/#narrow/channel/136281-t-opsem/topic/Is.20it.20UB.20if.20a.20signal.20handler.20aliases.20a.20.26mut.20T.20in.20another.20fn.3F/with/547311034

I think the TL;DR of the rust-lang discussion is: It is very simple. We just need to define a new formal memory model covering typical low-level patterns, we need to implement this model in LLVM and Rust, and then adapt the Cloud Hypervisor code to it.

Just kidding. It is inherently complex. I'm in favor of doing what works technically so far and this is what firecracker uses as well and used by Rob as well. Althogh I'm not 100% happy with that

Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time — the several instances of UB I've found in Cloud Hypervisor have all become apparent because a rustc/LLVM upgrade suddenly made it do something nonsensical, where before it was fine.

I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.

Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time

Firecracker stopped doing the "vcpu pointer in tls" thing a few months ago for exactly this reason

likebreath

Overall look good to me. I'd vote to land this PR, once @phip1611 confirms it fixes #7427.

Open comments are only around better handling Error::VcpuTlsNotPresent.

likebreath · 2025-10-27T22:58:37Z

vmm/src/cpu.rs

+    /// Cannot clean init vcpu TLS.
+    #[error("Failed to initialise vCPU TLS")]
+    VcpuTlsInit,
+    #[error("vCPU TLS not present")]


I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).

likebreath · 2025-10-27T23:03:47Z

vmm/src/cpu.rs

+    /// Should be called if the current `self` had called `init_thread_local_data()` and
+    /// now needs to move to a different thread.
+    ///
+    /// Fails if `self` was not previously associated with the current thread.


To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().

likebreath · 2025-10-27T23:14:12Z

vmm/src/cpu.rs

+            if let Some(vcpu_ptr) = cell.get() {
+                // SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty,
+                // and it is being cleared on `Vcpu::drop` so there is no dangling pointer.
+                let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr };


I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.

likebreath · 2025-10-27T23:20:44Z

vmm/src/cpu.rs

+            // SAFETY: This is safe because it's temporarily aliasing the `Vcpu` object, but we are
+            // only reading `vcpu.fd` which does not change for the lifetime of the `Vcpu`.
+            unsafe {
+                let _ = Vcpu::run_on_thread_local(|vcpu| {


A similar comment here regarding error handling: perhaps it is better to panic with Error::VcpuTlsNotPresent, since the error indicates programming errors.

rbradford · 2025-11-04T16:46:44Z

I have an alternative here - #7455 that doesn't require TLS.

rbradford requested a review from a team as a code owner October 22, 2025 15:52

rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from f5552e0 to 914954c Compare October 22, 2025 16:04

rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from 914954c to 44cf38e Compare October 22, 2025 16:15

rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch 3 times, most recently from 085e71b to c729acb Compare October 22, 2025 17:48

phip1611 reviewed Oct 23, 2025

View reviewed changes

vmm/src/cpu.rs Outdated Show resolved Hide resolved

phip1611 requested changes Oct 23, 2025

View reviewed changes

hypervisor/src/kvm/mod.rs Outdated Show resolved Hide resolved

vmm/src/cpu.rs Show resolved Hide resolved

vmm/src/cpu.rs Show resolved Hide resolved

vmm/src/cpu.rs Outdated Show resolved Hide resolved

hypervisor/src/kvm/mod.rs Outdated Show resolved Hide resolved

phip1611 reviewed Oct 23, 2025

View reviewed changes

vmm/src/cpu.rs Show resolved Hide resolved

phip1611 reviewed Oct 23, 2025

View reviewed changes

rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch 2 times, most recently from f612b6d to 1c332a6 Compare October 24, 2025 18:33

rbradford and others added 4 commits October 25, 2025 13:16

vmm: cpu: Remove (now) unnecessary clearing of KVM set_immediate_exit()

5faf046

This is now done as a part of the handling exits that result in an EINTR error. Signed-off-by: Rob Bradford <[email protected]>

seccomp: Add sendto to http-server thread syscalls

3227f5f

Signed-off-by: Rob Bradford <[email protected]>

rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from 1c332a6 to 3227f5f Compare October 25, 2025 12:16

phip1611 reviewed Oct 27, 2025

View reviewed changes

olivereanderson reviewed Oct 27, 2025

View reviewed changes

likebreath reviewed Oct 27, 2025

View reviewed changes

likebreath added the bug-fix Bug fix to include in release notes label Oct 27, 2025

phip1611 mentioned this pull request Oct 29, 2025

x86: immediate_exit in KVM_RUN should probably be AtomicU8 rust-vmm/kvm#357

Open

rbradford mentioned this pull request Nov 3, 2025

vmm: allow http-server to use sendto #7454

Merged

rbradford closed this Nov 4, 2025

phip1611 mentioned this pull request Nov 11, 2025

KVM: Invalid immediate_exit handling #7427

Open

	#[error("vCPU TLS not present")]
	#[error("vCPU was moved to another thread early and TLS is no longer present")]

vmm: cpu: Use set_immediate_exit to avoid a pause race #7428

vmm: cpu: Use set_immediate_exit to avoid a pause race #7428

Uh oh!

Conversation

rbradford commented Oct 22, 2025

Uh oh!

rbradford commented Oct 22, 2025

Uh oh!

rbradford commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbradford commented Oct 22, 2025

Uh oh!

Uh oh!

phip1611 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 commented Oct 27, 2025

Uh oh!

olivereanderson Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

rbradford commented Oct 22, 2025 •

edited

Loading

phip1611 Oct 27, 2025 •

edited

Loading

olivereanderson Oct 27, 2025 •

edited

Loading

phip1611 Oct 27, 2025 •

edited

Loading

phip1611 Oct 27, 2025 •

edited

Loading

phip1611 Oct 27, 2025 •

edited

Loading