Skip to content

Conversation

@rbradford
Copy link
Member

Problem is described in excellent detail in #7427. In essence it was
possible for the pause to be triggered between when we check for the
pause atomic being signalled and when we enter KVM_RUN. This would then
result in that pause being lost until the next KVM_EXIT. The code
already made use of KVM set_immediate_exit() for handling the
completion of I/O exits and the RT signal for breaking the execution of
KVM_RUN.

Using the code from Firecracker:
firecracker-microvm/firecracker@b359722

Implement Thread Local Storage (TLS) for the Vcpu object and then use
the RT signal to trigger KVM set_immediate_exit(). The use of TLS is
necessary as signals are not thread safe.

Fixes: #7427

Signed-off-by: Rob Bradford [email protected]

@rbradford rbradford requested a review from a team as a code owner October 22, 2025 15:52
@rbradford rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from f5552e0 to 914954c Compare October 22, 2025 16:04
@rbradford
Copy link
Member Author

This seems to fail when I do a shutdown - it's interacting badly with the SIGWINCH handler:

It's stuck in the poll():

#4  0x00007fa9d4965c1b in vmm::sigwinch_listener::sigwinch_listener_main::{closure#1} (tx=0x7fa9d43ff570) at vmm/src/sigwinch_listener.rs:196
196	        while unsafe { poll(&mut pollfd, 1, -1) } == -1 {

@alyssais - Any thoughts?

@rbradford rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from 914954c to 44cf38e Compare October 22, 2025 16:15
@rbradford
Copy link
Member Author

rbradford commented Oct 22, 2025

This seems to fail when I do a shutdown - it's interacting badly with the SIGWINCH handler:

It's stuck in the poll():

#4  0x00007fa9d4965c1b in vmm::sigwinch_listener::sigwinch_listener_main::{closure#1} (tx=0x7fa9d43ff570) at vmm/src/sigwinch_listener.rs:196
196	        while unsafe { poll(&mut pollfd, 1, -1) } == -1 {

@alyssais - Any thoughts?

This is not to do with that - I was getting confused because of the second process used for the SIGWINCH handler - gdb was dropping me into that. Rather it's the signalling as part of the shutdown which is going wrong (hence why some of the tests - those that reboot in the middle - are failing.)

@rbradford rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch 3 times, most recently from 085e71b to c729acb Compare October 22, 2025 17:48
@rbradford
Copy link
Member Author

This is not to do with that - I was getting confused because of the second process used for the SIGWINCH handler - gdb was dropping me into that. Rather it's the signalling as part of the shutdown which is going wrong (hence why some of the tests - those that reboot in the middle - are failing.)

All fixed now.

Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for coming up with a quick first draft for this! Your implementation using thread-local storage looks much cleaner and less hacky than I initially expected - nicely done!

I have some remarks, tho.

  • I’d like to run it in our test suite to verify everything works as intended.

unsafe {
let _ = Vcpu::run_on_thread_local(|vcpu| {
vcpu.vcpu.set_immediate_exit(true);
fence(Ordering::Release);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? please explain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what - the fence? I copied it from firecracker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the fence. Well, do we genuinely understand the need for the fence there? We discussed this internally and we are not sure. That said: If you see something that we do not see, please write a short comment. Otherwise, I think we should remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QEMU has a barrier after this too - it makes sense that this would be necessary on systems that don't have TSO - i.e. a weaker memory model. As the KVM_RUN ioctl could be performed on a different physical core to the one where this value was updated (and thus resulting in the change being in the cache in that core but not seen by the running core). The memory barrier ensures all cores see all writes prior to the barrier before any reads that might happen after the barrier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbradford Can you elaborate on the case where multiple CPUs would be involved? Are you worried about this thread migrating (doesn't need atomics, just signal safety) or are there multiple threads writing to KVM_RUN (needs atomics)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former - the memory fence is to ensure the write to immediate_exit (from the signal handler) is visible to the upcoming KVM_RUN ioctl (which could be running on different physical core from the signal handler).

@rbradford
Copy link
Member Author

@phip1611 Don't worry too much about this version of the PR - i've got some ideas on how to improve it and will update it.

@rbradford rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch 2 times, most recently from f612b6d to 1c332a6 Compare October 24, 2025 18:33
rbradford and others added 4 commits October 25, 2025 13:16
Problem is described in excellent detail in cloud-hypervisor#7427. In essence it was
possible for the pause to be triggered between when we check for the
pause atomic being signalled and when we enter KVM_RUN. This would then
result in that pause being lost until the next KVM_EXIT. The code
already made use of KVM set_immediate_exit() for handling the
completion of I/O exits and the RT signal for breaking the execution of
KVM_RUN.

Using the code from Firecracker:
firecracker-microvm/firecracker@b359722

Implement Thread Local Storage (TLS) for the Vcpu object and then use
the RT signal to trigger KVM set_immediate_exit(). The use of TLS is
necessary as signals are not thread safe.

Fixes: cloud-hypervisor#7427

Signed-off-by: Rob Bradford <[email protected]>
This is now done as a part of the handling exits that result in an EINTR
error.

Signed-off-by: Rob Bradford <[email protected]>
@rbradford rbradford force-pushed the 2025-10-22-vcpu-tls-kick branch from 1c332a6 to 3227f5f Compare October 25, 2025 12:16
(libc::SYS_recvfrom, vec![]),
(libc::SYS_recvmsg, vec![]),
(libc::SYS_sched_yield, vec![]),
(libc::SYS_sendto, vec![]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this commit is missing the why part

/// Should be called if the current `self` had called `init_thread_local_data()` and
/// now needs to move to a different thread.
///
/// Fails if `self` was not previously associated with the current thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails if self was not previously associated with the current thread.

In what realistic scenarios can this happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().

/// Cannot clean init vcpu TLS.
#[error("Failed to initialise vCPU TLS")]
VcpuTlsInit,
#[error("vCPU TLS not present")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a little more context, something like

Suggested change
#[error("vCPU TLS not present")]
#[error("vCPU was moved to another thread early and TLS is no longer present")]

Copy link
Member

@phip1611 phip1611 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this discussion still remains. Why can the vCPU be moved to another thread here? Not sure how useful this error message is.

If we have a hard error here, I'd rather panic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).

@phip1611
Copy link
Member

Following up on #7428 (comment): Either I'm blind or you somehow missed to unregister the signal handler? 😀

if let Some(vcpu_ptr) = cell.get() {
// SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty,
// and it is being cleared on `Vcpu::drop` so there is no dangling pointer.
let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr };
Copy link
Contributor

@olivereanderson olivereanderson Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this will break Rust's aliasing rules.

What happens if you are in vcpu.run() here and then the signal handler gets triggered which will then run this function. Then you have two mutable references to the same VCPU resource don't you?

Copy link
Member

@phip1611 phip1611 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! However, it might be that there is a "loop hole" for us in the undefined behavior spectrum - as we have differently execution contexts (normal thread, thread handling signal) and different stack variables, I think it could be okay.

I don't know how to verify either, tho.

Copy link
Member

@phip1611 phip1611 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our internal discussion:

  • It might be okay to have an imperfect fix for now with potential UB (thanks Rob for working on this, really!)
  • Assuming we have UB here indeed: A possible path forward for a correct solution could be
    • add functionality to VcpuFd in kvm-ioctls to create new distinct mappings of KVM_RUN
    • add an atomic way to read/write immediate_exit from rust
    • use this abstraction in CHV and create a reasonably nice interface within CHV for that
    • CHV would then use a distinct KVM_RUN mapping (which holds immediate_exit) in its vCPU TLS

Further:

  • If we have indeed UB here, Firecracker runs into the same UB problem
  • While Firecracker may not serve as the ultimate reference implementation, it successfully runs millions of VMs in production, which suggests that this is probably not "critical undefined behavior".
  • It might be that the Rust language/compiler team one days says what we are doing here is not UB and a valid corner-case - but at this point in time this is not clear

@olivereanderson asks the Rust team upstream (link will be added ASAP) what's their take on our UB discussion. Perhaps we could wait on a reply there before we merge this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@phip1611 phip1611 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TL;DR of the rust-lang discussion is: It is very simple. We just need to define a new formal memory model covering typical low-level patterns, we need to implement this model in LLVM and Rust, and then adapt the Cloud Hypervisor code to it.

Just kidding. It is inherently complex. I'm in favor of doing what works technically so far and this is what firecracker uses as well and used by Rob as well. Althogh I'm not 100% happy with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time — the several instances of UB I've found in Cloud Hypervisor have all become apparent because a rustc/LLVM upgrade suddenly made it do something nonsensical, where before it was fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time

Firecracker stopped doing the "vcpu pointer in tls" thing a few months ago for exactly this reason

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall look good to me. I'd vote to land this PR, once @phip1611 confirms it fixes #7427.

Open comments are only around better handling Error::VcpuTlsNotPresent.

/// Cannot clean init vcpu TLS.
#[error("Failed to initialise vCPU TLS")]
VcpuTlsInit,
#[error("vCPU TLS not present")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).

/// Should be called if the current `self` had called `init_thread_local_data()` and
/// now needs to move to a different thread.
///
/// Fails if `self` was not previously associated with the current thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().

if let Some(vcpu_ptr) = cell.get() {
// SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty,
// and it is being cleared on `Vcpu::drop` so there is no dangling pointer.
let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.

// SAFETY: This is safe because it's temporarily aliasing the `Vcpu` object, but we are
// only reading `vcpu.fd` which does not change for the lifetime of the `Vcpu`.
unsafe {
let _ = Vcpu::run_on_thread_local(|vcpu| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar comment here regarding error handling: perhaps it is better to panic with Error::VcpuTlsNotPresent, since the error indicates programming errors.

@rbradford
Copy link
Member Author

I have an alternative here - #7455 that doesn't require TLS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix Bug fix to include in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KVM: Invalid immediate_exit handling

7 participants