-
Notifications
You must be signed in to change notification settings - Fork 565
vmm: cpu: Use set_immediate_exit to avoid a pause race #7428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vmm: cpu: Use set_immediate_exit to avoid a pause race #7428
Conversation
f5552e0 to
914954c
Compare
|
This seems to fail when I do a shutdown - it's interacting badly with the SIGWINCH handler: It's stuck in the @alyssais - Any thoughts? |
914954c to
44cf38e
Compare
This is not to do with that - I was getting confused because of the second process used for the SIGWINCH handler - gdb was dropping me into that. Rather it's the signalling as part of the shutdown which is going wrong (hence why some of the tests - those that reboot in the middle - are failing.) |
085e71b to
c729acb
Compare
All fixed now. |
phip1611
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for coming up with a quick first draft for this! Your implementation using thread-local storage looks much cleaner and less hacky than I initially expected - nicely done!
I have some remarks, tho.
- I’d like to run it in our test suite to verify everything works as intended.
| unsafe { | ||
| let _ = Vcpu::run_on_thread_local(|vcpu| { | ||
| vcpu.vcpu.set_immediate_exit(true); | ||
| fence(Ordering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? please explain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what - the fence? I copied it from firecracker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the fence. Well, do we genuinely understand the need for the fence there? We discussed this internally and we are not sure. That said: If you see something that we do not see, please write a short comment. Otherwise, I think we should remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QEMU has a barrier after this too - it makes sense that this would be necessary on systems that don't have TSO - i.e. a weaker memory model. As the KVM_RUN ioctl could be performed on a different physical core to the one where this value was updated (and thus resulting in the change being in the cache in that core but not seen by the running core). The memory barrier ensures all cores see all writes prior to the barrier before any reads that might happen after the barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbradford Can you elaborate on the case where multiple CPUs would be involved? Are you worried about this thread migrating (doesn't need atomics, just signal safety) or are there multiple threads writing to KVM_RUN (needs atomics)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former - the memory fence is to ensure the write to immediate_exit (from the signal handler) is visible to the upcoming KVM_RUN ioctl (which could be running on different physical core from the signal handler).
|
@phip1611 Don't worry too much about this version of the PR - i've got some ideas on how to improve it and will update it. |
f612b6d to
1c332a6
Compare
Problem is described in excellent detail in cloud-hypervisor#7427. In essence it was possible for the pause to be triggered between when we check for the pause atomic being signalled and when we enter KVM_RUN. This would then result in that pause being lost until the next KVM_EXIT. The code already made use of KVM set_immediate_exit() for handling the completion of I/O exits and the RT signal for breaking the execution of KVM_RUN. Using the code from Firecracker: firecracker-microvm/firecracker@b359722 Implement Thread Local Storage (TLS) for the Vcpu object and then use the RT signal to trigger KVM set_immediate_exit(). The use of TLS is necessary as signals are not thread safe. Fixes: cloud-hypervisor#7427 Signed-off-by: Rob Bradford <[email protected]>
Also see [0] for more info. [0] https://docs.kernel.org/virt/kvm/api.html#the-kvm-run-structure Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]
This is now done as a part of the handling exits that result in an EINTR error. Signed-off-by: Rob Bradford <[email protected]>
Signed-off-by: Rob Bradford <[email protected]>
1c332a6 to
3227f5f
Compare
| (libc::SYS_recvfrom, vec![]), | ||
| (libc::SYS_recvmsg, vec![]), | ||
| (libc::SYS_sched_yield, vec![]), | ||
| (libc::SYS_sendto, vec![]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this commit is missing the why part
| /// Should be called if the current `self` had called `init_thread_local_data()` and | ||
| /// now needs to move to a different thread. | ||
| /// | ||
| /// Fails if `self` was not previously associated with the current thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails if
selfwas not previously associated with the current thread.
In what realistic scenarios can this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().
| /// Cannot clean init vcpu TLS. | ||
| #[error("Failed to initialise vCPU TLS")] | ||
| VcpuTlsInit, | ||
| #[error("vCPU TLS not present")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a little more context, something like
| #[error("vCPU TLS not present")] | |
| #[error("vCPU was moved to another thread early and TLS is no longer present")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this discussion still remains. Why can the vCPU be moved to another thread here? Not sure how useful this error message is.
If we have a hard error here, I'd rather panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).
|
Following up on #7428 (comment): Either I'm blind or you somehow missed to unregister the signal handler? 😀 |
| if let Some(vcpu_ptr) = cell.get() { | ||
| // SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty, | ||
| // and it is being cleared on `Vcpu::drop` so there is no dangling pointer. | ||
| let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear this will break Rust's aliasing rules.
What happens if you are in vcpu.run() here and then the signal handler gets triggered which will then run this function. Then you have two mutable references to the same VCPU resource don't you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! However, it might be that there is a "loop hole" for us in the undefined behavior spectrum - as we have differently execution contexts (normal thread, thread handling signal) and different stack variables, I think it could be okay.
I don't know how to verify either, tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our internal discussion:
- It might be okay to have an imperfect fix for now with potential UB (thanks Rob for working on this, really!)
- Assuming we have UB here indeed: A possible path forward for a correct solution could be
- add functionality to
VcpuFdinkvm-ioctlsto create new distinct mappings of KVM_RUN - add an atomic way to read/write immediate_exit from rust
- use this abstraction in CHV and create a reasonably nice interface within CHV for that
- CHV would then use a distinct KVM_RUN mapping (which holds immediate_exit) in its vCPU TLS
- add functionality to
Further:
- If we have indeed UB here, Firecracker runs into the same UB problem
- While Firecracker may not serve as the ultimate reference implementation, it successfully runs millions of VMs in production, which suggests that this is probably not "critical undefined behavior".
- It might be that the Rust language/compiler team one days says what we are doing here is not UB and a valid corner-case - but at this point in time this is not clear
@olivereanderson asks the Rust team upstream (link will be added ASAP) what's their take on our UB discussion. Perhaps we could wait on a reply there before we merge this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TL;DR of the rust-lang discussion is: It is very simple. We just need to define a new formal memory model covering typical low-level patterns, we need to implement this model in LLVM and Rust, and then adapt the Cloud Hypervisor code to it.
Just kidding. It is inherently complex. I'm in favor of doing what works technically so far and this is what firecracker uses as well and used by Rob as well. Althogh I'm not 100% happy with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time — the several instances of UB I've found in Cloud Hypervisor have all become apparent because a rustc/LLVM upgrade suddenly made it do something nonsensical, where before it was fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because Firecracker has got away with it so far doesn't mean it isn't liable to break at any time
Firecracker stopped doing the "vcpu pointer in tls" thing a few months ago for exactly this reason
likebreath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Cannot clean init vcpu TLS. | ||
| #[error("Failed to initialise vCPU TLS")] | ||
| VcpuTlsInit, | ||
| #[error("vCPU TLS not present")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a vCPU instance (of the struct Vcpu) will be moved to a different thread. To be not confused, the discussion you referred is talking about the same thread could be executed on two different physical cores (e.g. thread migration).
| /// Should be called if the current `self` had called `init_thread_local_data()` and | ||
| /// now needs to move to a different thread. | ||
| /// | ||
| /// Fails if `self` was not previously associated with the current thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, that's likely a programming error, similar to Error::VcpuTlsInit (where we did panic from its caller side with an expect()). Perhaps, we should panic too with Error::VcpuTlsNotPresent from the caller, say from Vcpu::drop().
| if let Some(vcpu_ptr) = cell.get() { | ||
| // SAFETY: Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty, | ||
| // and it is being cleared on `Vcpu::drop` so there is no dangling pointer. | ||
| let vcpu_ref: &mut Vcpu = unsafe { &mut *vcpu_ptr }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd focusing on finding the best solution for solving #7427 with this PR. If carrying a known UB is the best solution we have for now, I'd go with it and with clear documentation on the UB of course.
| // SAFETY: This is safe because it's temporarily aliasing the `Vcpu` object, but we are | ||
| // only reading `vcpu.fd` which does not change for the lifetime of the `Vcpu`. | ||
| unsafe { | ||
| let _ = Vcpu::run_on_thread_local(|vcpu| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar comment here regarding error handling: perhaps it is better to panic with Error::VcpuTlsNotPresent, since the error indicates programming errors.
|
I have an alternative here - #7455 that doesn't require TLS. |
Problem is described in excellent detail in #7427. In essence it was
possible for the pause to be triggered between when we check for the
pause atomic being signalled and when we enter KVM_RUN. This would then
result in that pause being lost until the next KVM_EXIT. The code
already made use of KVM set_immediate_exit() for handling the
completion of I/O exits and the RT signal for breaking the execution of
KVM_RUN.
Using the code from Firecracker:
firecracker-microvm/firecracker@b359722
Implement Thread Local Storage (TLS) for the Vcpu object and then use
the RT signal to trigger KVM set_immediate_exit(). The use of TLS is
necessary as signals are not thread safe.
Fixes: #7427
Signed-off-by: Rob Bradford [email protected]