-
Notifications
You must be signed in to change notification settings - Fork 565
vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcpu lifecycle: fix race-condition in resume()-pause() cycle #7397
Conversation
1be7611 to
dec7e69
Compare
|
Thanks @phip1611 - i've seen this. I'm going to need a bit of time with a clear head to give this a proper review. Thanks for your patience. |
In the commit message there is a reproducer! We discovered and fixed this in a four person -two-day workshop 😁 So no worries, it is hard to debug these issues. I can tell you that since we have this patch, we've successfully ran tens of thousands of live-migrations in our test setup including vCPU auto-converging (vCPU throttling - not upstream). |
likebreath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reporting and fixing the bug with detailed information. Overall look good to me. Some minor comments below for clarify.
dec7e69 to
bf75a81
Compare
|
I think we should be good to go now. |
bf75a81 to
f9965a8
Compare
The vCPU lifecycle is already complicated. Let's remove dead code (the impl is a no-op). Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]
These bindings actually refer to atomic bool shared across all vCPUs to instruct single vCPUs with their next action. As there are already enough Arc<AtomicBool>, this helps while debugging things to see that different bindings refer to the same atomic bool. In other words: This naming really confused us while debugging. Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]
Fix a race condition that happens in resume()-pause() cycles. It is odd that for pause(), the CpuManager waited via `state.paused` for the vCPU thread to ACK the state change but not for `resume()`. In the `resume()` case, oddly CpuManager "owned" the state change in `state.paused`. This commit changes this so that the vCPU ACKs its state change itself in `state.paused` when it transitions from pause->run. Further, `CpuManager::resume()` now gracefully waits for the vCPU to be resumed. More technical: This change ensures proper synchronization and prevents situations in that park() follows right after unpark(), causing deadlocks and other weird behavior due to race conditions. Calling resume() now takes slightly longer, very similar to pause(). This is, however, even for 254 vCPUs in the range of less than 10ms, and ultimately we now have correct behaviour. ## Reproducer Since [0] is merged, the underlying problem can be tested without this commit by modifying the pause() API call to run `CpuManager::pause()` and `CpuManager::resume()` in a loop a thousand times. `ch-remote --api-socket ... pause` ```patch diff --git a/vmm/src/vm.rs b/vmm/src/vm.rs index d7bba25..35557d58f 100644 --- a/vmm/src/vm.rs +++ b/vmm/src/vm.rs @@ -2687,6 +2687,10 @@ impl Pausable for Vm { MigratableError::Pause(anyhow!("Error activating pending virtio devices: {:?}", e)) })?; + for _ in 0..1000 { + self.cpu_manager.lock().unwrap().pause()?; + self.cpu_manager.lock().unwrap().resume()?; + } self.cpu_manager.lock().unwrap().pause()?; self.device_manager.lock().unwrap().pause()?; ``` ## Outlook Decades of experience in VMM development showed us that using many AtomicBools is a footgun. They are not synchronized with each other at all. On the long term, we might want to refactor things to have a single shared AtomicU64 with different bits having different meanings. [0] cloud-hypervisor#7290 Signed-off-by: Philipp Schuster <[email protected]> On-behalf-of: SAP [email protected]
f9965a8 to
bb3c012
Compare
|
Rebased - hopefully the flakyness in the net_hotplug test is gone now. |
| while vcpus_pause_signalled.load(Ordering::SeqCst) { | ||
| thread::park(); | ||
| } | ||
| vcpu_paused.store(false, Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw: this is the line were the individual vCPU tells the cPU manager is has been resumed
|
more CI flakiness.. now the |
d39b565
List of Changes
Hints for Reviewers
Please review the commits one-by-one, as they include a comprehensive description.