Skip to content

Conversation

@rbradford
Copy link
Member

  • vmm: properly unset immediate_exit on -EINTR
  • vmm: cpu: Remove (now) unnecessary clearing of KVM set_immediate_exit()
  • vmm: Use set_immediate_exit to avoid a race on pausing the vCPU
  • vmm: cpu: Clear immediate_exit flag on resume

phip1611 and others added 4 commits November 3, 2025 15:20
This is now done as a part of the handling exits that result in an EINTR
error.

Signed-off-by: Rob Bradford <[email protected]>
A race can occur if the signal to the vCPU thread is sent between the
shared state atomic being checked (e.g. for paused) and entering
KVM_RUN. As such we can use KVM's immediate_exit boolean to make the
KVM_RUN exit immediately.

We only conditionally try and take the lock - since because we can end
up a deadlock from a priority inversion situation where the ordering of
signals doesn't match the ordering in which we take the locks.

Fixes: cloud-hypervisor#7427

Signed-off-by: Rob Bradford <[email protected]>
This avoids an extra KVM_RUN exit upon resume where the immediate_exit
flags is still set.

Signed-off-by: Rob Bradford <[email protected]>
@rbradford rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from a5786b5 to 8b1ac03 Compare November 4, 2025 16:07
@rbradford rbradford marked this pull request as ready for review November 4, 2025 16:46
@rbradford rbradford requested a review from a team as a code owner November 4, 2025 16:46
Ok(mut vcpu) => {
vcpu.vcpu.set_immediate_exit(true);
}
Err(std::sync::TryLockError::WouldBlock) => {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better (e.g. reducing the race window further) if we call into signal_thread() here, rather than waiting for looping all vcpus? Essentially, we will be using vcpu.try_lock() to detect from vmm thread if the vcpu thread is executing KVM_RUN, and pick either set_immediate_exit(true) or signal_thread to kick out the vcpu thread.

Copy link
Member Author

@rbradford rbradford Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think picking between two paths would actually increase the chance of a race (and setting immediate_exit or signalling are both cheap) - one of the reasons we have to do it like this is that these loops are split. If we recombine them like this - I think we can avoid the race entirely (the reduces the chance further but I think it could still theoretically happen):

    fn signal_vcpus(&mut self) {
        // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the
        // signal will kick it out.
        for vcpu_id in 0..self.vcpus.len() {
            match self.vcpus[vcpu_id].try_lock() {
                Ok(mut vcpu) => {
                    vcpu.vcpu.set_immediate_exit(true);
                }
                Err(std::sync::TryLockError::WouldBlock) => {}
                Err(e) => {
                    unreachable!("{e:?}")
                }
            }

            self.vcpu_states[vcpu_id].signal_thread();

            // In this case it's fine to wait for the vCPU run to make the lock available for the taking
            self.vcpus[vcpu_id]
                .lock()
                .unwrap()
                .vcpu
                .set_immediate_exit(true);

            self.vcpu_states[vcpu_id].wait_until_signal_acknowledged();
         }
     }

We can't do this two attempts - one non-blocking, one blocking with the split loop variant as you get a deadlock due to the signals not being handled in vCPU ordering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were split into two loops in this commit - c1f4df6

Copy link
Member

@likebreath likebreath Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think picking between two paths would actually increase the chance of a race

Hmm... my view is the opposite. I thought with such change, we will be able to further reduce the time window between vmm thread detected vCPU thread is in KVM_RUN and vmm thread kick vCPU thread with signal, which is the racing window. Reducing the race window would reduce the changes of race, isn't it?

The change I am thinking is:

    fn signal_vcpus(&self) {
        // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the
        // signal will kick it out.
        for vcpu_id in 0..self.vcpus.len() {
            match self.vcpus[vcpu_id].try_lock() {
                Ok(mut vcpu) => {
                    // vCPU thread is not in KVM_RUN, kick it out with set_immediate_exit
                    vcpu.vcpu.set_immediate_exit(true);
                }
                Err(std::sync::TryLockError::WouldBlock) => {
                    // vCPU thread is in KVM_RUN, kick it out with signal
                    self.vcpu_states[vcpu_id].signal_thread();
                }
                Err(e) => {
                    unreachable!("{e:?}")
                }
            }
        }

        for state in self.vcpu_states.iter() {
            state.wait_until_signal_acknowledged();
        }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but the thing is that KVM_RUN isn't the only place that the lock is taken - it's the most likely. I think conditionally sending the signal could result in situations where the wait_until_signal_acknowledged spins forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and added even more "defence in depth" - by making the interrupted signalling check and retry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but the thing is that KVM_RUN isn't the only place that the lock is taken

I see. How about the following change? It should help with reducing the race further, while not assuming KVM_RUN is the only place where the vcpu lock is held.

    fn signal_vcpus(&self) {
        // If the lock is unavailable it is almost certainly due to being in KVM_RUN itself so the
        // signal will kick it out.
        for vcpu_id in 0..self.vcpus.len() {
            match self.vcpus[vcpu_id].try_lock() {
                Ok(mut vcpu) => {
                    // Try to avoid the race mentioned above, when the vcpu lock is not taken.
                    vcpu.vcpu.set_immediate_exit(true);
                }
                Err(std::sync::TryLockError::WouldBlock) => {}
                Err(e) => {
                    unreachable!("{e:?}")
                }
            }

            // Always send signal to kick out vCPU thread, as KVM_RUN is not the only place 
            // that the vCPU lock could be taken.
            self.vcpu_states[vcpu_id].signal_thread();
        }

        for state in self.vcpu_states.iter() {
            state.wait_until_signal_acknowledged();
        }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - let me incorporate that change - will tackle that tomorrow.

@rbradford rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from 6061b65 to 273b3ec Compare November 4, 2025 19:46
vmm/src/cpu.rs Outdated
Comment on lines 714 to 717
///
/// Returns whether the vCPU was interrupted within the 1 second timeout
fn wait_until_signal_acknowledged(&self) -> bool {
let start = Instant::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split such commit into a different PR, as such change addresses a separate topic.

Also, we should report at least a warning if a timeout occurs here, as this likely suggests abnormal behavior that warrants our attention.

Copy link
Member Author

@rbradford rbradford Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually improved it since - but yes, let me split it out (and tbh, this change on it's own would probably fix the lost signal issue in #7427 itself.)

By retrying the signalling on the vCPU thread if it doesn't respond
within 1 second the race between pause being requested and the KVM_RUN
ioctl returning can be mitigated.

Signed-off-by: Rob Bradford <[email protected]>
@rbradford rbradford force-pushed the 2025-11-03-vcpu-immediate-exit-on-pause branch from 273b3ec to b8a5afc Compare November 4, 2025 20:25
@phip1611
Copy link
Member

phip1611 commented Nov 5, 2025

Just to make sure that I get it right. Your high-level idea is that instead of doing (unsafe) stuff in the signal handler, we stick to the NOOP signal handler and instead you tackle the challenge at places than cause the signal to be send?

I think this could be a possible way forward, but we need some time to think about it and evaluate it. Thanks for working on this!

@phip1611
Copy link
Member

I've added our (=@cyberus-technology) analysis, statement, and recommendation in the issue #7427 (comment).

@rbradford
Copy link
Member Author

Closing as @phip1611 is going to raise a new PR with a proposed solution.

@rbradford rbradford closed this Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants