Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Mar 11, 2025

About

Add advisory file locks for disk images, aligning the behavior with QEMU. Please check out the commit messages for more details.

Hints for Reviewers

Please review this commit-by-commit.

Steps to Undraft

  • basic functionality
  • add command line option to make this configurable (context)
  • solution for live migration

@phip1611 phip1611 requested a review from a team as a code owner March 11, 2025 15:50
@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from bceac2b to 2b03eb1 Compare March 11, 2025 15:53
@iggy
Copy link
Contributor

iggy commented Mar 11, 2025

This should either be opt-in (for backwards compat) or at the very least have a way to opt-out. I'm pretty sure there's a test that migrates on the same host that will fail.

@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from 8560a88 to 17f6654 Compare March 11, 2025 16:04
@phip1611
Copy link
Member Author

phip1611 commented Mar 11, 2025

@iggy I think you're right. I'll look into how QEMU solves this and add a second commit soon.

@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from 1767409 to 0d71e3f Compare March 11, 2025 16:08
@rbradford
Copy link
Member

There is an intentional use case here where you

@iggy I think you're right. I'll look into how QEMU solves this and add a second commit soon.

Yes - you will need to add an option to --block - I propose lock with an enum of "off, warn, on" . Default to warn for this this release and then move to "on" from a future release.

@phip1611 phip1611 marked this pull request as draft March 11, 2025 19:28
@phip1611
Copy link
Member Author

phip1611 commented Mar 11, 2025

From what I've seen, there is no code I could reuse for a tristate parsing (off|warn|on), it also sounds a little feature creep to me 🤔. I'd much rather stick with two states so that I can reuse the Toggle abstraction.

I'd like to suggest that the default behavior from now on is:

  • off => always log::warn if the file is locked but don't throw an error
  • on => return an error

What do you think, @rbradford?

phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 11, 2025
We discussed [0] that the new behaviour should be made configurable with
a tristate: off,warn,on. We add it to the next release with "warn" by
default and "on" for the release after that.

[0]
cloud-hypervisor#6974 (comment)
@rbradford
Copy link
Member

From what I've seen, there is no code I could reuse for a tristate parsing (off|warn|on), it also sounds a little feature creep to me 🤔. I'd much rather stick with two states so that I can reuse the Toggle abstraction.

I'd like to suggest that the default behavior from now on is:

* `off` => always `log::warn` if the file is locked but don't throw an error

* `on` => return an error

What do you think, @rbradford?

I think I would prefer the enum of three different values as if you know what you're doing and you don't want a warn spamming you then you can choose "off". You can use vm_config::HotplugMethod as an example on how to implement. If you give it a general name like WarnToggle it might get used elsewhere.

phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 12, 2025
We discussed [0] that the new behaviour should be made configurable with
a tristate: off,warn,on.

 off  => ignore
 warn => Warn when disk-images are already used by other processes and
         this VM wants to have exclusive access. Don't acquire any locks.
 on   => Enforce the acquiring of advisory file-system locks for disk
        image files.

 "warn" is the default for now and we will soon switch to "on"

[0]
cloud-hypervisor#6974 (comment)
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 12, 2025
We discussed [0] that the new behaviour should be made configurable with
a tristate: off,warn,on.

 off  => ignore
 warn => Warn when disk-images are already used by other processes and
         this VM wants to have exclusive access. Don't acquire any locks.
 on   => Enforce the acquiring of advisory file-system locks for disk
        image files.

 "warn" is the default for now and we will soon switch to "on"

[0]
cloud-hypervisor#6974 (comment)

Signed-off-by: Philipp Schuster <[email protected]>
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 12, 2025
We discussed [0] that the new behaviour should be made configurable with
a tristate: off,warn,on.

 off  => ignore
 warn => Warn when disk-images are already used by other processes and
         this VM wants to have exclusive access. Don't acquire any
         locks.
 on   => Enforce the acquiring of advisory file-system locks for disk
        image files.

 "warn" is the default for now and we will soon switch to "on"

[0]
cloud-hypervisor#6974 (comment)

Signed-off-by: Philipp Schuster <[email protected]>
@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from ba8b35c to d404fd4 Compare March 12, 2025 09:50
@phip1611
Copy link
Member Author

phip1611 commented Mar 12, 2025

I adjusted the CLI as discussed. I'm happy for a review of that part. I'll look into a solution for the live-migration as last step.

It is unfortunate that gitlint doesn't ignore links. How do we solve the failing CI?
Fixed that in a dedicated commit, hope this is fine.

phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 12, 2025
We discussed [0] that the new behaviour should be made configurable with
a tristate: off,warn,on.

 off  => ignore
 warn => Warn when disk-images are already used by other processes and
         this VM wants to have exclusive access. Don't acquire any
         locks.
 on   => Enforce the acquiring of advisory file-system locks for disk
        image files.

 "warn" is the default for now and we will soon switch to "on"

[0]
cloud-hypervisor#6974 (comment)

Signed-off-by: Philipp Schuster <[email protected]>
@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from 0ec8668 to 1ca0780 Compare March 12, 2025 10:21
@rbradford
Copy link
Member

@blitz What's the reason this PR is still draft? Can we move ahead with it?

@phip1611 phip1611 marked this pull request as draft May 14, 2025 10:53
@phip1611
Copy link
Member Author

phip1611 commented May 14, 2025

Yikes, took quite a while. The CI integration tests are finally green.

Since your last major review, the following changes have been added:

  • Device hot plugging of disks and also unplugging
  • DiskFile::fd() is safer to use

@rbradford @likebreath

A comprehensive write-up is in the commit message. Excerpt:

block: introduce advisory locks for disk image files

# What

This commit introduces file-based advisory locking for the files backing
up the block devices by using the fcntl() syscall with OFD locks. The
per-open-file-descriptor (OFD) locks are more robust than traditional
POSIX locks (F_SETLK) as they are not tied to process IDs and avoid
common issues in multithreaded or multi-fd scenarios [1]. Therefore,
we don't use `std::fs::File::try_lock()`, which is backed by F_SETLKW.

The locking mechanism is aware of the `readonly` property and allows
`n` readers or `1` writer (exclusive mode).

As the locks are advisory, multiple cloud-hypervisor processes can
prevent themselves from writing to the same file. However, this is not
a system-wide file-system level locking mechanism preventing to open()
a file.

The introduced new locking mechanism does not cover vhost-user devices.

# Why

To prevent misconfiguration and improve safety, it is good practice to
protect disk image files with a locking mechanism. Experience and common
best practices suggest that advisory locks are preferable over mandatory
locks due to better compatibility and fewer pitfalls (in fs space).

The introduced functionality is aligned with the approach taken by
QEMU [0], and is also recommended in [1].

# Implementation Details

We need to ensure that not only normal operation keeps working but also
state save/resume and live-migration. Especially for live migration,
it is crucial that the sender VMM releases the locks when the VM stops
so the receiver VMM can acquire them right after that.

Therefore, the locking and releasing happen directly on the block
device struct. The device manager knows all block devices and can
forward requests to these types.

Last but not least, this commit uses on explicit lock acquiring
but implicit lock releasing (FD close). It only explicitly releases
the locks where this integrates more smoothly into the existing
code.

# Testing

I tested
- normal operation
- state save/resume,
- device hot plugging,
- and live-migration
with read/shared and write/exclusive locks.

One can use the `fcntl-tool` to test if locks are actually acquired
or released [2].

# Links

[0] https://github.com/qemu/qemu/blob/825b96dbcee23d134b691fc75618b59c5f53da32/util/osdep.c#L266
[1] https://apenwarr.ca/log/20101213
[2] https://crates.io/crates/fcntl-tool

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP <[email protected]>

@phip1611 phip1611 marked this pull request as ready for review May 14, 2025 12:42
@phip1611 phip1611 force-pushed the disk-lock branch 2 times, most recently from e117b5f to 6ef9f73 Compare May 14, 2025 13:17
@phip1611 phip1611 requested a review from rbradford May 14, 2025 13:17
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor feedback about the last commit.

phip1611 added 7 commits May 16, 2025 09:18
This simplifies debugging of running VMs.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
This is a prerequisite for the next steps.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
As we can't use BorrowedFd, we should at least create a similar
safe alternative.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
This is a prerequisite for the next steps.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
# What

This commit introduces file-based advisory locking for the files backing
up the block devices by using the fcntl() syscall with OFD locks. The
per-open-file-descriptor (OFD) locks are more robust than traditional
POSIX locks (F_SETLK) as they are not tied to process IDs and avoid
common issues in multithreaded or multi-fd scenarios [1]. Therefore,
we don't use `std::fs::File::try_lock()`, which is backed by F_SETLKW.

The locking mechanism is aware of the `readonly` property and allows
`n` readers or `1` writer (exclusive mode).

As the locks are advisory, multiple cloud-hypervisor processes can
prevent themselves from writing to the same file. However, this is not
a system-wide file-system level locking mechanism preventing to open()
a file.

The introduced new locking mechanism does not cover vhost-user devices.

# Why

To prevent misconfiguration and improve safety, it is good practice to
protect disk image files with a locking mechanism. Experience and common
best practices suggest that advisory locks are preferable over mandatory
locks due to better compatibility and fewer pitfalls (in fs space).

The introduced functionality is aligned with the approach taken by
QEMU [0], and is also recommended in [1].

# Implementation Details

We need to ensure that not only normal operation keeps working but also
state save/resume and live-migration. Especially for live migration,
it is crucial that the sender VMM releases the locks when the VM stops
so the receiver VMM can acquire them right after that.

Therefore, the locking and releasing happen directly on the block
device struct. The device manager knows all block devices and can
forward requests to these types.

Last but not least, this commit uses on explicit lock acquiring
but implicit lock releasing (FD close). It only explicitly releases
the locks where this integrates more smoothly into the existing
code.

# Testing

I tested
- normal operation
- state save/resume,
- device hot plugging,
- and live-migration
with read/shared and write/exclusive locks.

One can use the `fcntl-tool` to test if locks are actually acquired
or released [2].

# Links

[0] https://github.com/qemu/qemu/blob/825b96dbcee23d134b691fc75618b59c5f53da32/util/osdep.c#L266
[1] https://apenwarr.ca/log/20101213
[2] https://crates.io/crates/fcntl-tool

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The new locking behavior uncovered that unfortunate test situation:
Many tests running in parallel access the same disk image with
rw permissions. Luckily, none of the tests actually writes to
the disk. Therefore, we can set it to readonly=true. In case this
changes, the test needs to be moved to the sequential test module.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
@rbradford rbradford added this pull request to the merge queue May 16, 2025
Merged via the queue into cloud-hypervisor:main with commit d4718a9 May 16, 2025
39 checks passed
@phip1611 phip1611 deleted the disk-lock branch May 16, 2025 09:14
@phip1611
Copy link
Member Author

Great that this is merged! Feel free to ping me in case any bug reports are incoming; I'm happy to help out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants