Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Jun 19, 2025

This adds support for the live migration of virtio-net devices backed by
network FDs along with some pre-requisites and QoL improvements.

So far, this was tested successfully with a patched libvirt setup (patches will be upstreamed soon). @hertrste is the responsible person here.

Closes #7054 #7291.

Hints for Reviewers

  • Please review this commit-by-commit

Steps to Undraft

@phip1611 phip1611 requested a review from a team as a code owner June 19, 2025 12:24
@phip1611 phip1611 force-pushed the network-fd-livemig branch 2 times, most recently from 2771036 to 07dbecf Compare June 19, 2025 12:56
@phip1611 phip1611 marked this pull request as draft June 19, 2025 12:56
@phip1611 phip1611 force-pushed the network-fd-livemig branch 2 times, most recently from dce61a3 to 4943a69 Compare June 19, 2025 13:00
@phip1611
Copy link
Member Author

phip1611 commented Jun 19, 2025

I'm not sure how to address the live-migration of virtio-net devices in ch-remote. Passing FDs there, such as ch-remote receive-migration 'receiver_url=tcp:127.0.0.1:1337,net_fds=[net1@[42]]' doesn't make really sense as we can't really send FDs from ch-remote to the cloud-hypervisor process.

In the case of libvirt, things are working as there is a UNIX domain socket where one can send FDs using a SCS_RIGHTS message next to the HTTP REST JSON Request.

My current preferred approach: When a user provides net_fds in ch-remote, just fail with an error message. Something like "feature not supported because ...". Seems reasonable, what do you think?

@phip1611 phip1611 marked this pull request as ready for review June 19, 2025 13:21
@phip1611 phip1611 marked this pull request as draft June 19, 2025 13:22
@phip1611 phip1611 force-pushed the network-fd-livemig branch 5 times, most recently from f4c1a89 to 85c47f5 Compare June 19, 2025 16:29
@phip1611 phip1611 changed the title add support for live-migration of virtio-net devices & network FDs [WIP] add support for live-migration of virtio-net devices & network FDs Jun 19, 2025
@phip1611 phip1611 force-pushed the network-fd-livemig branch 6 times, most recently from 6434161 to 6e6888f Compare June 20, 2025 08:40
@alyssais
Copy link
Member

I'm not sure how to address the live-migration of virtio-net devices in ch-remote. Passing FDs there, such as ch-remote receive-migration 'receiver_url=tcp:127.0.0.1:1337,net_fds=[net1@[42]]' doesn't make really sense as we can't really send FDs from ch-remote to the cloud-hypervisor process.

Why not?

@phip1611
Copy link
Member Author

phip1611 commented Jun 23, 2025

I'm not sure how to address the live-migration of virtio-net devices in ch-remote. Passing FDs there, such as ch-remote receive-migration 'receiver_url=tcp:127.0.0.1:1337,net_fds=[net1@[42]]' doesn't make really sense as we can't really send FDs from ch-remote to the cloud-hypervisor process.

Why not?

Well, how do you open these FDs in ch-remote?

  • If ch-remote opens files by interface name, we can drop that step entirely => forward the Tap device name to Cloud Hypervisor
  • If we send FDs to ch-remote via a UNIX socket with an SCM_RIGHTS message, we can drop ch-remote and send the FDs directly to Cloud Hypervisor (as done in libvirt (to be specific: our not yet upstreamed patches for libvirt))

The only way I can think of is the following bash script to send FDs via ch-remote to Cloud Hypervisor:

# Open tap0 as FD 44
exec 44<>/tap/tap0
# Note that the API Receive Magic in CHV might replace this FD
# with an FD that is available in the FD space of the cloud
# hypervisor process; semantically it is the same FD tho
ch-remote --api-socket /tmp/chv-dst.sock \
  receive-migration receiver_url=/tmp/chv-migration.sock,,net_fds=[net1@[44]]

but I don't know how practical or common this use-case is.

@alyssais
Copy link
Member

The only way I can think of is the following bash script to send FDs via ch-remote to Cloud Hypervisor:

I wouldn't do it in bash, but I'd be very interested in doing exactly this: invoking ch-remote with file descriptors set up, and referring to those descriptors on its command line.

@phip1611
Copy link
Member Author

But how would you get these FDs to ch-remote? Why would a management layer does that when it can directly with CHV via the socket?

@alyssais
Copy link
Member

Because it might be a script (I'd likely use execline, which is basically designed for doing this sort of thing, but could be a shell), or it could be some other simple thing where the overhead of doing JSON over HTTP is just higher than opening a file and then running a binary.

@phip1611
Copy link
Member Author

phip1611 commented Jun 23, 2025

Interesting, never heard of it. Thanks for the pointer! IMHO: I think what you proposed should be solved independently of this PR in a dedicated PR, as

  • one also need to do this for the save state/resume path
  • split complexity into reviewable units

Does that sound like a fair plan?

@alyssais
Copy link
Member

Yeah, fine by me.

@phip1611

This comment was marked as outdated.

During development this new message helps to quickly parse the
logs for success. In case this message is not shown but the last
message is not an error, one can assume that likely a livelock
(locking a contended lock) is the cause of the problem.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
One can call `to_vec()` anyway if one needs an owned copy. This change
further helps to prevent needless copies in upcoming changes.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
To ease debugging of networking in the field, especially in context of
libvirt, state save/resume, and live-migration, more logging helps to
identify what happens behind the scenes in certain corner-cases as well
as (apparently) normal operation.

For example: There are multiple ways to create virtio-net devices:
- with vhost-user backend
- from provided network FDs
- from provided interface name of Tap device is given
- Tap device is created by CHV (fallback)

To confirm that the expected behavior occurs—especially in the more
complex case of network file descriptors—these logging statements
provide valuable insight into the system's internal operations.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Live migration, state save/resume, and hotplug are not trivial when it
comes to virtio-net devices backed by network FDs. As the mechanism
behind it can be considered as quite "multi-step magic" even for
experienced programmers, it makes sense to thoroughly document this to
ease debugging and to improve the mental model of developers working on
this in the future.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
(1) The old messages are missing the "why" part. With this change, users
    of Cloud Hypervisor have somehow more context and people looking at
    the code perfectly know what's going on.

(2) Using warn! implies that the user should take action, but in this
    case, there’s nothing the user can do. If the API is used correctly
    and file descriptors are passed via an SCM_RIGHTS message over a
    UNIX domain socket, everything works as intended. In that case,
    there's no need to issue a warning — debug! is sufficient.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Deserializing values as `-1` makes sense to prevent errors, so let's
keep it. However, serializing them differently adds confusion. For
example, a `ch-remote info` call should not report `-1` but the actual
FDs.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
This unifies error handling, the implementation, and logging. Otherwise,
we have code repetition, especially with the upcoming support for the
VmReceiveMigration patch.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
This adds support for the live migration of virtio-net devices backed by
network FDs.

So far, this was tested successfully with a patched libvirt setup.
`receive-migration` via ch-remote is not supported as transferring any
FDs from there to Cloud Hypervisor isn't really sensible in typical
workflows.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Live Migration of VMs using Network Interfaces configured with explicit Net FDs

3 participants