Skip to content

Conversation

@blitz
Copy link
Member

@blitz blitz commented Oct 28, 2025

Summary

This PR implements VM migration using multiple TCP connections.

The send-migration HTTP command now accepts a connections parameter (defaults to 1) that specifies how many connections to use for migration.

Ticket: https://github.com/cobaltcore-dev/cobaltcore/issues/205

Design

If connections is larger than 1, the sender will attempt to establish additional TCP connections to the same migration destination. The main (initial) connection handles most of the migration protocol. Additional connections handle only Memory commands for transferring chunks of VM memory.

We use a thread-per-connection model. This adds the burden of creating, keeping track of and tearing threads down. But given the I/O model of Cloud Hypervisor this seemed to be the only feasible option.

For each iteration of sending memory, the MemoryTable that describes dirty memory will be split into chunks of fixed size (CHUNK_SIZE). These chunks will then be distributed among the available threads in a round-robin fashion. Threads can have a configurable backlog of outstanding chunks to send (BUFFERED_REQUESTS_PER_THREAD). This is currently 1 to avoid accumulating too much backlog on connections that encountered throughput issues.

We still use the original request-response scheme. Since we don't pipeline requests, but always wait for the other side to acknowledge them, we have a fundamental limit on throughput that we can reach. The original code only expected one ACK for the whole dirty memory table. We now have one ACK per chunk.

I've come up with this formula of the upper bound of throughput per connection:

effective_throughput = chunk_size / (chunk_size / throughput_per_connection + round_trip_time)

This formula is also in the code. I've played around with this and with large enough chunks, the impact seems negligable, especially since we can scale up connections. Feel free to plug in your favorite numbers.

In a better world, we would pipeline requests and not wait for responses. The good thing: The sender is in full control over the parameters, so we can make them configurable if necessary. But I would rather keep this complexity from users.

Noteworthy Issues

New

  • Teardown of threads will hang if the sender stops sending data on a connection. But then the whole VM migration will not succeed anyhow. This is difficult to fix without rewriting some of the I/O code and probably not worth it for now.

Still There

  • The code scans the whole dirty bitmap with one thread with an unoptimized loop and then stores the result in a large vector. The dirty bitmap can reach hundreds of MB in size on huge instances. For lots of non-contiguous writes, the size of the resulting MemoryTable can be huge as well. Speed up dirty_log() (a lot) #36
  • The code only starts sending memory once the whole bitmap is scanned.

These issues give the VM time to mutate more memory and thus lengthen the time needed for migration.

Noteworthy Changes in the Code

More Consistent Error Handling

Previously, the protocol handling had some issues, such as not sending responses for unknown commands etc. Now we always send a response. Some errors resulted in error responses, some errors resulted in dying. This should all be consistent now and can be changed in one place.

What's still missing is to handle payloads for all migration commands. For now any payload bytes are ignored in the most inconvenient way (going out-of-sync between sender and receiver).

State Machine for Receiver

Moved to #34.

You may ask where the additional connections come into play: When we enter the Configured state, the sender will establish more connections to send memory and tear them down, once we are done sending memory.

Left to be done

  • Address first round of feedback
  • Fail gracefully if someone sends connections: 0
  • Split state machine into separate PR: State Machine for Migration Receive Code #34
  • Merge State Machine for Migration Receive Code #34
  • Fail gracefully if the other end doesn't accept multiple connections
  • Update src/api/openapi/cloud-hypervisor.yaml for connections parameter
  • Add tests to the test suite
    • Test for multiple connections
    • Test for forward/backward compatibility? (maybe optional)
  • Run test suite
    • Regression tests (libvirt-tests, do we break usage that doesn't use multiple connections?)
    • With new tests
  • Test on some big boxes with lots of guest RAM

@blitz blitz changed the base branch from main to gardenlinux October 28, 2025 14:23
Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, excellent work, Julian! Given the time you had and the state of the code, I think you've came up with an excellent trade-off in refactoring and adding the new functionality. I absolutely love the state machine you introduced! Further, kudos for the awesome git commit hygiene!

Regarding my review: I mostly left some nits. If the open discussion points are resolved, I'm happy to merge this,

/// destination.
///
/// When this function returns, all memory has been sent and acknowledged.
fn send_memory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handles all memory of a certain memory iteration, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it is currently used. But it doesn't have to be.

@phip1611
Copy link
Member

Could you split this in three PRs, please?

  • unrelated misc improvements
  • state machine
  • multiple parallel connections?

I think the first two would be merged very easily. I only see potential for further discussions in the last one.

@blitz blitz force-pushed the multiple-tcp branch 7 times, most recently from 92aad24 to 3908982 Compare November 4, 2025 14:09
@blitz
Copy link
Member Author

blitz commented Nov 5, 2025

@phip1611 I've updated the todo list above with the current status. Let's chat tomorrow whether you need anything else for handover.

@blitz blitz mentioned this pull request Nov 7, 2025
1 task
@phip1611
Copy link
Member

Regarding the open TODOs by Julian: I think we can ignore some, for example the OpenAPI stuff. It is sufficient if we do this when we upstream this. Regarding the remaining testing: I try to take care of it as part of our hackathon this week

Comment on lines +1146 to +1148
let (send, recv) = std::sync::mpsc::sync_channel::<SendMemoryThreadMessage>(
Self::BUFFERED_REQUESTS_PER_THREAD,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a stupid question, but: Do I understand it correctly, that you create a mpsc channel per sending thread? And if yes: wouldn't it be more efficient to have some sort of a single producer multiple consumer channel? (like async channel). Then you could just put all partitions into that channel and the threads will take them when they are free.

Copy link
Member Author

@blitz blitz Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also possible. I didn't know whether we want to add new dependencies. But this would also solve the problem of buffering too many chunks per thread, which is a problem if one connection becomes slow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it would have some advantages. @phip1611 what do you think? Maybe it would be worthwhile to change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a follow-up / next step 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think so too. Please open a ticket

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amphi could you please open a ticket here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blitz added 11 commits November 12, 2025 08:40
This is not wired up to anywhere yet. We will use this to establish
multiple connections for live migration.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
This has no functional change, but it is a requirement to remove the
lock that used to obtain the MemoryManager instance.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
... to avoid having to grab a lock when we receive a chunk of memory
over the migration socket. This will come in handy when we have
multiple threads for receiving memory.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
To allow for accepting more connections in the migration receive code
paths, we need to keep track of the listener. This commit adds a thin
abstraction to be able to hold on to it regardless of whether it is a
UNIX domain or TCP socket.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
We keep the listening socket around and accept as many connections as
the sender wants to open.

There are still some problems: We never tear these threads down
again. We will handle this in subsequent commits.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
In anticipation of using multiple threads for sending memory, refactor
the sending code to be in a single place.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
... to be able to re-use it when establishing multiple send
connections. I moved the receive socket creation out for symmetry.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
... to simplify sending memory from multiple connections in future
commits.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
For sending memory over multiple connections, we need a way to split
up the work. With these changes, we can take a memory table and chop
it into same-sized chunks for transmit.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
This does not actually use the additional connections yet, but we are
getting closer!

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
This will stop us from listening for more connections on the TCP socket
when migration has finished.

Tearing down the individual connections will come in a subsequent
commit.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
... after the VM migration finishes.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
We don't die if we don't manage to establish more than the initial
connection to the other side. To limit the weird failure cases, we do
die when the other side only accepts some connections, but not all.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
This restores the old behavior/performance in case, we don't use
multiple connections.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
@tpressure
Copy link

While testing this branch, I realized that we see a

Error:vmm/src/lib.rs:1006 -- Failed to wake up other threads: Invalid argument (es error 22)

This happens on every live migration :(

@phip1611 phip1611 marked this pull request as ready for review November 12, 2025 10:36
@phip1611 phip1611 marked this pull request as draft November 12, 2025 10:36
@phip1611
Copy link
Member

I cannot edit or push to the branch of this PR. Before we waste time fixing this, I open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants