-
Notifications
You must be signed in to change notification settings - Fork 2
Multiple TCP Connections for Migration #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, excellent work, Julian! Given the time you had and the state of the code, I think you've came up with an excellent trade-off in refactoring and adding the new functionality. I absolutely love the state machine you introduced! Further, kudos for the awesome git commit hygiene!
Regarding my review: I mostly left some nits. If the open discussion points are resolved, I'm happy to merge this,
| /// destination. | ||
| /// | ||
| /// When this function returns, all memory has been sent and acknowledged. | ||
| fn send_memory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this handles all memory of a certain memory iteration, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it is currently used. But it doesn't have to be.
|
Could you split this in three PRs, please?
I think the first two would be merged very easily. I only see potential for further discussions in the last one. |
92aad24 to
3908982
Compare
|
@phip1611 I've updated the todo list above with the current status. Let's chat tomorrow whether you need anything else for handover. |
|
Regarding the open TODOs by Julian: I think we can ignore some, for example the OpenAPI stuff. It is sufficient if we do this when we upstream this. Regarding the remaining testing: I try to take care of it as part of our hackathon this week |
| let (send, recv) = std::sync::mpsc::sync_channel::<SendMemoryThreadMessage>( | ||
| Self::BUFFERED_REQUESTS_PER_THREAD, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a stupid question, but: Do I understand it correctly, that you create a mpsc channel per sending thread? And if yes: wouldn't it be more efficient to have some sort of a single producer multiple consumer channel? (like async channel). Then you could just put all partitions into that channel and the threads will take them when they are free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also possible. I didn't know whether we want to add new dependencies. But this would also solve the problem of buffering too many chunks per thread, which is a problem if one connection becomes slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it would have some advantages. @phip1611 what do you think? Maybe it would be worthwhile to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a follow-up / next step 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think so too. Please open a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not wired up to anywhere yet. We will use this to establish multiple connections for live migration. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
This has no functional change, but it is a requirement to remove the lock that used to obtain the MemoryManager instance. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... to avoid having to grab a lock when we receive a chunk of memory over the migration socket. This will come in handy when we have multiple threads for receiving memory. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
To allow for accepting more connections in the migration receive code paths, we need to keep track of the listener. This commit adds a thin abstraction to be able to hold on to it regardless of whether it is a UNIX domain or TCP socket. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
We keep the listening socket around and accept as many connections as the sender wants to open. There are still some problems: We never tear these threads down again. We will handle this in subsequent commits. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
In anticipation of using multiple threads for sending memory, refactor the sending code to be in a single place. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... to be able to re-use it when establishing multiple send connections. I moved the receive socket creation out for symmetry. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... to simplify sending memory from multiple connections in future commits. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
For sending memory over multiple connections, we need a way to split up the work. With these changes, we can take a memory table and chop it into same-sized chunks for transmit. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
This does not actually use the additional connections yet, but we are getting closer! On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
This will stop us from listening for more connections on the TCP socket when migration has finished. Tearing down the individual connections will come in a subsequent commit. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
... after the VM migration finishes. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
We don't die if we don't manage to establish more than the initial connection to the other side. To limit the weird failure cases, we do die when the other side only accepts some connections, but not all. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
This restores the old behavior/performance in case, we don't use multiple connections. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>
|
While testing this branch, I realized that we see a
This happens on every live migration :( |
|
I cannot edit or push to the branch of this PR. Before we waste time fixing this, I open a new one. |
Summary
This PR implements VM migration using multiple TCP connections.
The send-migration HTTP command now accepts a
connectionsparameter (defaults to 1) that specifies how many connections to use for migration.Ticket: https://github.com/cobaltcore-dev/cobaltcore/issues/205
Design
If
connectionsis larger than 1, the sender will attempt to establish additional TCP connections to the same migration destination. The main (initial) connection handles most of the migration protocol. Additional connections handle onlyMemorycommands for transferring chunks of VM memory.We use a thread-per-connection model. This adds the burden of creating, keeping track of and tearing threads down. But given the I/O model of Cloud Hypervisor this seemed to be the only feasible option.
For each iteration of sending memory, the
MemoryTablethat describes dirty memory will be split into chunks of fixed size (CHUNK_SIZE). These chunks will then be distributed among the available threads in a round-robin fashion. Threads can have a configurable backlog of outstanding chunks to send (BUFFERED_REQUESTS_PER_THREAD). This is currently 1 to avoid accumulating too much backlog on connections that encountered throughput issues.We still use the original request-response scheme. Since we don't pipeline requests, but always wait for the other side to acknowledge them, we have a fundamental limit on throughput that we can reach. The original code only expected one ACK for the whole dirty memory table. We now have one ACK per chunk.
I've come up with this formula of the upper bound of throughput per connection:
effective_throughput = chunk_size / (chunk_size / throughput_per_connection + round_trip_time)This formula is also in the code. I've played around with this and with large enough chunks, the impact seems negligable, especially since we can scale up connections. Feel free to plug in your favorite numbers.
In a better world, we would pipeline requests and not wait for responses. The good thing: The sender is in full control over the parameters, so we can make them configurable if necessary. But I would rather keep this complexity from users.
Noteworthy Issues
New
Still There
The code scans the whole dirty bitmap with one thread with an unoptimized loop and then stores the result in a large vector. The dirty bitmap can reach hundreds of MB in size on huge instances. For lots of non-contiguous writes, the size of the resultingSpeed up dirty_log() (a lot) #36MemoryTablecan be huge as well.These issues give the VM time to mutate more memory and thus lengthen the time needed for migration.
Noteworthy Changes in the Code
More Consistent Error Handling
Previously, the protocol handling had some issues, such as not sending responses for unknown commands etc. Now we always send a response. Some errors resulted in error responses, some errors resulted in dying. This should all be consistent now and can be changed in one place.
What's still missing is to handle payloads for all migration commands. For now any payload bytes are ignored in the most inconvenient way (going out-of-sync between sender and receiver).
State Machine for Receiver
Moved to #34.
You may ask where the additional connections come into play: When we enter the
Configuredstate, the sender will establish more connections to send memory and tear them down, once we are done sending memory.Left to be done
connections: 0src/api/openapi/cloud-hypervisor.yamlforconnectionsparameter