State Machine for Migration Receive Code #34

blitz · 2025-11-04T14:14:38Z

This is split off from #30. See there for the longer discussion.

The receiver is now implemented via an explicit state machine. This limits in what order we accept migration commands, but it makes it much easier to carry state around from one state to another.

The state machine is this:

stateDiagram-v2
    direction TB
    [*] --> Started: Start
    Started --> MemoryFdsReceived: MemoryFd
    MemoryFdsReceived --> MemoryFdsReceived: MemoryFd
    Started --> Configured: Config
    MemoryFdsReceived --> Configured: Config
    Configured --> Configured: Memory
    Configured --> StateReceived: State
    StateReceived --> Completed: Complete

This state machine accepts migrations from previous Cloud Hypervisor versions. Note that previous Cloud Hypervisor versions were much more liberal in what order they accepted migration commands.

On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

This will be useful later when we rebuild the live migration code as a state machine. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

... and nuke some Option<> while I was there. Given that HashMap has a usable default and we end up passing an empty HashMap anyway, just get rid of the Option. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

Previously, state that we accumulated during the migration process in the receiver was kept in `mut Option` variables or HashMaps. The problem is that it is unclear in the code when these variables can be safely used. It's also difficult to add new state, such as the state we need to handle the upcoming feature for performing the migration using multiple connections. To solve this, I've refactored the code to use the state machine pattern. Each state carries the data it needs to. Any state that didn't arrive yet (memory_files, memory_manager) cannot be accessed until we are in the proper state. Some benefits that fall out of this: - We now respond to all requests, even invalid ones, at least with an error message. - Any error handling a request will result in an error message being sent. - There is only a single place where responses are sent and thus it's very hard to mess this up in the code. - The main protocol state machine fits on a screen. I would argue that especially the error cases are now much more consistent. There is still a lot to be done. There is still state transfer via self.vm and similar. In an ideal world, this would also be carried by the state machine. I also want to see better handling of payloads, which are still handled all over the place, but this change is already too big. :) On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

blitz · 2025-11-04T15:35:44Z

Testing Done

Cloud Hypervisor test suite -> Too unreliable 😿
libvirt-tests

phip1611 · 2025-11-05T06:07:49Z

vmm/src/lib.rs


+    /// Handle a migration command and advance the protocol state machine.
+    ///
+    /// **Note**: This function is responsible for consuming any payloads! It also must


What do you mean by responsible for consuming any payloads?

Each Request can have a payload. This is indicated by req.length() != 0. At the moment, this function must make sure that these payloads are read from the socket. If this is not the case, the generic request handling code will lose "synchronization" and mix up request headers and payloads.

This is a deficiency in the design that existed before as well. I considered introducing an abstraction to make this harder to misuse, but this is very intrusive to the existing code (hence rebase hell). The idea would be to wrap the SocketStream in a Payload struct. Code can then read the payload via the struct or the struct goes out of scope and its Drop can either consume/ignore the payload or panic if the payload was not read.

The simpler design of just reading the payload in one place doesn't work, because for Memory requests the payload is very large and we don't want to copy it multiple times.

olivereanderson

Nice state machine!

The following is a very minor nit that you can choose to ignore: I noticed that vm_receive_migration_step makes the assumption that it does not get called with either of the Completed or Aborted states. If you want to then you can enforce that in the type system by splitting off those two states into a new enum MigrationTermination and changing the return signature of vm_receive_migration_step to be

Result<ControlFlow<MigrationTermination, ReceiveMigrationState>, Error>

olivereanderson · 2025-11-06T10:48:24Z

vmm/src/lib.rs

-                            if existing_memory_files.is_none() {
-                                existing_memory_files = Some(HashMap::default())
-                            }
+                    existing_memory_files.push(Self::vm_receive_memory_fd(&mut socket)?);


Nit: Why not just insert the result of Self::vm_receive_memory_fd into the given hashmap?

I see you removed this code in the next commit so probably not relevant anyway ...

blitz · 2025-11-06T12:58:09Z

That's a good suggestion @olivereanderson ! I would leave it as is for now. But feel free to try to implement your idea.

blitz added 4 commits November 4, 2025 10:54

vmm: don't needlessly clone strings

e075fe1

On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

vmm: allow comparing commands and extracting response length

07d09a9

This will be useful later when we rebuild the live migration code as a state machine. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

vmm: simplify receiving memory fds

a4694fc

... and nuke some Option<> while I was there. Given that HashMap has a usable default and we end up passing an empty HashMap anyway, just get rid of the Option. On-behalf-of: SAP [email protected] Signed-off-by: Julian Stecklina <[email protected]>

blitz mentioned this pull request Nov 4, 2025

Multiple TCP Connections for Migration #30

Closed

13 tasks

blitz marked this pull request as ready for review November 4, 2025 15:35

blitz assigned phip1611 Nov 4, 2025

phip1611 reviewed Nov 5, 2025

View reviewed changes

phip1611 approved these changes Nov 5, 2025

View reviewed changes

olivereanderson approved these changes Nov 6, 2025

View reviewed changes

blitz merged commit 1fa6a03 into cyberus-technology:gardenlinux Nov 6, 2025
14 checks passed

phip1611 mentioned this pull request Nov 12, 2025

migration: refactor receiver side into a state machine cloud-hypervisor/cloud-hypervisor#7472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

State Machine for Migration Receive Code #34

State Machine for Migration Receive Code #34

blitz commented Nov 4, 2025

Uh oh!

blitz commented Nov 4, 2025

Uh oh!

phip1611 Nov 5, 2025

Uh oh!

blitz Nov 5, 2025

Uh oh!

olivereanderson left a comment •

edited

Loading

Uh oh!

olivereanderson Nov 6, 2025

Uh oh!

blitz commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

State Machine for Migration Receive Code #34

State Machine for Migration Receive Code #34

Conversation

blitz commented Nov 4, 2025

Uh oh!

blitz commented Nov 4, 2025

Testing Done

Uh oh!

phip1611 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

blitz Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

olivereanderson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olivereanderson Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

blitz commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

olivereanderson left a comment •

edited

Loading