Skip to content

Conversation

@blitz
Copy link
Member

@blitz blitz commented Nov 4, 2025

This is split off from #30. See there for the longer discussion.

The receiver is now implemented via an explicit state machine. This limits in what order we accept migration commands, but it makes it much easier to carry state around from one state to another.

The state machine is this:

stateDiagram-v2
    direction TB
    [*] --> Started: Start
    Started --> MemoryFdsReceived: MemoryFd
    MemoryFdsReceived --> MemoryFdsReceived: MemoryFd
    Started --> Configured: Config
    MemoryFdsReceived --> Configured: Config
    Configured --> Configured: Memory
    Configured --> StateReceived: State
    StateReceived --> Completed: Complete
Loading

This state machine accepts migrations from previous Cloud Hypervisor versions. Note that previous Cloud Hypervisor versions were much more liberal in what order they accepted migration commands.

blitz added 4 commits November 4, 2025 10:54
On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
This will be useful later when we rebuild the live migration code as a
state machine.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
... and nuke some Option<> while I was there. Given that HashMap has a
usable default and we end up passing an empty HashMap anyway, just get
rid of the Option.

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
Previously, state that we accumulated during the migration process in
the receiver was kept in `mut Option` variables or HashMaps. The
problem is that it is unclear in the code when these variables can be
safely used. It's also difficult to add new state, such as the state
we need to handle the upcoming feature for performing the migration
using multiple connections.

To solve this, I've refactored the code to use the state machine
pattern. Each state carries the data it needs to. Any state that
didn't arrive yet (memory_files, memory_manager) cannot be accessed
until we are in the proper state.

Some benefits that fall out of this:

- We now respond to all requests, even invalid ones, at least with an
  error message.
- Any error handling a request will result in an error message being
  sent.
- There is only a single place where responses are sent and thus it's
  very hard to mess this up in the code.
- The main protocol state machine fits on a screen.

I would argue that especially the error cases are now much more
consistent. There is still a lot to be done. There is still state
transfer via self.vm and similar. In an ideal world, this would also
be carried by the state machine. I also want to see better handling of
payloads, which are still handled all over the place, but this change
is already too big. :)

On-behalf-of: SAP [email protected]
Signed-off-by: Julian Stecklina <[email protected]>
@blitz blitz mentioned this pull request Nov 4, 2025
13 tasks
@blitz
Copy link
Member Author

blitz commented Nov 4, 2025

Testing Done

  • Cloud Hypervisor test suite -> Too unreliable 😿
  • libvirt-tests

@blitz blitz marked this pull request as ready for review November 4, 2025 15:35

/// Handle a migration command and advance the protocol state machine.
///
/// **Note**: This function is responsible for consuming any payloads! It also must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by responsible for consuming any payloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Request can have a payload. This is indicated by req.length() != 0. At the moment, this function must make sure that these payloads are read from the socket. If this is not the case, the generic request handling code will lose "synchronization" and mix up request headers and payloads.

This is a deficiency in the design that existed before as well. I considered introducing an abstraction to make this harder to misuse, but this is very intrusive to the existing code (hence rebase hell). The idea would be to wrap the SocketStream in a Payload struct. Code can then read the payload via the struct or the struct goes out of scope and its Drop can either consume/ignore the payload or panic if the payload was not read.

The simpler design of just reading the payload in one place doesn't work, because for Memory requests the payload is very large and we don't want to copy it multiple times.

Copy link

@olivereanderson olivereanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice state machine!

The following is a very minor nit that you can choose to ignore: I noticed that vm_receive_migration_step makes the assumption that it does not get called with either of the Completed or Aborted states. If you want to then you can enforce that in the type system by splitting off those two states into a new enum MigrationTermination and changing the return signature of vm_receive_migration_step to be

Result<ControlFlow<MigrationTermination, ReceiveMigrationState>, Error>

vmm/src/lib.rs Outdated
if existing_memory_files.is_none() {
existing_memory_files = Some(HashMap::default())
}
existing_memory_files.push(Self::vm_receive_memory_fd(&mut socket)?);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why not just insert the result of Self::vm_receive_memory_fd into the given hashmap?

I see you removed this code in the next commit so probably not relevant anyway ...

@blitz
Copy link
Member Author

blitz commented Nov 6, 2025

That's a good suggestion @olivereanderson ! I would leave it as is for now. But feel free to try to implement your idea.

@blitz blitz merged commit 1fa6a03 into cyberus-technology:gardenlinux Nov 6, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants