Skip to content

Conversation

@ljrcore
Copy link

@ljrcore ljrcore commented Nov 27, 2024

Summary

This pull request updates the vm-memory crate from version 0.16.0 to 0.16.1, incorporating the necessary implementations for TcpStream. It introduces TCP-based cross-host live migration support in Cloud Hypervisor, enhancing migration performance and simplifying the process. Additionally, the live migration documentation has been updated to provide instructions for executing cross-host live migrations over TCP connections.

Changes Made

build: Bump vm-memory from 0.16.0 to 0.16.1

  • Updated the vm-memory crate dependency to version 0.16.1 to include the implementations of ReadVolatile and WriteVolatile for TcpStream, which are necessary for TCP-based live migration.

vmm: Add support for cross-host live migration over TCP

  • Modified vm_send_migration and vm_receive_migration functions to support destination_url and receiver_url starting with tcp:, allowing live migration over TCP sockets.
  • Moved socket connection logic into vm_send_migration and vm_receive_migration, simplifying the migration functions.
  • Improved migration efficiency by implementing direct TCP connections between hosts to transmit data.

docs: Add documentation for cross-host TCP live migration

  • Updated the live migration documentation to include instructions for performing cross-host live migrations over TCP connections.
  • Provided detailed steps, commands, and considerations for users to perform TCP-based cross-host migrations.

Test

We conducted tests to compare the performance of the new TCP-based cross-host live migration with the existing Unix socket migration using socat for cross-host communication. The key metrics measured were the Total Migration Time and Downtime experienced during the migration process.

Testing Environment

  • Host Machines: Both the source and destination hosts are equipped with identical hardware specifications.

  • Virtual Machine Under Test:

    • CPU: 4 vCPUs
    • Memory: 8 GB RAM
  • Memory Pressure Application

    • Inside the guest VM, we applied memory pressure using the following command:

      stress -m 4 --vm-bytes 256M

Estimating Migration Time and Downtime from Logs

The migration time and downtime were estimated using timestamps from the Cloud Hypervisor logs. The methodology is as follows:

  • Migration Time: Calculated as the difference between the timestamps when the migration started and when it completed.
    • Migration Start Time: The timestamp when the VmSendMigration API request is received.
    • Migration End Time: The timestamp when the migration is marked as complete (Migration complete log entry).
  • Downtime: Calculated as the difference between the timestamps when the VM is paused for migration and when the migration is complete.
    • Downtime Start Time: The timestamp when the VM is paused (VM has been paused log entry).
    • Downtime End Time: The same as the Migration End Time.

Log Examples and Calculations

TCP Migration Example

Start Migration: cloud-hypervisor: 12.852263s:  INFO:vmm/src/api/mod.rs:1262 -- API request event: VmSendMigration VmSendMigrationData { destination_url: "tcp:192.168.102.2:6000", local: false }
Start Downtime: cloud-hypervisor: 17.796833s: INFO:vmm/src/vm.rs:2488 -- VM has been paused
End Migration: cloud-hypervisor: 18.226345s:  INFO:vmm/src/lib.rs:1167 -- Migration complete

Calculations:

  • Migration Time = Migration End Time - Migration Start Time
    • Migration Time = 18.226345s - 12.852263s = 5374.082 ms
  • Downtime = Migration End Time - Downtime Start Time
    • Downtime = 18.226345s - 17.796833s = 429.512 ms

Test Results

TCP Migration Results

Test Migration Time (ms) Downtime (ms)
1 5374.082 429.512
2 5347.307 353.986
3 5364.191 404.331
Average 5361.860 395.943

Unix Socket Migration with Socat Results

Test Migration Time (ms) Downtime (ms)
1 11797.579 845.931
2 11211.548 786.769
3 11689.248 790.308
Average 11566.125 807.669

Analysis

  • Migration Time:
    • TCP migration shows significantly faster migration times, averaging approximately 5,362 ms.
    • Unix socket migration with socat takes longer, averaging approximately 11,566 ms.
    • TCP migration reduces the total migration time by about 53% compared to Unix socket migration.
  • Downtime:
    • Downtime during TCP migration is shorter, averaging approximately 396 ms.
    • Downtime during Unix socket migration averages around 808 ms.
    • TCP migration reduces downtime by about 51% compared to Unix socket migration.

The implementation of TCP-based cross-host live migration significantly improves migration performance when compared to Unix socket migration using socat. By reducing both the total migration time and the downtime by over 50%, this enhancement increases the efficiency and reliability of live migrations in cross-host scenarios.

@ljrcore ljrcore requested a review from a team as a code owner November 27, 2024 11:53
vmm/src/vm.rs Outdated
impl Pausable for Vm {
fn pause(&mut self) -> std::result::Result<(), MigratableError> {
event!("vm", "pausing");
info!("Pause VM for migration");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log may not be right, 'cause vm may pause from api request also.

Copy link
Author

@ljrcore ljrcore Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version has fixed this problem:)

Thanks

@ljrcore ljrcore force-pushed the migration branch 2 times, most recently from a8dbb4c to 8425ec4 Compare November 28, 2024 09:39
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add integration tests.

@ljrcore
Copy link
Author

ljrcore commented Nov 29, 2024

Please add integration tests.

Integration tests have been added in the new pushed PR:)

Thanks

@TimePrinciple
Copy link
Member

TimePrinciple commented Nov 29, 2024

@ljrcore Would you mind rebasing to apply the new clippy fix?

@ljrcore
Copy link
Author

ljrcore commented Dec 2, 2024

@ljrcore Would you mind rebasing to apply the new clippy fix?

Done.

Thanks

@ljrcore
Copy link
Author

ljrcore commented Dec 2, 2024

@blitz The live migration documentation has been updated.

vmm/src/lib.rs Outdated
vm.send_memory_fds(unix_socket)?;
}
SocketStream::Tcp(tcp_socket) => {
vm.send_memory_fds_tcp(tcp_socket)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of --local is to not need to read/write the memory and instead use SCM_RIGHTS to move the fd across the process boundaries. --local makes no sense with TCP and should generate an error.

Copy link
Author

@ljrcore ljrcore Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I modified the code as you suggested and if the user performs a local migration over TCP they will get an error message:)

Thanks

vmm/src/lib.rs Outdated
))
})?;
}
SocketStream::Tcp(tcp_socket) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should be an error - MemoryFd command should not be sent over TCP.

Copy link
Author

@ljrcore ljrcore Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should be an error - MemoryFd command should not be sent over TCP.

Yes, TCP does not use MemoryFd. The TCP related code has been removed here:)

Thanks

vmm/src/vm.rs Outdated

impl Pausable for Vm {
fn pause(&mut self) -> std::result::Result<(), MigratableError> {
info!("VM has been paused");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO here is not the right place for this and contradicts just the line below, would line 2476 and lines after be more appropriate 🧐

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO here is not the right place for this and contradicts just the line below, would line 2476 and lines after be more appropriate 🧐

I agree:)

vmm/src/lib.rs Outdated
Comment on lines 1279 to 1291
// For TCP sockets, we cannot transfer file descriptors
warn!(
"MemoryFd command received over TCP socket, which is not supported"
);
Response::error().write_to(&mut socket).map_err(|e| {
MigratableError::MigrateReceive(anyhow!(
"Error sending response: {}",
e
))
})?;
return Err(MigratableError::MigrateReceive(anyhow!(
"MemoryFd command not supported over TCP sockets"
)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since send_memory_fd would (and should) never be invoked on TcpSocket, should we change the whole section to unreachable!("MemoryFd command not supported over TCP sockets")

Copy link
Author

@ljrcore ljrcore Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since send_memory_fd would (and should) never be invoked on TcpSocket, should we change the whole section to unreachable!("MemoryFd command not supported over TCP sockets")

You are right. TCP does not use MemoryFd, so I will remove the TCP related code here:)

@ljrcore ljrcore force-pushed the migration branch 2 times, most recently from e0c509f to c3ea7fa Compare December 4, 2024 11:55
@ljrcore
Copy link
Author

ljrcore commented Dec 4, 2024

Hi, @rbradford

A check failed, how can I fix it?

Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - just some refactoring I think would be helpful!

vmm/src/lib.rs Outdated
Comment on lines 2239 to 2258
// Set up the socket connection
let socket = if send_data_migration.destination_url.starts_with("tcp:") {
let address = &send_data_migration.destination_url[4..];
info!("Connecting to TCP socket at {}", address);
let socket = TcpStream::connect(address).map_err(|e| {
MigratableError::MigrateSend(anyhow!("Error connecting to TCP socket: {}", e))
})?;
SocketStream::Tcp(socket)
} else {
let path = Self::socket_url_to_path(&send_data_migration.destination_url)?;
info!("Connecting to UNIX socket at {:?}", path);
let socket = UnixStream::connect(path).map_err(|e| {
MigratableError::MigrateSend(anyhow!("Error connecting to UNIX socket: {}", e))
})?;
SocketStream::Unix(socket)
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this into a method that returns Result<SocketStream,...>

vmm/src/lib.rs Outdated
Comment on lines 2179 to 2203
// Determine whether to use TCP or UNIX listener based on receiver_url
let listener = if receive_data_migration.receiver_url.starts_with("tcp:") {
let address = &receive_data_migration.receiver_url[4..];
let listener = TcpListener::bind(address).map_err(|e| {
MigratableError::MigrateReceive(anyhow!("Error binding to TCP socket: {}", e))
})?;
SocketListener::Tcp(listener)
} else {
let path = Self::socket_url_to_path(&receive_data_migration.receiver_url)?;
let listener = UnixListener::bind(&path).map_err(|e| {
MigratableError::MigrateReceive(anyhow!("Error binding to UNIX socket: {}", e))
})?;
SocketListener::Unix(listener, path)
};

Response::ok().write_to(&mut socket)?;
}
Command::Complete => {
info!("Complete Command Received");
if let Some(ref mut vm) = self.vm.as_mut() {
vm.resume()?;
Response::ok().write_to(&mut socket)?;
} else {
warn!("VM not created yet");
Response::error().write_to(&mut socket)?;
}
break;
}
Command::Abandon => {
info!("Abandon Command Received");
self.vm = None;
self.vm_config = None;
Response::ok().write_to(&mut socket).ok();
break;
}
// Accept the connection and get the socket
let socket = match listener {
SocketListener::Unix(listener, path) => {
let (socket, _addr) = listener.accept().map_err(|e| {
MigratableError::MigrateReceive(anyhow!("Error unlinking UNIX socket: {}", e))
})?;
std::fs::remove_file(path).map_err(|e| {
MigratableError::MigrateReceive(anyhow!("Error accepting on TCP socket: {}", e))
})?;
SocketStream::Unix(socket)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, I think this could be extracted to a function.

vmm/src/lib.rs Outdated
let socket = match listener {
SocketListener::Unix(listener, path) => {
let (socket, _addr) = listener.accept().map_err(|e| {
MigratableError::MigrateReceive(anyhow!("Error unlinking UNIX socket: {}", e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this error message?

@TimePrinciple
Copy link
Member

Hi, @rbradford

A check failed, how can I fix it?

It could be flakes, you could pin me to restart the failed job to find out

@ljrcore
Copy link
Author

ljrcore commented Dec 5, 2024

Hi, @rbradford
A check failed, how can I fix it?

It could be flakes, you could pin me to restart the failed job to find out

Got it, thanks:)

@ljrcore
Copy link
Author

ljrcore commented Dec 5, 2024

Looking good - just some refactoring I think would be helpful!

Thanks for your suggestion, the code has been updated.

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. Generally looking good to me. Some comments below.

One note, I can see that using TCP socket can reduce the live migration downtime over using Unix socket, but I won't draw the conclusion that it reduces the downtime by 50%. The live migration downtime depends a lot on the workload being running.

@ljrcore
Copy link
Author

ljrcore commented Dec 6, 2024

Thank you for the contribution. Generally looking good to me. Some comments below.

One note, I can see that using TCP socket can reduce the live migration downtime over using Unix socket, but I won't draw the conclusion that it reduces the downtime by 50%. The live migration downtime depends a lot on the workload being running.

Thanks for your review. The live migration downtime does depend a lot on the workload being run. I got this data when I applied memory pressure using the following command.

stress -m 4 --vm-bytes 256M

If the memory pressure changes, the data will change accordingly. Maybe it is also related to the test environment. I used a 4c8g virtual machine for testing.

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you accidentally dropped the commit that adds an integration test.

I took a closer look at the PR and have some additional comments below.

Bump vm-memory from 0.16.0 to 0.16.1 to include the implementations of
ReadVolatile and WriteVolatile for TcpStream.

Signed-off-by: Jinrong Liang <[email protected]>
@ljrcore
Copy link
Author

ljrcore commented Dec 10, 2024

Thanks for your patience and careful review of my code. I have modified and updated the code according to your suggestions. I appreciate your time and effort in this process.

@ljrcore
Copy link
Author

ljrcore commented Dec 10, 2024

Hi, @rbradford
A check failed, how can I fix it?

It could be flakes, you could pin me to restart the failed job to find out

@TimePrinciple Hi, Ruoqing

Could you restart failed jobs?

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost forgot this last one issue. Other than this, I think this PR is ready to be landed.

Jinrong Liang added 3 commits December 12, 2024 10:35
Add support for cross-host live migration over TCP, which
significantly improves performance in cross-host live migration
scenarios compared to the Unix socket forwarding method.

Signed-off-by: Jinrong Liang <[email protected]>
Updated the live migration documentation to include instructions for
performing cross-host live migrations over TCP connections.

Signed-off-by: Jinrong Liang <[email protected]>
Add integration test for tcp live migration to ensure live migration
functions as expected.

Signed-off-by: Jinrong Liang <[email protected]>
@rbradford rbradford added this pull request to the merge queue Dec 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 18, 2024
@rbradford rbradford added this pull request to the merge queue Dec 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 18, 2024
@rbradford rbradford added this pull request to the merge queue Dec 18, 2024
Merged via the queue into cloud-hypervisor:main with commit bbd8d3b Dec 18, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants