vm-migration: Add support for downtime limits #7033

ljrcore · 2025-04-18T08:25:56Z

Hi,

These patches introduce two important features for live migration:

1. Downtime limits

In live migration scenarios, service downtime directly affects user experience and application availability. Without configurable downtime limits, migration may cause unpredictable or excessive service interruptions, especially under high load or network fluctuations. By introducing downtime thresholds, users can:

Set an upper limit on acceptable service interruption during migration
Ensure that critical applications remain highly available
Make migration behavior more predictable and controllable

2. Asynchronous migration

Previously, the migration process was handled synchronously, blocking the VMM thread, causing clh to be unable to respond to other migration-related requests. By refactoring the migration process to run asynchronously in a separate thread, clh can now:

Respond to cancellation, query, and other API requests during the migration process
Improve overall responsiveness and flexibility
Improve the user experience for VM administrators

Together, these features make live migration more robust and user-friendly, especially in production environments with strict uptime and management requirements.

Looking forward to your comments and feedback.

Thanks

rbradford

Wow - this is very interesting! Perhaps these two feature could be separate PRs. That might make it easier to land and review

rbradford · 2025-04-19T11:17:35Z

vmm/src/lib.rs

+        // Update statistics
+        s.pending_size = final_table.regions().iter().map(|range| range.length).sum();
+        s.total_transferred_bytes += s.pending_size;
+        s.current_dirty_pages = s.pending_size / PAGE_SIZE as u64;


Is pending_size guaranteed to be a multiple of PAGE_SIZE - otherwise we should perhaps use .div_ceil() ?

rbradford · 2025-04-19T11:18:38Z

vmm/src/lib.rs

+            // Update the number of dirty pages
+            s.total_transferred_bytes += s.pending_size;
+            s.current_dirty_pages = s.pending_size / PAGE_SIZE as u64;
+            s.total_transferred_dirty_pages += s.current_dirty_pages;


ljrcore · 2025-04-21T04:01:08Z

Wow - this is very interesting! Perhaps these two feature could be separate PRs. That might make it easier to land and review

Thanks for your feedback. That makes sense. I will split these two features into separate PRs to make the review process easier.

I have submitted a separate PR about asynchronous migration. (#7038 )

After the asynchronous PR is merged, I will modify this PR to submit only the downtime limit feature.

likebreath

Just a heads-up, we will need to update our docs, restful api spec, and also add integration tests for the new feature added to /vm.send-migration.

phip1611 · 2025-06-11T09:01:21Z

For the record: We used this as base to implement auto-converging (vCPU thottling) relatively easy! Well done so far!

ljrcore · 2025-06-13T02:13:18Z

For the record: We used this as base to implement auto-converging (vCPU thottling) relatively easy! Well done so far!

@phip1611 That's really great.

The live migration integration test for this PR is timing out, and I haven't figured out why, do you have any suggestions?

phip1611 · 2025-06-25T08:34:13Z

The live migration integration test for this PR is timing out, and I haven't figured out why, do you have any suggestions?

In case things fail suspiciously: Add --seccomp false when the integration test runs.. this often helped me to uncover the root-cause.

You can run the test locally if you execute bash ./scripts/dev_cli.sh tests --integration -- --test-filter test_live_migration_basic -- --show-output. You need Docker in root/system mode isntalled, rootless mode doesn't work. Further, your user must be a member of the docker group.

ljrcore · 2025-08-06T03:18:53Z

Since I did not have enough time to devote to the live migration, the subsequent live migration contributions was carried out by Songqian @lisongqian .

lisongqian · 2025-08-06T13:40:43Z

Hi everyone, PR is ready.
The Quality Check failed since the beta version upgrade. It has been fixed in #7241 .
The RISC-V test failed since the riscv runners may have some issues.

phip1611 · 2025-08-12T08:52:57Z

tests/integration.rs

        src_api_socket: &str,
        dest_api_socket: &str,
        local: bool,
+        downtime: u64,


Suggested change

downtime: u64,

downtime: Duration,

This would the code much clearer and people don't need to remember "is this second? Is this minute? Is this milliseconds?"

This commit encapsulates the memory copy stage of cross-host live migration ops as a function do_memory_migration. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

Add handling of migration timeout failures to provide more flexible live migration options. Implement downtime limiting logic to minimize service disruptions. Support for setting downtime thresholds and migration timeouts. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

Updated live migration documentation to include migration timeout controls and downtime limits. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

Current (squashed) state of: https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7033/commits --- vm-migration: Add support for downtime limits Add handling of migration timeout failures to provide more flexible live migration options. Implement downtime limiting logic to minimize service disruptions. Support for setting downtime thresholds and migration timeouts. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]> docs: Add migration parameters to live migration document Updated live migration documentation to include migration timeout controls and downtime limits. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]> tests: Add downtime and migration timeout tests Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

phip1611

Let's not block this any longer with more nit-picks. We think this is a very good step forward. Approved. All further non-critical concerns and nitpicks will be handled in a follow-up.

Current (squashed) state of: https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7033/commits --- vm-migration: Add support for downtime limits Add handling of migration timeout failures to provide more flexible live migration options. Implement downtime limiting logic to minimize service disruptions. Support for setting downtime thresholds and migration timeouts. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]> docs: Add migration parameters to live migration document Updated live migration documentation to include migration timeout controls and downtime limits. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]> tests: Add downtime and migration timeout tests Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

likebreath

@lisongqian @phip1611 You are on the right path with separating changes on do_memory_migration to its own commit. There a lot more to be done for breaking changes into logical and self-contained commits, particularly commit 61aaaf2.

To summarize my detailed comments below, my general suggestions moving forward:

Drop non-essential changes from the PR, particularly about providing statistics (let's do it later in a separate PR);
Separate API changes (e.g. introduction of downtime and migration_timeout, changes to yaml file, ch-remote, etc) from the core functional changes (e.g. changes around memory_copy_iterations();
Break down the core functional changes as suggested below ;

likebreath · 2025-09-02T22:56:35Z

vmm/src/api/mod.rs

+    /// Microsecond level downtime
+    #[serde(default = "default_downtime")]
+    pub downtime: u64,
+    /// Second level migration timeout


Comment needs improvement to clarify what it is, particularly how is it different from the downtime?

likebreath · 2025-09-02T23:28:04Z

vmm/src/lib.rs

+        // Define the maximum allowed downtime 2000 seconds(2000000 milliseconds)
+        const MAX_MIGRATE_DOWNTIME: u64 = 2000000;
+
+        // Verify that downtime must be between 1 and MAX_MIGRATE_DOWNTIME
+        if send_data_migration.downtime == 0 || send_data_migration.downtime > MAX_MIGRATE_DOWNTIME
+        {
+            return Err(MigratableError::MigrateSend(anyhow!(
+                "downtime_limit must be an integer in the range of 1 to {} ms",
+                MAX_MIGRATE_DOWNTIME
+            )));
        }

-        // Now pause VM
+        let migration_timeout = Duration::from_secs(send_data_migration.migration_timeout);
+        let migrate_downtime_limit = Duration::from_millis(send_data_migration.downtime);
+
+        // Verify that downtime must be less than the migration timeout
+        if !migration_timeout.is_zero() && migrate_downtime_limit >= migration_timeout {
+            return Err(MigratableError::MigrateSend(anyhow!(
+                "downtime_limit {}ms must be less than migration_timeout {}ms",
+                send_data_migration.downtime,
+                send_data_migration.migration_timeout * 1000
+            )));


Configuration validation should be done at the entry of the HTTP api handler, in this case it should be from vm_send_migration().

likebreath · 2025-09-02T23:29:38Z

vmm/src/lib.rs

+        // Define the maximum allowed downtime 2000 seconds(2000000 milliseconds)
+        const MAX_MIGRATE_DOWNTIME: u64 = 2000000;


Not sure about how this value is set and if such configuration limit is useful, given user dictates their preferred downtime. I'd think we can drop it.

likebreath · 2025-09-03T00:00:11Z

vmm/src/lib.rs

+struct MigrationState {
+    current_dirty_pages: u64,
+    downtime: Duration,
+    downtime_start: Instant,
+    iteration: u64,
+    iteration_cost_time: Duration,
+    iteration_start_time: Instant,
+    mb_per_sec: f64,
+    pages_per_second: u64,
+    pending_size: u64,
+    start_time: Instant,
+    threshold_size: u64,
+    total_time: Duration,
+    total_transferred_bytes: u64,
+    total_transferred_dirty_pages: u64,
+}


I understand such struct is mostly for providing statistics with live-migration, and this is good to have. However, I don't see we need it for down-time limit support, say most of the members are internal to memory_copy_iterations and never used anywhere else.

To keep the change small and self-contained, I'd drop such struct along with related changes on providing statistics.

likebreath · 2025-09-03T00:32:44Z

vmm/src/lib.rs

+    /// Try to find a timing to send dirty log. As the VM keeps running, this function continuously
+    /// transmits the memory delta in multiple iterations until the delta is small enough to fulfil
+    /// the desired downtime. The final dirty log is not transmitted by this function.
+    fn memory_copy_iterations(
+        vm: &mut Vm,
+        socket: &mut SocketStream,
+        s: &mut MigrationState,
+        migration_timeout: Duration,
+        migrate_downtime_limit: Duration,
+    ) -> result::Result<MemoryRangeTable, MigratableError> {


Here are the core functional changes, which mixed with many things that I believe can be broken-down.

With non-essential changes being removed, I can see such function essentially provide the following functionalities:

Check for convergence based on down-time limit and bandwidth;

Send dirty pages;

Measure and update bandwidth;

I believe these functionalities are best to be implemented via gradually reusing and extending vm_maybe_send_dirty_pages(). I'd suggest you break down the commits as following:

vmm: Support measurement of bandwidth with live migration
Essentially extending it to:

fn vm_maybe_send_dirty_pages( vm: &mut Vm, socket: &mut SocketStream, bandwidth: &mut f64, // initial value would be 0, and will be updated each iteration (e.g. call to this function) ) -> result::Result<bool, MigratableError>

vmm: Check convergence with bandwidth and downtime limit
Essentially, making it return True only when convergence criteria is met.

vmm: Converge live-migration based on downtime limit
Essentially, adopting the caller send_migration() and switching to reply on the newly convergence check to pause vm, rather than using the hard-coded 5 iteration approach.

In case this is not clear. The changes of aborting on migration_timeout should be separated from the above changes (e.g. downtime limit support) and be done with its own commit(s).

rbradford · 2025-10-22T11:27:57Z

@phip1611 Do we still need this PR - I know you've been working on this problem. Are you close to submitting your version?

phip1611 · 2025-10-22T11:50:12Z

@rbradford We depend on it rather than having completely custom distinct solution. I’ll try to contact the original author and ask if it’s okay for us to take over this PR. We'd love to have this merged very soon.

ljrcore · 2025-10-28T17:04:29Z

@rbradford We depend on it rather than having completely custom distinct solution. I’ll try to contact the original author and ask if it’s okay for us to take over this PR. We'd love to have this merged very soon.

Feel free to modify this MR as per your preference, thank you.

phip1611 · 2025-10-29T05:53:16Z

Thanks! As 1) there were requests to split this into smaller PRs and 2) this PR received already so much comments and it is hard to keep an overview of everything, I'm in favor of closing it. I'd prefer to open new PRs under my control. (with respect to Co-Authored-by and Signed-Off-by)

ljrcore requested a review from a team as a code owner April 18, 2025 08:25

rbradford reviewed Apr 19, 2025

View reviewed changes

ljrcore changed the title ~~vm-migration: Add support for downtime limits and live migration asynchronization~~ [WIP] vm-migration: Add support for downtime limits and live migration asynchronization Apr 21, 2025

likebreath reviewed Apr 21, 2025

View reviewed changes

likebreath marked this pull request as draft April 25, 2025 17:39

ljrcore force-pushed the queue branch from c0cf1a0 to 0d2c566 Compare June 5, 2025 13:38

ljrcore changed the title ~~[WIP] vm-migration: Add support for downtime limits and live migration asynchronization~~ vm-migration: Add support for downtime limits and live migration asynchronization Jun 5, 2025

ljrcore marked this pull request as ready for review June 5, 2025 13:39

ljrcore force-pushed the queue branch 2 times, most recently from cc7efbc to 58ea818 Compare June 6, 2025 06:40

ljrcore changed the title ~~vm-migration: Add support for downtime limits and live migration asynchronization~~ vm-migration: Add support for downtime limits Jun 6, 2025

ljrcore force-pushed the queue branch 6 times, most recently from 29c2af1 to 199c104 Compare June 9, 2025 03:29

ljrcore mentioned this pull request Jun 23, 2025

Live Migration Feature Tracking and Collaboration #7111

Open

26 tasks

lisongqian force-pushed the queue branch from 199c104 to 86787cb Compare August 6, 2025 02:22

lisongqian force-pushed the queue branch 3 times, most recently from 9bf7bd9 to eacdb70 Compare August 6, 2025 07:44

lisongqian force-pushed the queue branch from eacdb70 to f072114 Compare August 8, 2025 03:16

phip1611 reviewed Aug 12, 2025

View reviewed changes

lisongqian force-pushed the queue branch from f60fedf to bb150dd Compare August 28, 2025 14:26

lisongqian and others added 4 commits September 2, 2025 11:29

vmm: Encapsulate the memory copy stage as a function

75181a5

This commit encapsulates the memory copy stage of cross-host live migration ops as a function do_memory_migration. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

docs: Add migration parameters to live migration document

1bca282

Updated live migration documentation to include migration timeout controls and downtime limits. Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

tests: Add downtime and migration timeout tests

2db89e4

Signed-off-by: Jinrong Liang <[email protected]> Signed-off-by: Songqian Li <[email protected]>

lisongqian force-pushed the queue branch from bb150dd to 2db89e4 Compare September 2, 2025 03:32

phip1611 approved these changes Sep 2, 2025

View reviewed changes

likebreath requested changes Sep 3, 2025

View reviewed changes

rbradford closed this Oct 29, 2025

		// Define the maximum allowed downtime 2000 seconds(2000000 milliseconds)
		const MAX_MIGRATE_DOWNTIME: u64 = 2000000;

vm-migration: Add support for downtime limits #7033

vm-migration: Add support for downtime limits #7033

Uh oh!

Conversation

ljrcore commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbradford left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljrcore commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

phip1611 commented Jun 11, 2025

Uh oh!

ljrcore commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phip1611 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljrcore commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lisongqian commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 left a comment

Choose a reason for hiding this comment

Uh oh!

likebreath left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

likebreath Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford commented Oct 22, 2025

Uh oh!

phip1611 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljrcore commented Oct 28, 2025

Uh oh!

phip1611 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ljrcore commented Apr 18, 2025 •

edited

Loading

ljrcore commented Apr 21, 2025 •

edited

Loading

ljrcore commented Jun 13, 2025 •

edited

Loading

phip1611 commented Jun 25, 2025 •

edited

Loading

ljrcore commented Aug 6, 2025 •

edited

Loading

phip1611 Aug 12, 2025 •

edited

Loading

likebreath left a comment •

edited

Loading

likebreath Sep 3, 2025 •

edited

Loading

phip1611 commented Oct 22, 2025 •

edited

Loading

phip1611 commented Oct 29, 2025 •

edited

Loading