Skip to content

fix: prevent snapshot COW child disks from being deleted by Disk::Drop#330

Merged
DorianZheng merged 1 commit intoboxlite-ai:mainfrom
acmerfight:fix-snapshot-cow-disk-leak
Mar 4, 2026
Merged

fix: prevent snapshot COW child disks from being deleted by Disk::Drop#330
DorianZheng merged 1 commit intoboxlite-ai:mainfrom
acmerfight:fix-snapshot-cow-disk-leak

Conversation

@acmerfight
Copy link
Copy Markdown
Contributor

@acmerfight acmerfight commented Mar 2, 2026

Impact

All snapshot operations are broken. After snapshot.create() or snapshot.restore(), the box becomes un-startable — the operation returns success, but the box's disk files (disk.qcow2, guest-rootfs.qcow2) are silently deleted within the same function call. The next box.start() fails because the required disk files no longer exist.

  • Affected: LiteBox::snapshots().create(), LiteBox::snapshots().restore()
  • Not affected: clone, export (they already call .leak() correctly)
  • Reproducibility: 100% — every snapshot operation triggers the bug

Root cause

Disk is an RAII type (boxlite/src/disk/image.rs) that deletes its file on drop when persistent=false. create_cow_child_disk() always returns Disk { persistent: false }. Callers must call .leak() to prevent the destructor from deleting the file.

In local_snapshot.rs, all 4 calls to create_cow_child_disk() neglect to call .leak():

  • do_snapshot_create() uses if let Err(e) = create_cow_child_disk(...) — the Ok(Disk) is never bound, so it drops immediately
  • do_snapshot_restore() uses create_cow_child_disk(...)?; — the Disk is unwrapped by ? then dropped at the semicolon

Every other call site in the codebase handles this correctly:

  • container_rootfs.rs:233temp_disk.leak()
  • guest_rootfs.rs:150temp_disk.leak()
  • qcow2.rs:827,833 (clone_disk_pair) — create_cow_child_disk(...)?.leak()

Fix

Add .leak() to all 4 create_cow_child_disk() return values in local_snapshot.rs:

Location Change
do_snapshot_create() container disk (L248) if let Errmatch with Ok(disk) => { disk.leak(); }, error branch keeps rollback
do_snapshot_create() guest disk (L288) Same match pattern, error branch keeps full rollback including container cleanup
do_snapshot_restore() container disk (L375) )?;)?.leak();
do_snapshot_restore() guest disk (L393) )?;)?.leak();

Tests

Added boxlite/tests/snapshot.rs — 5 integration tests following clone_export_import.rs conventions:

Test Assertion type Verifies
test_cow_child_disks_exist_after_snapshot_create Filesystem disk.qcow2 and guest-rootfs.qcow2 exist in box dir after create
test_box_restartable_after_snapshot_create Functional Box starts and runs echo alive after create
test_cow_child_disks_exist_after_snapshot_restore Filesystem Both disks exist after restore
test_box_startable_after_snapshot_restore Functional Box starts after restore, cat /root/ver.txt returns snapshot content
test_snapshot_list_returns_created_snapshot Metadata Snapshot appears in list() after create

Uses PerTestBoxHome::new() for isolation, public LiteBox::snapshots() API. VM integration tests require make runtime-debug.

CI results (same commit ac6f182, run on fork)

Check Platform Result
Rust formatting (cargo fmt --check) ubuntu pass
Clippy (cargo clippy -- -D warnings) ubuntu pass
Clippy macos-15 (ARM64) pass
Rust Tests (boxlite-shared unit) linux-x64-gnu pass
Rust Tests (boxlite-shared unit) darwin-arm64 pass

Full CI run: https://github.com/acmerfight/boxlite/actions/runs/22585287612

Local verification

  • cargo fmt --check — pass
  • cargo clippy -p boxlite --tests -- -D warnings — 0 warnings
  • cargo test -p boxlite --lib -- litebox::local_snapshot — 12/12 unit tests pass
  • cargo test -p boxlite --test snapshot — requires VM runtime (make runtime-debug)

create_cow_child_disk() returns Disk with persistent=false. When the
Disk goes out of scope, its Drop impl deletes the file via
std::fs::remove_file. In do_snapshot_create() and do_snapshot_restore(),
the returned Disk was never leaked, causing the newly created COW child
QCOW2 files to be immediately deleted.

This made the box un-startable after snapshot create or restore, because
the required disk.qcow2 and guest-rootfs.qcow2 files no longer existed
on disk.

Fix: call .leak() on all four create_cow_child_disk() return values in
local_snapshot.rs, matching the correct pattern already used in
container_rootfs.rs, guest_rootfs.rs, and qcow2.rs::clone_disk_pair().

Added boxlite/tests/snapshot.rs with 5 integration tests covering disk
integrity and functional verification after snapshot create and restore.
@DorianZheng DorianZheng merged commit 2f8bb0f into boxlite-ai:main Mar 4, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants