Skip to content

Conversation

@weltling
Copy link
Member

@weltling weltling commented Dec 3, 2025

This PR addresses cluster leaks in QCOW2 format handling through:

  • Fix cluster leak when refcount blocks are replaced - recursively decrement refcounts when refcount blocks need reallocation
  • Fix cluster leak when converting compressed clusters - decrement refcounts for all clusters spanned over the same compressed data

These fixes prevent disk space leaks when working with compressed clusters and during refcount table operations.

@weltling weltling requested a review from a team as a code owner December 3, 2025 13:08
@weltling weltling changed the title QCOW2 Cluster Leak Fixes block: qcow: Cluster Leak Fixes Dec 3, 2025
@weltling weltling force-pushed the qcow-fixes-cluster-leaks2 branch from db8285f to 0dc0367 Compare December 3, 2025 13:13
@phip1611
Copy link
Member

phip1611 commented Dec 3, 2025

Great! Is there some way to test this?

@weltling
Copy link
Member Author

weltling commented Dec 3, 2025

Right! There was still #7527 I've been preparing separately. I was just checking that one once again and it seems the refcount leaks is what is only remaining now (rebased to include the previous data corruption fix).

I can actually pull those integration test changes over here. Most of the common_parallel::test_virtio_block_qcow2* are relevant. In case there are no unrelated issues in the integrated tests, these patches should be fully covered.

Thanks

@phip1611
Copy link
Member

phip1611 commented Dec 3, 2025

I don't have a strong opinion - I'm slightly in favor of having the tests in the same PR.

@weltling weltling force-pushed the qcow-fixes-cluster-leaks2 branch from 34b63ea to ebbcb2e Compare December 3, 2025 15:29
@weltling
Copy link
Member Author

weltling commented Dec 3, 2025

I don't have a strong opinion - I'm slightly in favor of having the tests in the same PR.

Usually it's the norm. This deviation has been just because the initial extent of the issue was unknown, so a patch set might have become quite unwieldy.

Thanks

@rbradford
Copy link
Member

rbradford commented Dec 3, 2025

@weltling Yeh - if the tests in your other PR are now passing do please pull those into this PR and close that one.

@weltling
Copy link
Member Author

weltling commented Dec 3, 2025

@weltling Yeh - if the tests in your other PR are now passing do please pull those into this PR and close that one.

Yup, for what it matters the test changes are passing now, covering the actual bugfixes. I just closed the other PR.

Thanks

Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to rebase for the actions change in #7544.

When a refcount block is evicted from cache and replaced with a new
one, the old refcount block cluster was added to unref_clusters but
its refcount was never decremented to 0 on disk. This left the cluster
with refcount=1 while no metadata referenced it, causing errors in
qemu-img check

`Leaked cluster X refcount=1 reference=0`

This fix recursively calls set_cluster_refcount(freed_cluster, 0) to
properly decrement the freed refcount block's refcount on disk. The
recursion handles cascading replacements where freeing one refcount
block may trigger the replacement of another.

Signed-off-by: Anatol Belski <[email protected]>
When converting a compressed cluster to standard during write
operations, the old compressed cluster's refcount was never
decremented, causing leak warnings by `qemu-img check ..`

`Leaked cluster X refcount=N reference=M`

Additionally, compressed data can span multiple physical clusters,
not just one. The compressed cluster address and size are encoded
in the L2 entry, and the data may cross cluster boundaries.

The proper handling is implemented as follows:
- Extract compressed cluster address and size before overwriting
  L2 entry
- Identify all clusters occupied by the compressed data
- Decrement refcount for each cluster in the range

Signed-off-by: Anatol Belski <[email protected]>
Freed clusters correctly have refcount=0. Remove the assertion that
expected no clusters with zero refcount, as it was validating the
buggy behavior.

Signed-off-by: Anatol Belski <[email protected]>
The image passed for the guest construction is copied. Previously,
check-img has been checking the unchanged image from the workspace dir,
which is supposed to be error free.

Signed-off-by: Anatol Belski <[email protected]>
@weltling weltling force-pushed the qcow-fixes-cluster-leaks2 branch from ebbcb2e to 4e1aef9 Compare December 5, 2025 14:48
@weltling
Copy link
Member Author

weltling commented Dec 5, 2025

You will need to rebase for the actions change in #7544.

Done, thanks.

@rbradford rbradford enabled auto-merge December 5, 2025 15:01
@rbradford rbradford added this pull request to the merge queue Dec 5, 2025
Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very great contribution, thank you!

Merged via the queue into cloud-hypervisor:main with commit f56adb8 Dec 5, 2025
43 checks passed
@weltling weltling deleted the qcow-fixes-cluster-leaks2 branch December 6, 2025 10:26
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Cloud Hypervisor Roadmap Dec 11, 2025
@likebreath likebreath added the bug-fix Bug fix to include in release notes label Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix Bug fix to include in release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants