block: virtio-blk: report IO errors to the guest #7107

gjolly · 2025-05-28T20:17:25Z

Instead of exiting on IO errors, report the errors to the guest with VIRTIO_BLK_S_IOERR. For example, the guest kernel will log something similar to this if the nbd behind /dev/vdc is unexpectedly disconnected:

[ 166.033957] I/O error, dev vdc, sector 264 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[ 166.035083] Aborting journal on device vdc-8.
[ 166.037307] Buffer I/O error on dev vdc, logical block 9, lost sync page write
[ 166.038471] JBD2: I/O error when updating journal superblock for vdc-8.
[...]
[ 174.234470] EXT4-fs (vdc): I/O error while writing superblock

In case the rootfs is not located on the affected block device, this will not crash the host.

Fixes: #6995

likebreath

Thank you for the contribution. Besides the comment below, I believe this patch only handles the async block backend, while we also need to have the same behavior for the synchronous backend. Please take a look at process_queue_submit(), particularly the call to request.execut_async() which returns false on no error when using synchronous backend.

virtio-devices/src/block.rs

gjolly · 2025-05-29T09:05:54Z

As far as I can tell, the only synchronous operation handled by execute_async() is GetDeviceId which returns the serial of the device to the host and doesn't involve any IO.

likebreath · 2025-05-29T18:21:56Z

As far as I can tell, the only synchronous operation handled by execute_async() is GetDeviceId which returns the serial of the device to the host and doesn't involve any IO.

Please checkout the different implementations of pub trait AsyncIo, and you will see most of them are actually using synchronous IO backends. The only async IO backends we use are io_uring and aio both for raw images only.

likebreath

Changes now look good to me. Just kicked off the CI pipeline.

Would you please look into if it is possible to add an integration test for this, e.g. manually trigger such error on the host and check the guest is still alive with expected errors reported to the guest kernel log? Thank you.

gjolly · 2025-06-05T11:10:29Z

@likebreath thank you for your review. I am working on the integration tests.

I have a few questions:

I see that the link check and the fuzz build have failed. It seems to be unrelated with my change. Are these known issues?
The log lines in the commit message exceed the limit of 72 characters per line. Should I drop them or is it ok?
I haven't looked everywhere but I don't see any obvious documentation on how to run the integration tests. Does such doc exist?

rbradford · 2025-06-05T15:51:55Z

@likebreath thank you for your review. I am working on the integration tests.

I have a few questions:
1. I see that the link check and the fuzz build have failed. It seems to be unrelated with my change. Are these known issues?

They won't hold up merging your PR (if you rebase you will see the fuzz failures gone I think.)

2. The log lines in the commit message exceed the limit of 72 characters per line. Should I drop them or is it ok?

That's fine - the check is to ensure that normal paragraphs are appropriately wrapped.

3. I haven't looked everywhere but I don't see any obvious documentation on how to run the integration tests. Does such doc exist?

It's not easy to do but you can use scripts/dev_cli.sh to run them

rbradford

I imagine this would be very tricky to test with an integration test - the code looks okay to me.

rbradford · 2025-06-05T15:55:24Z

virtio-devices/src/block.rs

+                let status = match result {
+                    Ok(_) => VIRTIO_BLK_S_OK,
+                    Err(e) => {
+                        error!("Request failed: {:x?} {:?}", request, e);


Do we think error! might be too aggressive here - it's no longer fatal for the VMM so maybe it should be warn! @likebreath WDYT?

I think that makes sense, both for the sync and async code path.

gjolly · 2025-06-05T20:31:08Z

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

likebreath · 2025-06-06T17:06:58Z

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

@gjolly Thanks for looking into this. That's where I was not sure and wanted you to help and find out. For such case, I am fine with not adding a integration test.

Can you please manually validate both the async and sync cases? To enforce the sync backend of block devices, you can either use the internal options of --disk (for testing purpose only) like below:

cloud-hypervisor/tests/integration.rs

Lines 3254 to 3257 in 59e11f1

    
           "path={},readonly=on,direct=on,num_queues=4,_disable_io_uring={},_disable_aio={}", 
        
           blk_file_path.to_str().unwrap(), 
        
           disable_io_uring, 
        
           disable_aio,

Can you please also rebase that should resolve the fuzz build error? Thank you.

likebreath · 2025-06-06T17:24:00Z

For the commit message, I believe you meant to say:

In case the rootfs is not located on the affected block device, this will not crash the ~~host~~ guest.

rbradford · 2025-06-06T18:50:50Z

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

@gjolly Thanks for looking into this. That's where I was not sure and wanted you to help and find out. For such case, I am fine with not adding a integration test.

I agree - we don't need an integration test if you've manually verified it.

Instead of exiting on IO errors, report the errors to the guest with VIRTIO_BLK_S_IOERR. For example, the guest kernel will log something similar to this if the nbd behind /dev/vdc is unexpectedly disconnected: [ 166.033957] I/O error, dev vdc, sector 264 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2 [ 166.035083] Aborting journal on device vdc-8. [ 166.037307] Buffer I/O error on dev vdc, logical block 9, lost sync page write [ 166.038471] JBD2: I/O error when updating journal superblock for vdc-8. [...] [ 174.234470] EXT4-fs (vdc): I/O error while writing superblock In case the rootfs is not located on the affected block device, this will not crash the guest. Fixes: cloud-hypervisor#6995 Signed-off-by: Gauthier Jolly <[email protected]>

gjolly · 2025-06-09T14:03:31Z

I rebased and I think I fixed all the things that were mentioned (commit message and error level)
For the tests, here is what I did:

truncate -s 1M disk.raw
qemu-nbd --connect /dev/nbd0 disk.raw
./cloud-hypervisor -disk path=/dev/nbd0 ...

Then when the VM is running I simply do:

(guest) echo "foo" > /dev/vdc # that should not raise any error
(host) qemu-nbd --disconnect /dev/nbd0
(guest) echo "foo" > /dev/vdc # that should trigger the kernel errors + the hypervisor warnings

At the end the guest should still be running.

The same process can be repeated but with --disk path=/dev/nbd0,_disable_aio=true,_disable_io_uring=true for the sync case.

rbradford · 2025-06-09T16:36:32Z

@likebreath If you're happy with the changes could you approve so this can be merged.

gjolly requested a review from a team as a code owner May 28, 2025 20:17

likebreath requested changes May 28, 2025

View reviewed changes

virtio-devices/src/block.rs Outdated Show resolved Hide resolved

gjolly force-pushed the virtio-blk_report_err branch from 753d09b to 59af0df Compare May 29, 2025 09:02

gjolly force-pushed the virtio-blk_report_err branch from 59af0df to 39f1335 Compare May 30, 2025 14:40

likebreath reviewed Jun 4, 2025

View reviewed changes

rbradford reviewed Jun 5, 2025

View reviewed changes

likebreath added this to Cloud Hypervisor Roadmap Jun 6, 2025

likebreath moved this to 🏗 In progress in Cloud Hypervisor Roadmap Jun 6, 2025

gjolly force-pushed the virtio-blk_report_err branch from 39f1335 to 4c9713d Compare June 9, 2025 13:18

gjolly force-pushed the virtio-blk_report_err branch from 4c9713d to fdf7339 Compare June 9, 2025 13:53

rbradford approved these changes Jun 9, 2025

View reviewed changes

rbradford enabled auto-merge June 9, 2025 15:22

likebreath approved these changes Jun 9, 2025

View reviewed changes

rbradford added this pull request to the merge queue Jun 9, 2025

github-project-automation bot moved this from 🏗 In progress to 📋 Backlog in Cloud Hypervisor Roadmap Jun 9, 2025

Merged via the queue into cloud-hypervisor:main with commit 3d78662 Jun 9, 2025
46 of 48 checks passed

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Cloud Hypervisor Roadmap Jun 9, 2025

block: virtio-blk: report IO errors to the guest #7107

block: virtio-blk: report IO errors to the guest #7107

Uh oh!

Conversation

gjolly commented May 28, 2025

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gjolly commented May 29, 2025

Uh oh!

likebreath commented May 29, 2025

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

gjolly commented Jun 5, 2025

Uh oh!

rbradford commented Jun 5, 2025

Uh oh!

rbradford left a comment

Choose a reason for hiding this comment

Uh oh!

rbradford Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

likebreath Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

gjolly commented Jun 5, 2025

Uh oh!

likebreath commented Jun 6, 2025

Uh oh!

likebreath commented Jun 6, 2025

Uh oh!

rbradford commented Jun 6, 2025

Uh oh!

gjolly commented Jun 9, 2025

Uh oh!

rbradford commented Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants