Skip to content

Conversation

@gjolly
Copy link
Contributor

@gjolly gjolly commented May 28, 2025

Instead of exiting on IO errors, report the errors to the guest with VIRTIO_BLK_S_IOERR. For example, the guest kernel will log something similar to this if the nbd behind /dev/vdc is unexpectedly disconnected:

[ 166.033957] I/O error, dev vdc, sector 264 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[ 166.035083] Aborting journal on device vdc-8.
[ 166.037307] Buffer I/O error on dev vdc, logical block 9, lost sync page write
[ 166.038471] JBD2: I/O error when updating journal superblock for vdc-8.
[...]
[ 174.234470] EXT4-fs (vdc): I/O error while writing superblock

In case the rootfs is not located on the affected block device, this will not crash the host.

Fixes: #6995

@gjolly gjolly requested a review from a team as a code owner May 28, 2025 20:17
Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. Besides the comment below, I believe this patch only handles the async block backend, while we also need to have the same behavior for the synchronous backend. Please take a look at process_queue_submit(), particularly the call to request.execut_async() which returns false on no error when using synchronous backend.

@gjolly gjolly force-pushed the virtio-blk_report_err branch from 753d09b to 59af0df Compare May 29, 2025 09:02
@gjolly
Copy link
Contributor Author

gjolly commented May 29, 2025

As far as I can tell, the only synchronous operation handled by execute_async() is GetDeviceId which returns the serial of the device to the host and doesn't involve any IO.

@likebreath
Copy link
Member

As far as I can tell, the only synchronous operation handled by execute_async() is GetDeviceId which returns the serial of the device to the host and doesn't involve any IO.

Please checkout the different implementations of pub trait AsyncIo, and you will see most of them are actually using synchronous IO backends. The only async IO backends we use are io_uring and aio both for raw images only.

@gjolly gjolly force-pushed the virtio-blk_report_err branch from 59af0df to 39f1335 Compare May 30, 2025 14:40
Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes now look good to me. Just kicked off the CI pipeline.

Would you please look into if it is possible to add an integration test for this, e.g. manually trigger such error on the host and check the guest is still alive with expected errors reported to the guest kernel log? Thank you.

@gjolly
Copy link
Contributor Author

gjolly commented Jun 5, 2025

@likebreath thank you for your review. I am working on the integration tests.

I have a few questions:

  1. I see that the link check and the fuzz build have failed. It seems to be unrelated with my change. Are these known issues?
  2. The log lines in the commit message exceed the limit of 72 characters per line. Should I drop them or is it ok?
  3. I haven't looked everywhere but I don't see any obvious documentation on how to run the integration tests. Does such doc exist?

@rbradford
Copy link
Member

@likebreath thank you for your review. I am working on the integration tests.

I have a few questions:

1. I see that the link check and the fuzz build have failed. It seems to be unrelated with my change. Are these known issues?

They won't hold up merging your PR (if you rebase you will see the fuzz failures gone I think.)

2. The log lines in the commit message exceed the limit of 72 characters per line. Should I drop them or is it ok?

That's fine - the check is to ensure that normal paragraphs are appropriately wrapped.

3. I haven't looked everywhere but I don't see any obvious documentation on how to run the integration tests. Does such doc exist?

It's not easy to do but you can use scripts/dev_cli.sh to run them

Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this would be very tricky to test with an integration test - the code looks okay to me.

let status = match result {
Ok(_) => VIRTIO_BLK_S_OK,
Err(e) => {
error!("Request failed: {:x?} {:?}", request, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think error! might be too aggressive here - it's no longer fatal for the VMM so maybe it should be warn! @likebreath WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, both for the sync and async code path.

@gjolly
Copy link
Contributor Author

gjolly commented Jun 5, 2025

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

@likebreath
Copy link
Member

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

@gjolly Thanks for looking into this. That's where I was not sure and wanted you to help and find out. For such case, I am fine with not adding a integration test.

Can you please manually validate both the async and sync cases? To enforce the sync backend of block devices, you can either use the internal options of --disk (for testing purpose only) like below:

"path={},readonly=on,direct=on,num_queues=4,_disable_io_uring={},_disable_aio={}",
blk_file_path.to_str().unwrap(),
disable_io_uring,
disable_aio,

Can you please also rebase that should resolve the fuzz build error? Thank you.

@likebreath
Copy link
Member

For the commit message, I believe you meant to say:

In case the rootfs is not located on the affected block device, this will not crash the host guest.

@rbradford
Copy link
Member

I imagine this would be very tricky to test with an integration test

My reproducer setup involves NBD and qemu-nbd, which seems to require root. So it's not as simple to add to the tests as I originally thought.

@gjolly Thanks for looking into this. That's where I was not sure and wanted you to help and find out. For such case, I am fine with not adding a integration test.

I agree - we don't need an integration test if you've manually verified it.

@gjolly gjolly force-pushed the virtio-blk_report_err branch from 39f1335 to 4c9713d Compare June 9, 2025 13:18
Instead of exiting on IO errors, report the errors to the guest with
VIRTIO_BLK_S_IOERR. For example, the guest kernel will log something
similar to this if the nbd behind /dev/vdc is unexpectedly disconnected:

[  166.033957] I/O error, dev vdc, sector 264 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[  166.035083] Aborting journal on device vdc-8.
[  166.037307] Buffer I/O error on dev vdc, logical block 9, lost sync page write
[  166.038471] JBD2: I/O error when updating journal superblock for vdc-8.
[...]
[  174.234470] EXT4-fs (vdc): I/O error while writing superblock

In case the rootfs is not located on the affected block device, this
will not crash the guest.

Fixes: cloud-hypervisor#6995

Signed-off-by: Gauthier Jolly <[email protected]>
@gjolly gjolly force-pushed the virtio-blk_report_err branch from 4c9713d to fdf7339 Compare June 9, 2025 13:53
@gjolly
Copy link
Contributor Author

gjolly commented Jun 9, 2025

I rebased and I think I fixed all the things that were mentioned (commit message and error level)
For the tests, here is what I did:

truncate -s 1M disk.raw
qemu-nbd --connect /dev/nbd0 disk.raw
./cloud-hypervisor -disk path=/dev/nbd0 ...

Then when the VM is running I simply do:

(guest) echo "foo" > /dev/vdc # that should not raise any error
(host) qemu-nbd --disconnect /dev/nbd0
(guest) echo "foo" > /dev/vdc # that should trigger the kernel errors + the hypervisor warnings

At the end the guest should still be running.

The same process can be repeated but with --disk path=/dev/nbd0,_disable_aio=true,_disable_io_uring=true for the sync case.

@rbradford rbradford enabled auto-merge June 9, 2025 15:22
@rbradford
Copy link
Member

@likebreath If you're happy with the changes could you approve so this can be merged.

@rbradford rbradford added this pull request to the merge queue Jun 9, 2025
@github-project-automation github-project-automation bot moved this from 🏗 In progress to 📋 Backlog in Cloud Hypervisor Roadmap Jun 9, 2025
Merged via the queue into cloud-hypervisor:main with commit 3d78662 Jun 9, 2025
46 of 48 checks passed
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Cloud Hypervisor Roadmap Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Support error reporting to the guest for virtio-blk

3 participants