Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented May 15, 2025

Building on the foundations introduced with #7061, #7069, #7085, #7089, and some more, we can now print nice error messages when Cloud Hypervisor or ch-remote exit with an error.

This is a major step forward in UX for both users and developers. It will be clearer where an error originates from and still all debugging information will be preserved.

Changes

  • streamline Display impl: don't repeat display of sub error. This was achieved using a Python script (see below):
  • Switch error print style on failure exit in Cloud Hypervisor
  • Switch error print style on failure exit in ch-remote
  • Adapt how printed errors are propagated via the HTTP API

Hints for Reviewers

Please review this commit by commit. The interesting commits are easily detectable as they don't follow the series naming scheme.

Steps to Undraft and to Merge

Python Script to change Display::fmt() of all Errors

The python script is embedded into the first commit message of the series changing the behavior.

Old Style (Cloud Hypervisor Invocation Fails)

`target/release/cloud-hypervisor --api-socket /tmp/chv2.sock --kernel /etc/bootitems/linux/kernel_minimal/stable.bzImage --cmdline console=ttyS0 --serial tty --console off --disk path=img.raw --initramfs /etc/bootitems/linux/initrd_minimal/default`

Error booting VM: VmBoot(LockingError(BlockError(LockDiskImage(AlreadyLocked)))

New Style (Cloud Hypervisor Invocation Fails)

! NOTE THAT THE FOLLOWING EXAMPLES MIGHT BE OUTDATED AS THE PR PROGRESSES !

`target/release/cloud-hypervisor --api-socket /tmp/chv2.sock --kernel /etc/bootitems/linux/kernel_minimal/stable.bzImage --cmdline console=ttyS0 --serial tty --console off --disk path=img.raw --initramfs 
/etc/bootitems/linux/initrd_minimal/default`

Error: Cloud Hypervisor exited with the following chain of errors:
  0: Error booting VM
  1: The VM could not boot
  2: Error locking disk images: Another instance likely holds a lock
  3: Cannot lock images of all block devices
  4: Failed to get Write lock for disk image: ./raw_disk.bin
  5: The file is already locked

Debug Info: VmBoot(VmBoot(LockingError(DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "./raw_disk.bin" })))

New Style (ch-remote Invocation Fails)

cargo run --bin ch-remote -- --api-socket=/tmp/chv.sock resize --cpus abc
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/ch-remote --api-socket=/tmp/chv.sock resize --cpus abc`

Error: ch-remote exited with the following chain of errors:
  0: Error parsing CPU count
  1: invalid digit found in string

Debug Info: InvalidCpuCount(ParseIntError { kind: InvalidDigit })

New Style (ch-remote shows API failure on from remote)

cargo run --bin ch-remote -- --api-socket=/tmp/chv.sock add-disk path=./raw_disk.bin,id=disk2
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/ch-remote --api-socket=/tmp/chv.sock add-disk path=./raw_disk.bin,id=disk2`

Error: ch-remote exited with the following chain of errors:
  0: http client error
  1: Server responded with InternalServerError
  2: Error from API
  3: The disk could not be added to the VM
  4: Failed to validate config
  5: Identifier disk1 is not unique

Debug Info: HttpApiClient(ServerResponse(InternalServerError, Some("Error from API<br>The disk could not be added to the VM<br>Failed to validate config<br>Identifier disk1 is not unique")))

@phip1611 phip1611 requested a review from a team as a code owner May 15, 2025 05:54
@phip1611 phip1611 changed the title vmm: Nice Error Messages vmm: Nice Error Messages on Exit May 15, 2025
@phip1611 phip1611 marked this pull request as draft May 15, 2025 05:59
@phip1611 phip1611 changed the title vmm: Nice Error Messages on Exit [WAITING FOR OTHER PRs] vmm: Nice Error Messages on Exit May 15, 2025
@phip1611 phip1611 changed the title [WAITING FOR OTHER PRs] vmm: Nice Error Messages on Exit [WIP]] vmm: Nice Error Messages on Exit May 15, 2025
@phip1611 phip1611 changed the title [WIP]] vmm: Nice Error Messages on Exit [WIP] vmm: Nice Error Messages on Exit May 15, 2025
@phip1611

This comment was marked as outdated.

@snue
Copy link
Contributor

snue commented May 15, 2025

As the Debug impl often contains valuable information, I added the following extension. As soon as -v is set, Cloud Hypervisor additionally prints the debug output. Like that:

Cloud Hypervisor exited with the following nested error:
  0: Error booting VM
     └VmBoot(VmBoot(LockingError(DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "nixos-livemig-vm--base-image-backup.raw" }))))
  1: The VM could not boot
     └VmBoot(LockingError(DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "nixos-livemig-vm--base-image-backup.raw" })))
  2: Error locking the disk images
     └LockingError(DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "nixos-livemig-vm--base-image-backup.raw" }))
  3: Cannot lock images of all block device
     └DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "nixos-livemig-vm--base-image-backup.raw" })
  4: Failed to get Write lock for disk image nixos-livemig-vm--base-image-backup.raw
     └LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "nixos-livemig-vm--base-image-backup.raw" }
  5: The file is already locked
     └AlreadyLocked

Question: But I think it should be also enough to print the Debug output of the top most error once, right? Otherwise, we will get repeated information anyway

I like the new verbose nested error stack message. At first glance, I find it a bit less obvious that there was an actual error than with the older output, though. I would find it more appealing if the message was prefixed:

Error: Cloud Hypervisor exited with the following nested error:
 0: ....

The additional debug output looks pretty redundant, so having it once for the outermost error in a single line should be sufficient from my perspective. Being able to see the actual image path in the error stack is definitely a big plus (but you don't need the debug info for that)!

That's my 2ct. I don't have very strong opinions on this.

@phip1611

This comment was marked as outdated.

@rbradford
Copy link
Member

@phip1611 Cool idea - one issue might be that currently send the "Error C: Error B: Error A" style output in the HTTP messages when an error occurs - can you have a think how this would work for those situations?

src/main.rs Outdated
eprintln!(" 0: {top_error}");
let mut level = 1;
let mut next_error: &dyn StdError = &top_error;
while let Some(sub_error) = next_error.source() {
Copy link
Contributor

@blitz blitz May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe a combination of std::iter::successors and enumerate can streamline this a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know std::iter::successors, cool! Thanks

Copy link
Member Author

@phip1611 phip1611 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: This lead to the discovery of a Rust compiler bug 😀

rust-lang/rust#141673

Nevertheless, a neat suggestion and now used in the code - thanks!

@phip1611 phip1611 mentioned this pull request May 22, 2025
@phip1611 phip1611 force-pushed the nice-error-messages branch from 24771c9 to 0c1defb Compare May 23, 2025 09:33
@phip1611 phip1611 changed the title [WIP] vmm: Nice Error Messages on Exit [WIP] Nice Error Messages on Exit for Cloud Hypervisor and ch-remote May 27, 2025
@phip1611 phip1611 force-pushed the nice-error-messages branch from 3410083 to b245038 Compare May 27, 2025 16:12
@phip1611
Copy link
Member Author

phip1611 commented May 28, 2025

one issue might be that currently send the "Error C: Error B: Error A" style output in the HTTP messages when an error occurs - can you have a think how this would work for those situations?

@rbradford @likebreath While the prerequisites aren't all merged yet, this still has progressed. The changes (IMHO massively) improve the quality and helpfulness of error messages on exit for:

  • failing direct invocations of Cloud Hypervisor
  • Cloud Hypervisor failures during runtime (not all possible paths tested yet, but I think if no panic occurs, the respective code paths should be taken)
  • failing direct invocations of ch-remote
  • failure responses received by ch-remote

All of that is reflected in the examples in the PR description.

@phip1611 phip1611 force-pushed the nice-error-messages branch from b245038 to 519bd9c Compare June 2, 2025 14:29
phip1611 added 14 commits June 13, 2025 13:42
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
With the foundations of each error type implementing std::error::Error,
we can now nicely walk the `.source()` chain and print an error trace.

This commit introduces improved user-facing error printing when:
- Cloud Hypervisor fails with an error
- ch-remote fails (client error)
- ch-remote fails (remote error)

The additional context is a clear improvement in UX for both users and
developers. In the following example, the new behaviour is shown for
a direct invocation of Cloud Hypervisor leading to a failure. This
looks similar for ch-remote.

```
Old Style
`target/release/cloud-hypervisor --api-socket /tmp/chv2.sock --kernel /etc/bootitems/linux/kernel_minimal/stable.bzImage --cmdline console=ttyS0 --serial tty --console off --disk path=img.raw --initramfs /etc/bootitems/linux/initrd_minimal/default`

Error booting VM: VmBoot(LockingError(BlockError(LockDiskImage(AlreadyLocked)))
```

```
`target/release/cloud-hypervisor --api-socket /tmp/chv2.sock --kernel /etc/bootitems/linux/kernel_minimal/stable.bzImage --cmdline console=ttyS0 --serial tty --console off --disk path=img.raw --initramfs
/etc/bootitems/linux/initrd_minimal/default`

Error: Cloud Hypervisor exited with the following chain of errors:
  0: Error booting VM
  1: The VM could not boot
  2: Error locking disk images: Another instance likely holds a lock
  3: Cannot lock images of all block devices
  4: Failed to get Write lock for disk image: ./img.raw
  5: The file is already locked

Debug Info: VmBoot(VmBoot(LockingError(DiskLockError(LockDiskImage { error: AlreadyLocked, lock_type: Write, path: "./raw_disk.bin" })))
```

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
While working on this, I found a subtle but severe compiler bug [0].
To fight the bug with explicitness rather than implicitness (to
prevent weird stuff in the future), this change is beneficial.

The bug is at least in Rust stable 1.34..1.87.

[0]: rust-lang/rust#141673

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Remote server errors are transferred as raw HTTP body. This way,
we lose the nested structured error information.

This is an attempt to retrieve the errors from the HTTP response
and to align the output with the normal error output.

For example, this produces the following chain of errors. Note
that everything after level 0 was retrieved from the HTTP server
response:

```
Error: ch-remote exited with the following chain of errors:
  0: http client error
  1: Server responded with InternalServerError
  2: Error from API
  3: The disk could not be added to the VM
  4: Failed to validate config
  5: Identifier disk1 is not unique

Debug Info: HttpApiClient(ServerResponse(InternalServerError, Some("Error from API<br>The disk could not be added to the VM<br>Failed to validate config<br>Identifier disk1 is not unique")))
```

In case the JSON can't be parsed properly, ch-remote will print:

```
Error: ch-remote exited with the following chain of errors:
  0: http client error
  X: Can't get remote's error messages from JSON response: EOF while parsing a value at line 1 column 0: body=''

Debug Info: HttpApiClient(ServerResponse(InternalServerError, Some("")))
```

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
@phip1611 phip1611 force-pushed the nice-error-messages branch from cfdac58 to 234ca64 Compare June 13, 2025 11:43
@rbradford rbradford added this pull request to the merge queue Jun 13, 2025
Merged via the queue into cloud-hypervisor:main with commit 190a11f Jun 13, 2025
40 checks passed
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Cloud Hypervisor Roadmap Jun 13, 2025
@phip1611 phip1611 deleted the nice-error-messages branch June 16, 2025 08:21
@phip1611
Copy link
Member Author

phip1611 commented Jun 16, 2025

Very cool that we got this merged. A major UX improvement for users and developers.

One remark tho: The description of the PR and the commit message still contain the old style with <br>, e.g. ... Debug Info: HttpApiClient(ServerResponse(InternalServerError, Some("Error from API<br>The disk could not be added to the VM<br>Failed to validate config<br>Identifier disk1 is not unique"))), but the code actually uses the JSON approach as discussed. So please don't get confused - all good!

phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Aug 11, 2025
This partially reverts
ed8f347 from cloud-hypervisor#7183 and
6277d7d from cloud-hypervisor#7201.

# Output how it was merged for v47 (cloud-hypervisor#7066)
```
Error: Cloud Hypervisor exited with the following chain of errors:
 0: Error booting VM
 1: The VM could not boot
 2: Error manipulating firmware file
 3: No such file or directory (os error 2)

Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

# Output after cloud-hypervisor#7183 and cloud-hypervisor#7201
```
cloud-hypervisor: 31.385730ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:27 -- Error: Cloud Hypervisor exited with the following chain of errors:
cloud-hypervisor: 31.417961ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   0: Error booting VM
cloud-hypervisor: 31.448078ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   1: The VM could not boot
cloud-hypervisor: 31.486711ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   2: Error manipulating firmware file
cloud-hypervisor: 31.513331ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   3: No such file or directory (os error 2)
cloud-hypervisor: 31.548037ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:44 --
cloud-hypervisor: 31.568045ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:45 -- Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

The "proper logger" has indeed the advantage that messages can
be gracefully redirected to log files etc. However, this makes the
error message hardly readable.

Therefore, I propose to use error!() only for runtime errors that
are non-critical, i.e., messages whose error cause will not lead to
a VMM shutdown.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Aug 11, 2025
This partially reverts
ed8f347 from cloud-hypervisor#7183 and
6277d7d from cloud-hypervisor#7201.

# Output how it was merged for v47 (cloud-hypervisor#7066)
```
Error: Cloud Hypervisor exited with the following chain of errors:
 0: Error booting VM
 1: The VM could not boot
 2: Error manipulating firmware file
 3: No such file or directory (os error 2)

Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

# Output after cloud-hypervisor#7183 and cloud-hypervisor#7201
```
cloud-hypervisor: 31.385730ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:27 -- Error: Cloud Hypervisor exited with the following chain of errors:
cloud-hypervisor: 31.417961ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   0: Error booting VM
cloud-hypervisor: 31.448078ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   1: The VM could not boot
cloud-hypervisor: 31.486711ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   2: Error manipulating firmware file
cloud-hypervisor: 31.513331ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   3: No such file or directory (os error 2)
cloud-hypervisor: 31.548037ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:44 --
cloud-hypervisor: 31.568045ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:45 -- Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

The "proper logger" has indeed the advantage that messages can
be gracefully redirected to log files etc. However, this makes the
error message hardly readable.

Therefore, I propose to use error!() only for runtime errors messages
but not a pretty-printed version of those.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Aug 11, 2025
This partially reverts
ed8f347 from cloud-hypervisor#7183 and
6277d7d from cloud-hypervisor#7201.

# Output how it was merged for v47 (cloud-hypervisor#7066)
```
Error: Cloud Hypervisor exited with the following chain of errors:
 0: Error booting VM
 1: The VM could not boot
 2: Error manipulating firmware file
 3: No such file or directory (os error 2)

Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

# Output after cloud-hypervisor#7183 and cloud-hypervisor#7201
```
cloud-hypervisor: 31.385730ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:27 -- Error: Cloud Hypervisor exited with the following chain of errors:
cloud-hypervisor: 31.417961ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   0: Error booting VM
cloud-hypervisor: 31.448078ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   1: The VM could not boot
cloud-hypervisor: 31.486711ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   2: Error manipulating firmware file
cloud-hypervisor: 31.513331ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   3: No such file or directory (os error 2)
cloud-hypervisor: 31.548037ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:44 --
cloud-hypervisor: 31.568045ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:45 -- Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

The "proper logger" has indeed the advantage that messages can
be gracefully redirected to log files etc. However, this makes the
error message hardly readable.

Therefore, I propose to use error!() only for runtime errors messages
but not a pretty-printed version of those.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
github-merge-queue bot pushed a commit that referenced this pull request Aug 11, 2025
This partially reverts
ed8f347 from #7183 and
6277d7d from #7201.

# Output how it was merged for v47 (#7066)
```
Error: Cloud Hypervisor exited with the following chain of errors:
 0: Error booting VM
 1: The VM could not boot
 2: Error manipulating firmware file
 3: No such file or directory (os error 2)

Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

# Output after #7183 and #7201
```
cloud-hypervisor: 31.385730ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:27 -- Error: Cloud Hypervisor exited with the following chain of errors:
cloud-hypervisor: 31.417961ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   0: Error booting VM
cloud-hypervisor: 31.448078ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   1: The VM could not boot
cloud-hypervisor: 31.486711ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   2: Error manipulating firmware file
cloud-hypervisor: 31.513331ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:39 --   3: No such file or directory (os error 2)
cloud-hypervisor: 31.548037ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:44 --
cloud-hypervisor: 31.568045ms: <main> ERROR:/home/pschuster/dev/cloud-hypervisor/src/lib.rs:45 -- Debug Info: VmBoot(VmBoot(FirmwareFile(Os { code: 2, kind: NotFound, message: "No such file or directory" })))
```

The "proper logger" has indeed the advantage that messages can
be gracefully redirected to log files etc. However, this makes the
error message hardly readable.

Therefore, I propose to use error!() only for runtime errors messages
but not a pretty-printed version of those.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
@phip1611 phip1611 mentioned this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants