Skip to content

Fix silent initialization failures#15

Merged
DorianZheng merged 1 commit intomainfrom
fix/silent-init-errors
Dec 16, 2025
Merged

Fix silent initialization failures#15
DorianZheng merged 1 commit intomainfrom
fix/silent-init-errors

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

Summary

  • Add tracing::error! logging in Rust before errors propagate, so failures are visible even when caught by retry loops
  • Fix Python wait_until_ready() to distinguish transient vs fatal errors

Changes

  • spawn.rs: Log subprocess spawn failures (e.g., "Permission denied")
  • pipeline.rs: Log failures for all 6 init stages with box_id and stage context
  • shim.rs: Log guest ready timeout and socket binding errors
  • computerbox.py: Only retry transient errors (ExecError, ConnectionError, OSError, TimeoutError); propagate fatal errors immediately

Test plan

  • Run with RUST_LOG=error and verify init failures are logged
  • Simulate spawn failure (e.g., remove execute permission from boxlite-shim) and verify error propagates immediately instead of timing out after 60s

Add error logging before errors propagate so failures are visible even
when caught by retry loops:

- spawn.rs: Log subprocess spawn failures (e.g., permission denied)
- pipeline.rs: Log failures for all 6 init stages with box_id and stage
- shim.rs: Log guest ready timeout and socket errors

Also fix Python wait_until_ready() to distinguish transient vs fatal
errors - transient errors retry, fatal errors propagate immediately.
@DorianZheng DorianZheng merged commit 77d9b0c into main Dec 16, 2025
4 checks passed
@DorianZheng DorianZheng deleted the fix/silent-init-errors branch December 19, 2025 13:13
DorianZheng added a commit that referenced this pull request Feb 20, 2026
…ion metrics

- Fix FIFREEZE/FITHAW ioctl numbers: replace nix::ioctl_write_int! (_IOW)
  with raw libc::ioctl using correct _IOWR constants (0xC0045877/0xC0045878)
- Split export into two phases: only disk flatten runs inside quiesce bracket
  (VM paused ~550ms), checksum+archive run after VM resumes (~6.5s saved)
- Add Instant-based timing metrics to all operations (start, stop, clone,
  export, import, snapshot, quiesce phases)
- Remove debug artifacts (eprintln, sleep) from with_quiesce_async
- Add CLAUDE.md rule #15: no sleep for events
DorianZheng added a commit that referenced this pull request Feb 20, 2026
)

* feat: auto-quiesce clone/export on running boxes + REST import_box

Replace require_stopped_for with SIGSTOP/SIGCONT-based PauseGuard for
clone and export operations, allowing them to work on running boxes
without requiring a manual stop. Disable snapshot APIs (to be
re-enabled later). Implement REST client import_box to enable
cross-backend archive portability.

Key changes:
- PauseGuard: RAII guard that freezes VM via SIGSTOP, resumes on drop
- clone_box/export work on running boxes (auto-quiesce via PauseGuard)
- Snapshot APIs return unimplemented error (temporarily disabled)
- REST import_box client: reads archive, POSTs to /boxes/import
- Fix clone_cow Disk RAII leak (.leak() prevents auto-delete)
- Python examples for clone/export/import and local-to-REST migration

* feat: fix FIFREEZE ioctl, minimize VM pause during export, add operation metrics

- Fix FIFREEZE/FITHAW ioctl numbers: replace nix::ioctl_write_int! (_IOW)
  with raw libc::ioctl using correct _IOWR constants (0xC0045877/0xC0045878)
- Split export into two phases: only disk flatten runs inside quiesce bracket
  (VM paused ~550ms), checksum+archive run after VM resumes (~6.5s saved)
- Add Instant-based timing metrics to all operations (start, stop, clone,
  export, import, snapshot, quiesce phases)
- Remove debug artifacts (eprintln, sleep) from with_quiesce_async
- Add CLAUDE.md rule #15: no sleep for events

* refactor: make local_to_rest_migration example self-contained

Start the reference server as a subprocess on a random port
instead of requiring manual server startup. Removes a common
source of confusion and stale-build errors.
lilongen pushed a commit to lilongen/boxlite that referenced this pull request Mar 26, 2026
…ven detection

ProcessMonitor::wait_for_exit() used a 500ms sleep-based polling loop
(tokio::time::sleep + try_wait), violating the project's Rule boxlite-ai#15:
"No Sleep for Events." This added up to 500ms latency to VM crash
detection during startup (used in guest_connect.rs select! race).

Replace with platform-native event-driven mechanisms:
- Linux: pidfd_open() (kernel 5.3+) + tokio AsyncFd
- macOS: kqueue + EVFILT_PROC + NOTE_EXIT + tokio AsyncFd
- Fallback: 100ms polling for older kernels (< 5.3)

Key design decisions:
- OwnedFd wraps raw FDs immediately after creation (leak-free by construction)
- fcntl O_NONBLOCK checked; falls back to polling on failure
- Block scope for kevent struct (contains *mut c_void, not Send)
- Best-effort race guard via is_alive() after FD setup

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant