Skip to content

fix: address codex review issues for zygote PR#364

Merged
DorianZheng merged 9 commits intomainfrom
fix/concurrent-exec-zygote
Mar 11, 2026
Merged

fix: address codex review issues for zygote PR#364
DorianZheng merged 9 commits intomainfrom
fix/concurrent-exec-zygote

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

@DorianZheng DorianZheng commented Mar 11, 2026

Summary

Fixes #349

Addresses issues found during two rounds of Codex code review on the zygote concurrent exec deadlock fix.

Round 1 (4 fixes)

  • wait_process routing: Route zygote-spawned PIDs through wait_via_zygote() in ExecutionState::wait_process() instead of local waitpid (which returns ECHILD)
  • Architecture selection: Prefer matching architecture in find_prebuilt_shim/guest to avoid embedding wrong-arch binary on multi-arch machines
  • IPC buffer validation: Validate message size before send in zygote SEQPACKET channel to prevent silent truncation
  • Dead metrics cleanup: Remove unused exec_duration_total_ms and exec_max_duration_ms from BoxMetrics, REST types, and Python/Node SDKs

Round 2 (3 fixes)

  • Container mutex scope (HIGH): Release container mutex before PTY handshake by splitting spawn into two phases (spawn_build + finish). Prevents a stuck console-socket accept() from blocking all execs and shutdown. Added 30s SO_RCVTIMEO timeout on console socket.
  • Dead ExecHandle::wait() (MEDIUM): Remove dead wait() method that would return ECHILD for zygote-spawned PIDs. Update doc examples in 4 files.

Test plan

  • cargo clippy -p boxlite-guest --target aarch64-unknown-linux-musl -- -D warnings — clean
  • cargo clippy -p boxlite --features link-krun --tests -- -D warnings — clean
  • cargo fmt --check — clean
  • make runtime-debug — builds successfully
  • 28 execution/shutdown integration tests pass
  • Zygote IPC oversized message unit tests pass

…ion (#349)

Route clone3() and waitpid through a single-threaded zygote process to fix
three issues with concurrent container exec:

1. **musl __malloc_lock deadlock**: clone3() from tokio spawn_blocking
   inherits locked __malloc_lock, deadlocking the child on first malloc.
   Zygote forks before tokio starts, keeping clone3() single-threaded.

2. **Exit code loss (ECHILD)**: Container processes are children of the
   zygote, not main. Added Wait IPC to route waitpid through zygote.

3. **Concurrent wait deadlock**: Blocking waitpid held zygote Mutex for
   process lifetime, blocking all other waits and builds. Added WNOHANG
   polling (10ms) so each IPC round-trip holds Mutex for ~microseconds.

Key changes:
- guest/src/container/zygote.rs: New zygote process with tagged IPC
  protocol (ZygoteRequest/ZygoteResponse for Build + Wait)
- guest/src/service/exec/state.rs: Replace direct waitpid with WNOHANG
  polling loop via ZYGOTE.wait()
- guest/src/main.rs: Start zygote before tokio runtime
- docs/investigations/concurrent-exec-deadlock.md: Root cause analysis
- 28 integration tests covering all scenarios
Round 1 (4 fixes):
- fix(guest): route zygote-spawned PIDs through wait_via_zygote in
  ExecutionState::wait_process() instead of local waitpid (ECHILD)
- fix(build): prefer matching architecture in find_prebuilt_shim/guest
  to avoid embedding wrong-arch binary on multi-arch machines
- fix(guest): validate IPC message size before send in zygote channel
  to prevent silent truncation on SEQPACKET sockets
- fix(metrics): remove dead exec_duration_total_ms and
  exec_max_duration_ms fields from BoxMetrics, REST types, and SDKs

Round 2 (3 fixes):
- fix(guest): release container mutex before PTY handshake by splitting
  spawn into two phases (spawn_build + finish), preventing a stuck
  console-socket accept from blocking all execs and shutdown
- fix(guest): add 30s SO_RCVTIMEO timeout on console socket accept and
  recvmsg to prevent indefinite blocking in PTY handshake
- fix(guest): remove dead ExecHandle::wait() method that would return
  ECHILD for zygote-spawned PIDs, update doc examples
- fix(runtime): call sign_shim on cache hit path so previously cached
  unsigned shims get signed; make signing failure fatal on macOS
Keep our branch's search order comment (direct build output first)
which matches the actual code behavior.
Root cause of 3 flaky VM tests: copy_shim_to_box() copies the signed
shim binary to a per-box path, but macOS file copy strips code
signatures. The copied binary runs without com.apple.security.hypervisor,
so the kernel kills it with SIGKILL (137) when it touches
Hypervisor.framework.

Fix: re-sign the copied shim with codesign --force after every copy.
codesign --force is idempotent (safe on already-signed binaries).

Also fixes embedded.rs signing race: when nextest runs 4 test processes
in parallel, the old codesign --force in-place rewrite on the fast path
could produce a partially-written binary observed by concurrent readers.

Fix: replace in-place sign_shim() with:
- shim_needs_signing(): check if hypervisor entitlement is already
  present (skips re-sign in the common case)
- sign_shim_atomic(): sign to PID-scoped temp file, then atomic
  rename() over the original
Root cause of 3 flaky VM tests (SIGKILL/137): when nextest runs 4 test
processes in parallel, the old codesign --force in-place rewrite on the
embedded.rs fast path could produce a partially-written binary observed
by concurrent copy_shim_to_box() readers. The copied binary had an
invalid code signature, causing the kernel to kill it on exec.

Note: macOS file copy preserves code signatures (they're embedded in
the Mach-O __LINKEDIT segment), so re-signing after copy is unnecessary.

Fix in embedded.rs:
- shim_needs_signing(): check if hypervisor entitlement is already
  present (skips re-sign in the common case, eliminating the race)
- sign_shim_atomic(): sign to PID-scoped temp file, then atomic
  rename() over the original — concurrent readers always see either
  the old or new binary, never an intermediate state
The agent's shim_needs_signing() + sign_shim_atomic() + fast-path
signing was solving a problem that didn't exist in the original code:

- Slow path signs in a PID-scoped temp dir before atomic rename,
  so the shim is always signed when it becomes visible
- Fast path correctly skips signing — the shim is already signed
- macOS file copy preserves code signatures (they're embedded in the
  Mach-O binary), so copy_shim_to_box doesn't strip them

The only real improvement over the original sign_shim() from a6f17a9
is making signing failure fatal (return Err instead of warn + Ok),
which prevents caching an unsigned shim if codesign fails.
Fully revert embedded.rs to match main. The signing added in a6f17a9
is handled separately — this file should have no diff against main.
Single blank line removal was unrelated to the zygote PR scope.
These #[allow(dead_code)] methods predate the zygote PR — removing
them was scope creep from the Codex review, not a zygote fix.
@DorianZheng DorianZheng merged commit a14d758 into main Mar 11, 2026
18 checks passed
@DorianZheng DorianZheng deleted the fix/concurrent-exec-zygote branch March 11, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(guest): concurrent exec() deadlock — root cause analysis and proposed fix

1 participant