fix: address codex review issues for zygote PR#364
Merged
DorianZheng merged 9 commits intomainfrom Mar 11, 2026
Merged
Conversation
…ion (#349) Route clone3() and waitpid through a single-threaded zygote process to fix three issues with concurrent container exec: 1. **musl __malloc_lock deadlock**: clone3() from tokio spawn_blocking inherits locked __malloc_lock, deadlocking the child on first malloc. Zygote forks before tokio starts, keeping clone3() single-threaded. 2. **Exit code loss (ECHILD)**: Container processes are children of the zygote, not main. Added Wait IPC to route waitpid through zygote. 3. **Concurrent wait deadlock**: Blocking waitpid held zygote Mutex for process lifetime, blocking all other waits and builds. Added WNOHANG polling (10ms) so each IPC round-trip holds Mutex for ~microseconds. Key changes: - guest/src/container/zygote.rs: New zygote process with tagged IPC protocol (ZygoteRequest/ZygoteResponse for Build + Wait) - guest/src/service/exec/state.rs: Replace direct waitpid with WNOHANG polling loop via ZYGOTE.wait() - guest/src/main.rs: Start zygote before tokio runtime - docs/investigations/concurrent-exec-deadlock.md: Root cause analysis - 28 integration tests covering all scenarios
Round 1 (4 fixes): - fix(guest): route zygote-spawned PIDs through wait_via_zygote in ExecutionState::wait_process() instead of local waitpid (ECHILD) - fix(build): prefer matching architecture in find_prebuilt_shim/guest to avoid embedding wrong-arch binary on multi-arch machines - fix(guest): validate IPC message size before send in zygote channel to prevent silent truncation on SEQPACKET sockets - fix(metrics): remove dead exec_duration_total_ms and exec_max_duration_ms fields from BoxMetrics, REST types, and SDKs Round 2 (3 fixes): - fix(guest): release container mutex before PTY handshake by splitting spawn into two phases (spawn_build + finish), preventing a stuck console-socket accept from blocking all execs and shutdown - fix(guest): add 30s SO_RCVTIMEO timeout on console socket accept and recvmsg to prevent indefinite blocking in PTY handshake - fix(guest): remove dead ExecHandle::wait() method that would return ECHILD for zygote-spawned PIDs, update doc examples - fix(runtime): call sign_shim on cache hit path so previously cached unsigned shims get signed; make signing failure fatal on macOS
Keep our branch's search order comment (direct build output first) which matches the actual code behavior.
Root cause of 3 flaky VM tests: copy_shim_to_box() copies the signed shim binary to a per-box path, but macOS file copy strips code signatures. The copied binary runs without com.apple.security.hypervisor, so the kernel kills it with SIGKILL (137) when it touches Hypervisor.framework. Fix: re-sign the copied shim with codesign --force after every copy. codesign --force is idempotent (safe on already-signed binaries). Also fixes embedded.rs signing race: when nextest runs 4 test processes in parallel, the old codesign --force in-place rewrite on the fast path could produce a partially-written binary observed by concurrent readers. Fix: replace in-place sign_shim() with: - shim_needs_signing(): check if hypervisor entitlement is already present (skips re-sign in the common case) - sign_shim_atomic(): sign to PID-scoped temp file, then atomic rename() over the original
Root cause of 3 flaky VM tests (SIGKILL/137): when nextest runs 4 test processes in parallel, the old codesign --force in-place rewrite on the embedded.rs fast path could produce a partially-written binary observed by concurrent copy_shim_to_box() readers. The copied binary had an invalid code signature, causing the kernel to kill it on exec. Note: macOS file copy preserves code signatures (they're embedded in the Mach-O __LINKEDIT segment), so re-signing after copy is unnecessary. Fix in embedded.rs: - shim_needs_signing(): check if hypervisor entitlement is already present (skips re-sign in the common case, eliminating the race) - sign_shim_atomic(): sign to PID-scoped temp file, then atomic rename() over the original — concurrent readers always see either the old or new binary, never an intermediate state
The agent's shim_needs_signing() + sign_shim_atomic() + fast-path signing was solving a problem that didn't exist in the original code: - Slow path signs in a PID-scoped temp dir before atomic rename, so the shim is always signed when it becomes visible - Fast path correctly skips signing — the shim is already signed - macOS file copy preserves code signatures (they're embedded in the Mach-O binary), so copy_shim_to_box doesn't strip them The only real improvement over the original sign_shim() from a6f17a9 is making signing failure fatal (return Err instead of warn + Ok), which prevents caching an unsigned shim if codesign fails.
Fully revert embedded.rs to match main. The signing added in a6f17a9 is handled separately — this file should have no diff against main.
Single blank line removal was unrelated to the zygote PR scope.
These #[allow(dead_code)] methods predate the zygote PR — removing them was scope creep from the Codex review, not a zygote fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #349
Addresses issues found during two rounds of Codex code review on the zygote concurrent exec deadlock fix.
Round 1 (4 fixes)
wait_via_zygote()inExecutionState::wait_process()instead of localwaitpid(which returns ECHILD)find_prebuilt_shim/guestto avoid embedding wrong-arch binary on multi-arch machinesexec_duration_total_msandexec_max_duration_msfrom BoxMetrics, REST types, and Python/Node SDKsRound 2 (3 fixes)
spawn_build+finish). Prevents a stuck console-socketaccept()from blocking all execs and shutdown. Added 30sSO_RCVTIMEOtimeout on console socket.ExecHandle::wait()(MEDIUM): Remove deadwait()method that would return ECHILD for zygote-spawned PIDs. Update doc examples in 4 files.Test plan
cargo clippy -p boxlite-guest --target aarch64-unknown-linux-musl -- -D warnings— cleancargo clippy -p boxlite --features link-krun --tests -- -D warnings— cleancargo fmt --check— cleanmake runtime-debug— builds successfully