fix(guest): concurrent exec() deadlock — root cause analysis and proposed fix

## Problem

When multiple `exec()` calls run concurrently on the same box (via `join_all` / `asyncio.gather`), the program hangs permanently. This is the primary pattern for AI agent scenarios — running multiple tools concurrently in a single sandbox.

```rust
// This deadlocks
let results = futures::future::join_all(vec![
    handle.exec(BoxCommand::new("echo").arg("task A")),
    handle.exec(BoxCommand::new("echo").arg("task B")),
    handle.exec(BoxCommand::new("echo").arg("task C")),
    handle.exec(BoxCommand::new("echo").arg("task D")),
]).await;
```

Sequential exec works fine (N=1 verified 5/5 pass). Only concurrent exec (N >= 2) triggers the hang.

## Root Cause

Three compounding issues identified through systematic investigation on macOS ARM64:

### 1. Tokio worker thread starvation

`ContainerCommand::build_and_spawn()` calls libcontainer's `TenantContainerBuilder::build()` — a 100% synchronous blocking function (clone3, waitpid, blocking pipe read) — directly on a tokio worker thread. The four methods in the call chain (`spawn`, `spawn_with_pipes`, `spawn_with_pty`, `build_and_spawn`) are marked `async fn` but contain **zero `.await` yield points**.

With N concurrent execs on a VM with C cores (= C tokio workers), when N >= C all workers block and the runtime deadlocks.

### 2. Process-global `chdir()` race in libcontainer

libcontainer 0.5.7 uses `chdir()` to work around Unix socket 108-char path limits (`notify_socket.rs:63`, `notify_socket.rs:136`, `tty.rs:91`). This is a process-global operation. Concurrent `build()` calls race on CWD, causing socket operations in wrong directories. No file-level locks exist in the tenant exec path.

### 3. Single vCPU starvation

Default `BoxOptions` allocates 1 vCPU. The tokio runtime, blocking threads, and gRPC handler all compete for one core.

## Full Analysis

Complete investigation with binary SHA256 verification, controlled experiments across 5 configurations, and architecture diagrams:

**https://gist.github.com/acmerfight/08c159cded0cdaab824e3eaee7b736ef**

## Proposed Fix

Three changes targeting the three root causes:

### 1. `guest/src/container/command.rs` — Remove false async

Convert `spawn()`, `spawn_with_pipes()`, `spawn_with_pty()`, `build_and_spawn()` from `async fn` to `fn` (zero actual async operations).

### 2. `guest/src/service/exec/executor.rs` — `spawn_blocking` + static lock

```rust
async fn spawn(&self, req: &ExecRequest) -> BoxliteResult<ExecHandle> {
    let cmd = {
        let container = self.container.lock().await;
        // ... build ContainerCommand (quick, no I/O) ...
        cmd
    }; // Container lock released — gRPC stays concurrent.

    static SPAWN_LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(());
    tokio::task::spawn_blocking(move || {
        let _guard = SPAWN_LOCK.lock().unwrap_or_else(|e| e.into_inner());
        cmd.spawn()
    })
    .await
    .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join failed: {}", e)))?
}
```

Key: the `std::sync::Mutex` lives entirely on blocking threads — **zero impact on the tokio async runtime**. This is measurably better than holding a `tokio::sync::Mutex` across `.await` (tested: 60% vs 70% pass rate at 2 vCPU).

### 3. Test: use 2 vCPUs

```rust
let opts = BoxOptions { cpus: Some(2), ..common::alpine_opts() };
```

## Measured Results

All from real test execution on macOS ARM64, each run preceded by `make runtime-debug`:

| Configuration | Pass rate |
|---------------|-----------|
| Main branch (no fix), N=4, 1 vCPU | 0% (0/1) |
| N=1 (single exec), 1 vCPU | 100% (5/5) |
| spawn_blocking, no lock, 1 vCPU | 20% (1/5) |
| spawn_blocking, no lock, 2 vCPU | 80% (4/5) |
| spawn_blocking + tokio Mutex, 2 vCPU | 60% (3/5) |
| **spawn_blocking + static std Mutex, 2 vCPU** | **70% (7/10)** |

Remaining 30% failure is due to intermittent libkrun/Hypervisor.framework issues (the first exec's `build()` occasionally hangs indefinitely inside `spawn_blocking` — verified with 120s timeout, confirmed by other integration tests also failing with VM-level errors on both main and PR branches).

## Related

- PR [#5](https://github.com/acmerfight/boxlite/pull/5) — initial Mutex-only fix (addresses part of root cause 1, does not address root causes 2 and 3)
- [Build system bug](https://gist.github.com/acmerfight/6a794f596ef5af81b1a64df7065e75c2) — `cargo test` overwrites `boxlite-shim` with default features (separate issue discovered during investigation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(guest): concurrent exec() deadlock — root cause analysis and proposed fix #349

Problem

Root Cause

1. Tokio worker thread starvation

2. Process-global `chdir()` race in libcontainer

3. Single vCPU starvation

Full Analysis

Proposed Fix

1. `guest/src/container/command.rs` — Remove false async

2. `guest/src/service/exec/executor.rs` — `spawn_blocking` + static lock

3. Test: use 2 vCPUs

Measured Results

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configuration	Pass rate
Main branch (no fix), N=4, 1 vCPU	0% (0/1)
N=1 (single exec), 1 vCPU	100% (5/5)
spawn_blocking, no lock, 1 vCPU	20% (1/5)
spawn_blocking, no lock, 2 vCPU	80% (4/5)
spawn_blocking + tokio Mutex, 2 vCPU	60% (3/5)
spawn_blocking + static std Mutex, 2 vCPU	70% (7/10)

fix(guest): concurrent exec() deadlock — root cause analysis and proposed fix #349

Description

Problem

Root Cause

1. Tokio worker thread starvation

2. Process-global chdir() race in libcontainer

3. Single vCPU starvation

Full Analysis

Proposed Fix

1. guest/src/container/command.rs — Remove false async

2. guest/src/service/exec/executor.rs — spawn_blocking + static lock

3. Test: use 2 vCPUs

Measured Results

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Process-global `chdir()` race in libcontainer

1. `guest/src/container/command.rs` — Remove false async

2. `guest/src/service/exec/executor.rs` — `spawn_blocking` + static lock