Skip to content

fix(guest): concurrent exec() deadlock — root cause analysis and proposed fix #349

@acmerfight

Description

@acmerfight

Problem

When multiple exec() calls run concurrently on the same box (via join_all / asyncio.gather), the program hangs permanently. This is the primary pattern for AI agent scenarios — running multiple tools concurrently in a single sandbox.

// This deadlocks
let results = futures::future::join_all(vec![
    handle.exec(BoxCommand::new("echo").arg("task A")),
    handle.exec(BoxCommand::new("echo").arg("task B")),
    handle.exec(BoxCommand::new("echo").arg("task C")),
    handle.exec(BoxCommand::new("echo").arg("task D")),
]).await;

Sequential exec works fine (N=1 verified 5/5 pass). Only concurrent exec (N >= 2) triggers the hang.

Root Cause

Three compounding issues identified through systematic investigation on macOS ARM64:

1. Tokio worker thread starvation

ContainerCommand::build_and_spawn() calls libcontainer's TenantContainerBuilder::build() — a 100% synchronous blocking function (clone3, waitpid, blocking pipe read) — directly on a tokio worker thread. The four methods in the call chain (spawn, spawn_with_pipes, spawn_with_pty, build_and_spawn) are marked async fn but contain zero .await yield points.

With N concurrent execs on a VM with C cores (= C tokio workers), when N >= C all workers block and the runtime deadlocks.

2. Process-global chdir() race in libcontainer

libcontainer 0.5.7 uses chdir() to work around Unix socket 108-char path limits (notify_socket.rs:63, notify_socket.rs:136, tty.rs:91). This is a process-global operation. Concurrent build() calls race on CWD, causing socket operations in wrong directories. No file-level locks exist in the tenant exec path.

3. Single vCPU starvation

Default BoxOptions allocates 1 vCPU. The tokio runtime, blocking threads, and gRPC handler all compete for one core.

Full Analysis

Complete investigation with binary SHA256 verification, controlled experiments across 5 configurations, and architecture diagrams:

https://gist.github.com/acmerfight/08c159cded0cdaab824e3eaee7b736ef

Proposed Fix

Three changes targeting the three root causes:

1. guest/src/container/command.rs — Remove false async

Convert spawn(), spawn_with_pipes(), spawn_with_pty(), build_and_spawn() from async fn to fn (zero actual async operations).

2. guest/src/service/exec/executor.rsspawn_blocking + static lock

async fn spawn(&self, req: &ExecRequest) -> BoxliteResult<ExecHandle> {
    let cmd = {
        let container = self.container.lock().await;
        // ... build ContainerCommand (quick, no I/O) ...
        cmd
    }; // Container lock released — gRPC stays concurrent.

    static SPAWN_LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(());
    tokio::task::spawn_blocking(move || {
        let _guard = SPAWN_LOCK.lock().unwrap_or_else(|e| e.into_inner());
        cmd.spawn()
    })
    .await
    .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join failed: {}", e)))?
}

Key: the std::sync::Mutex lives entirely on blocking threads — zero impact on the tokio async runtime. This is measurably better than holding a tokio::sync::Mutex across .await (tested: 60% vs 70% pass rate at 2 vCPU).

3. Test: use 2 vCPUs

let opts = BoxOptions { cpus: Some(2), ..common::alpine_opts() };

Measured Results

All from real test execution on macOS ARM64, each run preceded by make runtime-debug:

Configuration Pass rate
Main branch (no fix), N=4, 1 vCPU 0% (0/1)
N=1 (single exec), 1 vCPU 100% (5/5)
spawn_blocking, no lock, 1 vCPU 20% (1/5)
spawn_blocking, no lock, 2 vCPU 80% (4/5)
spawn_blocking + tokio Mutex, 2 vCPU 60% (3/5)
spawn_blocking + static std Mutex, 2 vCPU 70% (7/10)

Remaining 30% failure is due to intermittent libkrun/Hypervisor.framework issues (the first exec's build() occasionally hangs indefinitely inside spawn_blocking — verified with 120s timeout, confirmed by other integration tests also failing with VM-level errors on both main and PR branches).

Related

  • PR #5 — initial Mutex-only fix (addresses part of root cause 1, does not address root causes 2 and 3)
  • Build system bugcargo test overwrites boxlite-shim with default features (separate issue discovered during investigation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions