Skip to content

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458

@nesquena-hermes

Description

@nesquena-hermes

Summary

User reports at least one crash per day running a persistent WebUI on a Mac mini under launchd. Their agent has been able to recover by restarting, but the underlying issues are reproducible and architectural — not transient.

Two distinct bugs, both confirmed against current master.

Bug #1bootstrap.py exits after spawning child, breaking launchd KeepAlive

Symptom

launchd was also respawning bootstrap.py because it exited after starting the child server, while the child kept holding port 8787.

Root cause (verified)

bootstrap.py:243-268:

proc = subprocess.Popen(
    [python_exe, str(REPO_ROOT / "server.py")],
    cwd=str(agent_dir or REPO_ROOT),
    env=env,
    stdout=log_file,
    stderr=subprocess.STDOUT,
    start_new_session=True,         # ← decouples child from parent
)

health_url = f"http://{args.host}:{args.port}/health"
if not wait_for_health(health_url):
    raise RuntimeError(...)

info(f"Web UI is ready: {app_url}")
...
return 0                            # ← parent exits here

This is a double-fork daemon pattern. It works fine for an interactive CLI run (user runs bash start.sh, gets a working server, terminal returns to prompt). It's broken for any process supervisor (launchd, systemd, runit, supervisord) because the supervisor sees:

  1. Parent process (bootstrap.py) exits with code 0 → supervisor thinks "program completed"
  2. Server child still running in a new process group, holding port 8787
  3. Under KeepAlive=true / Restart=always, supervisor respawns bootstrap.py
  4. New bootstrap.py calls Popen([python_exe, "server.py"]) → server child fails to bind 8787 (still held by orphan) → exit
  5. New bootstrap.py raises RuntimeError from wait_for_health → exit non-zero → supervisor respawns → loop

Eventually the orphan crashes for some other reason (FD exhaustion per Bug #2), the next respawn finds 8787 free, and the loop "self-heals." User's "agent fixes it" is observing this loop intermittently succeed.

Fix shapes

Option (a) — Foreground mode flag (preferred).
Add --foreground (or auto-detect LAUNCHD_SOCKET / INVOCATION_ID / NOTIFY_SOCKET env vars) that runs the server inside the bootstrap process instead of spawning a child. Pattern:

if args.foreground or _detect_supervisor():
    # Replace current process with server (or run inline)
    os.execv(python_exe, [python_exe, str(REPO_ROOT / "server.py")])

os.execv() replaces the current process image, so launchd sees the long-lived server as the original child and KeepAlive works correctly.

Documentation update: example launchd plist using bash start.sh --foreground.

Option (b) — Wait for child in default path.
Change the default Popen + return 0 flow to Popen + proc.wait() so bootstrap stays alive as parent of the server. Backward-incompatible for users who relied on bootstrap returning to prompt; gate behind --detach flag for the old behavior.

Recommend (a) — opt-in flag plus auto-detection. Doesn't break any existing user; fixes the supervisor case.

Out of scope

  • Whether the launchd plist itself should be in our repo. Probably yes as a docs/example, but separate issue.
  • Migrating start.sh to systemd-style Type=notify. Overkill.

Bug #2 — File-descriptor / state.db handle leak

Symptom

Server PID had 454 file descriptors and loads of leaked state.db handles.

Context

We shipped #1421 in v0.50.259 specifically for a SessionDB WAL handle leak — that fix released SessionDB on cached-agent reuse and on LRU eviction. But the user is reporting >454 FDs and "loads of leaked state.db handles" on a current build. Either #1421 didn't cover all paths, or there's a second leak shape that surfaced after.

Investigation needed (do not blind-fix)

Possible suspects to audit:

  1. SessionDB instances created outside the streaming/cached-agent path.

    • api/streaming.py:1715-1720 creates a fresh SessionDB() per request and (post-fix: close previous SessionDB before replacing on cached agent #1421) closes the cached-agent's old one. But search for SessionDB( across the codebase — any other callsite that opens it without .close() is a candidate.
    • Includes any background thread, batch processing, gateway hook, cron worker that touches sessions.
  2. _session_db reassignment in error paths.

    • The agent's _session_db field is reassigned per turn. If an exception fires between the agent._session_db = _session_db line and the request completion, the previous _session_db may not be closed. Check try/finally coverage around streaming.
  3. Gateway-side session opens.

    • If gateway processes share state via the same state.db, each gateway call may open a connection. If the gateway runs in-process (it usually doesn't, but check), those connections inherit the WebUI's process FD limit.
  4. WAL checkpoint never running.

    • SQLite WAL files (state.db-wal, state.db-shm) accumulate write log when no checkpointing happens. PRAGMA wal_checkpoint(TRUNCATE) periodically would shrink the WAL file but the handle leak is separate from the WAL-size issue.
  5. Other open files.

    • 454 FDs is high enough that it might not be only state.db. The user mentioned "loads of leaked state.db handles" but the total is 454. Worth asking the user for lsof -p <pid> | sort | uniq -c | sort -rn | head -20 to see what dominates.

Repro

Suggest the user run:

# When the WebUI is in a "running but unhealthy" state:
lsof -p $(pgrep -f 'server.py' | head -1) | wc -l
lsof -p $(pgrep -f 'server.py' | head -1) | grep state.db | wc -l
lsof -p $(pgrep -f 'server.py' | head -1) | sort | uniq -c | sort -rn | head -20

Plus uptime since last restart, approximate session count, approximate request count over the run.

Fix shape (after diagnosis)

Once we know which open path leaks: add explicit .close() in the missing path, plus a defensive watchdog that tracks len(SESSION_AGENT_CACHE) × expected-FDs-per-agent and logs a warning if observed FD count is more than 2× expected. That gives us a self-diagnosing leak signal in future regressions.

Out of scope

  • Switching off SQLite WAL mode (rollback is worse, this isn't the answer)
  • Hard FD limit / ulimit -n increase (band-aid, not a fix)

What we should give the user now

While the underlying fixes are in flight, three immediate mitigations:

  1. Cron job: lsof | wc watchdog with auto-restart at FD threshold. Recipe in docs. ~5 LOC bash. Their agent can run this while we land the proper fixes.
  2. launchd plist that uses bash start.sh --foreground once that flag exists.
  3. Workaround for bootstrap.py exit-on-success today: invoke python server.py directly under launchd, bypass bootstrap, accept that auto-install-agent and dep-check don't run on supervisor restarts. Document in a "running under launchd" section of the README.

Reporter / context

Reported May 02 2026 by user running persistent WebUI on Mac mini under launchd. At least one crash per day. User's agent has been recovering it ad-hoc; root causes identified above.

This is a stability issue affecting any user who runs the WebUI as a persistent service, which is a setup we want to support — particularly for the Mac mini / always-on-server use case.

Suggested labels

bug, stability, mac, priority:high

Estimated effort

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededsprint-candidateStrong candidate for next sprint

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions