bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak

## Summary

User reports **at least one crash per day** running a persistent WebUI on a Mac mini under launchd. Their agent has been able to recover by restarting, but the underlying issues are reproducible and architectural — not transient.

Two distinct bugs, both confirmed against current master.

## Bug #1 — `bootstrap.py` exits after spawning child, breaking launchd KeepAlive

### Symptom

> launchd was also respawning `bootstrap.py` because it exited after starting the child server, while the child kept holding port `8787`.

### Root cause (verified)

`bootstrap.py:243-268`:

```python
proc = subprocess.Popen(
    [python_exe, str(REPO_ROOT / "server.py")],
    cwd=str(agent_dir or REPO_ROOT),
    env=env,
    stdout=log_file,
    stderr=subprocess.STDOUT,
    start_new_session=True,         # ← decouples child from parent
)

health_url = f"http://{args.host}:{args.port}/health"
if not wait_for_health(health_url):
    raise RuntimeError(...)

info(f"Web UI is ready: {app_url}")
...
return 0                            # ← parent exits here
```

This is a **double-fork daemon pattern**. It works fine for an interactive CLI run (user runs `bash start.sh`, gets a working server, terminal returns to prompt). It's broken for any process supervisor (launchd, systemd, runit, supervisord) because the supervisor sees:

1. Parent process (`bootstrap.py`) exits with code 0 → supervisor thinks "program completed"
2. Server child still running in a new process group, holding port 8787
3. Under `KeepAlive=true` / `Restart=always`, supervisor respawns `bootstrap.py`
4. New `bootstrap.py` calls `Popen([python_exe, "server.py"])` → server child fails to bind 8787 (still held by orphan) → exit
5. New `bootstrap.py` raises `RuntimeError` from `wait_for_health` → exit non-zero → supervisor respawns → loop

Eventually the orphan crashes for some other reason (FD exhaustion per Bug #2), the next respawn finds 8787 free, and the loop "self-heals." User's "agent fixes it" is observing this loop intermittently succeed.

### Fix shapes

**Option (a) — Foreground mode flag (preferred).**
Add `--foreground` (or auto-detect `LAUNCHD_SOCKET` / `INVOCATION_ID` / `NOTIFY_SOCKET` env vars) that runs the server **inside** the bootstrap process instead of spawning a child. Pattern:

```python
if args.foreground or _detect_supervisor():
    # Replace current process with server (or run inline)
    os.execv(python_exe, [python_exe, str(REPO_ROOT / "server.py")])
```

`os.execv()` replaces the current process image, so launchd sees the long-lived server as the original child and KeepAlive works correctly.

Documentation update: example launchd plist using `bash start.sh --foreground`.

**Option (b) — Wait for child in default path.**
Change the default `Popen + return 0` flow to `Popen + proc.wait()` so bootstrap stays alive as parent of the server. Backward-incompatible for users who relied on bootstrap returning to prompt; gate behind `--detach` flag for the old behavior.

**Recommend (a)** — opt-in flag plus auto-detection. Doesn't break any existing user; fixes the supervisor case.

### Out of scope

- Whether the launchd plist itself should be in our repo. Probably yes as a docs/example, but separate issue.
- Migrating start.sh to systemd-style `Type=notify`. Overkill.

## Bug #2 — File-descriptor / `state.db` handle leak

### Symptom

> Server PID had 454 file descriptors and loads of leaked `state.db` handles.

### Context

We shipped `#1421` in **v0.50.259** specifically for a SessionDB WAL handle leak — that fix released SessionDB on cached-agent reuse and on LRU eviction. But the user is reporting `>454 FDs` and "loads of leaked `state.db` handles" on a current build. **Either #1421 didn't cover all paths, or there's a second leak shape that surfaced after.**

### Investigation needed (do not blind-fix)

Possible suspects to audit:

1. **SessionDB instances created outside the streaming/cached-agent path.**
   - `api/streaming.py:1715-1720` creates a fresh `SessionDB()` per request and (post-#1421) closes the cached-agent's old one. But search for `SessionDB(` across the codebase — any *other* callsite that opens it without `.close()` is a candidate.
   - Includes any background thread, batch processing, gateway hook, cron worker that touches sessions.

2. **`_session_db` reassignment in error paths.**
   - The agent's `_session_db` field is reassigned per turn. If an exception fires *between* the `agent._session_db = _session_db` line and the request completion, the previous `_session_db` may not be closed. Check try/finally coverage around streaming.

3. **Gateway-side session opens.**
   - If gateway processes share state via the same `state.db`, each gateway call may open a connection. If the gateway runs in-process (it usually doesn't, but check), those connections inherit the WebUI's process FD limit.

4. **WAL checkpoint never running.**
   - SQLite WAL files (`state.db-wal`, `state.db-shm`) accumulate write log when no checkpointing happens. `PRAGMA wal_checkpoint(TRUNCATE)` periodically would shrink the WAL file but the *handle* leak is separate from the WAL-size issue.

5. **Other open files.**
   - 454 FDs is high enough that it might not be only `state.db`. The user mentioned "loads of leaked `state.db` handles" but the total is 454. Worth asking the user for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` to see what dominates.

### Repro

Suggest the user run:
```bash
# When the WebUI is in a "running but unhealthy" state:
lsof -p $(pgrep -f 'server.py' | head -1) | wc -l
lsof -p $(pgrep -f 'server.py' | head -1) | grep state.db | wc -l
lsof -p $(pgrep -f 'server.py' | head -1) | sort | uniq -c | sort -rn | head -20
```

Plus uptime since last restart, approximate session count, approximate request count over the run.

### Fix shape (after diagnosis)

Once we know which open path leaks: add explicit `.close()` in the missing path, **plus** a defensive watchdog that tracks `len(SESSION_AGENT_CACHE)` × expected-FDs-per-agent and logs a warning if observed FD count is more than 2× expected. That gives us a self-diagnosing leak signal in future regressions.

### Out of scope

- Switching off SQLite WAL mode (rollback is worse, this isn't the answer)
- Hard FD limit / `ulimit -n` increase (band-aid, not a fix)

## What we should give the user *now*

While the underlying fixes are in flight, three immediate mitigations:

1. **Cron job: `lsof | wc` watchdog with auto-restart at FD threshold.** Recipe in docs. ~5 LOC bash. Their agent can run this while we land the proper fixes.
2. **launchd plist that uses `bash start.sh --foreground` once that flag exists.**
3. **Workaround for `bootstrap.py` exit-on-success today:** invoke `python server.py` directly under launchd, bypass bootstrap, accept that auto-install-agent and dep-check don't run on supervisor restarts. Document in a "running under launchd" section of the README.

## Reporter / context

Reported May 02 2026 by user running persistent WebUI on Mac mini under launchd. **At least one crash per day.** User's agent has been recovering it ad-hoc; root causes identified above.

This is a **stability issue affecting any user who runs the WebUI as a persistent service**, which is a setup we want to support — particularly for the Mac mini / always-on-server use case.

## Suggested labels

`bug`, `stability`, `mac`, `priority:high`

## Estimated effort

- Bug #1 (bootstrap fix): ~20 LOC + supervisor-detection helper + tests + docs. Half-day.
- Bug #2 (FD leak diagnosis + fix): unknown until lsof output narrows it. Likely 50-150 LOC + targeted tests. 1-2 days.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458

Summary

Bug #1 — `bootstrap.py` exits after spawning child, breaking launchd KeepAlive

Symptom

Root cause (verified)

Fix shapes

Out of scope

Bug #2 — File-descriptor / `state.db` handle leak

Symptom

Context

Investigation needed (do not blind-fix)

Repro

Fix shape (after diagnosis)

Out of scope

What we should give the user now

Reporter / context

Suggested labels

Estimated effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458

Description

Summary

Bug #1 — bootstrap.py exits after spawning child, breaking launchd KeepAlive

Symptom

Root cause (verified)

Fix shapes

Out of scope

Bug #2 — File-descriptor / state.db handle leak

Symptom

Context

Investigation needed (do not blind-fix)

Repro

Fix shape (after diagnosis)

Out of scope

What we should give the user now

Reporter / context

Suggested labels

Estimated effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug #1 — `bootstrap.py` exits after spawning child, breaking launchd KeepAlive

Bug #2 — File-descriptor / `state.db` handle leak