You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User reports at least one crash per day running a persistent WebUI on a Mac mini under launchd. Their agent has been able to recover by restarting, but the underlying issues are reproducible and architectural — not transient.
Two distinct bugs, both confirmed against current master.
launchd was also respawning bootstrap.py because it exited after starting the child server, while the child kept holding port 8787.
Root cause (verified)
bootstrap.py:243-268:
proc=subprocess.Popen(
[python_exe, str(REPO_ROOT/"server.py")],
cwd=str(agent_dirorREPO_ROOT),
env=env,
stdout=log_file,
stderr=subprocess.STDOUT,
start_new_session=True, # ← decouples child from parent
)
health_url=f"http://{args.host}:{args.port}/health"ifnotwait_for_health(health_url):
raiseRuntimeError(...)
info(f"Web UI is ready: {app_url}")
...
return0# ← parent exits here
This is a double-fork daemon pattern. It works fine for an interactive CLI run (user runs bash start.sh, gets a working server, terminal returns to prompt). It's broken for any process supervisor (launchd, systemd, runit, supervisord) because the supervisor sees:
Parent process (bootstrap.py) exits with code 0 → supervisor thinks "program completed"
Server child still running in a new process group, holding port 8787
Under KeepAlive=true / Restart=always, supervisor respawns bootstrap.py
New bootstrap.py calls Popen([python_exe, "server.py"]) → server child fails to bind 8787 (still held by orphan) → exit
New bootstrap.py raises RuntimeError from wait_for_health → exit non-zero → supervisor respawns → loop
Eventually the orphan crashes for some other reason (FD exhaustion per Bug #2), the next respawn finds 8787 free, and the loop "self-heals." User's "agent fixes it" is observing this loop intermittently succeed.
Fix shapes
Option (a) — Foreground mode flag (preferred).
Add --foreground (or auto-detect LAUNCHD_SOCKET / INVOCATION_ID / NOTIFY_SOCKET env vars) that runs the server inside the bootstrap process instead of spawning a child. Pattern:
ifargs.foregroundor_detect_supervisor():
# Replace current process with server (or run inline)os.execv(python_exe, [python_exe, str(REPO_ROOT/"server.py")])
os.execv() replaces the current process image, so launchd sees the long-lived server as the original child and KeepAlive works correctly.
Documentation update: example launchd plist using bash start.sh --foreground.
Option (b) — Wait for child in default path.
Change the default Popen + return 0 flow to Popen + proc.wait() so bootstrap stays alive as parent of the server. Backward-incompatible for users who relied on bootstrap returning to prompt; gate behind --detach flag for the old behavior.
Recommend (a) — opt-in flag plus auto-detection. Doesn't break any existing user; fixes the supervisor case.
Out of scope
Whether the launchd plist itself should be in our repo. Probably yes as a docs/example, but separate issue.
Migrating start.sh to systemd-style Type=notify. Overkill.
Server PID had 454 file descriptors and loads of leaked state.db handles.
Context
We shipped #1421 in v0.50.259 specifically for a SessionDB WAL handle leak — that fix released SessionDB on cached-agent reuse and on LRU eviction. But the user is reporting >454 FDs and "loads of leaked state.db handles" on a current build. Either #1421 didn't cover all paths, or there's a second leak shape that surfaced after.
Investigation needed (do not blind-fix)
Possible suspects to audit:
SessionDB instances created outside the streaming/cached-agent path.
api/streaming.py:1715-1720 creates a fresh SessionDB() per request and (post-fix: close previous SessionDB before replacing on cached agent #1421) closes the cached-agent's old one. But search for SessionDB( across the codebase — any other callsite that opens it without .close() is a candidate.
Includes any background thread, batch processing, gateway hook, cron worker that touches sessions.
_session_db reassignment in error paths.
The agent's _session_db field is reassigned per turn. If an exception fires between the agent._session_db = _session_db line and the request completion, the previous _session_db may not be closed. Check try/finally coverage around streaming.
Gateway-side session opens.
If gateway processes share state via the same state.db, each gateway call may open a connection. If the gateway runs in-process (it usually doesn't, but check), those connections inherit the WebUI's process FD limit.
WAL checkpoint never running.
SQLite WAL files (state.db-wal, state.db-shm) accumulate write log when no checkpointing happens. PRAGMA wal_checkpoint(TRUNCATE) periodically would shrink the WAL file but the handle leak is separate from the WAL-size issue.
Other open files.
454 FDs is high enough that it might not be only state.db. The user mentioned "loads of leaked state.db handles" but the total is 454. Worth asking the user for lsof -p <pid> | sort | uniq -c | sort -rn | head -20 to see what dominates.
Repro
Suggest the user run:
# When the WebUI is in a "running but unhealthy" state:
lsof -p $(pgrep -f 'server.py'| head -1)| wc -l
lsof -p $(pgrep -f 'server.py'| head -1)| grep state.db | wc -l
lsof -p $(pgrep -f 'server.py'| head -1)| sort | uniq -c | sort -rn | head -20
Plus uptime since last restart, approximate session count, approximate request count over the run.
Fix shape (after diagnosis)
Once we know which open path leaks: add explicit .close() in the missing path, plus a defensive watchdog that tracks len(SESSION_AGENT_CACHE) × expected-FDs-per-agent and logs a warning if observed FD count is more than 2× expected. That gives us a self-diagnosing leak signal in future regressions.
Out of scope
Switching off SQLite WAL mode (rollback is worse, this isn't the answer)
Hard FD limit / ulimit -n increase (band-aid, not a fix)
What we should give the user now
While the underlying fixes are in flight, three immediate mitigations:
Cron job: lsof | wc watchdog with auto-restart at FD threshold. Recipe in docs. ~5 LOC bash. Their agent can run this while we land the proper fixes.
launchd plist that uses bash start.sh --foreground once that flag exists.
Workaround for bootstrap.py exit-on-success today: invoke python server.py directly under launchd, bypass bootstrap, accept that auto-install-agent and dep-check don't run on supervisor restarts. Document in a "running under launchd" section of the README.
Reporter / context
Reported May 02 2026 by user running persistent WebUI on Mac mini under launchd. At least one crash per day. User's agent has been recovering it ad-hoc; root causes identified above.
This is a stability issue affecting any user who runs the WebUI as a persistent service, which is a setup we want to support — particularly for the Mac mini / always-on-server use case.
Summary
User reports at least one crash per day running a persistent WebUI on a Mac mini under launchd. Their agent has been able to recover by restarting, but the underlying issues are reproducible and architectural — not transient.
Two distinct bugs, both confirmed against current master.
Bug #1 —
bootstrap.pyexits after spawning child, breaking launchd KeepAliveSymptom
Root cause (verified)
bootstrap.py:243-268:This is a double-fork daemon pattern. It works fine for an interactive CLI run (user runs
bash start.sh, gets a working server, terminal returns to prompt). It's broken for any process supervisor (launchd, systemd, runit, supervisord) because the supervisor sees:bootstrap.py) exits with code 0 → supervisor thinks "program completed"KeepAlive=true/Restart=always, supervisor respawnsbootstrap.pybootstrap.pycallsPopen([python_exe, "server.py"])→ server child fails to bind 8787 (still held by orphan) → exitbootstrap.pyraisesRuntimeErrorfromwait_for_health→ exit non-zero → supervisor respawns → loopEventually the orphan crashes for some other reason (FD exhaustion per Bug #2), the next respawn finds 8787 free, and the loop "self-heals." User's "agent fixes it" is observing this loop intermittently succeed.
Fix shapes
Option (a) — Foreground mode flag (preferred).
Add
--foreground(or auto-detectLAUNCHD_SOCKET/INVOCATION_ID/NOTIFY_SOCKETenv vars) that runs the server inside the bootstrap process instead of spawning a child. Pattern:os.execv()replaces the current process image, so launchd sees the long-lived server as the original child and KeepAlive works correctly.Documentation update: example launchd plist using
bash start.sh --foreground.Option (b) — Wait for child in default path.
Change the default
Popen + return 0flow toPopen + proc.wait()so bootstrap stays alive as parent of the server. Backward-incompatible for users who relied on bootstrap returning to prompt; gate behind--detachflag for the old behavior.Recommend (a) — opt-in flag plus auto-detection. Doesn't break any existing user; fixes the supervisor case.
Out of scope
Type=notify. Overkill.Bug #2 — File-descriptor /
state.dbhandle leakSymptom
Context
We shipped
#1421in v0.50.259 specifically for a SessionDB WAL handle leak — that fix released SessionDB on cached-agent reuse and on LRU eviction. But the user is reporting>454 FDsand "loads of leakedstate.dbhandles" on a current build. Either #1421 didn't cover all paths, or there's a second leak shape that surfaced after.Investigation needed (do not blind-fix)
Possible suspects to audit:
SessionDB instances created outside the streaming/cached-agent path.
api/streaming.py:1715-1720creates a freshSessionDB()per request and (post-fix: close previous SessionDB before replacing on cached agent #1421) closes the cached-agent's old one. But search forSessionDB(across the codebase — any other callsite that opens it without.close()is a candidate._session_dbreassignment in error paths._session_dbfield is reassigned per turn. If an exception fires between theagent._session_db = _session_dbline and the request completion, the previous_session_dbmay not be closed. Check try/finally coverage around streaming.Gateway-side session opens.
state.db, each gateway call may open a connection. If the gateway runs in-process (it usually doesn't, but check), those connections inherit the WebUI's process FD limit.WAL checkpoint never running.
state.db-wal,state.db-shm) accumulate write log when no checkpointing happens.PRAGMA wal_checkpoint(TRUNCATE)periodically would shrink the WAL file but the handle leak is separate from the WAL-size issue.Other open files.
state.db. The user mentioned "loads of leakedstate.dbhandles" but the total is 454. Worth asking the user forlsof -p <pid> | sort | uniq -c | sort -rn | head -20to see what dominates.Repro
Suggest the user run:
Plus uptime since last restart, approximate session count, approximate request count over the run.
Fix shape (after diagnosis)
Once we know which open path leaks: add explicit
.close()in the missing path, plus a defensive watchdog that trackslen(SESSION_AGENT_CACHE)× expected-FDs-per-agent and logs a warning if observed FD count is more than 2× expected. That gives us a self-diagnosing leak signal in future regressions.Out of scope
ulimit -nincrease (band-aid, not a fix)What we should give the user now
While the underlying fixes are in flight, three immediate mitigations:
lsof | wcwatchdog with auto-restart at FD threshold. Recipe in docs. ~5 LOC bash. Their agent can run this while we land the proper fixes.bash start.sh --foregroundonce that flag exists.bootstrap.pyexit-on-success today: invokepython server.pydirectly under launchd, bypass bootstrap, accept that auto-install-agent and dep-check don't run on supervisor restarts. Document in a "running under launchd" section of the README.Reporter / context
Reported May 02 2026 by user running persistent WebUI on Mac mini under launchd. At least one crash per day. User's agent has been recovering it ad-hoc; root causes identified above.
This is a stability issue affecting any user who runs the WebUI as a persistent service, which is a setup we want to support — particularly for the Mac mini / always-on-server use case.
Suggested labels
bug,stability,mac,priority:highEstimated effort