Follow-up to #1458. That issue was closed by the bootstrap supervisor fix in #1483, but the production failure still reproduced after that class of fix: WebUI process alive, port listening, HTTP requests reset, with a large number of leaked state.db file descriptors.
Summary
Persistent WebUI on macOS launchd can become process-alive and port-listening while all HTTP requests fail with Connection reset by peer. In the observed case, the server process had hundreds of open file descriptors, dominated by repeated ~/.hermes/state.db, state.db-wal, and state.db-shm handles.
I found one concrete remaining leak path in the WebUI's raw sqlite usage. Several callsites use with sqlite3.connect(...) as conn:, but Python's sqlite connection context manager commits or rolls back only. It does not close the connection. In a long-running process with sidebar polling, this leaks FDs.
Environment
- macOS, Mac mini
- launchd service
- WebUI bound to
0.0.0.0:8787
- Persistent, always-on WebUI
- Local repo before patch was v0.50.266-dirty, remote checked against
origin/master at 7fddc33
Observed failure
curl -sv http://127.0.0.1:8787/health
* Connected to 127.0.0.1 (127.0.0.1) port 8787
> GET /health HTTP/1.1
* Recv failure: Connection reset by peer
Launchd and socket state at the same time:
launchctl print gui/501/ai.hermes.webui
state = running
program = /Users/jeremygavin/.hermes/hermes-webui/start.sh
runs = 2
pid = 22461
lsof -nP -iTCP:8787 -sTCP:LISTEN
Python 22468 ... TCP *:8787 (LISTEN)
FD state:
server pid: 22468
lsof -p 22468 | wc -l -> 366
lsof -p 22468 | grep state.db | wc -l -> 238
Source audit
Current origin/master still has raw sqlite context-manager use in at least these files:
# api/agent_sessions.py
with sqlite3.connect(str(db_path)) as conn:
...
# api/models.py
with sqlite3.connect(str(db_path)) as conn:
...
Specific callsites patched locally:
api/agent_sessions.py:read_importable_agent_session_rows()
api/agent_sessions.py:read_session_lineage_metadata()
api/models.py:get_cli_session_messages()
api/models.py:delete_cli_session()
Local fix shape
from contextlib import closing
with closing(sqlite3.connect(str(db_path))) as conn:
conn.row_factory = sqlite3.Row
...
Verification after local patch
After patching those sqlite callsites, cleanly restarting launchd, and checkpointing the WAL:
GET /health on 127.0.0.1:8787 -> 200 {"status": "ok"}
GET /health on 100.78.69.27:8787 -> 200 {"status": "ok"}
GET /api/sessions -> 200
GET /api/projects -> 200
Browser loaded root page, title: Hermes
Browser console errors: 0
Stress loop against session-list polling paths:
for batch in 1 2 3 4 5; do
for i in $(seq 1 20); do
curl -fsS http://127.0.0.1:8787/api/sessions >/dev/null
curl -fsS http://127.0.0.1:8787/api/projects >/dev/null
done
lsof -p "$PID" | wc -l
lsof -p "$PID" | grep -c state.db
done
After patch:
batch=1 fd=92 state_handles=0
batch=2 fd=92 state_handles=0
batch=3 fd=92 state_handles=0
batch=4 fd=92 state_handles=0
batch=5 fd=92 state_handles=0
Before patch, repeated sidebar/session polling increased FD count and state.db handles steadily.
Expected behavior
Repeated /api/sessions polling should not increase process FD count or keep new state.db handles open.
Suggested fix
- Wrap every raw
sqlite3.connect() in WebUI long-lived process code with contextlib.closing(...), unless the connection is explicitly closed elsewhere.
- Add a regression test around the session-list codepath that catches connection leaks. If FD count is hard to assert portably, monkeypatch
sqlite3.connect() with a close-tracking fake and assert all connections are closed.
- Consider adding a deep health endpoint that exercises
/api/sessions or the session-store path, since /health alone can miss the main UI becoming unusable.
Relationship to #1458
This is one of the unresolved #1458 failure modes. #1483 appears to address the bootstrap supervisor/double-fork bug, but not this sqlite FD leak. The user-visible symptom can be the same: launchd says the service is running, the port is listening, but the WebUI is broken.
Follow-up to #1458. That issue was closed by the bootstrap supervisor fix in #1483, but the production failure still reproduced after that class of fix: WebUI process alive, port listening, HTTP requests reset, with a large number of leaked
state.dbfile descriptors.Summary
Persistent WebUI on macOS launchd can become process-alive and port-listening while all HTTP requests fail with
Connection reset by peer. In the observed case, the server process had hundreds of open file descriptors, dominated by repeated~/.hermes/state.db,state.db-wal, andstate.db-shmhandles.I found one concrete remaining leak path in the WebUI's raw sqlite usage. Several callsites use
with sqlite3.connect(...) as conn:, but Python's sqlite connection context manager commits or rolls back only. It does not close the connection. In a long-running process with sidebar polling, this leaks FDs.Environment
0.0.0.0:8787origin/masterat7fddc33Observed failure
Launchd and socket state at the same time:
FD state:
Source audit
Current
origin/masterstill has raw sqlite context-manager use in at least these files:Specific callsites patched locally:
api/agent_sessions.py:read_importable_agent_session_rows()api/agent_sessions.py:read_session_lineage_metadata()api/models.py:get_cli_session_messages()api/models.py:delete_cli_session()Local fix shape
Verification after local patch
After patching those sqlite callsites, cleanly restarting launchd, and checkpointing the WAL:
Stress loop against session-list polling paths:
After patch:
Before patch, repeated sidebar/session polling increased FD count and
state.dbhandles steadily.Expected behavior
Repeated
/api/sessionspolling should not increase process FD count or keep newstate.dbhandles open.Suggested fix
sqlite3.connect()in WebUI long-lived process code withcontextlib.closing(...), unless the connection is explicitly closed elsewhere.sqlite3.connect()with a close-tracking fake and assert all connections are closed./api/sessionsor the session-store path, since/healthalone can miss the main UI becoming unusable.Relationship to #1458
This is one of the unresolved #1458 failure modes. #1483 appears to address the bootstrap supervisor/double-fork bug, but not this sqlite FD leak. The user-visible symptom can be the same: launchd says the service is running, the port is listening, but the WebUI is broken.