Skip to content

bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling #1494

@insecurejezza

Description

@insecurejezza

Follow-up to #1458. That issue was closed by the bootstrap supervisor fix in #1483, but the production failure still reproduced after that class of fix: WebUI process alive, port listening, HTTP requests reset, with a large number of leaked state.db file descriptors.

Summary

Persistent WebUI on macOS launchd can become process-alive and port-listening while all HTTP requests fail with Connection reset by peer. In the observed case, the server process had hundreds of open file descriptors, dominated by repeated ~/.hermes/state.db, state.db-wal, and state.db-shm handles.

I found one concrete remaining leak path in the WebUI's raw sqlite usage. Several callsites use with sqlite3.connect(...) as conn:, but Python's sqlite connection context manager commits or rolls back only. It does not close the connection. In a long-running process with sidebar polling, this leaks FDs.

Environment

  • macOS, Mac mini
  • launchd service
  • WebUI bound to 0.0.0.0:8787
  • Persistent, always-on WebUI
  • Local repo before patch was v0.50.266-dirty, remote checked against origin/master at 7fddc33

Observed failure

curl -sv http://127.0.0.1:8787/health
* Connected to 127.0.0.1 (127.0.0.1) port 8787
> GET /health HTTP/1.1
* Recv failure: Connection reset by peer

Launchd and socket state at the same time:

launchctl print gui/501/ai.hermes.webui
  state = running
  program = /Users/jeremygavin/.hermes/hermes-webui/start.sh
  runs = 2
  pid = 22461

lsof -nP -iTCP:8787 -sTCP:LISTEN
  Python 22468 ... TCP *:8787 (LISTEN)

FD state:

server pid: 22468
lsof -p 22468 | wc -l                  -> 366
lsof -p 22468 | grep state.db | wc -l  -> 238

Source audit

Current origin/master still has raw sqlite context-manager use in at least these files:

# api/agent_sessions.py
with sqlite3.connect(str(db_path)) as conn:
    ...

# api/models.py
with sqlite3.connect(str(db_path)) as conn:
    ...

Specific callsites patched locally:

  • api/agent_sessions.py:read_importable_agent_session_rows()
  • api/agent_sessions.py:read_session_lineage_metadata()
  • api/models.py:get_cli_session_messages()
  • api/models.py:delete_cli_session()

Local fix shape

from contextlib import closing

with closing(sqlite3.connect(str(db_path))) as conn:
    conn.row_factory = sqlite3.Row
    ...

Verification after local patch

After patching those sqlite callsites, cleanly restarting launchd, and checkpointing the WAL:

GET /health on 127.0.0.1:8787      -> 200 {"status": "ok"}
GET /health on 100.78.69.27:8787   -> 200 {"status": "ok"}
GET /api/sessions                  -> 200
GET /api/projects                  -> 200
Browser loaded root page, title: Hermes
Browser console errors: 0

Stress loop against session-list polling paths:

for batch in 1 2 3 4 5; do
  for i in $(seq 1 20); do
    curl -fsS http://127.0.0.1:8787/api/sessions >/dev/null
    curl -fsS http://127.0.0.1:8787/api/projects >/dev/null
  done
  lsof -p "$PID" | wc -l
  lsof -p "$PID" | grep -c state.db
 done

After patch:

batch=1 fd=92 state_handles=0
batch=2 fd=92 state_handles=0
batch=3 fd=92 state_handles=0
batch=4 fd=92 state_handles=0
batch=5 fd=92 state_handles=0

Before patch, repeated sidebar/session polling increased FD count and state.db handles steadily.

Expected behavior

Repeated /api/sessions polling should not increase process FD count or keep new state.db handles open.

Suggested fix

  1. Wrap every raw sqlite3.connect() in WebUI long-lived process code with contextlib.closing(...), unless the connection is explicitly closed elsewhere.
  2. Add a regression test around the session-list codepath that catches connection leaks. If FD count is hard to assert portably, monkeypatch sqlite3.connect() with a close-tracking fake and assert all connections are closed.
  3. Consider adding a deep health endpoint that exercises /api/sessions or the session-store path, since /health alone can miss the main UI becoming unusable.

Relationship to #1458

This is one of the unresolved #1458 failure modes. #1483 appears to address the bootstrap supervisor/double-fork bug, but not this sqlite FD leak. The user-visible symptom can be the same: launchd says the service is running, the port is listening, but the WebUI is broken.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions