Skip to content

fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494)#1495

Merged
1 commit merged intomasterfrom
fix/state-db-fd-leak-1494
May 3, 2026
Merged

fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494)#1495
1 commit merged intomasterfrom
fix/state-db-fd-leak-1494

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Summary

Production fix for Bug #2 of #1458, reported with full diagnostic + verified fix shape in #1494 by @insecurejezza.

What was happening

Persistent WebUI on macOS launchd became process-alive, port-listening, HTTP-unhealthy after the bootstrap supervisor fix in #1483. Every request reset before a response:

$ curl -sv http://127.0.0.1:8787/health
* Connected to 127.0.0.1 (127.0.0.1) port 8787
> GET /health HTTP/1.1
* Recv failure: Connection reset by peer

The reporter's lsof showed FD exhaustion driven entirely by state.db handles:

server pid: 22468
lsof -p 22468 | wc -l                  -> 366
lsof -p 22468 | grep state.db | wc -l  -> 238

After macOS's 256 default soft FD limit, every fresh request handler that called sqlite3.connect() raised before producing any bytes — handler returned, ThreadingHTTPServer closed the conn, kernel RST'd the client. Process stayed alive, accept loop stayed healthy, port stayed in LISTEN.

Root cause

Four sqlite callsites use Python's with sqlite3.connect(...) as conn: pattern. sqlite3.Connection.__exit__ only commits or rolls back. It does not close the connection. From the Python docs:

A Connection object can be used as a context manager that automatically commits or rolls back open transactions when leaving the body of the context manager. […] The context manager neither implicitly opens a new transaction nor closes the connection.

/api/sessions polling calls these on every sidebar refresh, so each poll leaks one or more open state.db / state.db-wal / state.db-shm FDs.

Files patched

from contextlib import closing

with closing(sqlite3.connect(str(db_path))) as conn:
    conn.row_factory = sqlite3.Row
    ...

Applied at:

  • api/agent_sessions.py:read_importable_agent_session_rows — sidebar agent-session listing
  • api/agent_sessions.py:read_session_lineage_metadata — sidebar lineage collapse (called on every poll)
  • api/models.py:get_cli_session_messages — CLI session detail view
  • api/models.py:delete_cli_session — CLI session deletion (explicit conn.commit() retained)

These four callsites match exactly the audit @insecurejezza did against origin/master at 7fddc33. I re-ran the audit (rg 'sqlite3\.connect' --glob '!tests/') and confirmed there are no other production callsites of the leaking pattern.

The pre-existing SessionDB cache fix from #1421 is orthogonal — it covered cached connection reassignment and LRU eviction, not these raw sqlite uses.

Verification

Reporter's stress loop (post-fix)

for batch in 1 2 3 4 5; do
  for i in $(seq 1 20); do
    curl -fsS http://127.0.0.1:8787/api/sessions >/dev/null
    curl -fsS http://127.0.0.1:8787/api/projects >/dev/null
  done
  lsof -p "$PID" | wc -l
  lsof -p "$PID" | grep -c state.db
done
batch=1 fd=92 state_handles=0
batch=2 fd=92 state_handles=0
batch=3 fd=92 state_handles=0
batch=4 fd=92 state_handles=0
batch=5 fd=92 state_handles=0

Pre-fix the same loop made both numbers climb monotonically.

Regression tests

tests/test_issue1494_state_db_fd_leak.py — 4 tests, one per callsite. Monkeypatches sqlite3.connect with a _TrackingConn wrapper that records .close() calls, then asserts every connection opened by the function under test is explicitly closed.

The tests follow the same lightweight-tracking-conn pattern already used in test_pr1370_lineage_metadata_perf_and_orphan.py, kept inline so the regression survives refactors there.

$ pytest tests/test_issue1494_state_db_fd_leak.py -v
tests/test_issue1494_state_db_fd_leak.py::test_read_importable_agent_session_rows_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_read_session_lineage_metadata_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_get_cli_session_messages_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_delete_cli_session_closes_connection PASSED
============================== 4 passed in 1.94s ===============================

I also confirmed the tests catch the bug — reverting just the closing() wrap in read_importable_agent_session_rows produces:

AssertionError: read_importable_agent_session_rows leaked 5 of 5 sqlite connection(s)
— context-manager-only `with sqlite3.connect()` does not close.
Wrap in contextlib.closing(). See #1494.

Full suite

$ pytest tests/ -q
3870 passed, 2 skipped, 3 xpassed, 8 subtests passed in 87.03s

(3866 → 3870, the 4 new regression tests.)

Scope discipline — what this PR does NOT do

#1458 catalogs three failure modes. This PR fixes Bug #2 only.

Commit message uses Refs #1458 (Bug #2 of 3) rather than Closes #1458 so the umbrella stays open until Bug #3 lands.

Attribution

@insecurejezza did all the production diagnosis (wedge reproduction, lsof analysis, source audit, exact callsite list, fix shape, verification stress loop). I just turned the verified fix into a PR with regression tests. Co-authored-by trailer on the commit.

Closes #1494
Refs #1458 (Bug #2 of 3)

… sidebar polling (#1494)

Production WebUI on macOS launchd reproduced an HTTP-unhealthy wedge after
#1483 closed the bootstrap supervisor double-fork: process alive, port
listening, every HTTP request reset by peer before a response. The reporter
(@insecurejezza) traced it to FD exhaustion — 366 open FDs on the wedged
process, 238 of them `~/.hermes/state.db`, `state.db-wal`, and `state.db-shm`.

Root cause: four sqlite callsites use `with sqlite3.connect(...) as conn:`.
Python's sqlite3 connection context manager only commits or rolls back on
exit; it does NOT close the connection. `/api/sessions` polling calls these
on every sidebar refresh, so each poll leaked one or more open state.db FDs
until the process hit macOS's soft FD limit and new sqlite3.connect() calls
inside fresh request handlers raised before any response bytes were written.

Fix: wrap each `sqlite3.connect(...)` in `contextlib.closing(...)` so the
connection is explicitly closed on scope exit, in addition to the auto-
commit / rollback semantics that `Connection.__exit__` already provides.

Callsites patched:
- api/agent_sessions.py:read_importable_agent_session_rows
- api/agent_sessions.py:read_session_lineage_metadata
- api/models.py:get_cli_session_messages
- api/models.py:delete_cli_session

Reporter's verification (post-patch, 100-request stress loop against
/api/sessions and /api/projects):

  batch=1 fd=92 state_handles=0
  batch=2 fd=92 state_handles=0
  ...
  batch=5 fd=92 state_handles=0

Pre-patch the same loop made FD count and state.db handle count climb
monotonically.

4 regression tests in tests/test_issue1494_state_db_fd_leak.py monkeypatch
sqlite3.connect with a tracking wrapper that records .close() calls and
assert every connection opened by each of the four functions is explicitly
closed. Verified to fail (catching the original bug) when the closing()
wrap is reverted: "leaked 5 of 5 sqlite connection(s) — context-manager-
only `with sqlite3.connect()` does not close. Wrap in contextlib.closing()."

This addresses Bug #2 of the umbrella issue #1458. Bug #3 (HTTP-unhealthy
wedge in the absence of FD exhaustion) remains open pending separate
diagnostic data — explicit scope discipline.

Closes #1494
Refs #1458 (Bug #2 of 3)

Co-authored-by: insecurejezza <[email protected]>
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean approve, behavioural harness confirms bug + fix)

Production fix for #1494 / Bug #2 of #1458. Wraps four with sqlite3.connect(...) as conn: callsites in contextlib.closing(...) because Python's sqlite3 connection context manager only commits/rolls back on exit — it does not close the connection. /api/sessions polling on a long-running webui leaked one or more state.db* FDs per poll until the process hit the macOS 256-FD soft limit and new request handlers RST'd before producing bytes.

What this ships

Traced against upstream hermes-agent

Webui-only change. The CLI / agent has its own hermes_state.SessionDB layer with proper connection lifecycle — verified state.db write paths in the agent tarball don't share these four functions. The SessionDB cache fix from #1421 is orthogonal (covered cached-connection LRU eviction, not raw sqlite3.connect() callers). No cross-tool surface. ✓

Behavioural harness — bug and fix verified

$ python3 -c "..." # see review for full script
BUG: connection still open after with block (this is the bug)
FIX OK: connection closed after closing(): Cannot operate on a closed database.

Confirms:

  1. with sqlite3.connect(p) as c: does NOT close — the connection remains usable after the block (FD still open).
  2. with closing(sqlite3.connect(p)) as c: DOES close — cursor() raises ProgrammingError: Cannot operate on a closed database. after the block.

Audit — exhaustive sweep of production callsites

$ grep -rn "sqlite3.connect" api/ --include="*.py"
api/agent_sessions.py:238:    with closing(sqlite3.connect(str(db_path))) as conn:
api/agent_sessions.py:310:        with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1105:        with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1146:        with closing(sqlite3.connect(str(db_path))) as conn:

All 4 production callsites in api/ are wrapped. No other sqlite3.connect() invocations exist outside the test directory. The PR description's audit claim ("re-ran the audit and confirmed there are no other production callsites of the leaking pattern") is verified.

End-to-end trace

Pre-fix request flow on a long-running webui:

  1. Browser polls /api/sessions every few seconds.
  2. Each call invokes read_importable_agent_session_rows (api/agent_sessions.py:235) and read_session_lineage_metadata (api/agent_sessions.py:307).
  3. Each function used with sqlite3.connect(str(db_path)) as conn:__exit__ commits, never closes.
  4. Connection's underlying FDs (state.db, state.db-wal, state.db-shm) stay open.
  5. After ~80 polls (3 FDs × ~80 calls = ~240 + base ~25 = ~256), the process hits macOS soft FD limit.
  6. Next request handler thread tries sqlite3.connect()OSError: [Errno 24] Too many open files.
  7. Handler returns without writing bytes; ThreadingHTTPServer closes the socket; kernel RSTs the client.
  8. Process stays alive (accept loop is fine), port stays in LISTEN, every request RSTs. The exact symptoms reported.

Post-fix: every connection is closed on scope exit; FD count flat across stress tests (reporter measured 92 across 100 requests).

Edge-case trace

Scenario Pre-fix Post-fix
100 sidebar polls on idle webui FD count climbs by 1-3 per poll flat at base ✅
Process hits 256 FD limit requests RST, server appears dead doesn't reach limit ✅
read_* exits via exception connection leaked closing() still closes ✅
delete_cli_session with explicit commit commit + connection leaked commit + close ✅
delete_cli_session exception before commit connection leaked, partial state close, no half-write ✅
Concurrent delete_cli_session from two threads each leaks each closes; no shared state ✅
Read-only path (3 callsites) — no auto-commit needed works works ✅

Security audit

  • ✅ No XSS, no SSRF, no path traversal — pure sqlite connection lifecycle fix.
  • ✅ No new endpoints; no auth changes; no config.yaml writes.
  • closing() is from stdlib contextlib, well-tested.
  • ✅ The _TrackingConn wrapper in tests is test-only (under tests/) and is not production code.

Tests

  • tests/test_issue1494_state_db_fd_leak.py — 4/4 pass:
    • test_read_importable_agent_session_rows_closes_connection — calls 5×, asserts 5/5 closed
    • test_read_session_lineage_metadata_closes_connection — calls 5×, asserts 5/5 closed
    • test_get_cli_session_messages_closes_connection — calls 5×, asserts 2 messages returned + all closed (sanity-checks the real query path is exercised, not just open/close)
    • test_delete_cli_session_closes_connection — first call deletes (rowcount > 0), second is idempotent no-op; verifies commit semantics survived via a separate sqlite3.Connection that re-queries the table afterward
  • PR description claims the test catches the bug. Verified by reverting the closing() wrap on read_importable_agent_session_rows — produces:
    AssertionError: read_importable_agent_session_rows leaked 5 of 5 sqlite connection(s) — context-manager-only `with sqlite3.connect()` does not close. Wrap in contextlib.closing(). See #1494.
    
  • Full suite: 3817 passed, 55 skipped, 3 xpassed, 0 failed in 18.64s on the PR branch.
  • CI: 3.11 / 3.12 / 3.13 all green.

Race / state analysis

  • Concurrent calls to the patched functions (e.g. two sidebar polls hitting /api/sessions near-simultaneously): each opens its own connection, runs SELECTs, closes. No shared connection state between threads. ✓
  • delete_cli_session two-call race: explicit conn.commit() before return; second call sees the deleted state on its own connection. SQLite WAL mode handles this. ✓
  • Exception during SELECT inside with closing(...): closing.__exit__ calls close() regardless of exception status. FD released. ✓

Minor observations (non-blocking)

  • closing() skips the original __exit__'s auto-commit/rollback semantics. contextlib.closing(thing) only calls thing.close() on exit — it does not delegate to thing.__exit__. For the 3 read-only callsites this is a non-issue (no transaction to commit). For delete_cli_session, the PR correctly retains an explicit conn.commit() (api/models.py:1150). A future contributor adding a write op inside one of these blocks expecting auto-commit would silently get a no-op. Could be tightened with the layered-context idiom (with closing(sqlite3.connect(...)) as conn: with conn: ...) but the current shape is correct for what's there now. Worth a one-line code comment near the closing() wrap noting "auto-commit semantics are dropped — add explicit conn.commit() if you need it." Not blocking.

  • The 4 leak sites are all read-mostly. The agent's own write-heavy SessionDB layer was already correctly closed via #1421's LRU eviction fix — this PR is the missing piece for the raw sqlite3.connect() callers that didn't go through SessionDB.

  • PR description test count off by ~53 (3870 claimed vs 3817 local). Counting drift consistent with prior batches; not blocking.

  • Reporter attribution is well done@insecurejezza did the production diagnosis (lsof analysis, source audit, fix shape, verification stress loop) and gets a Co-authored-by trailer. Good external-contributor credit.

Recommendation

Approved. Textbook FD-leak fix: correct diagnosis (verified independently via behavioural harness), correct fix at all 4 production callsites (verified via exhaustive grep), regression tests prove the fix and are confirmed to catch the bug when reverted. Cross-tool clean. No security implications. Reporter's stress-loop verification (FD count flat at 92 across 100 requests vs. monotonic growth pre-fix) is the kind of evidence reviews want.

This unblocks the persistent-host wedge that #1483 surfaced after fixing the supervisor double-fork. Bug #3 of #1458 (HTTP-unhealthy without FD exhaustion) still tracked separately — correct scope discipline.

Parked at approval — ready for the release agent's merge/tag pipeline.

@nesquena-hermes nesquena-hermes closed this pull request by merging all changes into master in 6c3ff3f May 3, 2026
@nesquena-hermes nesquena-hermes deleted the fix/state-db-fd-leak-1494 branch May 3, 2026 01:43
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026
…ix + P0 polish bundle (nesquena#1466 nesquena#1494 nesquena#1469 nesquena#1484 nesquena#1486)

3 PRs in this batch (3866 → 3874 tests, +8):

- nesquena#1493 (@dso2ng) — sidebar Stop response cancels row's stream not active pane's (closes nesquena#1466, follow-up to nesquena#1480)
- nesquena#1495 (self-built; reported by @insecurejezza in nesquena#1494) — state.db connection FD leak in sidebar polling (closes nesquena#1494, addresses Bug #2 of nesquena#1458)
- nesquena#1492 (@bergeouss) — P0 bugfixes bundle: tool-card args readability + CLI rename persistence + scroll pinning + sw.js relative-path regression test (closes nesquena#1469 nesquena#1484 nesquena#1486)

This release closes Bug #2 of the umbrella issue nesquena#1458. Bug #1 was closed by v0.50.269 (nesquena#1483) + v0.50.270 (nesquena#1487). Bug #3 (HTTP-unhealthy without FD exhaustion) is the remaining work item.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling

2 participants