fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494) by nesquena-hermes · Pull Request #1495 · nesquena/hermes-webui

nesquena-hermes · 2026-05-03T01:16:13Z

Summary

Production fix for Bug #2 of #1458, reported with full diagnostic + verified fix shape in #1494 by @insecurejezza.

What was happening

Persistent WebUI on macOS launchd became process-alive, port-listening, HTTP-unhealthy after the bootstrap supervisor fix in #1483. Every request reset before a response:

$ curl -sv http://127.0.0.1:8787/health
* Connected to 127.0.0.1 (127.0.0.1) port 8787
> GET /health HTTP/1.1
* Recv failure: Connection reset by peer

The reporter's lsof showed FD exhaustion driven entirely by state.db handles:

server pid: 22468
lsof -p 22468 | wc -l                  -> 366
lsof -p 22468 | grep state.db | wc -l  -> 238

After macOS's 256 default soft FD limit, every fresh request handler that called sqlite3.connect() raised before producing any bytes — handler returned, ThreadingHTTPServer closed the conn, kernel RST'd the client. Process stayed alive, accept loop stayed healthy, port stayed in LISTEN.

Root cause

Four sqlite callsites use Python's with sqlite3.connect(...) as conn: pattern. sqlite3.Connection.__exit__ only commits or rolls back. It does not close the connection. From the Python docs:

A Connection object can be used as a context manager that automatically commits or rolls back open transactions when leaving the body of the context manager. […] The context manager neither implicitly opens a new transaction nor closes the connection.

/api/sessions polling calls these on every sidebar refresh, so each poll leaks one or more open state.db / state.db-wal / state.db-shm FDs.

Files patched

from contextlib import closing

with closing(sqlite3.connect(str(db_path))) as conn:
    conn.row_factory = sqlite3.Row
    ...

Applied at:

api/agent_sessions.py:read_importable_agent_session_rows — sidebar agent-session listing
api/agent_sessions.py:read_session_lineage_metadata — sidebar lineage collapse (called on every poll)
api/models.py:get_cli_session_messages — CLI session detail view
api/models.py:delete_cli_session — CLI session deletion (explicit conn.commit() retained)

These four callsites match exactly the audit @insecurejezza did against origin/master at 7fddc33. I re-ran the audit (rg 'sqlite3\.connect' --glob '!tests/') and confirmed there are no other production callsites of the leaking pattern.

The pre-existing SessionDB cache fix from #1421 is orthogonal — it covered cached connection reassignment and LRU eviction, not these raw sqlite uses.

Verification

Reporter's stress loop (post-fix)

for batch in 1 2 3 4 5; do
  for i in $(seq 1 20); do
    curl -fsS http://127.0.0.1:8787/api/sessions >/dev/null
    curl -fsS http://127.0.0.1:8787/api/projects >/dev/null
  done
  lsof -p "$PID" | wc -l
  lsof -p "$PID" | grep -c state.db
done

batch=1 fd=92 state_handles=0
batch=2 fd=92 state_handles=0
batch=3 fd=92 state_handles=0
batch=4 fd=92 state_handles=0
batch=5 fd=92 state_handles=0

Pre-fix the same loop made both numbers climb monotonically.

Regression tests

tests/test_issue1494_state_db_fd_leak.py — 4 tests, one per callsite. Monkeypatches sqlite3.connect with a _TrackingConn wrapper that records .close() calls, then asserts every connection opened by the function under test is explicitly closed.

The tests follow the same lightweight-tracking-conn pattern already used in test_pr1370_lineage_metadata_perf_and_orphan.py, kept inline so the regression survives refactors there.

$ pytest tests/test_issue1494_state_db_fd_leak.py -v
tests/test_issue1494_state_db_fd_leak.py::test_read_importable_agent_session_rows_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_read_session_lineage_metadata_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_get_cli_session_messages_closes_connection PASSED
tests/test_issue1494_state_db_fd_leak.py::test_delete_cli_session_closes_connection PASSED
============================== 4 passed in 1.94s ===============================

I also confirmed the tests catch the bug — reverting just the closing() wrap in read_importable_agent_session_rows produces:

AssertionError: read_importable_agent_session_rows leaked 5 of 5 sqlite connection(s)
— context-manager-only `with sqlite3.connect()` does not close.
Wrap in contextlib.closing(). See #1494.

Full suite

$ pytest tests/ -q
3870 passed, 2 skipped, 3 xpassed, 8 subtests passed in 87.03s

(3866 → 3870, the 4 new regression tests.)

Scope discipline — what this PR does NOT do

#1458 catalogs three failure modes. This PR fixes Bug #2 only.

Bug Portability #1 (bootstrap supervisor double-fork) — already fixed by v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480) #1483 + v0.50.270 — Bootstrap launcher import validation (#1315) + Opus follow-up #1487.
Bug Hermes Web UI — Sprints 11-14: multi-provider models, settings, sessi… #2 (state.db FD leak) — this PR.
Bug fix(frontend): use URL origin for fetch/EventSource to support revers… #3 (HTTP-unhealthy wedge in the absence of FD exhaustion) — still open. Diagnosis-in-progress in the bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 thread; deferred until a concrete root cause lands. The two suggestions from that thread (deep /health?deep=1 probe, accept-loop heartbeat counter) are useful but speculative without a confirmed wedge mode they would have caught. Not bundling.

Commit message uses Refs #1458 (Bug #2 of 3) rather than Closes #1458 so the umbrella stays open until Bug #3 lands.

Attribution

@insecurejezza did all the production diagnosis (wedge reproduction, lsof analysis, source audit, exact callsite list, fix shape, verification stress loop). I just turned the verified fix into a PR with regression tests. Co-authored-by trailer on the commit.

Closes #1494
Refs #1458 (Bug #2 of 3)

@insecurejezza

… sidebar polling (#1494) Production WebUI on macOS launchd reproduced an HTTP-unhealthy wedge after #1483 closed the bootstrap supervisor double-fork: process alive, port listening, every HTTP request reset by peer before a response. The reporter (@insecurejezza) traced it to FD exhaustion — 366 open FDs on the wedged process, 238 of them `~/.hermes/state.db`, `state.db-wal`, and `state.db-shm`. Root cause: four sqlite callsites use `with sqlite3.connect(...) as conn:`. Python's sqlite3 connection context manager only commits or rolls back on exit; it does NOT close the connection. `/api/sessions` polling calls these on every sidebar refresh, so each poll leaked one or more open state.db FDs until the process hit macOS's soft FD limit and new sqlite3.connect() calls inside fresh request handlers raised before any response bytes were written. Fix: wrap each `sqlite3.connect(...)` in `contextlib.closing(...)` so the connection is explicitly closed on scope exit, in addition to the auto- commit / rollback semantics that `Connection.__exit__` already provides. Callsites patched: - api/agent_sessions.py:read_importable_agent_session_rows - api/agent_sessions.py:read_session_lineage_metadata - api/models.py:get_cli_session_messages - api/models.py:delete_cli_session Reporter's verification (post-patch, 100-request stress loop against /api/sessions and /api/projects): batch=1 fd=92 state_handles=0 batch=2 fd=92 state_handles=0 ... batch=5 fd=92 state_handles=0 Pre-patch the same loop made FD count and state.db handle count climb monotonically. 4 regression tests in tests/test_issue1494_state_db_fd_leak.py monkeypatch sqlite3.connect with a tracking wrapper that records .close() calls and assert every connection opened by each of the four functions is explicitly closed. Verified to fail (catching the original bug) when the closing() wrap is reverted: "leaked 5 of 5 sqlite connection(s) — context-manager- only `with sqlite3.connect()` does not close. Wrap in contextlib.closing()." This addresses Bug #2 of the umbrella issue #1458. Bug #3 (HTTP-unhealthy wedge in the absence of FD exhaustion) remains open pending separate diagnostic data — explicit scope discipline. Closes #1494 Refs #1458 (Bug #2 of 3) Co-authored-by: insecurejezza <[email protected]>

nesquena

Review — end-to-end ✅ (clean approve, behavioural harness confirms bug + fix)

Production fix for #1494 / Bug #2 of #1458. Wraps four with sqlite3.connect(...) as conn: callsites in contextlib.closing(...) because Python's sqlite3 connection context manager only commits/rolls back on exit — it does not close the connection. /api/sessions polling on a long-running webui leaked one or more state.db* FDs per poll until the process hit the macOS 256-FD soft limit and new request handlers RST'd before producing bytes.

What this ships

api/agent_sessions.py:238 — read_importable_agent_session_rows (sidebar agent-session listing)
api/agent_sessions.py:310 — read_session_lineage_metadata (sidebar lineage collapse, runs every poll)
api/models.py:1105 — get_cli_session_messages (CLI session detail)
api/models.py:1146 — delete_cli_session (with explicit conn.commit() retained — see Minor observations)
tests/test_issue1494_state_db_fd_leak.py — 234 LOC, 4 tests using a _TrackingConn wrapper that monkeypatches sqlite3.connect and asserts every connection opened by each function is .close()'d.

Traced against upstream hermes-agent

Webui-only change. The CLI / agent has its own hermes_state.SessionDB layer with proper connection lifecycle — verified state.db write paths in the agent tarball don't share these four functions. The SessionDB cache fix from #1421 is orthogonal (covered cached-connection LRU eviction, not raw sqlite3.connect() callers). No cross-tool surface. ✓

Behavioural harness — bug and fix verified

$ python3 -c "..." # see review for full script
BUG: connection still open after with block (this is the bug)
FIX OK: connection closed after closing(): Cannot operate on a closed database.

Confirms:

with sqlite3.connect(p) as c: does NOT close — the connection remains usable after the block (FD still open).
with closing(sqlite3.connect(p)) as c: DOES close — cursor() raises ProgrammingError: Cannot operate on a closed database. after the block.

Audit — exhaustive sweep of production callsites

$ grep -rn "sqlite3.connect" api/ --include="*.py"
api/agent_sessions.py:238:    with closing(sqlite3.connect(str(db_path))) as conn:
api/agent_sessions.py:310:        with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1105:        with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1146:        with closing(sqlite3.connect(str(db_path))) as conn:

All 4 production callsites in api/ are wrapped. No other sqlite3.connect() invocations exist outside the test directory. The PR description's audit claim ("re-ran the audit and confirmed there are no other production callsites of the leaking pattern") is verified.

End-to-end trace

Pre-fix request flow on a long-running webui:

Browser polls /api/sessions every few seconds.
Each call invokes read_importable_agent_session_rows (api/agent_sessions.py:235) and read_session_lineage_metadata (api/agent_sessions.py:307).
Each function used with sqlite3.connect(str(db_path)) as conn: — __exit__ commits, never closes.
Connection's underlying FDs (state.db, state.db-wal, state.db-shm) stay open.
After ~80 polls (3 FDs × ~80 calls = ~240 + base ~25 = ~256), the process hits macOS soft FD limit.
Next request handler thread tries sqlite3.connect() → OSError: [Errno 24] Too many open files.
Handler returns without writing bytes; ThreadingHTTPServer closes the socket; kernel RSTs the client.
Process stays alive (accept loop is fine), port stays in LISTEN, every request RSTs. The exact symptoms reported.

Post-fix: every connection is closed on scope exit; FD count flat across stress tests (reporter measured 92 across 100 requests).

Edge-case trace

Scenario	Pre-fix	Post-fix
100 sidebar polls on idle webui	FD count climbs by 1-3 per poll	flat at base ✅
Process hits 256 FD limit	requests RST, server appears dead	doesn't reach limit ✅
`read_*` exits via exception	connection leaked	closing() still closes ✅
`delete_cli_session` with explicit commit	commit + connection leaked	commit + close ✅
`delete_cli_session` exception before commit	connection leaked, partial state	close, no half-write ✅
Concurrent `delete_cli_session` from two threads	each leaks	each closes; no shared state ✅
Read-only path (3 callsites) — no auto-commit needed	works	works ✅

Security audit

✅ No XSS, no SSRF, no path traversal — pure sqlite connection lifecycle fix.
✅ No new endpoints; no auth changes; no config.yaml writes.
✅ closing() is from stdlib contextlib, well-tested.
✅ The _TrackingConn wrapper in tests is test-only (under tests/) and is not production code.

Tests

tests/test_issue1494_state_db_fd_leak.py — 4/4 pass:
- test_read_importable_agent_session_rows_closes_connection — calls 5×, asserts 5/5 closed
- test_read_session_lineage_metadata_closes_connection — calls 5×, asserts 5/5 closed
- test_get_cli_session_messages_closes_connection — calls 5×, asserts 2 messages returned + all closed (sanity-checks the real query path is exercised, not just open/close)
- test_delete_cli_session_closes_connection — first call deletes (rowcount > 0), second is idempotent no-op; verifies commit semantics survived via a separate sqlite3.Connection that re-queries the table afterward

PR description claims the test catches the bug. Verified by reverting the closing() wrap on read_importable_agent_session_rows — produces:

AssertionError: read_importable_agent_session_rows leaked 5 of 5 sqlite connection(s) — context-manager-only `with sqlite3.connect()` does not close. Wrap in contextlib.closing(). See #1494.

Full suite: 3817 passed, 55 skipped, 3 xpassed, 0 failed in 18.64s on the PR branch.
CI: 3.11 / 3.12 / 3.13 all green.

Race / state analysis

Concurrent calls to the patched functions (e.g. two sidebar polls hitting /api/sessions near-simultaneously): each opens its own connection, runs SELECTs, closes. No shared connection state between threads. ✓
delete_cli_session two-call race: explicit conn.commit() before return; second call sees the deleted state on its own connection. SQLite WAL mode handles this. ✓
Exception during SELECT inside with closing(...): closing.__exit__ calls close() regardless of exception status. FD released. ✓

Minor observations (non-blocking)

closing() skips the original __exit__'s auto-commit/rollback semantics. contextlib.closing(thing) only calls thing.close() on exit — it does not delegate to thing.__exit__. For the 3 read-only callsites this is a non-issue (no transaction to commit). For delete_cli_session, the PR correctly retains an explicit conn.commit() (api/models.py:1150). A future contributor adding a write op inside one of these blocks expecting auto-commit would silently get a no-op. Could be tightened with the layered-context idiom (with closing(sqlite3.connect(...)) as conn: with conn: ...) but the current shape is correct for what's there now. Worth a one-line code comment near the closing() wrap noting "auto-commit semantics are dropped — add explicit conn.commit() if you need it." Not blocking.
The 4 leak sites are all read-mostly. The agent's own write-heavy SessionDB layer was already correctly closed via #1421's LRU eviction fix — this PR is the missing piece for the raw sqlite3.connect() callers that didn't go through SessionDB.
PR description test count off by ~53 (3870 claimed vs 3817 local). Counting drift consistent with prior batches; not blocking.
Reporter attribution is well done — @insecurejezza did the production diagnosis (lsof analysis, source audit, fix shape, verification stress loop) and gets a Co-authored-by trailer. Good external-contributor credit.

Recommendation

✅ Approved. Textbook FD-leak fix: correct diagnosis (verified independently via behavioural harness), correct fix at all 4 production callsites (verified via exhaustive grep), regression tests prove the fix and are confirmed to catch the bug when reverted. Cross-tool clean. No security implications. Reporter's stress-loop verification (FD count flat at 92 across 100 requests vs. monotonic growth pre-fix) is the kind of evidence reviews want.

This unblocks the persistent-host wedge that #1483 surfaced after fixing the supervisor double-fork. Bug #3 of #1458 (HTTP-unhealthy without FD exhaustion) still tracked separately — correct scope discipline.

Parked at approval — ready for the release agent's merge/tag pipeline.

…#2 of nesquena#1458)

@dso2ng

…ix + P0 polish bundle (nesquena#1466 nesquena#1494 nesquena#1469 nesquena#1484 nesquena#1486) 3 PRs in this batch (3866 → 3874 tests, +8): - nesquena#1493 (@dso2ng) — sidebar Stop response cancels row's stream not active pane's (closes nesquena#1466, follow-up to nesquena#1480) - nesquena#1495 (self-built; reported by @insecurejezza in nesquena#1494) — state.db connection FD leak in sidebar polling (closes nesquena#1494, addresses Bug #2 of nesquena#1458) - nesquena#1492 (@bergeouss) — P0 bugfixes bundle: tool-card args readability + CLI rename persistence + scroll pinning + sw.js relative-path regression test (closes nesquena#1469 nesquena#1484 nesquena#1486) This release closes Bug #2 of the umbrella issue nesquena#1458. Bug #1 was closed by v0.50.269 (nesquena#1483) + v0.50.270 (nesquena#1487). Bug #3 (HTTP-unhealthy without FD exhaustion) is the remaining work item.

nesquena-hermes mentioned this pull request May 3, 2026

bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling #1494

Closed

nesquena approved these changes May 3, 2026

View reviewed changes

nesquena-hermes mentioned this pull request May 3, 2026

Stage 272: 3 PRs — #1493 sidebar cancel + #1495 state.db FD leak fix + #1492 P0 polish bundle (closes #1466 #1469 #1484 #1486 #1494; refs #1458 Bug #2) #1496

Merged

nesquena-hermes closed this pull request by merging all changes into master in 6c3ff3f May 3, 2026

nesquena-hermes mentioned this pull request May 3, 2026

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458

Closed

nesquena-hermes deleted the fix/state-db-fd-leak-1494 branch May 3, 2026 01:43

pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026

Stage 272: PR nesquena#1495 — state.db FD leak fix (nesquena#1494, Bug …

1d41522

…#2 of nesquena#1458)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494)#1495

fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494)#1495
1 commit merged intomasterfrom
fix/state-db-fd-leak-1494

nesquena-hermes commented May 3, 2026

Uh oh!

nesquena left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nesquena-hermes commented May 3, 2026

Summary

What was happening

Root cause

Files patched

Verification

Reporter's stress loop (post-fix)

Regression tests

Full suite

Scope discipline — what this PR does NOT do

Attribution

Uh oh!

nesquena left a comment

Choose a reason for hiding this comment

Review — end-to-end ✅ (clean approve, behavioural harness confirms bug + fix)

What this ships

Traced against upstream hermes-agent

Behavioural harness — bug and fix verified

Audit — exhaustive sweep of production callsites

End-to-end trace

Edge-case trace

Security audit

Tests

Race / state analysis

Minor observations (non-blocking)

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants