fix(sqlite): close state.db connections explicitly to stop FD leak in sidebar polling (#1494)#1495
Conversation
… sidebar polling (#1494) Production WebUI on macOS launchd reproduced an HTTP-unhealthy wedge after #1483 closed the bootstrap supervisor double-fork: process alive, port listening, every HTTP request reset by peer before a response. The reporter (@insecurejezza) traced it to FD exhaustion — 366 open FDs on the wedged process, 238 of them `~/.hermes/state.db`, `state.db-wal`, and `state.db-shm`. Root cause: four sqlite callsites use `with sqlite3.connect(...) as conn:`. Python's sqlite3 connection context manager only commits or rolls back on exit; it does NOT close the connection. `/api/sessions` polling calls these on every sidebar refresh, so each poll leaked one or more open state.db FDs until the process hit macOS's soft FD limit and new sqlite3.connect() calls inside fresh request handlers raised before any response bytes were written. Fix: wrap each `sqlite3.connect(...)` in `contextlib.closing(...)` so the connection is explicitly closed on scope exit, in addition to the auto- commit / rollback semantics that `Connection.__exit__` already provides. Callsites patched: - api/agent_sessions.py:read_importable_agent_session_rows - api/agent_sessions.py:read_session_lineage_metadata - api/models.py:get_cli_session_messages - api/models.py:delete_cli_session Reporter's verification (post-patch, 100-request stress loop against /api/sessions and /api/projects): batch=1 fd=92 state_handles=0 batch=2 fd=92 state_handles=0 ... batch=5 fd=92 state_handles=0 Pre-patch the same loop made FD count and state.db handle count climb monotonically. 4 regression tests in tests/test_issue1494_state_db_fd_leak.py monkeypatch sqlite3.connect with a tracking wrapper that records .close() calls and assert every connection opened by each of the four functions is explicitly closed. Verified to fail (catching the original bug) when the closing() wrap is reverted: "leaked 5 of 5 sqlite connection(s) — context-manager- only `with sqlite3.connect()` does not close. Wrap in contextlib.closing()." This addresses Bug #2 of the umbrella issue #1458. Bug #3 (HTTP-unhealthy wedge in the absence of FD exhaustion) remains open pending separate diagnostic data — explicit scope discipline. Closes #1494 Refs #1458 (Bug #2 of 3) Co-authored-by: insecurejezza <[email protected]>
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (clean approve, behavioural harness confirms bug + fix)
Production fix for #1494 / Bug #2 of #1458. Wraps four with sqlite3.connect(...) as conn: callsites in contextlib.closing(...) because Python's sqlite3 connection context manager only commits/rolls back on exit — it does not close the connection. /api/sessions polling on a long-running webui leaked one or more state.db* FDs per poll until the process hit the macOS 256-FD soft limit and new request handlers RST'd before producing bytes.
What this ships
- api/agent_sessions.py:238 —
read_importable_agent_session_rows(sidebar agent-session listing) - api/agent_sessions.py:310 —
read_session_lineage_metadata(sidebar lineage collapse, runs every poll) - api/models.py:1105 —
get_cli_session_messages(CLI session detail) - api/models.py:1146 —
delete_cli_session(with explicitconn.commit()retained — see Minor observations) - tests/test_issue1494_state_db_fd_leak.py — 234 LOC, 4 tests using a
_TrackingConnwrapper that monkeypatchessqlite3.connectand asserts every connection opened by each function is.close()'d.
Traced against upstream hermes-agent
Webui-only change. The CLI / agent has its own hermes_state.SessionDB layer with proper connection lifecycle — verified state.db write paths in the agent tarball don't share these four functions. The SessionDB cache fix from #1421 is orthogonal (covered cached-connection LRU eviction, not raw sqlite3.connect() callers). No cross-tool surface. ✓
Behavioural harness — bug and fix verified
$ python3 -c "..." # see review for full script
BUG: connection still open after with block (this is the bug)
FIX OK: connection closed after closing(): Cannot operate on a closed database.
Confirms:
with sqlite3.connect(p) as c:does NOT close — the connection remains usable after the block (FD still open).with closing(sqlite3.connect(p)) as c:DOES close —cursor()raisesProgrammingError: Cannot operate on a closed database.after the block.
Audit — exhaustive sweep of production callsites
$ grep -rn "sqlite3.connect" api/ --include="*.py"
api/agent_sessions.py:238: with closing(sqlite3.connect(str(db_path))) as conn:
api/agent_sessions.py:310: with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1105: with closing(sqlite3.connect(str(db_path))) as conn:
api/models.py:1146: with closing(sqlite3.connect(str(db_path))) as conn:All 4 production callsites in api/ are wrapped. No other sqlite3.connect() invocations exist outside the test directory. The PR description's audit claim ("re-ran the audit and confirmed there are no other production callsites of the leaking pattern") is verified.
End-to-end trace
Pre-fix request flow on a long-running webui:
- Browser polls
/api/sessionsevery few seconds. - Each call invokes
read_importable_agent_session_rows(api/agent_sessions.py:235) andread_session_lineage_metadata(api/agent_sessions.py:307). - Each function used
with sqlite3.connect(str(db_path)) as conn:—__exit__commits, never closes. - Connection's underlying FDs (
state.db,state.db-wal,state.db-shm) stay open. - After ~80 polls (3 FDs × ~80 calls = ~240 + base ~25 = ~256), the process hits macOS soft FD limit.
- Next request handler thread tries
sqlite3.connect()→OSError: [Errno 24] Too many open files. - Handler returns without writing bytes;
ThreadingHTTPServercloses the socket; kernel RSTs the client. - Process stays alive (accept loop is fine), port stays in
LISTEN, every request RSTs. The exact symptoms reported.
Post-fix: every connection is closed on scope exit; FD count flat across stress tests (reporter measured 92 across 100 requests).
Edge-case trace
| Scenario | Pre-fix | Post-fix |
|---|---|---|
| 100 sidebar polls on idle webui | FD count climbs by 1-3 per poll | flat at base ✅ |
| Process hits 256 FD limit | requests RST, server appears dead | doesn't reach limit ✅ |
read_* exits via exception |
connection leaked | closing() still closes ✅ |
delete_cli_session with explicit commit |
commit + connection leaked | commit + close ✅ |
delete_cli_session exception before commit |
connection leaked, partial state | close, no half-write ✅ |
Concurrent delete_cli_session from two threads |
each leaks | each closes; no shared state ✅ |
| Read-only path (3 callsites) — no auto-commit needed | works | works ✅ |
Security audit
- ✅ No XSS, no SSRF, no path traversal — pure sqlite connection lifecycle fix.
- ✅ No new endpoints; no auth changes; no
config.yamlwrites. - ✅
closing()is from stdlibcontextlib, well-tested. - ✅ The
_TrackingConnwrapper in tests is test-only (undertests/) and is not production code.
Tests
tests/test_issue1494_state_db_fd_leak.py— 4/4 pass:test_read_importable_agent_session_rows_closes_connection— calls 5×, asserts 5/5 closedtest_read_session_lineage_metadata_closes_connection— calls 5×, asserts 5/5 closedtest_get_cli_session_messages_closes_connection— calls 5×, asserts 2 messages returned + all closed (sanity-checks the real query path is exercised, not just open/close)test_delete_cli_session_closes_connection— first call deletes (rowcount > 0), second is idempotent no-op; verifies commit semantics survived via a separatesqlite3.Connectionthat re-queries the table afterward
- PR description claims the test catches the bug. Verified by reverting the
closing()wrap onread_importable_agent_session_rows— produces:AssertionError: read_importable_agent_session_rows leaked 5 of 5 sqlite connection(s) — context-manager-only `with sqlite3.connect()` does not close. Wrap in contextlib.closing(). See #1494. - Full suite: 3817 passed, 55 skipped, 3 xpassed, 0 failed in 18.64s on the PR branch.
- CI: 3.11 / 3.12 / 3.13 all green.
Race / state analysis
- Concurrent calls to the patched functions (e.g. two sidebar polls hitting
/api/sessionsnear-simultaneously): each opens its own connection, runs SELECTs, closes. No shared connection state between threads. ✓ delete_cli_sessiontwo-call race: explicitconn.commit()before return; second call sees the deleted state on its own connection. SQLite WAL mode handles this. ✓- Exception during SELECT inside
with closing(...):closing.__exit__callsclose()regardless of exception status. FD released. ✓
Minor observations (non-blocking)
-
closing()skips the original__exit__'s auto-commit/rollback semantics.contextlib.closing(thing)only callsthing.close()on exit — it does not delegate tothing.__exit__. For the 3 read-only callsites this is a non-issue (no transaction to commit). Fordelete_cli_session, the PR correctly retains an explicitconn.commit()(api/models.py:1150). A future contributor adding a write op inside one of these blocks expecting auto-commit would silently get a no-op. Could be tightened with the layered-context idiom (with closing(sqlite3.connect(...)) as conn: with conn: ...) but the current shape is correct for what's there now. Worth a one-line code comment near the closing() wrap noting "auto-commit semantics are dropped — add explicit conn.commit() if you need it." Not blocking. -
The 4 leak sites are all read-mostly. The agent's own write-heavy
SessionDBlayer was already correctly closed via #1421's LRU eviction fix — this PR is the missing piece for the rawsqlite3.connect()callers that didn't go throughSessionDB. -
PR description test count off by ~53 (3870 claimed vs 3817 local). Counting drift consistent with prior batches; not blocking.
-
Reporter attribution is well done — @insecurejezza did the production diagnosis (lsof analysis, source audit, fix shape, verification stress loop) and gets a
Co-authored-bytrailer. Good external-contributor credit.
Recommendation
✅ Approved. Textbook FD-leak fix: correct diagnosis (verified independently via behavioural harness), correct fix at all 4 production callsites (verified via exhaustive grep), regression tests prove the fix and are confirmed to catch the bug when reverted. Cross-tool clean. No security implications. Reporter's stress-loop verification (FD count flat at 92 across 100 requests vs. monotonic growth pre-fix) is the kind of evidence reviews want.
This unblocks the persistent-host wedge that #1483 surfaced after fixing the supervisor double-fork. Bug #3 of #1458 (HTTP-unhealthy without FD exhaustion) still tracked separately — correct scope discipline.
Parked at approval — ready for the release agent's merge/tag pipeline.
…ix + P0 polish bundle (nesquena#1466 nesquena#1494 nesquena#1469 nesquena#1484 nesquena#1486) 3 PRs in this batch (3866 → 3874 tests, +8): - nesquena#1493 (@dso2ng) — sidebar Stop response cancels row's stream not active pane's (closes nesquena#1466, follow-up to nesquena#1480) - nesquena#1495 (self-built; reported by @insecurejezza in nesquena#1494) — state.db connection FD leak in sidebar polling (closes nesquena#1494, addresses Bug #2 of nesquena#1458) - nesquena#1492 (@bergeouss) — P0 bugfixes bundle: tool-card args readability + CLI rename persistence + scroll pinning + sw.js relative-path regression test (closes nesquena#1469 nesquena#1484 nesquena#1486) This release closes Bug #2 of the umbrella issue nesquena#1458. Bug #1 was closed by v0.50.269 (nesquena#1483) + v0.50.270 (nesquena#1487). Bug #3 (HTTP-unhealthy without FD exhaustion) is the remaining work item.
Summary
Production fix for Bug #2 of #1458, reported with full diagnostic + verified fix shape in #1494 by @insecurejezza.
What was happening
Persistent WebUI on macOS launchd became process-alive, port-listening, HTTP-unhealthy after the bootstrap supervisor fix in #1483. Every request reset before a response:
The reporter's
lsofshowed FD exhaustion driven entirely bystate.dbhandles:After macOS's 256 default soft FD limit, every fresh request handler that called
sqlite3.connect()raised before producing any bytes — handler returned,ThreadingHTTPServerclosed the conn, kernel RST'd the client. Process stayed alive, accept loop stayed healthy, port stayed inLISTEN.Root cause
Four sqlite callsites use Python's
with sqlite3.connect(...) as conn:pattern.sqlite3.Connection.__exit__only commits or rolls back. It does not close the connection. From the Python docs:/api/sessionspolling calls these on every sidebar refresh, so each poll leaks one or more openstate.db/state.db-wal/state.db-shmFDs.Files patched
Applied at:
api/agent_sessions.py:read_importable_agent_session_rows— sidebar agent-session listingapi/agent_sessions.py:read_session_lineage_metadata— sidebar lineage collapse (called on every poll)api/models.py:get_cli_session_messages— CLI session detail viewapi/models.py:delete_cli_session— CLI session deletion (explicitconn.commit()retained)These four callsites match exactly the audit @insecurejezza did against
origin/masterat7fddc33. I re-ran the audit (rg 'sqlite3\.connect' --glob '!tests/') and confirmed there are no other production callsites of the leaking pattern.The pre-existing
SessionDBcache fix from #1421 is orthogonal — it covered cached connection reassignment and LRU eviction, not these raw sqlite uses.Verification
Reporter's stress loop (post-fix)
Pre-fix the same loop made both numbers climb monotonically.
Regression tests
tests/test_issue1494_state_db_fd_leak.py— 4 tests, one per callsite. Monkeypatchessqlite3.connectwith a_TrackingConnwrapper that records.close()calls, then asserts every connection opened by the function under test is explicitly closed.The tests follow the same lightweight-tracking-conn pattern already used in
test_pr1370_lineage_metadata_perf_and_orphan.py, kept inline so the regression survives refactors there.I also confirmed the tests catch the bug — reverting just the
closing()wrap inread_importable_agent_session_rowsproduces:Full suite
(3866 → 3870, the 4 new regression tests.)
Scope discipline — what this PR does NOT do
#1458 catalogs three failure modes. This PR fixes Bug #2 only.
/health?deep=1probe, accept-loop heartbeat counter) are useful but speculative without a confirmed wedge mode they would have caught. Not bundling.Commit message uses
Refs #1458 (Bug #2 of 3)rather thanCloses #1458so the umbrella stays open until Bug #3 lands.Attribution
@insecurejezza did all the production diagnosis (wedge reproduction,
lsofanalysis, source audit, exact callsite list, fix shape, verification stress loop). I just turned the verified fix into a PR with regression tests.Co-authored-bytrailer on the commit.Closes #1494
Refs #1458 (Bug #2 of 3)