Skip to content

fix: close previous SessionDB before replacing on cached agent#1421

Merged
1 commit merged intonesquena:masterfrom
wali-reheman:fix/sessiondb-wal-handle-leak
May 1, 2026
Merged

fix: close previous SessionDB before replacing on cached agent#1421
1 commit merged intonesquena:masterfrom
wali-reheman:fix/sessiondb-wal-handle-leak

Conversation

@wali-reheman
Copy link
Copy Markdown

Problem

SessionDB WAL handles leak when _run_agent_streaming creates a new SessionDB instance per request and replaces the cached agent's _session_db without closing the old one.

Each orphaned SessionDB connection holds 2 FDs (state.db + state.db-wal). After ~73 messages, leaked handles exhaust the 256 FD default limit causing EMFILE crashes.

Root cause

In api/streaming.py, when reusing a cached agent:

agent._session_db = _session_db  # old _session_db never closed

Fix

Close the previous _session_db before replacing it on cached agents:

if hasattr(agent, '_session_db') and agent._session_db is not None:
    try:
        agent._session_db.close()
    except Exception:
        pass
agent._session_db = _session_db

Testing

  • Fresh server after fix: 2 DB handles (.db + .db-wal + .db-shm) vs 73 handles before
  • EMFILE crash at 16:28 ET reproduced the 73-handle leak exactly
  • Server has been stable since restart

Closes #streaming FD leak.

SessionDB WAL handles leak when streaming.py creates a new SessionDB
instance per request and replaces the cached agent's _session_db without
closing the old one. Each orphaned connection holds 2 FDs (.db +
.db-wal), causing FD exhaustion and EMFILE crashes after ~73 messages.

Fix: close the previous _session_db before replacing it on cached
agents, mirroring the close-before-replace pattern used elsewhere in the
codebase.
@wali-reheman wali-reheman reopened this May 1, 2026
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @wali-reheman — this is a good catch.

Confirmed the leak: in api/streaming.py ~L1885, the cached-agent reuse path overwrites agent._session_db without closing the previous instance. Each SessionDB.__init__ opens a SQLite connection that holds three FDs once WAL kicks in (state.db, state.db-wal, state.db-shm), and they're never released because nothing else holds a reference to the old object once agent._session_db is rebound. After enough chat turns on a long-lived agent, EMFILE is the predictable end state.

Your fix shape matches what SessionDB.close() is designed for — it grabs the internal lock, runs PRAGMA wal_checkpoint(PASSIVE), then closes the connection and nulls _conn. Wrapping it in try/except Exception: pass is right: a busted WAL checkpoint shouldn't block the new session from being installed.

Two things I'll verify on my side before merging:

  1. No double-close races. Want to make sure no other code path (cancel handler, finally block in the streaming generator, agent destructor) closes the same _session_db after we've already swapped it. SessionDB.close() is idempotent (if self._conn:), so worst case is benign, but I'd like to confirm the swap point is the only owner.
  2. FD-count regression test. Going to add a small test that drives N streaming requests through the cached-agent reuse path and asserts the open-FD count for state.db* stays bounded.

Will run the full pytest suite (3604 tests on master) plus the browser harness today and aim to merge in the next batch release.

Repro detail you mentioned — 73 handles at the EMFILE crash matching exactly the message count — is solid evidence and worth pinning in the test. Thanks again.

@nesquena-hermes nesquena-hermes closed this pull request by merging all changes into nesquena:master in c0d50b3 May 1, 2026
pull Bot pushed a commit to JamesWilliam1977/hermes-webui that referenced this pull request May 1, 2026
pull Bot pushed a commit to JamesWilliam1977/hermes-webui that referenced this pull request May 1, 2026
…tion + CHANGELOG + 5 regression tests

PR nesquena#1421 (SessionDB WAL handle leak fix on cached-agent reuse path) had a
sibling leak at the LRU eviction site that I caught during pre-review:

api/streaming.py SESSION_AGENT_CACHE.popitem(last=False) was discarding
the evicted entry with `evicted_sid, _ = ...`. The agent's _session_db
was dropped on the floor and only released when GC eventually finalized
the agent — which on a long-running server may be never (cyclic refs,
extension types holding C handles, etc.).

Same fix shape as nesquena#1421: capture the evicted entry, call
_evicted_agent._session_db.close() explicitly. SessionDB.close() is
idempotent + thread-safe (with self._lock: if self._conn:), so the
double-close-is-benign property still holds.

5 regression tests in test_v050259_sessiondb_fd_leak.py:
- Source-level: cached-agent reuse path closes before replace
- Source-level: LRU eviction path captures + closes evicted agent
- Behavioral: SessionDB.close() is idempotent (3 calls safe)
- Behavioral: cached-agent reuse with mock — close called exactly once
- Behavioral: LRU eviction with mock — only evicted agent's DB closes

Full suite: 3615 passed, 0 failed.

Nathan explicitly authorized 'just go ahead and merge it as a small release'
since the PR is 9 LOC, focused, has Opus pre-release follow-up + tests, and
matches the empirically-confirmed leak shape (73-handle leak at EMFILE).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants