bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling

Follow-up to #1458. That issue was closed by the bootstrap supervisor fix in #1483, but the production failure still reproduced after that class of fix: WebUI process alive, port listening, HTTP requests reset, with a large number of leaked `state.db` file descriptors.

## Summary

Persistent WebUI on macOS launchd can become process-alive and port-listening while all HTTP requests fail with `Connection reset by peer`. In the observed case, the server process had hundreds of open file descriptors, dominated by repeated `~/.hermes/state.db`, `state.db-wal`, and `state.db-shm` handles.

I found one concrete remaining leak path in the WebUI's raw sqlite usage. Several callsites use `with sqlite3.connect(...) as conn:`, but Python's sqlite connection context manager commits or rolls back only. It does not close the connection. In a long-running process with sidebar polling, this leaks FDs.

## Environment

- macOS, Mac mini
- launchd service
- WebUI bound to `0.0.0.0:8787`
- Persistent, always-on WebUI
- Local repo before patch was v0.50.266-dirty, remote checked against `origin/master` at `7fddc33`

## Observed failure

```text
curl -sv http://127.0.0.1:8787/health
* Connected to 127.0.0.1 (127.0.0.1) port 8787
> GET /health HTTP/1.1
* Recv failure: Connection reset by peer
```

Launchd and socket state at the same time:

```text
launchctl print gui/501/ai.hermes.webui
  state = running
  program = /Users/jeremygavin/.hermes/hermes-webui/start.sh
  runs = 2
  pid = 22461

lsof -nP -iTCP:8787 -sTCP:LISTEN
  Python 22468 ... TCP *:8787 (LISTEN)
```

FD state:

```text
server pid: 22468
lsof -p 22468 | wc -l                  -> 366
lsof -p 22468 | grep state.db | wc -l  -> 238
```

## Source audit

Current `origin/master` still has raw sqlite context-manager use in at least these files:

```python
# api/agent_sessions.py
with sqlite3.connect(str(db_path)) as conn:
    ...

# api/models.py
with sqlite3.connect(str(db_path)) as conn:
    ...
```

Specific callsites patched locally:

- `api/agent_sessions.py:read_importable_agent_session_rows()`
- `api/agent_sessions.py:read_session_lineage_metadata()`
- `api/models.py:get_cli_session_messages()`
- `api/models.py:delete_cli_session()`

## Local fix shape

```python
from contextlib import closing

with closing(sqlite3.connect(str(db_path))) as conn:
    conn.row_factory = sqlite3.Row
    ...
```

## Verification after local patch

After patching those sqlite callsites, cleanly restarting launchd, and checkpointing the WAL:

```text
GET /health on 127.0.0.1:8787      -> 200 {"status": "ok"}
GET /health on 100.78.69.27:8787   -> 200 {"status": "ok"}
GET /api/sessions                  -> 200
GET /api/projects                  -> 200
Browser loaded root page, title: Hermes
Browser console errors: 0
```

Stress loop against session-list polling paths:

```bash
for batch in 1 2 3 4 5; do
  for i in $(seq 1 20); do
    curl -fsS http://127.0.0.1:8787/api/sessions >/dev/null
    curl -fsS http://127.0.0.1:8787/api/projects >/dev/null
  done
  lsof -p "$PID" | wc -l
  lsof -p "$PID" | grep -c state.db
 done
```

After patch:

```text
batch=1 fd=92 state_handles=0
batch=2 fd=92 state_handles=0
batch=3 fd=92 state_handles=0
batch=4 fd=92 state_handles=0
batch=5 fd=92 state_handles=0
```

Before patch, repeated sidebar/session polling increased FD count and `state.db` handles steadily.

## Expected behavior

Repeated `/api/sessions` polling should not increase process FD count or keep new `state.db` handles open.

## Suggested fix

1. Wrap every raw `sqlite3.connect()` in WebUI long-lived process code with `contextlib.closing(...)`, unless the connection is explicitly closed elsewhere.
2. Add a regression test around the session-list codepath that catches connection leaks. If FD count is hard to assert portably, monkeypatch `sqlite3.connect()` with a close-tracking fake and assert all connections are closed.
3. Consider adding a deep health endpoint that exercises `/api/sessions` or the session-store path, since `/health` alone can miss the main UI becoming unusable.

## Relationship to #1458

This is one of the unresolved #1458 failure modes. #1483 appears to address the bootstrap supervisor/double-fork bug, but not this sqlite FD leak. The user-visible symptom can be the same: launchd says the service is running, the port is listening, but the WebUI is broken.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling #1494

Summary

Environment

Observed failure

Source audit

Local fix shape

Verification after local patch

Expected behavior

Suggested fix

Relationship to #1458

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: state.db FD leak in sidebar session polling can wedge WebUI HTTP handling #1494

Description

Summary

Environment

Observed failure

Source audit

Local fix shape

Verification after local patch

Expected behavior

Suggested fix

Relationship to #1458

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions