Skip to content

v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480)#1483

Merged
nesquena-hermes merged 8 commits intomasterfrom
stage-269
May 2, 2026
Merged

v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480)#1483
nesquena-hermes merged 8 commits intomasterfrom
stage-269

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

v0.50.269 — Bootstrap supervisor fix + 2 v0.50.267 follow-ups

3-PR batch: 1 self-built nesquena APPROVED + 2 contributor follow-ups to #1473 (which shipped in v0.50.267). All green CI, all mergeable, no Opus blockers.

Constituent PRs (all auto-closing on merge)

PR Author Files LOC Summary
#1478 @nesquena-hermes (self-built, nesquena APPROVED) 4 +814/-9 bootstrap.py --foreground mode for process supervisors (closes #1458 Bug #1) + Opus pre-merge MUST-FIX (XPC_SERVICE_NAME noise filter for macOS Terminal) + 2 SHOULD-FIX
#1479 @Thanatos-Z 3 +6/-0 Session.compact() includes pending_user_message — surgical follow-up to #1473
#1480 @dso2ng 3 +106/-21 bfcache pageshow restores active session via loadSession() + checkInflightOnBoot()

Highlights

#1478 fixes a real production crash loop described by the issue reporter as "the agent fixes it eventually." bootstrap.py was double-forking and exiting 0, which broke every process supervisor (launchd, systemd, supervisord, runit, s6) — they saw the parent exit, marked the program "completed," and respawned it while the orphaned server still owned port 8787. New --foreground flag (auto-detected via INVOCATION_ID / JOURNAL_STREAM / NOTIFY_SOCKET / XPC_SERVICE_NAME / etc.) replaces the bootstrap process image with server.py via os.execv so KeepAlive / Restart=always work correctly.

The Opus pre-merge MUST-FIX caught a serious regression that would have shipped without the noise filter: macOS launchd sets XPC_SERVICE_NAME in every Terminal-spawned shell with values like "0" (truthy in Python!) and "application.com.apple.Terminal.<UUID>". Without filtering, every Mac dev running interactive ./start.sh would silently auto-promote to foreground mode, losing the /health probe, browser open, log file redirect, and hanging the shell.

Also includes a runnable docs/supervisor.md reference: launchd plist + systemd .service + supervisord conf + diagnostic lsof/ppid recipe. Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) under #1458 remain open awaiting diagnostic data — explicit scope discipline.

#1479 is a 6-line completeness fix following up on v0.50.267 #1473. The frontend filter was already checking both active_stream_id AND pending_user_message to keep mid-restore sessions visible, but only active_stream_id was in the API payload. Not user-visible because the two are set/cleared atomically together in master, but the fix prevents future drift.

#1480 patches a bfcache regression: navigating away from the WebUI and back via browser history left the in-flight pane stale because the v0.50.267 reload-recovery only ran on fresh page loads. The pageshow handler now refreshes through the normal load path.

Test results

Diffstat (excluding tests)

 .gitignore           |   1 +
 api/models.py        |   1 +
 bootstrap.py         | 130 +++++++++-
 docs/supervisor.md   | 237 +++++++++++++++++++  (new)
 static/boot.js       |  13 +-
 static/i18n.js       |   1 +
 6 files changed, ~380 insertions(+), ~10 deletions(-)

Closes

Closes #1478 (self-built), #1479, #1480 on merge. Issue #1458 Bug #1 closes via #1478. Bugs #2 and #3 remain open under #1458 awaiting diagnostic data.

Independent review

Hermes Bot and others added 8 commits May 2, 2026 17:37
…Bug #1)

Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI
under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls
`subprocess.Popen([python, "server.py"], start_new_session=True)`, probes
/health, then exits 0. Under any process supervisor (launchd, systemd,
supervisord, runit, s6), the supervisor sees its tracked PID exit, marks
the program as "completed," and respawns it. The new bootstrap fails to
bind port 8787 (orphaned server still has it), exits non-zero, supervisor
respawns again — loop until the orphan crashes for some other reason and
the next respawn finds the port free.

This PR addresses Bug #1 of the three failure modes tracked in #1458:
the `bootstrap.py` double-fork breaking process supervisors. Bug #2
(state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under
the same issue — they need diagnosis data before a fix can land.

Changes
-------

1. `bootstrap.py`:
   - New `--foreground` argparse flag with help text mentioning launchd /
     systemd / supervisord.
   - New `_detect_supervisor()` that returns the env var name for any
     supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` /
     `NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd),
     `SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for
     the explicit user opt-in. Truthy values for the explicit opt-in:
     `1` / `true` / `yes` / `on` (case-insensitive).
   - `main()` branches on `args.foreground or _detect_supervisor()`:
     - **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then
       `os.execv(python, [python, server_path])` to replace the bootstrap
       process image with the server. The supervisor sees the long-lived
       server as the original child. No `wait_for_health` probe — the
       supervisor's KeepAlive / Restart=on-failure handles liveness.
     - **Default path:** unchanged. Spawn server as detached child via
       `Popen + start_new_session=True`, probe /health, return 0. This
       still works for interactive `bash start.sh` invocations.
   - Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on
     `os.environ` directly instead of into a local `env` copy so they
     are inherited across `os.execv`.

2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service,
   and supervisord conf examples + a diagnostic recipe (`lsof` + ppid
   chain) for catching the orphan-loop in production.

3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an
   opt-in pattern; matches the existing `!docs/docker.md` precedent).

4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests
   covering the argparse flag, `_detect_supervisor()` behavior across all
   five supervisor env vars, the explicit opt-in's truthy/falsy values,
   and `main()`'s execv-vs-Popen routing decision under each input
   combination. `os.execv` is monkeypatched in the routing tests — we
   pin the structural choice (which call is made, with which args, in
   which cwd, with which env) not the post-exec behavior.

Why this scope and no more
--------------------------

Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter
for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to
disambiguate. Until that data lands, any "fix" would be speculative —
explicitly out of scope per the contributor-pickup comment on the issue.

Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in
@stefanpieter's reply comment. Diagnosis is in flight; no concrete fix
shape yet. Also out of scope.

Running locally end-to-end verifies the behavior:

```
[bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground)
$ pgrep -af 'server.py'
2997632 /home/.../python /tmp/wt-fix-1458/server.py
$ ps -o ppid -p 2997632
2997581   ← bash that ran bootstrap.py — same PID as the original bootstrap
$ ps -p 2997581 -o cmd
... bootstrap.py ...   ← but exec'd into server.py
```

The same PID that bash forked for `bootstrap.py` is now `server.py`.
A supervisor watching that PID would correctly observe the long-lived
server. No double-fork.

Verification
------------

- 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus
  master-merge-in)
- All 35 new bootstrap-foreground tests pass
- `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree)
- `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA)
- Live verified: server starts cleanly under both `--foreground` and
  `HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork

Closes #1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the
issue.
Three changes from the pre-merge Opus review:

**MUST-FIX** — XPC_SERVICE_NAME false-positive on macOS Terminal

macOS launchd sets `XPC_SERVICE_NAME` in EVERY Terminal-spawned shell, not
just real services. Typical noise values: `"0"` (truthy in Python!) and
`"application.com.apple.Terminal.<UUID>"`. A bare `os.environ.get(name)`
existence check would auto-promote interactive `./start.sh` runs to
foreground mode on every Mac dev machine — silently breaking the most
common installation path (no /health probe, no browser open, no log file,
hanging shell).

Fix: new `_is_real_supervisor_value()` helper that filters noise. For
`XPC_SERVICE_NAME` specifically, reject `"0"` and any `"application.*"`
prefix. Real launchd plists use reverse-DNS Label form (`com.<rdns>.<svc>`)
which still triggers correctly.

7 new tests in `TestXPCServiceNameNoiseFilter`:
- 4 noise values (`0`, Terminal.app, iTerm2, VSCode) → no detection
- 3 real Label forms → correct detection
- Mixed env with XPC noise + real INVOCATION_ID → falls through to systemd

**SHOULD-FIX 1** — Test env leakage

The original `clean_env` fixture stripped supervisor-detection env vars
but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR)
that `main()` mutates onto `os.environ`. After
`test_foreground_exports_resolved_env_vars` ran, later tests would import
bootstrap with polluted defaults (DEFAULT_HOST="0.0.0.0" instead of
"127.0.0.1"). Existing assertions still passed (tautological vs DEFAULT_*),
but it was a footgun for future tests.

Fix: extend `clean_env` to also `delenv` the three resolved vars before
each test.

**SHOULD-FIX 2** — Pre-execv executability guard

If `discover_launcher_python` returns a path that doesn't exist or isn't
executable, `os.execv` raises OSError → wrapper catches → SystemExit(1)
→ supervisor restarts → loop forever. That's exactly the failure mode
this PR is supposed to eliminate.

Fix: `os.access(python_exe, os.X_OK)` check before execv. Converts
infinite supervisor loop into a single visible RuntimeError.

1 new test in `TestForegroundExecutabilityGuard` pinning that the guard
fires before execv when the python path is non-executable.

**Docs** — supervisor.md updates

- New section explaining the XPC_SERVICE_NAME noise filter and what
  values trigger / don't trigger detection
- New section listing supervisors that are NOT auto-detected (runit,
  daemontools, PM2, Foreman/Honcho, custom shell-script supervisors)
  with explicit recommendation to set HERMES_WEBUI_FOREGROUND=1

Verification

- 3820 tests pass (+9 from this commit's new tests vs the original PR
  push of 3811)
- Filter manually verified end-to-end with the live os.environ:
  XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers
- run-browser-tests.sh ALL CHECKS PASSED on the worktree

Items deferred from the Opus review

- #4 chdir target may not exist: REPO_ROOT comes from __file__.resolve()
  so it's stable; not a real concern in practice
- #6 two startup messages in foreground mode: cosmetic, useful for
  diagnostics
- #7 stricter explicit-only mode: leaves user the override of just not
  passing --foreground (current behavior)
- #8 test stub return value: trivial, can fix later if regression surface
- #9 argparse positional-after-option ordering: test reads fine

These can be follow-up issues if anyone hits them.
…267 follow-ups

- CHANGELOG.md: v0.50.269 entry detailing #1478 #1479 #1480
- ROADMAP.md: bump to v0.50.269, 3847 tests collected
- TESTING.md: bump header + total to 3847

#1478: nesquena APPROVED self-built bootstrap.py --foreground mode
       (closes #1458 Bug #1, +Opus follow-ups: XPC noise filter, executability guard)
#1479: surgical follow-up to #1473 — Session.compact() now includes pending_user_message
#1480: bfcache pageshow restores active session via loadSession + checkInflightOnBoot

3847 tests pass (+47 net). Opus advisor on stage diff: no blockers.
@nesquena-hermes nesquena-hermes merged commit b8a346f into master May 2, 2026
3 checks passed
@nesquena-hermes nesquena-hermes deleted the stage-269 branch May 2, 2026 18:14
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026
… sidebar polling (nesquena#1494)

Production WebUI on macOS launchd reproduced an HTTP-unhealthy wedge after
nesquena#1483 closed the bootstrap supervisor double-fork: process alive, port
listening, every HTTP request reset by peer before a response. The reporter
(@insecurejezza) traced it to FD exhaustion — 366 open FDs on the wedged
process, 238 of them `~/.hermes/state.db`, `state.db-wal`, and `state.db-shm`.

Root cause: four sqlite callsites use `with sqlite3.connect(...) as conn:`.
Python's sqlite3 connection context manager only commits or rolls back on
exit; it does NOT close the connection. `/api/sessions` polling calls these
on every sidebar refresh, so each poll leaked one or more open state.db FDs
until the process hit macOS's soft FD limit and new sqlite3.connect() calls
inside fresh request handlers raised before any response bytes were written.

Fix: wrap each `sqlite3.connect(...)` in `contextlib.closing(...)` so the
connection is explicitly closed on scope exit, in addition to the auto-
commit / rollback semantics that `Connection.__exit__` already provides.

Callsites patched:
- api/agent_sessions.py:read_importable_agent_session_rows
- api/agent_sessions.py:read_session_lineage_metadata
- api/models.py:get_cli_session_messages
- api/models.py:delete_cli_session

Reporter's verification (post-patch, 100-request stress loop against
/api/sessions and /api/projects):

  batch=1 fd=92 state_handles=0
  batch=2 fd=92 state_handles=0
  ...
  batch=5 fd=92 state_handles=0

Pre-patch the same loop made FD count and state.db handle count climb
monotonically.

4 regression tests in tests/test_issue1494_state_db_fd_leak.py monkeypatch
sqlite3.connect with a tracking wrapper that records .close() calls and
assert every connection opened by each of the four functions is explicitly
closed. Verified to fail (catching the original bug) when the closing()
wrap is reverted: "leaked 5 of 5 sqlite connection(s) — context-manager-
only `with sqlite3.connect()` does not close. Wrap in contextlib.closing()."

This addresses Bug #2 of the umbrella issue nesquena#1458. Bug #3 (HTTP-unhealthy
wedge in the absence of FD exhaustion) remains open pending separate
diagnostic data — explicit scope discipline.

Closes nesquena#1494
Refs nesquena#1458 (Bug #2 of 3)

Co-authored-by: insecurejezza <[email protected]>
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 3, 2026
…ix + P0 polish bundle (nesquena#1466 nesquena#1494 nesquena#1469 nesquena#1484 nesquena#1486)

3 PRs in this batch (3866 → 3874 tests, +8):

- nesquena#1493 (@dso2ng) — sidebar Stop response cancels row's stream not active pane's (closes nesquena#1466, follow-up to nesquena#1480)
- nesquena#1495 (self-built; reported by @insecurejezza in nesquena#1494) — state.db connection FD leak in sidebar polling (closes nesquena#1494, addresses Bug #2 of nesquena#1458)
- nesquena#1492 (@bergeouss) — P0 bugfixes bundle: tool-card args readability + CLI rename persistence + scroll pinning + sw.js relative-path regression test (closes nesquena#1469 nesquena#1484 nesquena#1486)

This release closes Bug #2 of the umbrella issue nesquena#1458. Bug #1 was closed by v0.50.269 (nesquena#1483) + v0.50.270 (nesquena#1487). Bug #3 (HTTP-unhealthy without FD exhaustion) is the remaining work item.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak

2 participants