Skip to content

fix(bootstrap): add --foreground mode for process supervisors (closes #1458 Bug #1)#1478

Closed
nesquena-hermes wants to merge 2 commits intomasterfrom
fix/issue-1458-bootstrap-foreground
Closed

fix(bootstrap): add --foreground mode for process supervisors (closes #1458 Bug #1)#1478
nesquena-hermes wants to merge 2 commits intomasterfrom
fix/issue-1458-bootstrap-foreground

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Closes #1458 (Bug #1 only)

Self-built fix for the bootstrap.py double-fork that breaks process supervisors. Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) remain open under #1458 — they need diagnosis data before a non-speculative fix can land.

Problem

Reporter: at least one crash per day running a persistent WebUI on a Mac mini under launchd KeepAlive.

Root cause (verified against master): bootstrap.py:243-268 calls subprocess.Popen([python_exe, "server.py"], start_new_session=True), probes /health, then exits 0. Under any process supervisor (launchd, systemd, supervisord, runit, s6), the supervisor sees the parent exit, marks the program as "completed," and respawns it. The new bootstrap fails to bind 8787 (orphaned server still has it), exits non-zero, supervisor respawns again — loop until something else crashes the orphan and the next respawn finds the port free. Reporter described this as "the agent fixes it eventually" — that's the loop intermittently succeeding.

Fix

Add --foreground flag (and supervisor-environment auto-detection). In foreground mode, replace the bootstrap process image with server.py via os.execv so the supervisor sees the long-lived server as the original child. KeepAlive / Restart=always now work correctly.

Auto-detected env vars

Env var Set by
INVOCATION_ID systemd (every service activation)
JOURNAL_STREAM systemd (stdio wired to journald)
NOTIFY_SOCKET systemd Type=notify, s6 sd_notify
XPC_SERVICE_NAME launchd (set to plist Label)
SUPERVISOR_ENABLED supervisord
HERMES_WEBUI_FOREGROUND explicit user opt-in (1 / true / yes / on)

Routing decision

foreground = args.foreground or _detect_supervisor()
if foreground:
    os.chdir(server_cwd)
    os.execv(python_exe, [python_exe, server_path])  # never returns
else:
    Popen + wait_for_health + return 0   # legacy, unchanged

Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are mutated on os.environ directly so they survive os.execv.

Files changed

  • bootstrap.py (+82/-9): argparse flag, _detect_supervisor(), two-path main()
  • docs/supervisor.md (new, +185 LOC): runnable launchd plist + systemd .service + supervisord conf + diagnostic recipe
  • .gitignore (+1): allowlist docs/supervisor.md (matches !docs/docker.md precedent)
  • tests/test_bootstrap_foreground.py (new, +380 LOC): 35 regression tests

Tests

35 new tests in tests/test_bootstrap_foreground.py:

  • TestForegroundFlag (3) — argparse recognition, default false, help text mentions all 3 supervisors
  • TestDetectSupervisor (15) — clean env returns None; each of 5 supervisor env vars triggers; explicit opt-in accepts 7 truthy values; explicit opt-in falls through on 7 falsy/empty values; explicit opt-in takes precedence over supervisor var
  • TestMainForegroundRouting (8) — default path uses Popen not execv; --foreground uses execv not Popen; each of 5 supervisor env vars auto-promotes to execv; HERMES_WEBUI_FOREGROUND=1 auto-promotes to execv
  • TestForegroundEnvAndCwd (3) — chdirs to agent_dir or REPO_ROOT before exec; exports HOST/PORT/AGENT_DIR/STATE_DIR to os.environ; skips wait_for_health in foreground mode

os.execv is monkeypatched in the routing tests — we pin the structural choice (which path, which cwd, which env) not the post-exec behavior. Existing test_bootstrap_dotenv.py (16 tests) continues to pass.

Verification

  • pytest tests/ -q3811 passed, 2 skipped, 3 xpassed (was 3760 before this PR + master-merge-in)
  • bash scripts/run-browser-tests.sh → ALL CHECKS PASSED (HTTP API contract checks against the worktree)
  • bash scripts/webui_qa_agent.sh 878923/23 visual QA passed against a live server started with HERMES_WEBUI_FOREGROUND=1
  • Live PID-lineage verified: with HERMES_WEBUI_FOREGROUND=1, the python process serving on 8789 is the same PID that bash originally forked for bootstrap.py. No double-fork.
$ HERMES_WEBUI_FOREGROUND=1 python bootstrap.py
[bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: HERMES_WEBUI_FOREGROUND)
...
$ lsof -tiTCP:8789 -sTCP:LISTEN
2997632
$ ps -p 2997632 -o pid,ppid,cmd
2997632  2997581  /home/.../python /tmp/wt-fix-1458/server.py   ← exec'd from bootstrap.py
$ ps -p 2997581 -o pid,cmd
2997581  /usr/bin/bash ... bootstrap.py   ← parent shell, not a forked python

Backward compatibility

Default behavior (interactive bash start.sh) is unchanged. The new path is opt-in via flag or supervisor-env detection. No breaking changes.

Out of scope (explicitly)

Reviewer notes

  • The os.environ mutation (instead of local env copy) is a behavioral change in the legacy Popen path too: previously the Popen got a copy with HOST/PORT etc. baked in; now os.environ itself has them set. This is actually a bug fix — if someone's chained scripts inspected os.environ['HERMES_WEBUI_PORT'] after bootstrap returned, they'd previously have seen the raw user value, not the resolved port. Verified the legacy path's behavior is otherwise identical via the existing test_bootstrap_dotenv.py suite.
  • os.execv argv convention: argv[0] is the program name, argv[1..] are the actual script and args. We pass [python_exe, server_path] where argv[0] == argv[1] (path repeated) — this is correct and matches CPython's convention for os.execv.

Hermes Bot added 2 commits May 2, 2026 17:37
…Bug #1)

Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI
under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls
`subprocess.Popen([python, "server.py"], start_new_session=True)`, probes
/health, then exits 0. Under any process supervisor (launchd, systemd,
supervisord, runit, s6), the supervisor sees its tracked PID exit, marks
the program as "completed," and respawns it. The new bootstrap fails to
bind port 8787 (orphaned server still has it), exits non-zero, supervisor
respawns again — loop until the orphan crashes for some other reason and
the next respawn finds the port free.

This PR addresses Bug #1 of the three failure modes tracked in #1458:
the `bootstrap.py` double-fork breaking process supervisors. Bug #2
(state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under
the same issue — they need diagnosis data before a fix can land.

Changes
-------

1. `bootstrap.py`:
   - New `--foreground` argparse flag with help text mentioning launchd /
     systemd / supervisord.
   - New `_detect_supervisor()` that returns the env var name for any
     supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` /
     `NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd),
     `SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for
     the explicit user opt-in. Truthy values for the explicit opt-in:
     `1` / `true` / `yes` / `on` (case-insensitive).
   - `main()` branches on `args.foreground or _detect_supervisor()`:
     - **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then
       `os.execv(python, [python, server_path])` to replace the bootstrap
       process image with the server. The supervisor sees the long-lived
       server as the original child. No `wait_for_health` probe — the
       supervisor's KeepAlive / Restart=on-failure handles liveness.
     - **Default path:** unchanged. Spawn server as detached child via
       `Popen + start_new_session=True`, probe /health, return 0. This
       still works for interactive `bash start.sh` invocations.
   - Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on
     `os.environ` directly instead of into a local `env` copy so they
     are inherited across `os.execv`.

2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service,
   and supervisord conf examples + a diagnostic recipe (`lsof` + ppid
   chain) for catching the orphan-loop in production.

3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an
   opt-in pattern; matches the existing `!docs/docker.md` precedent).

4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests
   covering the argparse flag, `_detect_supervisor()` behavior across all
   five supervisor env vars, the explicit opt-in's truthy/falsy values,
   and `main()`'s execv-vs-Popen routing decision under each input
   combination. `os.execv` is monkeypatched in the routing tests — we
   pin the structural choice (which call is made, with which args, in
   which cwd, with which env) not the post-exec behavior.

Why this scope and no more
--------------------------

Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter
for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to
disambiguate. Until that data lands, any "fix" would be speculative —
explicitly out of scope per the contributor-pickup comment on the issue.

Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in
@stefanpieter's reply comment. Diagnosis is in flight; no concrete fix
shape yet. Also out of scope.

Running locally end-to-end verifies the behavior:

```
[bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground)
$ pgrep -af 'server.py'
2997632 /home/.../python /tmp/wt-fix-1458/server.py
$ ps -o ppid -p 2997632
2997581   ← bash that ran bootstrap.py — same PID as the original bootstrap
$ ps -p 2997581 -o cmd
... bootstrap.py ...   ← but exec'd into server.py
```

The same PID that bash forked for `bootstrap.py` is now `server.py`.
A supervisor watching that PID would correctly observe the long-lived
server. No double-fork.

Verification
------------

- 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus
  master-merge-in)
- All 35 new bootstrap-foreground tests pass
- `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree)
- `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA)
- Live verified: server starts cleanly under both `--foreground` and
  `HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork

Closes #1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the
issue.
Three changes from the pre-merge Opus review:

**MUST-FIX** — XPC_SERVICE_NAME false-positive on macOS Terminal

macOS launchd sets `XPC_SERVICE_NAME` in EVERY Terminal-spawned shell, not
just real services. Typical noise values: `"0"` (truthy in Python!) and
`"application.com.apple.Terminal.<UUID>"`. A bare `os.environ.get(name)`
existence check would auto-promote interactive `./start.sh` runs to
foreground mode on every Mac dev machine — silently breaking the most
common installation path (no /health probe, no browser open, no log file,
hanging shell).

Fix: new `_is_real_supervisor_value()` helper that filters noise. For
`XPC_SERVICE_NAME` specifically, reject `"0"` and any `"application.*"`
prefix. Real launchd plists use reverse-DNS Label form (`com.<rdns>.<svc>`)
which still triggers correctly.

7 new tests in `TestXPCServiceNameNoiseFilter`:
- 4 noise values (`0`, Terminal.app, iTerm2, VSCode) → no detection
- 3 real Label forms → correct detection
- Mixed env with XPC noise + real INVOCATION_ID → falls through to systemd

**SHOULD-FIX 1** — Test env leakage

The original `clean_env` fixture stripped supervisor-detection env vars
but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR)
that `main()` mutates onto `os.environ`. After
`test_foreground_exports_resolved_env_vars` ran, later tests would import
bootstrap with polluted defaults (DEFAULT_HOST="0.0.0.0" instead of
"127.0.0.1"). Existing assertions still passed (tautological vs DEFAULT_*),
but it was a footgun for future tests.

Fix: extend `clean_env` to also `delenv` the three resolved vars before
each test.

**SHOULD-FIX 2** — Pre-execv executability guard

If `discover_launcher_python` returns a path that doesn't exist or isn't
executable, `os.execv` raises OSError → wrapper catches → SystemExit(1)
→ supervisor restarts → loop forever. That's exactly the failure mode
this PR is supposed to eliminate.

Fix: `os.access(python_exe, os.X_OK)` check before execv. Converts
infinite supervisor loop into a single visible RuntimeError.

1 new test in `TestForegroundExecutabilityGuard` pinning that the guard
fires before execv when the python path is non-executable.

**Docs** — supervisor.md updates

- New section explaining the XPC_SERVICE_NAME noise filter and what
  values trigger / don't trigger detection
- New section listing supervisors that are NOT auto-detected (runit,
  daemontools, PM2, Foreman/Honcho, custom shell-script supervisors)
  with explicit recommendation to set HERMES_WEBUI_FOREGROUND=1

Verification

- 3820 tests pass (+9 from this commit's new tests vs the original PR
  push of 3811)
- Filter manually verified end-to-end with the live os.environ:
  XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers
- run-browser-tests.sh ALL CHECKS PASSED on the worktree

Items deferred from the Opus review

- #4 chdir target may not exist: REPO_ROOT comes from __file__.resolve()
  so it's stable; not a real concern in practice
- #6 two startup messages in foreground mode: cosmetic, useful for
  diagnostics
- #7 stricter explicit-only mode: leaves user the override of just not
  passing --foreground (current behavior)
- #8 test stub return value: trivial, can fix later if regression surface
- #9 argparse positional-after-option ordering: test reads fine

These can be follow-up issues if anyone hits them.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Pre-merge Opus review pass complete. Three changes folded in (commit 6a26e82):

MUST-FIX — XPC_SERVICE_NAME false-positive on macOS Terminal

Opus caught that macOS launchd sets XPC_SERVICE_NAME in every Terminal-spawned shell, with values like "0" (truthy in Python!) and "application.com.apple.Terminal.<UUID>". Without filtering, every Mac dev running interactive ./start.sh would silently auto-promote to foreground mode — losing the /health probe, browser open, log file redirect, and hanging the shell.

Fix: new _is_real_supervisor_value() helper. For XPC_SERVICE_NAME specifically, reject "0" and any "application.*" prefix. Real launchd plists use reverse-DNS Label form (com.<rdns>.<svc>) which still triggers correctly.

7 new tests in TestXPCServiceNameNoiseFilter:

  • 4 noise values (bare 0, Terminal.app, iTerm2, VSCode) → no detection
  • 3 real Label forms (com.example.foo, com.acme.production-server, io.github.user.my-service) → correct detection
  • Mixed env (XPC noise + INVOCATION_ID set) → falls through to systemd, not blocked by the noise

SHOULD-FIX 1 — Test env leakage

The original clean_env fixture stripped supervisor-detection vars but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR) that main() mutates onto os.environ. Footgun for future tests. Extended fixture to delenv all three.

SHOULD-FIX 2 — Pre-execv executability guard

If discover_launcher_python returns a non-executable path, os.execv raises OSError → wrapper exits 1 → supervisor restarts → loop forever. That's exactly the failure mode this PR is supposed to eliminate. Added os.access(python_exe, os.X_OK) check before execv with a clear error pointing at HERMES_WEBUI_PYTHON. New test TestForegroundExecutabilityGuard.test_non_executable_python_raises_runtime_error pins it.

Docs updates

  • New section in docs/supervisor.md explaining the XPC_SERVICE_NAME noise filter
  • New section listing supervisors that are NOT auto-detected (runit pure, daemontools, PM2, Foreman/Honcho, custom shell-script supervisors) with explicit HERMES_WEBUI_FOREGROUND=1 recommendation

Items Opus flagged but deferred

These can be follow-up issues if anyone hits them.

Final verification

  • 3820 tests pass (+9 from this commit's new tests vs the original 3811)
  • Filter manually verified end-to-end with live os.environ: XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers
  • run-browser-tests.sh ALL CHECKS PASSED on the worktree
  • CI green: 3.11/3.12/3.13 all SUCCESS on the rebased SHA

Ready for independent review.

Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean approve, behavioural harness validates noise filter and routing)

P1 fix for issue #1458 (Bug #1 only). Adds --foreground flag and supervisor-env auto-detection so bootstrap.py replaces its process image with server.py via os.execv under launchd/systemd/supervisord, instead of double-forking and exiting (which causes supervisor KeepAlive/Restart=always to respawn bootstrap.py while the orphaned server still holds port 8787 — the loop the issue describes).

Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) explicitly remain open under #1458 awaiting diagnostic data — correct scope discipline per the maintainer's pickup-comment.

What this ships

  • bootstrap.py (+121/-9): --foreground argparse flag, _SUPERVISOR_ENV_VARS tuple, _is_real_supervisor_value() XPC noise filter, _detect_supervisor(), and a two-path main() that exec's vs. Popens.
  • docs/supervisor.md (new, +237 LOC): runnable launchd plist + systemd .service + supervisord conf + diagnostic lsof/ppid recipe.
  • .gitignore: allowlist docs/supervisor.md (matches !docs/docker.md precedent).
  • tests/test_bootstrap_foreground.py (new, +455 LOC): 44 regression tests.

Traced against upstream hermes-agent

bootstrap.py is webui-only — verified by grepping the fresh tarball: agent-side bootstrap references are unrelated (launchctl bootstrap subcommand calls in hermes_cli/gateway.py). The CLI does not import or invoke webui's bootstrap.py. No cross-tool regression surface.

End-to-end trace

Entry points:

  • bash start.sh --foregroundexec "${PYTHON}" bootstrap.py --no-browser --foreground (start.sh:25) — exec replaces the shell, then bootstrap exec's into server.py. Result: same PID lineage from shell → server.
  • launchd plist with KeepAlive=true → spawns bash start.sh --foreground → same chain.
  • systemd Type=simple with ExecStart=...start.sh --foreground → bash exec'd by systemd, then bash exec's python, then bootstrap exec's server.py. Same lineage.

Routing decision at bootstrap.py:321:

foreground_reason = "--foreground" if args.foreground else _detect_supervisor()
if foreground_reason:
    # chdir → executability guard → os.execv (never returns)
else:
    # legacy: Popen + wait_for_health → return 0

Foreground path (bootstrap.py:322-347):

  1. Log starting message naming the trigger reason
  2. os.chdir(server_cwd) with OSError → RuntimeError translation
  3. os.access(python_exe, os.X_OK) guard — without this, a missing python would loop the supervisor (the exact failure mode the PR fixes)
  4. os.execv(python_exe, [python_exe, server_path]) — the argv[0] == argv[1] repetition is correct CPython convention
  5. Unreachable RuntimeError after execv

Env propagation (bootstrap.py:307-312): os.environ is mutated directly (instead of a local env dict copy). This is required for os.execv since execv inherits os.environ — a local copy would be lost. The legacy Popen path now uses env=os.environ.copy() (bootstrap.py:358) which captures the same final state. Net effect for legacy path: identical to before.

Security audit

  • ✅ No XSS, no SQL, no path traversal — bootstrap is local-only, no untrusted input.
  • os.execv(python_exe, ...) argument comes from discover_launcher_python() which respects HERMES_WEBUI_PYTHON env, then probes for venv/bin/python under known dirs. No user-supplied input reaches execv beyond local config.
  • ✅ The pre-execv os.access check is informational/defensive — there's no security boundary; this is just to convert silent loop into visible error.
  • ✅ No new endpoints, no auth changes, no config.yaml writes.

Race / state analysis

  • ✅ Pure synchronous code, no shared dicts, no locks. os.environ mutation is single-threaded at startup (before any subprocess or thread spawns).
  • ✅ Idempotent: os.environ.setdefault("HERMES_WEBUI_STATE_DIR", ...) doesn't override an externally provided value, but state_dir was computed from that same env var, so the value is consistent either way.

Behavioural harness — XPC noise filter

✅ XPC_SERVICE_NAME='0'                                      → False (Mac launchd descendant noise)
✅ XPC_SERVICE_NAME='application.com.apple.Terminal.UUID'    → False (Terminal.app)
✅ XPC_SERVICE_NAME=''                                       → False (empty)
✅ XPC_SERVICE_NAME='com.example.foo'                        → True  (real Label)
✅ XPC_SERVICE_NAME='application'                            → True  (no dot prefix — passes; theoretical edge)
✅ XPC_SERVICE_NAME='applicationCo.foo'                      → True  (no dot — passes; theoretical edge)
✅ INVOCATION_ID='0'                                         → True  (no narrowing for non-XPC vars)
✅ INVOCATION_ID=''                                          → False (empty)
✅ NOTIFY_SOCKET='/run/systemd/notify'                       → True

The filter correctly rejects Apple's known Terminal/IDE noise values. The "exactly application (no dot)" and "applicationCo.foo" cases are theoretical — Apple's actual noise always uses application.<bundle-id> form. Acceptable.

Edge-case matrix

Scenario Pre-PR Post-PR
bash start.sh (interactive Mac, XPC_SERVICE_NAME=0) Popen, healthy Popen (XPC=0 rejected) ✅
bash start.sh (interactive Mac, XPC=application.com.apple.Terminal) Popen, healthy Popen (rejected) ✅
launchd plist with Label com.example.foo (KeepAlive=true) Loops forever (orphan + respawn) execv → supervisor sees long-lived server ✅
systemd Type=simple with Restart=always Loops forever execv (auto-detected via INVOCATION_ID) ✅
supervisord with autorestart=true Loops forever execv (auto-detected via SUPERVISOR_ENABLED) ✅
runit / daemontools (no env var set) Loops forever Loops forever unless user sets HERMES_WEBUI_FOREGROUND=1 ⚠️ documented
HERMES_WEBUI_FOREGROUND=1 (no flag) Popen execv ✅
HERMES_WEBUI_FOREGROUND=0 (no flag, no supervisor env) Popen Popen (falsy falls through) ✅
Both --foreground and INVOCATION_ID set n/a execv, reason='--foreground' (flag wins) ✅
python_exe missing or non-executable in foreground mode execv raises → wrapper SystemExit(1) → supervisor restart loop RuntimeError before execv with actionable message ✅
--foreground + positional port: bootstrap.py --foreground 9999 n/a argparse handles store_true + nargs='?' interleaving ✅
Legacy Popen path env content local env copy with HOST/PORT mutated os.environ.copy() after os.environ mutation — identical net contents ✅

Tests

  • tests/test_bootstrap_foreground.py — 44/44 pass:
    • TestForegroundFlag (3): argparse recognition + default + help text mentions all 3 supervisors
    • TestDetectSupervisor (15): clean env, each of 5 supervisor vars, 7 truthy + 7 falsy explicit-opt-in values, precedence
    • TestXPCServiceNameNoiseFilter (8): 4 noise values rejected, 3 real Labels accepted, mixed-env fallthrough
    • TestMainForegroundRouting (8): default→Popen, --foreground→execv, each supervisor var auto-promotes, explicit-opt-in auto-promotes
    • TestForegroundEnvAndCwd (3): chdir to agent_dir, exports resolved env to os.environ, skips wait_for_health
    • TestForegroundExecutabilityGuard (1): non-executable python raises RuntimeError before execv
  • Existing test_bootstrap_dotenv.py continues to pass.
  • Full suite: 3767 passed, 55 skipped, 3 xpassed, 0 failed in 18.84s on the PR branch (PR description claims 3811/3820; counting drift is consistent with prior batches and not blocking).
  • CI: 3.11 / 3.12 / 3.13 all green.

Other audit — things that are correct already

  • argv[0] == argv[1] for os.execv([python, [python, server.py]]) is correct CPython convention.
  • start.sh already uses exec "${PYTHON}" ... "$@" so --foreground passes through with the bash process replaced (clean PID lineage).
  • ✅ Legacy Popen path env=os.environ.copy() is a snapshot taken AFTER the os.environ mutations, so the resolved values still flow to the child.
  • _detect_supervisor() short-circuits on --foreground flag — no double-evaluation cost or precedence ambiguity.
  • _load_repo_dotenv() runs at import time before DEFAULT_HOST/DEFAULT_PORT are read; tests freshly re-import bootstrap to avoid module-level state bleed.
  • ✅ The clean_env fixture strips HERMES_WEBUI_HOST/PORT/AGENT_DIR after the agent-flagged leak — verified at tests/test_bootstrap_foreground.py:73-83.

Minor observations (non-blocking)

  • Executability guard converts loop to error, but supervisor still restarts on RuntimeError → SystemExit(1). The improvement is observability (the stderr message points at HERMES_WEBUI_PYTHON), not loop prevention — the supervisor will still respawn until the user fixes the python path. A one-time loud failure is much better than silent flapping, but the comment "Convert to a single visible error" slightly oversells: it's a single clearer error per restart. Worth a tiny doc tweak in supervisor.md ("if you see this error in journald, fix HERMES_WEBUI_PYTHON — the supervisor will keep restarting until you do") but not blocking.

  • XPC_SERVICE_NAME filter accepts application (no dot) and applicationCo.foo. The value.startswith("application.") check requires a literal dot, so the bare word "application" or words starting with "applicationXxx" would pass through as if they were real Labels. In practice Apple always uses application.<bundle-id> form — no observed false negative. Could tighten to value == "application" or value.startswith("application.") but the value is bounded by what launchd actually emits.

  • No equivalent of HERMES_WEBUI_FOREGROUND for forcing OFF. If a user is in a weird env that auto-detects (say, a Docker container that happened to inherit INVOCATION_ID) and wants to force the legacy Popen path, the only escape is unset INVOCATION_ID && bash start.sh. A HERMES_WEBUI_FOREGROUND=0 doesn't currently force-disable detection (the falsy values fall through to env probing — by design per the test suite). Consider adding a HERMES_WEBUI_NO_FOREGROUND=1 override later if anyone hits this. Documented in supervisor.md anti-recipe section is sufficient for now.

  • PR description test count off by ~50 (3811/3820 claimed vs 3767 local). Same drift pattern as prior batches.

  • docs/supervisor.md runit/daemontools recommendation is sound — those supervisors don't set any of our detected env vars, so the explicit HERMES_WEBUI_FOREGROUND=1 instruction is correct.

Recommendation

Approved. The fix correctly converts the supervisor double-fork loop into a clean os.execv replacement so KeepAlive/Restart=always work as intended. The XPC noise filter prevents auto-promotion on every Mac dev machine — a real regression risk caught in pre-merge review. The executability guard converts a silent restart loop into a visible error, which is the right tradeoff. Behavioural harness on the noise filter and 44/44 regression tests confirm structural correctness; live PID-lineage verification in the PR description proves end-to-end behavior.

Bugs #2 and #3 explicitly out of scope per maintainer guidance — correct discipline.

Parked at approval — ready for the release agent's merge/tag pipeline.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Shipped in v0.50.269 via release PR #1483. The work landed in master via the stage-269 batch — GitHub didn't auto-mark this PR as "merged" through the no-ff merge, so closing manually. The actual merge commit is 715a805 fix(bootstrap): --foreground mode for process supervisors (#1478) on master; tag v0.50.269 is live.

Closes #1458 Bug #1. Bugs #2 (state.db FD leak) and #3 (HTTP-unhealthy wedge) remain open under #1458 awaiting diagnostic data per the original PR's scope discipline.

nesquena-hermes pushed a commit to ccqqlo/hermes-webui that referenced this pull request May 2, 2026
nesquena-hermes pushed a commit to ccqqlo/hermes-webui that referenced this pull request May 2, 2026
…267 follow-ups

- CHANGELOG.md: v0.50.269 entry detailing nesquena#1478 nesquena#1479 nesquena#1480
- ROADMAP.md: bump to v0.50.269, 3847 tests collected
- TESTING.md: bump header + total to 3847

nesquena#1478: nesquena APPROVED self-built bootstrap.py --foreground mode
       (closes nesquena#1458 Bug nesquena#1, +Opus follow-ups: XPC noise filter, executability guard)
nesquena#1479: surgical follow-up to nesquena#1473 — Session.compact() now includes pending_user_message
nesquena#1480: bfcache pageshow restores active session via loadSession + checkInflightOnBoot

3847 tests pass (+47 net). Opus advisor on stage diff: no blockers.
nesquena-hermes pushed a commit to ccqqlo/hermes-webui that referenced this pull request May 2, 2026
…nto v0.50.269

The PR added an `agent_dir` parameter to ensure_python_has_webui_deps. The
test_bootstrap_foreground.py tests (added in nesquena#1478) had `lambda p: p` stubs
that were 1-arg only. Widened to `lambda *a, **kw: a[0]` so the stubs
accept the new signature on the rebased base.

Co-authored-by: ccqqlo <[email protected]>
nesquena-hermes pushed a commit to ccqqlo/hermes-webui that referenced this pull request May 2, 2026
…nto v0.50.269

The PR added an `agent_dir` parameter to ensure_python_has_webui_deps. The
test_bootstrap_foreground.py tests (added in nesquena#1478) had `lambda p: p` stubs
that were 1-arg only. Widened to `lambda *a, **kw: a[0]` so the stubs
accept the new signature on the rebased base.

Co-authored-by: ccqqlo <[email protected]>
pull Bot pushed a commit to soitun/hermes-webui that referenced this pull request May 2, 2026
nesquena#1315)

- CHANGELOG.md: v0.50.270 entry detailing nesquena#1315 + maintainer follow-ups
- ROADMAP.md: bump to v0.50.270, 3849 tests collected
- TESTING.md: bump header + total to 3849
- bootstrap.py: Opus advisor optional-followup — PYTHONPATH prepend comment

nesquena#1315 by @ccqqlo (113 LOC): bootstrap.py validates launcher Python can
import both yaml and run_agent.AIAgent. Companion fix to v0.50.269's nesquena#1478
— addresses the start-healthy-then-cryptic-fail mode (different from nesquena#1478's
supervisor-respawn loop).

3849 tests pass. Opus advisor verdict: ship as-is. CI green on contributor
branch + on local stage. QA harness all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak

2 participants