Skip to content

fix: harden persistent-host health checks#1657

Merged
1 commit merged intonesquena:masterfrom
Michaelyklam:fix/issue-1458-stability-hardening
May 4, 2026
Merged

fix: harden persistent-host health checks#1657
1 commit merged intonesquena:masterfrom
Michaelyklam:fix/issue-1458-stability-hardening

Conversation

@Michaelyklam
Copy link
Copy Markdown
Contributor

Thinking Path

  • bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458's bootstrap and state.db FD leak fixes are already shipped; the remaining production concern is the process-alive / port-listening / HTTP-unhealthy wedge.
  • The issue comments called out independent hardening that would make that final failure mode diagnosable/recoverable: deep health, an accept-loop heartbeat, supervisor watchdog docs, and more FD headroom.
  • This PR keeps the change narrowly scoped to those persistent-host safeguards instead of speculating about an unreported post-v0.50.272 root cause.
  • The deep probe exercises the same state surfaces that made the UI unusable during the wedge (STREAMS_LOCK, sidebar/session state, projects, and state.db) while plain /health stays cheap.
  • Closes #1458 once merged because the concrete fixed bugs are already covered and this adds the final residual hardening path.

What Changed

  • Added accept-loop heartbeat counters to QuietHTTPServer and exposed them under /health.accept_loop.
  • Added /health?deep=1 readiness checks for the streams lock, session/sidebar path, project state, and Hermes state.db connectivity.
  • Made health return 503 with status: "degraded" when the stream lock probe cannot complete, so watchdogs do not treat a wedged process as healthy.
  • Raised the server process RLIMIT_NOFILE soft limit to 4096 on supported platforms as defense in depth for persistent hosts.
  • Documented a launchd/systemd-compatible HTTP watchdog recipe and the deep-health/FD-limit behavior in docs/supervisor.md.
  • Added regression coverage in tests/test_issue1458_stability_hardening.py.

Why It Matters

The previous fixes removed the confirmed FD leaks and supervisor double-fork loop. This PR addresses the remaining operational gap: a process supervisor can only restart a crashed process, not one that is still alive but no longer serving usable HTTP responses. Deep health plus an accept-loop heartbeat gives persistent Mac mini / launchd deployments a concrete watchdog signal, and the FD soft-limit raise provides diagnostic headroom if a future leak regresses.

Verification

/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1458_stability_hardening.py -q
/home/michael/.hermes/hermes-agent/venv/bin/python -m py_compile server.py api/routes.py
/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1458_stability_hardening.py tests/test_sprint7.py tests/test_sprint19.py -q
git diff --check
/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/ -q

Result:

RED check before implementation: 3 failed as expected (missing accept_loop, missing deep checks, missing _raise_fd_soft_limit).
Targeted final: 3 passed in 1.61s.
Related health/security targeted: 35 passed in 1.86s.
Full suite: 4287 passed, 2 skipped, 3 xpassed, 1 warning, 8 subtests passed in 403.36s (0:06:43).
git diff --check: clean.

Manual verification, if applicable:

  • Inspected the final diff for unrelated files and credential patterns; no secrets or unrelated work included.

UI media, if applicable:

  • Not applicable — server health/readiness and docs change only; no UI/UX surface changed.

Risks / Follow-ups

  • /health?deep=1 intentionally does more work than plain /health; watchdogs should use reasonable timeouts/intervals rather than polling it aggressively.
  • The residual RST+LISTEN wedge has not been freshly reproduced on post-v0.50.272 builds. This PR adds observability/recovery hardening rather than claiming a new distinct root cause was found.
  • If a new post-fix wedge report arrives with FD counts/thread samples, it should be tracked as a new targeted issue with that diagnostic bundle.

Model Used

AI assisted.

  • Provider: OpenRouter
  • Model: openai/gpt-5.4-mini
  • Notable tool use: GitHub issue inspection via gh, isolated git worktree, strict RED/GREEN pytest loop, full local pytest suite, git diff/security review, PR creation via gh.

Closes #1458

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Scope is well-chosen — this targets the residual wedge mode that supervisor restarts can't reach (process alive, port listening, HTTP unhealthy) without speculating about a fresh root cause. Deep probe + accept-loop heartbeat + FD headroom is the right operational toolkit for that failure class.

A few notes from reading the description:

  1. /health?deep=1 and 503 semantics. Returning 503 when the streams-lock probe can't complete is the right call — that's what watchdogs need to see to act. Worth confirming the plain /health (no deep) path stays cheap and doesn't acquire STREAMS_LOCK, otherwise an aggressive watchdog poll could itself contribute to lock contention. The PR body implies this is the case ("plain /health stays cheap") but the diff is the source of truth.

  2. Accept-loop heartbeat counter monotonicity. If the counter is exposed under /health.accept_loop, watchdogs will compare now vs last_seen. Worth documenting (in docs/supervisor.md) the expected tick cadence and a recommended staleness threshold so users don't pick something arbitrary like 1s and flap on GIL pauses or GC pauses.

  3. RLIMIT_NOFILE 4096. Defense in depth, fine. On Linux the soft limit is usually already 1024 for non-root and 4096 is well within hard. On macOS launchd the hard limit can be unlimited per-process but the kernel kern.maxfilesperproc (default 24576 on recent macOS, 10240 on older) caps it — should be safe at 4096 universally. If the raise fails (EPERM / ValueError), please ensure it logs but doesn't abort startup.

  4. Watchdog recipe in docs. Good addition. One thing to call out explicitly: a watchdog that polls /health?deep=1 will itself be a load source — recommend e.g. 30–60s intervals, not 5s, and pair with a separate cheap /health for fast liveness.

  5. Issue bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 closure language. The body says "Closes bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 once merged because the concrete fixed bugs are already covered and this adds the final residual hardening path." That's the right framing — bootstrap and FD-leak fixes already shipped; this is the diagnosability layer. If a future post-v0.50.272 wedge reproduces with diagnostics, that becomes a fresh issue with FD counts + thread sample, not a bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 reopen.

The 4287-passing full suite plus the targeted RED→GREEN harness for the three new contracts (accept_loop, deep checks, _raise_fd_soft_limit) is solid coverage. Approach LGTM.

@nesquena-hermes nesquena-hermes closed this pull request by merging all changes into nesquena:master in 134433f May 4, 2026
nesquena-hermes pushed a commit to ai-ag2026/hermes-webui that referenced this pull request May 4, 2026
nesquena-hermes pushed a commit to ai-ag2026/hermes-webui that referenced this pull request May 4, 2026
…ps absorbed

Constituent PRs:
  nesquena#1659 by @bergeouss — Docker readonly false-positive (closes nesquena#1658, fixes v0.50.295 regression)
  nesquena#1653 by @nesquena — OAuth cancel race fix (follow-up to v0.50.296 nesquena#1652)
  nesquena#1657 by @Michaelyklam — health diagnostics + watchdog hardening (refs nesquena#1458 Bug nesquena#3)

Opus advisor SHIP verdict on stage-297. Two follow-ups absorbed in-release:
- _deep_health_checks(stream_check=...) reuses pre-computed lock probe
- _handle_request_noblock docstring documents single-thread safety

PR nesquena#1656 closed as superseded by nesquena#1657 (same author, both target nesquena#1458,
nesquena#1657 is functional superset).

4284 → 4288 tests passing (+4).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak

2 participants