fix: harden persistent-host health checks by Michaelyklam · Pull Request #1657 · nesquena/hermes-webui

Michaelyklam · 2026-05-04T22:31:16Z

Thinking Path

bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458's bootstrap and state.db FD leak fixes are already shipped; the remaining production concern is the process-alive / port-listening / HTTP-unhealthy wedge.
The issue comments called out independent hardening that would make that final failure mode diagnosable/recoverable: deep health, an accept-loop heartbeat, supervisor watchdog docs, and more FD headroom.
This PR keeps the change narrowly scoped to those persistent-host safeguards instead of speculating about an unreported post-v0.50.272 root cause.
The deep probe exercises the same state surfaces that made the UI unusable during the wedge (STREAMS_LOCK, sidebar/session state, projects, and state.db) while plain /health stays cheap.
Closes #1458 once merged because the concrete fixed bugs are already covered and this adds the final residual hardening path.

What Changed

Added accept-loop heartbeat counters to QuietHTTPServer and exposed them under /health.accept_loop.
Added /health?deep=1 readiness checks for the streams lock, session/sidebar path, project state, and Hermes state.db connectivity.
Made health return 503 with status: "degraded" when the stream lock probe cannot complete, so watchdogs do not treat a wedged process as healthy.
Raised the server process RLIMIT_NOFILE soft limit to 4096 on supported platforms as defense in depth for persistent hosts.
Documented a launchd/systemd-compatible HTTP watchdog recipe and the deep-health/FD-limit behavior in docs/supervisor.md.
Added regression coverage in tests/test_issue1458_stability_hardening.py.

Why It Matters

The previous fixes removed the confirmed FD leaks and supervisor double-fork loop. This PR addresses the remaining operational gap: a process supervisor can only restart a crashed process, not one that is still alive but no longer serving usable HTTP responses. Deep health plus an accept-loop heartbeat gives persistent Mac mini / launchd deployments a concrete watchdog signal, and the FD soft-limit raise provides diagnostic headroom if a future leak regresses.

Verification

/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1458_stability_hardening.py -q
/home/michael/.hermes/hermes-agent/venv/bin/python -m py_compile server.py api/routes.py
/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_issue1458_stability_hardening.py tests/test_sprint7.py tests/test_sprint19.py -q
git diff --check
/home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/ -q

Result:

RED check before implementation: 3 failed as expected (missing accept_loop, missing deep checks, missing _raise_fd_soft_limit).
Targeted final: 3 passed in 1.61s.
Related health/security targeted: 35 passed in 1.86s.
Full suite: 4287 passed, 2 skipped, 3 xpassed, 1 warning, 8 subtests passed in 403.36s (0:06:43).
git diff --check: clean.

Manual verification, if applicable:

Inspected the final diff for unrelated files and credential patterns; no secrets or unrelated work included.

UI media, if applicable:

Not applicable — server health/readiness and docs change only; no UI/UX surface changed.

Risks / Follow-ups

/health?deep=1 intentionally does more work than plain /health; watchdogs should use reasonable timeouts/intervals rather than polling it aggressively.
The residual RST+LISTEN wedge has not been freshly reproduced on post-v0.50.272 builds. This PR adds observability/recovery hardening rather than claiming a new distinct root cause was found.
If a new post-fix wedge report arrives with FD counts/thread samples, it should be tracked as a new targeted issue with that diagnostic bundle.

Model Used

AI assisted.

Provider: OpenRouter
Model: openai/gpt-5.4-mini
Notable tool use: GitHub issue inspection via gh, isolated git worktree, strict RED/GREEN pytest loop, full local pytest suite, git diff/security review, PR creation via gh.

Closes #1458

nesquena-hermes · 2026-05-04T22:52:05Z

Scope is well-chosen — this targets the residual wedge mode that supervisor restarts can't reach (process alive, port listening, HTTP unhealthy) without speculating about a fresh root cause. Deep probe + accept-loop heartbeat + FD headroom is the right operational toolkit for that failure class.

A few notes from reading the description:

/health?deep=1 and 503 semantics. Returning 503 when the streams-lock probe can't complete is the right call — that's what watchdogs need to see to act. Worth confirming the plain /health (no deep) path stays cheap and doesn't acquire STREAMS_LOCK, otherwise an aggressive watchdog poll could itself contribute to lock contention. The PR body implies this is the case ("plain /health stays cheap") but the diff is the source of truth.
Accept-loop heartbeat counter monotonicity. If the counter is exposed under /health.accept_loop, watchdogs will compare now vs last_seen. Worth documenting (in docs/supervisor.md) the expected tick cadence and a recommended staleness threshold so users don't pick something arbitrary like 1s and flap on GIL pauses or GC pauses.
RLIMIT_NOFILE 4096. Defense in depth, fine. On Linux the soft limit is usually already 1024 for non-root and 4096 is well within hard. On macOS launchd the hard limit can be unlimited per-process but the kernel kern.maxfilesperproc (default 24576 on recent macOS, 10240 on older) caps it — should be safe at 4096 universally. If the raise fails (EPERM / ValueError), please ensure it logs but doesn't abort startup.
Watchdog recipe in docs. Good addition. One thing to call out explicitly: a watchdog that polls /health?deep=1 will itself be a load source — recommend e.g. 30–60s intervals, not 5s, and pair with a separate cheap /health for fast liveness.
Issue bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 closure language. The body says "Closes bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 once merged because the concrete fixed bugs are already covered and this adds the final residual hardening path." That's the right framing — bootstrap and FD-leak fixes already shipped; this is the diagnosability layer. If a future post-v0.50.272 wedge reproduces with diagnostics, that becomes a fresh issue with FD counts + thread sample, not a bug(stability): persistent-host crashes — bootstrap.py double-fork breaks launchd + state.db FD leak #1458 reopen.

The 4287-passing full suite plus the targeted RED→GREEN harness for the three new contracts (accept_loop, deep checks, _raise_fd_soft_limit) is solid coverage. Approach LGTM.

@Michaelyklam

…rdening (refs nesquena#1458) by @Michaelyklam

@bergeouss

…ps absorbed Constituent PRs: nesquena#1659 by @bergeouss — Docker readonly false-positive (closes nesquena#1658, fixes v0.50.295 regression) nesquena#1653 by @nesquena — OAuth cancel race fix (follow-up to v0.50.296 nesquena#1652) nesquena#1657 by @Michaelyklam — health diagnostics + watchdog hardening (refs nesquena#1458 Bug nesquena#3) Opus advisor SHIP verdict on stage-297. Two follow-ups absorbed in-release: - _deep_health_checks(stream_check=...) reuses pre-computed lock probe - _handle_request_noblock docstring documents single-thread safety PR nesquena#1656 closed as superseded by nesquena#1657 (same author, both target nesquena#1458, nesquena#1657 is functional superset). 4284 → 4288 tests passing (+4).

fix: harden persistent WebUI health checks

ca135c2

This was referenced May 4, 2026

fix: add deep health diagnostics #1656

Closed

Release v0.50.297 — 3-PR batch (Docker regression fix + OAuth cancel race + persistent-host health hardening) #1661

Merged

nesquena-hermes closed this pull request by merging all changes into nesquena:master in 134433f May 4, 2026

nesquena-hermes pushed a commit to ai-ag2026/hermes-webui that referenced this pull request May 4, 2026

Stage 297: PR nesquena#1657 — Health diagnostics + persistent-host ha…

c3d6a2d

…rdening (refs nesquena#1458) by @Michaelyklam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden persistent-host health checks#1657

fix: harden persistent-host health checks#1657
1 commit merged intonesquena:masterfrom
Michaelyklam:fix/issue-1458-stability-hardening

Michaelyklam commented May 4, 2026

Uh oh!

nesquena-hermes commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Michaelyklam commented May 4, 2026

Thinking Path

What Changed

Why It Matters

Verification

Risks / Follow-ups

Model Used

Uh oh!

nesquena-hermes commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants