fix: harden persistent-host health checks#1657
Conversation
|
Scope is well-chosen — this targets the residual wedge mode that supervisor restarts can't reach (process alive, port listening, HTTP unhealthy) without speculating about a fresh root cause. Deep probe + accept-loop heartbeat + FD headroom is the right operational toolkit for that failure class. A few notes from reading the description:
The 4287-passing full suite plus the targeted RED→GREEN harness for the three new contracts (accept_loop, deep checks, |
134433f
…ps absorbed Constituent PRs: nesquena#1659 by @bergeouss — Docker readonly false-positive (closes nesquena#1658, fixes v0.50.295 regression) nesquena#1653 by @nesquena — OAuth cancel race fix (follow-up to v0.50.296 nesquena#1652) nesquena#1657 by @Michaelyklam — health diagnostics + watchdog hardening (refs nesquena#1458 Bug nesquena#3) Opus advisor SHIP verdict on stage-297. Two follow-ups absorbed in-release: - _deep_health_checks(stream_check=...) reuses pre-computed lock probe - _handle_request_noblock docstring documents single-thread safety PR nesquena#1656 closed as superseded by nesquena#1657 (same author, both target nesquena#1458, nesquena#1657 is functional superset). 4284 → 4288 tests passing (+4).
Thinking Path
state.dbFD leak fixes are already shipped; the remaining production concern is the process-alive / port-listening / HTTP-unhealthy wedge.STREAMS_LOCK, sidebar/session state, projects, andstate.db) while plain/healthstays cheap.Closes #1458once merged because the concrete fixed bugs are already covered and this adds the final residual hardening path.What Changed
QuietHTTPServerand exposed them under/health.accept_loop./health?deep=1readiness checks for the streams lock, session/sidebar path, project state, and Hermesstate.dbconnectivity.503withstatus: "degraded"when the stream lock probe cannot complete, so watchdogs do not treat a wedged process as healthy.RLIMIT_NOFILEsoft limit to 4096 on supported platforms as defense in depth for persistent hosts.docs/supervisor.md.tests/test_issue1458_stability_hardening.py.Why It Matters
The previous fixes removed the confirmed FD leaks and supervisor double-fork loop. This PR addresses the remaining operational gap: a process supervisor can only restart a crashed process, not one that is still alive but no longer serving usable HTTP responses. Deep health plus an accept-loop heartbeat gives persistent Mac mini / launchd deployments a concrete watchdog signal, and the FD soft-limit raise provides diagnostic headroom if a future leak regresses.
Verification
Result:
Manual verification, if applicable:
UI media, if applicable:
Risks / Follow-ups
/health?deep=1intentionally does more work than plain/health; watchdogs should use reasonable timeouts/intervals rather than polling it aggressively.Model Used
AI assisted.
openai/gpt-5.4-minigh, isolated git worktree, strict RED/GREEN pytest loop, full local pytest suite, git diff/security review, PR creation viagh.Closes #1458