feat(server): parent-death watchdog self-SIGTERMs on reparent (closes #440)#442
feat(server): parent-death watchdog self-SIGTERMs on reparent (closes #440)#442
Conversation
…440) Investigation in #440 (comments) established that the MCP stdio client (Claude Code observed, likely others) sometimes exits without closing our stdio unix sockets OR sending SIGTERM. `memtomem-server` is then left alive holding ~/.memtomem/.server.pid, blocking the next client's spawn. Server-side tests confirmed the server already handles stdin-EOF and SIGTERM correctly — the failure mode is purely "signal never arrives". Fixing the client is out of scope; this change makes the server resilient to any client that fails to clean up after itself. Approach: polling os.getppid() at 10s intervals. When the parent process is gone (POSIX reparents to init/launchd, so ppid changes), self-SIGTERM to trigger the existing sigterm handler from #439, which unlinks the XDG and legacy pid files and os._exit(0)s. Using self-SIGTERM rather than os._exit here is deliberate — bypassing the sigterm handler would re-create the stale-pid-file class of bugs #437/#439 closed. Design notes: - POSIX-portable (macOS launchd + Linux PID 1), no PR_SET_PDEATHSIG dependency. - False-positive-free: PPID change = parent exited, guaranteed. - Gated by MEMTOMEM_PARENT_WATCHDOG env (default on); set to "off"/"0"/ "false" to disable for supervised or daemonised deployments where reparenting is expected. - Poll interval configurable via MEMTOMEM_PARENT_WATCHDOG_INTERVAL (default 10s). Invalid values fall back to the default. - Task lifecycle: created in app_lifespan startup, cancelled + awaited in finally. Any exception during teardown (including CancelledError leaking out) is swallowed so main lifespan cleanup still runs. Tested: - 17 new tests in test_server_parent_watchdog.py: * env gating (6 params: default, 5 disable values, truthy values) * interval override + invalid-value fallback * coroutine no-ops while ppid stable * coroutine self-SIGTERMs when ppid changes (via monkey-patched getppid iterator) * cancellation path returns cleanly (no stray exception) * end-to-end subprocess test: grandparent spawns parent+server, SIGKILLs parent, asserts server pid file is unlinked within 5s (exercises the full watchdog → self-SIGTERM → sigterm handler → pid file unlink chain) - Full CI-filter suite: 2268 passed (previously 2251, +17 new). - ruff check/format, mypy: clean. Co-Authored-By: Claude <[email protected]>
|
Closing after empirical validation that v0.1.26 already handles both orphan-generating paths without this watchdog: Experiments (all run on v0.1.26 from PyPI)1. Normal
2. Parent SIGKILL with
Why both work in 0.1.26: kernel closes parent's fd on process death → unix socket peer closes → server's stdio read returns EOF → So when would this watchdog have helped?Only in a hypothetical scenario where:
I couldn't produce this state with any plausible user action. The empirical root cause of today's reported orphan turned out to be pre-0.1.26 processes still running in memory after an in-place upgrade (disk bytes new, process memory old — Decision
[1]: will link in follow-up |
Summary
Closes #440. Makes
memtomem-serverself-terminate when its MCP stdio client parent process disappears, instead of lingering as an orphan holding~/.memtomem/.server.pid.Background from the #440 investigation:
lsofshows the stdio unix sockets toclaudestay open across the shutdown path that produces orphans. When theclaudeprocess disappears without a clean socket close, the server has no client-provided signal to act on.os.getppid()changes to 1 on Linux, launchd's PID on macOS). Polling that is a portable, client-agnostic orphan signal.Approach
10s-interval
asyncio.create_taskwatchdog launched inapp_lifespan. On reparent detection it sendsSIGTERMto its own PID rather thanos._exit. This is deliberate:_install_sigterm_handler(#439) unlinks every pid file we own. Going around it would re-create the stale-file class of bugs #437/#439 closed.Design choices
MEMTOMEM_PARENT_WATCHDOGenv (defaulton). Disable withoff/0/falsefor supervised / daemonised deployments where reparenting is expected.MEMTOMEM_PARENT_WATCHDOG_INTERVAL(default10). Invalid values fall back to default, server never crashes on a bad env.os.kill(os.getpid(), SIGTERM)→ existing sigterm handler → pid files unlinked →os._exit(0). No new teardown code.app_lifespancontext, cancellation wrapped so main teardown still runs.Test plan
uv run pytest packages/memtomem/tests/test_server_parent_watchdog.py -v— 17 passedSIGKILLs parent, asserts server exits within 5s and pid file is unlinked — exercises the full watchdog → self-SIGTERM → fix(server): unlink legacy .server.pid on atexit and SIGTERM (closes #437) #439 handler → pid file unlink chainuv run pytest packages/memtomem/tests/ -m "not ollama"— 2268 passed (previously 2251; +17)ruff check/ruff format --check/mypyonserver/lifespan.py— cleanOut of scope (stays on #440 wishlist, not in this PR)
_try_hold_legacy_flockfailure (dead-pid detection and unlink). Same rationale — orphan rate should drop sharply.Either follow-up can land in a separate PR if still worth doing after real-world usage.
🤖 Generated with Claude Code