Skip to content

feat(server): parent-death watchdog self-SIGTERMs on reparent (closes #440)#442

Closed
memtomem wants to merge 1 commit intomainfrom
fix/440-parent-watchdog
Closed

feat(server): parent-death watchdog self-SIGTERMs on reparent (closes #440)#442
memtomem wants to merge 1 commit intomainfrom
fix/440-parent-watchdog

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Summary

Closes #440. Makes memtomem-server self-terminate when its MCP stdio client parent process disappears, instead of lingering as an orphan holding ~/.memtomem/.server.pid.

Background from the #440 investigation:

  • Server side is clean: three shutdown tests (stdin EOF / init+EOF / SIGTERM) all show the server exits properly and unlinks its pid file post-fix(server): unlink legacy .server.pid on atexit and SIGTERM (closes #437) #439.
  • Client side doesn't always close stdio: live lsof shows the stdio unix sockets to claude stay open across the shutdown path that produces orphans. When the claude process disappears without a clean socket close, the server has no client-provided signal to act on.
  • POSIX gives us one anyway: when a child's parent exits, the child is reparented (os.getppid() changes to 1 on Linux, launchd's PID on macOS). Polling that is a portable, client-agnostic orphan signal.

Approach

10s-interval asyncio.create_task watchdog launched in app_lifespan. On reparent detection it sends SIGTERM to its own PID rather than os._exit. This is deliberate: _install_sigterm_handler (#439) unlinks every pid file we own. Going around it would re-create the stale-file class of bugs #437/#439 closed.

Design choices

  • Gating: MEMTOMEM_PARENT_WATCHDOG env (default on). Disable with off / 0 / false for supervised / daemonised deployments where reparenting is expected.
  • Interval: MEMTOMEM_PARENT_WATCHDOG_INTERVAL (default 10). Invalid values fall back to default, server never crashes on a bad env.
  • Exit path: os.kill(os.getpid(), SIGTERM) → existing sigterm handler → pid files unlinked → os._exit(0). No new teardown code.
  • Task lifecycle: created/cancelled in the same app_lifespan context, cancellation wrapped so main teardown still runs.

Test plan

  • uv run pytest packages/memtomem/tests/test_server_parent_watchdog.py -v17 passed
    • 6× env-gating parametrized
    • 3× interval (default / override / invalid fallback)
    • Coroutine no-ops while parent alive
    • Coroutine self-SIGTERMs when ppid changes (monkey-patched iterator)
    • Cancellation returns cleanly
    • End-to-end: grandparent spawns parent+server, SIGKILLs parent, asserts server exits within 5s and pid file is unlinked — exercises the full watchdog → self-SIGTERM → fix(server): unlink legacy .server.pid on atexit and SIGTERM (closes #437) #439 handler → pid file unlink chain
  • uv run pytest packages/memtomem/tests/ -m "not ollama"2268 passed (previously 2251; +17)
  • ruff check / ruff format --check / mypy on server/lifespan.py — clean

Out of scope (stays on #440 wishlist, not in this PR)

  • Stderr message cleanup for the legacy-flock error ("pre-0.1.25 install" phrasing). With this watchdog live, that error should become rare enough that rewording is lower priority.
  • Liveness probe on _try_hold_legacy_flock failure (dead-pid detection and unlink). Same rationale — orphan rate should drop sharply.

Either follow-up can land in a separate PR if still worth doing after real-world usage.

🤖 Generated with Claude Code

…440)

Investigation in #440 (comments) established that the MCP stdio client
(Claude Code observed, likely others) sometimes exits without closing
our stdio unix sockets OR sending SIGTERM. `memtomem-server` is then
left alive holding ~/.memtomem/.server.pid, blocking the next client's
spawn. Server-side tests confirmed the server already handles stdin-EOF
and SIGTERM correctly — the failure mode is purely "signal never
arrives". Fixing the client is out of scope; this change makes the
server resilient to any client that fails to clean up after itself.

Approach: polling os.getppid() at 10s intervals. When the parent process
is gone (POSIX reparents to init/launchd, so ppid changes), self-SIGTERM
to trigger the existing sigterm handler from #439, which unlinks the
XDG and legacy pid files and os._exit(0)s. Using self-SIGTERM rather
than os._exit here is deliberate — bypassing the sigterm handler would
re-create the stale-pid-file class of bugs #437/#439 closed.

Design notes:
- POSIX-portable (macOS launchd + Linux PID 1), no PR_SET_PDEATHSIG
  dependency.
- False-positive-free: PPID change = parent exited, guaranteed.
- Gated by MEMTOMEM_PARENT_WATCHDOG env (default on); set to "off"/"0"/
  "false" to disable for supervised or daemonised deployments where
  reparenting is expected.
- Poll interval configurable via MEMTOMEM_PARENT_WATCHDOG_INTERVAL
  (default 10s). Invalid values fall back to the default.
- Task lifecycle: created in app_lifespan startup, cancelled + awaited
  in finally. Any exception during teardown (including CancelledError
  leaking out) is swallowed so main lifespan cleanup still runs.

Tested:
- 17 new tests in test_server_parent_watchdog.py:
  * env gating (6 params: default, 5 disable values, truthy values)
  * interval override + invalid-value fallback
  * coroutine no-ops while ppid stable
  * coroutine self-SIGTERMs when ppid changes (via monkey-patched
    getppid iterator)
  * cancellation path returns cleanly (no stray exception)
  * end-to-end subprocess test: grandparent spawns parent+server,
    SIGKILLs parent, asserts server pid file is unlinked within 5s
    (exercises the full watchdog → self-SIGTERM → sigterm handler →
     pid file unlink chain)
- Full CI-filter suite: 2268 passed (previously 2251, +17 new).
- ruff check/format, mypy: clean.

Co-Authored-By: Claude <[email protected]>
@memtomem
Copy link
Copy Markdown
Owner Author

Closing after empirical validation that v0.1.26 already handles both orphan-generating paths without this watchdog:

Experiments (all run on v0.1.26 from PyPI)

1. Normal /exit shutdown (primary use case)

  • User /exits a Claude Code session holding a memtomem-server child
  • Observed: server process disappears + both pid files unlinked within 1s ✅

2. Parent SIGKILL with MEMTOMEM_PARENT_WATCHDOG=off (worst-case simulation)

  • Parent python process gets SIGKILL — no chance to close stdio or send signal
  • Observed: server exits + pid files unlinked within 1s ✅

Why both work in 0.1.26: kernel closes parent's fd on process death → unix socket peer closes → server's stdio read returns EOF → mcp.run() exits → lifespan finally → atexit unlinks. The chain from #439 is sufficient.

So when would this watchdog have helped?

Only in a hypothetical scenario where:

  • Parent claude process is alive
  • But the stdio socket to the child is dead/leaked

I couldn't produce this state with any plausible user action. The empirical root cause of today's reported orphan turned out to be pre-0.1.26 processes still running in memory after an in-place upgrade (disk bytes new, process memory old — uv tool install doesn't touch running Python processes). That's an upgrade-hygiene problem, not a runtime-watchdog problem, and it's filed as [#443][1] for an mm upgrade subcommand.

Decision

[1]: will link in follow-up

@memtomem memtomem closed this Apr 24, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 24, 2026
@memtomem memtomem deleted the fix/440-parent-watchdog branch April 27, 2026 14:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: legacy flock UX follow-up — liveness probe, clearer stderr, orphan lifecycle (follow-up to #437)

2 participants