Skip to content

server: legacy flock UX follow-up — liveness probe, clearer stderr, orphan lifecycle (follow-up to #437) #440

@memtomem

Description

@memtomem

Follow-up to #437 (PR #439 closed that). PR #439 fixed the stale-file axis — when the server exits cleanly, ~/.memtomem/.server.pid is now unlinked on both atexit and SIGTERM, so a later probe won't race on a leftover file.

But a second axis of the same user symptom remains: live orphan holder. Observed today immediately after the fix landed:

Memtomem MCP Server
Status:           ✘ failed
Command:          memtomem-server
Config location:  /Users/pdstudio/.claude.json

Diagnosis at that moment:

$ lsof ~/.memtomem/.server.pid
COMMAND     PID     USER   FD   TYPE ...
python3.1 98059 pdstudio    3u   REG ...
$ ps 98059
98059  memtomem-server    (started ~5 min earlier)

So the server process is alive and holding the fd — this is the real "lock is held by another process" case, not a stale-file race. Claude Code's handshake failed but the child stayed up (Claude Code doesn't SIGTERM the orphan on handshake failure — or it does and the child swallows it), and every subsequent Reconnect spawns a new child that legitimately contends with the orphan.

User-visible symptoms repeat indefinitely:

  • Status: ✘ failed until the user manually pkill memtomem-server.
  • /mcp → "Failed to reconnect to memtomem." on every retry.
  • stderr of the new child says "pre-0.1.25 install" — which is still misleading: the holder is a current-build orphan, not an old install.

Three distinct improvements, probably separable

1. Liveness probe before aborting

When _try_hold_legacy_flock's fcntl.flock(LOCK_EX|LOCK_NB) fails, read the PID from the file (_lock_fp.read() before close, or open-for-read the path) and check os.kill(pid, 0):

Guardrail: the PID we read could be wrapped to a different process after the original died. Second check: on macOS/Linux, /proc/<pid>/comm or ps -p <pid> -o comm= to confirm it's actually a memtomem server. Accepting some false-positive abort is OK — false-negative would let two writers run.

2. Clearer stderr message

Today's message blames "a pre-0.1.25 install". Two better shapes:

  • Stale file path: "error: stale pid file at ~/.memtomem/.server.pid (no live holder). Retrying." — and self-heal via (1).
  • Real contention path: "error: another memtomem-server is already running (pid {N}). If this is an orphan from a failed MCP handshake, run kill {N} and reconnect."

The "pre-0.1.25 install" wording should only fire when the caller can actually verify the holder is pre-0.1.25 — which isn't practical post-#412, so probably drop it entirely.

3. Child-lifecycle hygiene

When memtomem-server detects that its controlling MCP client is gone (stdin EOF, or stdio writes erroring), it should exit proactively. mcp.run() should do this already via the JSON-RPC exit notification, but some clients (Claude Code observed here) seem to just stop reading without sending exit when their own handshake judgment fires. Options:

  • Add a stdin-watchdog: a thread/task that reads from stdin and exits the process on EOF.
  • Use SO_KEEPALIVE-equivalent for stdio — poll stdin reads with a short timeout, if no traffic for N minutes exit.

Option 3 is the broadest fix but also most invasive; (1) + (2) together would resolve the user-visible loop without changing lifecycle semantics.

Suggested scoping

  • Must: (1) + (2) as one PR — liveness probe + message rewrite. Small and self-contained in _try_hold_legacy_flock.
  • Should: (3) as a separate follow-up (or filed against the upstream modelcontextprotocol/python-sdk if the issue is the SDK not handling client-departure).

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions