Follow-up to #437 (PR #439 closed that). PR #439 fixed the stale-file axis — when the server exits cleanly, ~/.memtomem/.server.pid is now unlinked on both atexit and SIGTERM, so a later probe won't race on a leftover file.
But a second axis of the same user symptom remains: live orphan holder. Observed today immediately after the fix landed:
Memtomem MCP Server
Status: ✘ failed
Command: memtomem-server
Config location: /Users/pdstudio/.claude.json
Diagnosis at that moment:
$ lsof ~/.memtomem/.server.pid
COMMAND PID USER FD TYPE ...
python3.1 98059 pdstudio 3u REG ...
$ ps 98059
98059 memtomem-server (started ~5 min earlier)
So the server process is alive and holding the fd — this is the real "lock is held by another process" case, not a stale-file race. Claude Code's handshake failed but the child stayed up (Claude Code doesn't SIGTERM the orphan on handshake failure — or it does and the child swallows it), and every subsequent Reconnect spawns a new child that legitimately contends with the orphan.
User-visible symptoms repeat indefinitely:
Status: ✘ failed until the user manually pkill memtomem-server.
/mcp → "Failed to reconnect to memtomem." on every retry.
- stderr of the new child says "pre-0.1.25 install" — which is still misleading: the holder is a current-build orphan, not an old install.
Three distinct improvements, probably separable
1. Liveness probe before aborting
When _try_hold_legacy_flock's fcntl.flock(LOCK_EX|LOCK_NB) fails, read the PID from the file (_lock_fp.read() before close, or open-for-read the path) and check os.kill(pid, 0):
Guardrail: the PID we read could be wrapped to a different process after the original died. Second check: on macOS/Linux, /proc/<pid>/comm or ps -p <pid> -o comm= to confirm it's actually a memtomem server. Accepting some false-positive abort is OK — false-negative would let two writers run.
2. Clearer stderr message
Today's message blames "a pre-0.1.25 install". Two better shapes:
- Stale file path:
"error: stale pid file at ~/.memtomem/.server.pid (no live holder). Retrying." — and self-heal via (1).
- Real contention path:
"error: another memtomem-server is already running (pid {N}). If this is an orphan from a failed MCP handshake, run kill {N} and reconnect."
The "pre-0.1.25 install" wording should only fire when the caller can actually verify the holder is pre-0.1.25 — which isn't practical post-#412, so probably drop it entirely.
3. Child-lifecycle hygiene
When memtomem-server detects that its controlling MCP client is gone (stdin EOF, or stdio writes erroring), it should exit proactively. mcp.run() should do this already via the JSON-RPC exit notification, but some clients (Claude Code observed here) seem to just stop reading without sending exit when their own handshake judgment fires. Options:
- Add a stdin-watchdog: a thread/task that reads from stdin and exits the process on EOF.
- Use
SO_KEEPALIVE-equivalent for stdio — poll stdin reads with a short timeout, if no traffic for N minutes exit.
Option 3 is the broadest fix but also most invasive; (1) + (2) together would resolve the user-visible loop without changing lifecycle semantics.
Suggested scoping
- Must: (1) + (2) as one PR — liveness probe + message rewrite. Small and self-contained in
_try_hold_legacy_flock.
- Should: (3) as a separate follow-up (or filed against the upstream
modelcontextprotocol/python-sdk if the issue is the SDK not handling client-departure).
Reference
Follow-up to #437 (PR #439 closed that). PR #439 fixed the stale-file axis — when the server exits cleanly,
~/.memtomem/.server.pidis now unlinked on bothatexitand SIGTERM, so a later probe won't race on a leftover file.But a second axis of the same user symptom remains: live orphan holder. Observed today immediately after the fix landed:
Diagnosis at that moment:
So the server process is alive and holding the fd — this is the real "lock is held by another process" case, not a stale-file race. Claude Code's handshake failed but the child stayed up (Claude Code doesn't SIGTERM the orphan on handshake failure — or it does and the child swallows it), and every subsequent
Reconnectspawns a new child that legitimately contends with the orphan.User-visible symptoms repeat indefinitely:
Status: ✘ faileduntil the user manuallypkill memtomem-server./mcp→ "Failed to reconnect to memtomem." on every retry.Three distinct improvements, probably separable
1. Liveness probe before aborting
When
_try_hold_legacy_flock'sfcntl.flock(LOCK_EX|LOCK_NB)fails, read the PID from the file (_lock_fp.read()before close, or open-for-read the path) and checkos.kill(pid, 0):memtomem-server→ abort with a live-holder message (see refactor(stm): decouple surfacing via remote-only MCP #2).Guardrail: the PID we read could be wrapped to a different process after the original died. Second check: on macOS/Linux,
/proc/<pid>/commorps -p <pid> -o comm=to confirm it's actually a memtomem server. Accepting some false-positive abort is OK — false-negative would let two writers run.2. Clearer stderr message
Today's message blames "a pre-0.1.25 install". Two better shapes:
"error: stale pid file at ~/.memtomem/.server.pid (no live holder). Retrying."— and self-heal via (1)."error: another memtomem-server is already running (pid {N}). If this is an orphan from a failed MCP handshake, runkill {N}and reconnect."The "pre-0.1.25 install" wording should only fire when the caller can actually verify the holder is pre-0.1.25 — which isn't practical post-#412, so probably drop it entirely.
3. Child-lifecycle hygiene
When
memtomem-serverdetects that its controlling MCP client is gone (stdin EOF, or stdio writes erroring), it should exit proactively.mcp.run()should do this already via the JSON-RPCexitnotification, but some clients (Claude Code observed here) seem to just stop reading without sendingexitwhen their own handshake judgment fires. Options:SO_KEEPALIVE-equivalent for stdio — poll stdin reads with a short timeout, if no traffic for N minutes exit.Option 3 is the broadest fix but also most invasive; (1) + (2) together would resolve the user-visible loop without changing lifecycle semantics.
Suggested scoping
_try_hold_legacy_flock.modelcontextprotocol/python-sdkif the issue is the SDK not handling client-departure).Reference
.server.pidflock locked; reconnects loop #437.feedback_advisory_flock_unlink_pair.mdcaptures the general principle.