server: legacy flock UX follow-up — liveness probe, clearer stderr, orphan lifecycle (follow-up to #437)

Follow-up to #437 (PR #439 closed that). PR #439 fixed the **stale-file** axis — when the server exits cleanly, `~/.memtomem/.server.pid` is now unlinked on both `atexit` and SIGTERM, so a later probe won't race on a leftover file.

But a second axis of the same user symptom remains: **live orphan holder**. Observed today immediately after the fix landed:

```
Memtomem MCP Server
Status:           ✘ failed
Command:          memtomem-server
Config location:  /Users/pdstudio/.claude.json
```

Diagnosis at that moment:

```
$ lsof ~/.memtomem/.server.pid
COMMAND     PID     USER   FD   TYPE ...
python3.1 98059 pdstudio    3u   REG ...
$ ps 98059
98059  memtomem-server    (started ~5 min earlier)
```

So the server process is **alive and holding the fd** — this is the real "lock is held by another process" case, not a stale-file race. Claude Code's handshake failed but the child stayed up (Claude Code doesn't SIGTERM the orphan on handshake failure — or it does and the child swallows it), and every subsequent `Reconnect` spawns a new child that legitimately contends with the orphan.

User-visible symptoms repeat indefinitely:
- `Status: ✘ failed` until the user manually `pkill memtomem-server`.
- `/mcp` → "Failed to reconnect to memtomem." on every retry.
- stderr of the new child says "pre-0.1.25 install" — which is **still misleading**: the holder is a current-build orphan, not an old install.

## Three distinct improvements, probably separable

### 1. Liveness probe before aborting

When `_try_hold_legacy_flock`'s `fcntl.flock(LOCK_EX|LOCK_NB)` fails, read the PID from the file (`_lock_fp.read()` before close, or open-for-read the path) and check `os.kill(pid, 0)`:
- Process gone → file is stale (this path became rarer after #439, but can still happen if someone upgrades from 0.1.25 mid-run). Unlink + retry acquiring the flock.
- Process alive + is actually `memtomem-server` → abort with a *live-holder* message (see #2).
- Any error reading/parsing → conservative abort.

Guardrail: the PID we read could be wrapped to a different process after the original died. Second check: on macOS/Linux, `/proc/<pid>/comm` or `ps -p <pid> -o comm=` to confirm it's actually a memtomem server. Accepting some false-positive abort is OK — false-negative would let two writers run.

### 2. Clearer stderr message

Today's message blames "a pre-0.1.25 install". Two better shapes:

- **Stale file path**: `"error: stale pid file at ~/.memtomem/.server.pid (no live holder). Retrying."` — and self-heal via (1).
- **Real contention path**: `"error: another memtomem-server is already running (pid {N}). If this is an orphan from a failed MCP handshake, run `kill {N}` and reconnect."`

The "pre-0.1.25 install" wording should only fire when the caller can actually verify the holder is pre-0.1.25 — which isn't practical post-#412, so probably drop it entirely.

### 3. Child-lifecycle hygiene

When `memtomem-server` detects that its controlling MCP client is gone (stdin EOF, or stdio writes erroring), it should exit proactively. `mcp.run()` should do this already via the JSON-RPC `exit` notification, but some clients (Claude Code observed here) seem to just stop reading without sending `exit` when their own handshake judgment fires. Options:

- Add a stdin-watchdog: a thread/task that reads from stdin and exits the process on EOF.
- Use `SO_KEEPALIVE`-equivalent for stdio — poll stdin reads with a short timeout, if no traffic for N minutes exit.

Option 3 is the broadest fix but also most invasive; (1) + (2) together would resolve the user-visible loop without changing lifecycle semantics.

## Suggested scoping

- **Must**: (1) + (2) as one PR — liveness probe + message rewrite. Small and self-contained in `_try_hold_legacy_flock`.
- **Should**: (3) as a separate follow-up (or filed against the upstream `modelcontextprotocol/python-sdk` if the issue is the SDK not handling client-departure).

## Reference

- PR #439 (stale-file teardown): landed.
- Original issue: #437.
- Memory `feedback_advisory_flock_unlink_pair.md` captures the general principle.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: legacy flock UX follow-up — liveness probe, clearer stderr, orphan lifecycle (follow-up to #437) #440

Three distinct improvements, probably separable

1. Liveness probe before aborting

2. Clearer stderr message

3. Child-lifecycle hygiene

Suggested scoping

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

server: legacy flock UX follow-up — liveness probe, clearer stderr, orphan lifecycle (follow-up to #437) #440

Description

Three distinct improvements, probably separable

1. Liveness probe before aborting

2. Clearer stderr message

3. Child-lifecycle hygiene

Suggested scoping

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions