Skip to content

server: failed-handshake leaves legacy .server.pid flock locked; reconnects loop #437

@memtomem

Description

@memtomem

Summary

When a memtomem-server child process started by an MCP client (Claude Code) fails its stdio handshake but the process stays alive, it continues to hold the legacy ~/.memtomem/.server.pid flock. Every subsequent reconnect attempt from the client spawns a fresh child that aborts immediately with:

error: another memtomem-server holds a lock at /Users/.../.memtomem/.server.pid (likely a pre-0.1.25 install). Stop it before starting a new server; `mm uninstall` will also refuse until it is gone.

Manual ps+kill is required to recover. The error message blames "pre-0.1.25" but the lock holder here is a current-build server — the legacy path just happens to still be the one that gets held.

Repro (observed live; not yet minimized)

  1. Claude Code opens a project with .mcp.json pointing at memtomem-server.
  2. First server child spawns, acquires legacy flock at ~/.memtomem/.server.pid.
  3. Handshake fails for some reason (exact cause TBD — the child stayed up as an orphan), Claude Code's UI shows ✘ failed.
  4. Reconnect → new child hits the flock held by (1) → aborts with the message above.
  5. Loop until the user manually kills the orphan.

Reliable trigger path not yet isolated. Filed on one repro because the recovery story is poor regardless of how the handshake fails: a dead-on-arrival server handshake shouldn't produce a lock that requires manual cleanup.

Where (starting points)

  • packages/memtomem/src/memtomem/server/__init__.py:241-255_try_hold_legacy_flock(legacy_server_pid_path()) is acquired early, before the MCP stdio handshake. If the server exits via an unhandled path (or stays alive despite handshake failure) the lock can outlive a useful session.
  • _runtime_paths.py:148-161server_pid_path() (new $XDG_RUNTIME_DIR/memtomem/server.pid) vs legacy_server_pid_path() (~/.memtomem/.server.pid). Both are held in parallel for migration (server: relocate .server.pid to $XDG_RUNTIME_DIR so ~/.memtomem/ stays lazy #412).

Suggested investigation (hypothesis)

  1. Confirm whether orphan processes here are the original handshake failure or a reconnect storm that the lock itself perpetuates.
  2. If a handshake fails, the server should release the legacy flock on teardown. Check the exit-path / signal handling — related to feedback_asyncio_swallows_systemexit.md and feedback_cancelled_error_except_gap.md experience: except Exception misses CancelledError, teardowns need except BaseException + selective re-raise.
  3. Consider replacing the "refuse to start" behavior with a liveness probe: if the PID on the legacy file is no longer a live memtomem-server process, take over the lock rather than abort. The current "pre-0.1.25 install" message is misleading when the holder is a current-build orphan.

Out of scope

Recovery (for anyone hitting this now)

pgrep -f memtomem-server          # identify the orphan
kill <pid>
rm -f ~/.memtomem/.server.pid     # only after confirming no live holder

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions