Skip to content

mempalace.mcp_server has no single-instance guard — concurrent client spawns race on the palace and trigger HNSW corruption (defense-in-depth for #976) #1229

@Seph396

Description

@Seph396

Description

Claude Desktop's MCP loader can spawn a second python -m mempalace.mcp_server process against the same palace before the previous process has finished its shutdown handshake. Two writer processes then operate concurrently on ~/.mempalace/palace/<collection-uuid>/, which on mempalace 3.3.2 (i.e. without PR #976's hnsw:num_threads=1 pin) reliably triggers the HNSW link_lists.bin corruption pattern from #974.

PR #976 is the correct ChromaDB-side fix and resolves the corruption itself. This issue is the complementary MemPalace-side defense: the server has no single-instance guard, so any client whose lifecycle misbehaves (Desktop today, potentially other harnesses tomorrow) can put two writers on the same palace. With #976 in place corruption is prevented, but two processes still race on the SQLite metadata layer, double-load the embedding model, and waste RAM/CPU on every overlap. A startup-time PID file + advisory lock would make MemPalace robust against any misbehaving client without requiring fixes upstream of MemPalace.

How this is different from #976 / #974

Cross-ref: this is the same family of "concurrent writer" bug as #1202 (Stop hook firing a second mine while one is already running). #1202 added a hook-side lock; this issue proposes the equivalent guard inside mcp_server itself, so any client — Desktop, Code, a third-party MCP host, a future hook variant — gets the same protection for free.

Steps to reproduce

  1. Install mempalace 3.3.2 (the version captured in evidence; behavior is structurally present on all current versions since there's no PID guard in mcp_server).
  2. Configure the MemPalace MCP server in ~/Library/Application Support/Claude/claude_desktop_config.json per the standard install instructions.
  3. Use Claude Desktop normally. The exact trigger from the client side is not yet pinned down — see "What I haven't been able to isolate" below — but in the captured 30-hour window the second-process spawn happens in two distinct patterns:
    • Pattern A (no shutdown logged): a second Initializing server... line appears ~1h17m after the first, with no Shutting down server... between them.
    • Pattern B (rapid re-spawn after intentional shutdown): an intentional shutdown is logged and then a new Initializing server... appears ~1m later, followed by another spawn ~18m after that, all without the original process being cleanly torn down.
  4. On 3.3.2 (no hnsw:num_threads pin), this overlap is the reproducer for the SIGSEGV in HNSW parallel inserts — missing hnsw:num_threads on collection creation #974/fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955) #976 corruption — link_lists.bin bloats unboundedly (peaked at 55 GB in our environment for ~50K vectors; expected size ~30 MB).

Observed behavior — evidence from a 30-hour Desktop transport log

Concurrent-spawn instances captured in the sanitized Claude Desktop MCP transport log:

First server Initializing server... Second Initializing server... (overlapping) First server final close
2026-04-25T02:32:02.572Z 2026-04-25T03:49:08.358Z (1h17m later, no Shutting down server... between them) 2026-04-25T04:07:23.637Z (closed unexpectedly — operator pkill during recovery)
2026-04-26T06:35:37.955Z 2026-04-26T07:10:47.037Z, then again 2026-04-26T07:29:08.614Z 2026-04-26T08:05:41.878Z (closed unexpectedly — operator pkill during recovery)

Full sanitized log (username/drawer-counts/AAAK content redacted, all timestamps and JSON-RPC events preserved): https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0

Important context for reading the log: the closed unexpectedly lines at 04-25 04:07:23 and 04-26 08:05:41 are not server-side crashes — they correspond to manual pkill mempalace invocations during recovery from the runaway link_lists.bin growth. The bug being reported here is the concurrent spawn, not the unexpected close.

Why this is a MemPalace-side problem (and not just a Claude Desktop problem)

I want to be careful with the scope of the claim, since I haven't independently audited Claude Desktop's MCP loader.

What I observed in this environment: the Desktop transport log shows multiple Initializing server... events without intervening Shutting down server... events for the same server tag, which is sufficient evidence that two server processes were live concurrently against the same palace.

What is publicly reported about Claude Desktop's MCP lifecycle: there is an open upstream report (anthropics/claude-code#53134) describing two internal managers (directMcpHost and LocalMcpServerManager) spawning every configured MCP server twice on Windows MSIX builds without coordinating. The Cursor community (forum thread) has reported a related "spawns happen faster than a PID lock can be established" race in their MCP loader. Independent fixes for this class of issue exist (Cresnova/claude-desktop-mcp-fix). My environment is macOS Tahoe, not Windows MSIX, so I'm not asserting it's the same exact bug — only that the pattern of MCP host loaders double-spawning servers is a known, documented class of behavior in the broader ecosystem.

Why this still belongs in MemPalace: even if every MCP host eventually fixes its own lifecycle, MemPalace today has no defense against a misbehaving client. A single-instance guard inside mempalace.mcp_server is the structural fix — any future client (Desktop today, Cursor, Continue, a third-party MCP host, a hook variant) that misbehaves can't damage the palace if the second process refuses to start.

Expected behavior

When python -m mempalace.mcp_server is invoked while another mcp_server process is already attached to the same palace, the second invocation should:

  1. Detect the existing process via PID file + advisory lock.
  2. Refuse to start, emit a clear stderr message naming the holding PID and palace path, and exit non-zero.
  3. The MCP host's transport log will then surface the failure cleanly (closed unexpectedly with a useful stderr trail) instead of silently double-attaching.

Suggested fix

A startup guard inside mempalace/mcp_server/__init__.py (or wherever the entrypoint lives — happy to PR once direction is confirmed):

  1. On startup, compute a palace-keyed lock path, e.g. <palace_root>/.mcp_server.lock.
  2. Open the lock file and attempt fcntl.flock(fd, LOCK_EX | LOCK_NB) (POSIX) / msvcrt.locking (Windows).
  3. If the lock is already held: log "MemPalace MCP server already running for palace <path> (held by PID <pid>) — refusing to start" to stderr and exit 1.
  4. If acquired: write the current PID into the file, register an atexit / signal handler to release on clean shutdown.
  5. Stale lock recovery: if the PID in the file is no longer alive (kernel panic, pkill -9), reclaim the lock — flock releases automatically when the holder dies, but the PID-file content will be stale and the new process should overwrite it.

One caveat from prior art: the Cursor forum report notes that some loaders can spawn processes "faster than a PID lock can be established." flock-based locking is the right primitive here precisely because the kernel atomically arbitrates the contending opens — this is hopefully a non-issue on macOS/Linux, but worth verifying on Windows if MemPalace supports it.

Deferring final implementation choice to maintainer judgment — the structural ask is "make mcp_server refuse to second-start against a palace it doesn't own."

What I haven't been able to isolate

  • Exact client trigger. I can confirm from the Desktop log that two server processes existed concurrently, but I haven't been able to pin down which user-side action causes the second spawn. It is not consistently correlated with Desktop restarts, sleep/wake, or specific MCP tool calls in the captured window. A maintainer fix doesn't depend on knowing the trigger — the lock guard is correct regardless — but I want to flag the gap honestly.
  • Whether this reproduces on Claude Code. The captured log is Desktop-only (~/Library/Logs/Claude/); Claude Code's MCP traffic logs to ~/.claude/projects/.../*.jsonl instead and I haven't examined those for the same pattern. The lock guard would protect both.
  • Behavior post-fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955) #976. The captured log ends 5 minutes before I pinned the install to PR fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955) #976's commit 0d9929c0, so I don't have empirical evidence of how a num_threads=1 build behaves under the same concurrent-spawn condition. Theory says the corruption is prevented but the resource waste / SQLite contention remains.

Related upstream

Environment

mempalace 3.3.2 (uv tool install, ~/.local/share/uv/tools/mempalace/)
chromadb 1.5.x (whatever 3.3.2 pins)
Python 3.11
MCP protocol 2025-11-25
Claude Desktop (claude-ai client v0.1.0 per protocol handshake)
macOS Tahoe 26.3.1
Apple M4 Mac

Sanitized 30-hour Desktop MCP transport log: https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/mcpMCP server and toolsbugSomething isn't workingstorage

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions