mempalace.mcp_server has no single-instance guard — concurrent client spawns race on the palace and trigger HNSW corruption (defense-in-depth for #976)

## Description

Claude Desktop's MCP loader can spawn a second `python -m mempalace.mcp_server` process against the same palace before the previous process has finished its shutdown handshake. Two writer processes then operate concurrently on `~/.mempalace/palace/<collection-uuid>/`, which on `mempalace 3.3.2` (i.e. without PR #976's `hnsw:num_threads=1` pin) reliably triggers the HNSW `link_lists.bin` corruption pattern from #974.

PR #976 is the correct ChromaDB-side fix and resolves the corruption itself. This issue is the complementary MemPalace-side defense: the server has no single-instance guard, so any client whose lifecycle misbehaves (Desktop today, potentially other harnesses tomorrow) can put two writers on the same palace. With #976 in place corruption is prevented, but two processes still race on the SQLite metadata layer, double-load the embedding model, and waste RAM/CPU on every overlap. A startup-time PID file + advisory lock would make MemPalace robust against any misbehaving client without requiring fixes upstream of MemPalace.

## How this is different from #976 / #974

- **#976 (merged to `develop`, awaiting v3.3.4):** fixes the ChromaDB-side data race by pinning `hnsw:num_threads=1` so a single-threaded HNSW writer can't corrupt `link_lists.bin` even under concurrent access. **Necessary, in.**
- **This issue:** addresses the MemPalace-side trigger condition — there is nothing in `mempalace.mcp_server` preventing two server processes from attaching to the same palace at the same time. With #976, those two processes won't corrupt the index, but they will still race on SQLite, double-load the embedding model, and waste a couple of GB of RAM each time the client misbehaves. **Defense in depth, not a duplicate.**

Cross-ref: this is the same family of "concurrent writer" bug as #1202 (Stop hook firing a second mine while one is already running). #1202 added a hook-side lock; this issue proposes the equivalent guard inside `mcp_server` itself, so any client — Desktop, Code, a third-party MCP host, a future hook variant — gets the same protection for free.

## Steps to reproduce

1. Install `mempalace 3.3.2` (the version captured in evidence; behavior is structurally present on all current versions since there's no PID guard in `mcp_server`).
2. Configure the MemPalace MCP server in `~/Library/Application Support/Claude/claude_desktop_config.json` per the standard install instructions.
3. Use Claude Desktop normally. The exact trigger from the client side is not yet pinned down — see "What I haven't been able to isolate" below — but in the captured 30-hour window the second-process spawn happens in two distinct patterns:
   - **Pattern A (no shutdown logged):** a second `Initializing server...` line appears ~1h17m after the first, with no `Shutting down server...` between them.
   - **Pattern B (rapid re-spawn after intentional shutdown):** an `intentional shutdown` is logged and then a new `Initializing server...` appears ~1m later, followed by another spawn ~18m after that, all without the original process being cleanly torn down.
4. On `3.3.2` (no `hnsw:num_threads` pin), this overlap is the reproducer for the #974/#976 corruption — `link_lists.bin` bloats unboundedly (peaked at 55 GB in our environment for ~50K vectors; expected size ~30 MB).

## Observed behavior — evidence from a 30-hour Desktop transport log

Concurrent-spawn instances captured in the sanitized Claude Desktop MCP transport log:

| First server `Initializing server...` | Second `Initializing server...` (overlapping) | First server final close |
|---|---|---|
| `2026-04-25T02:32:02.572Z` | `2026-04-25T03:49:08.358Z` (1h17m later, **no `Shutting down server...` between them**) | `2026-04-25T04:07:23.637Z` (`closed unexpectedly` — operator `pkill` during recovery) |
| `2026-04-26T06:35:37.955Z` | `2026-04-26T07:10:47.037Z`, then again `2026-04-26T07:29:08.614Z` | `2026-04-26T08:05:41.878Z` (`closed unexpectedly` — operator `pkill` during recovery) |

Full sanitized log (username/drawer-counts/AAAK content redacted, all timestamps and JSON-RPC events preserved): https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0

Important context for reading the log: the `closed unexpectedly` lines at `04-25 04:07:23` and `04-26 08:05:41` are not server-side crashes — they correspond to manual `pkill mempalace` invocations during recovery from the runaway `link_lists.bin` growth. The bug being reported here is the **concurrent spawn**, not the unexpected close.

## Why this is a MemPalace-side problem (and not just a Claude Desktop problem)

I want to be careful with the scope of the claim, since I haven't independently audited Claude Desktop's MCP loader.

**What I observed in this environment:** the Desktop transport log shows multiple `Initializing server...` events without intervening `Shutting down server...` events for the same server tag, which is sufficient evidence that two server processes were live concurrently against the same palace.

**What is publicly reported about Claude Desktop's MCP lifecycle:** there is an open upstream report ([anthropics/claude-code#53134](https://github.com/anthropics/claude-code/issues/53134)) describing two internal managers (`directMcpHost` and `LocalMcpServerManager`) spawning every configured MCP server twice on Windows MSIX builds without coordinating. The Cursor community ([forum thread](https://forum.cursor.com/t/mcp-server-race-condition-causes-infinite-process-spawning-on-windows-cursor-2-0-34/139610)) has reported a related "spawns happen faster than a PID lock can be established" race in their MCP loader. Independent fixes for this class of issue exist ([Cresnova/claude-desktop-mcp-fix](https://github.com/Cresnova/claude-desktop-mcp-fix)). My environment is macOS Tahoe, not Windows MSIX, so I'm not asserting it's the same exact bug — only that the pattern of MCP host loaders double-spawning servers is a known, documented class of behavior in the broader ecosystem.

**Why this still belongs in MemPalace:** even if every MCP host eventually fixes its own lifecycle, MemPalace today has no defense against a misbehaving client. A single-instance guard inside `mempalace.mcp_server` is the structural fix — any future client (Desktop today, Cursor, Continue, a third-party MCP host, a hook variant) that misbehaves can't damage the palace if the second process refuses to start.

## Expected behavior

When `python -m mempalace.mcp_server` is invoked while another `mcp_server` process is already attached to the same palace, the second invocation should:

1. Detect the existing process via PID file + advisory lock.
2. Refuse to start, emit a clear stderr message naming the holding PID and palace path, and exit non-zero.
3. The MCP host's transport log will then surface the failure cleanly (`closed unexpectedly` with a useful stderr trail) instead of silently double-attaching.

## Suggested fix

A startup guard inside `mempalace/mcp_server/__init__.py` (or wherever the entrypoint lives — happy to PR once direction is confirmed):

1. On startup, compute a palace-keyed lock path, e.g. `<palace_root>/.mcp_server.lock`.
2. Open the lock file and attempt `fcntl.flock(fd, LOCK_EX | LOCK_NB)` (POSIX) / `msvcrt.locking` (Windows).
3. If the lock is already held: log `"MemPalace MCP server already running for palace <path> (held by PID <pid>) — refusing to start"` to stderr and exit `1`.
4. If acquired: write the current PID into the file, register an `atexit` / signal handler to release on clean shutdown.
5. Stale lock recovery: if the PID in the file is no longer alive (kernel panic, `pkill -9`), reclaim the lock — `flock` releases automatically when the holder dies, but the PID-file content will be stale and the new process should overwrite it.

**One caveat from prior art:** the [Cursor forum report](https://forum.cursor.com/t/mcp-server-race-condition-causes-infinite-process-spawning-on-windows-cursor-2-0-34/139610) notes that some loaders can spawn processes "faster than a PID lock can be established." `flock`-based locking is the right primitive here precisely because the kernel atomically arbitrates the contending opens — this is hopefully a non-issue on macOS/Linux, but worth verifying on Windows if MemPalace supports it.

Deferring final implementation choice to maintainer judgment — the structural ask is "make `mcp_server` refuse to second-start against a palace it doesn't own."

## What I haven't been able to isolate

- **Exact client trigger.** I can confirm from the Desktop log that two server processes existed concurrently, but I haven't been able to pin down which user-side action causes the second spawn. It is not consistently correlated with Desktop restarts, sleep/wake, or specific MCP tool calls in the captured window. A maintainer fix doesn't depend on knowing the trigger — the lock guard is correct regardless — but I want to flag the gap honestly.
- **Whether this reproduces on Claude Code.** The captured log is Desktop-only (`~/Library/Logs/Claude/`); Claude Code's MCP traffic logs to `~/.claude/projects/.../*.jsonl` instead and I haven't examined those for the same pattern. The lock guard would protect both.
- **Behavior post-#976.** The captured log ends 5 minutes before I pinned the install to PR #976's commit `0d9929c0`, so I don't have empirical evidence of how a `num_threads=1` build behaves under the same concurrent-spawn condition. Theory says the corruption is prevented but the resource waste / SQLite contention remains.

## Related upstream

- #974 — original HNSW SIGSEGV bug (closed by #976)
- #976 — `hnsw:num_threads=1` fix (merged to `develop`, awaiting v3.3.4)
- #1202 — Stop-hook concurrent-mine race (same bug family — concurrent writers — different trigger; that issue's per-palace lock approach is the precedent for this proposal)
- #1218 — `mempalace migrate/status` SIGSEGV on chromadb version mismatch (unrelated root cause but adjacent failure mode)
- #1222 — HNSW `max_elements` capacity issue (separate from this)

## Environment

```
mempalace 3.3.2 (uv tool install, ~/.local/share/uv/tools/mempalace/)
chromadb 1.5.x (whatever 3.3.2 pins)
Python 3.11
MCP protocol 2025-11-25
Claude Desktop (claude-ai client v0.1.0 per protocol handshake)
macOS Tahoe 26.3.1
Apple M4 Mac
```

Sanitized 30-hour Desktop MCP transport log: https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0


First server `Initializing server...`	Second `Initializing server...` (overlapping)	First server final close
`2026-04-25T02:32:02.572Z`	`2026-04-25T03:49:08.358Z` (1h17m later, no `Shutting down server...` between them)	`2026-04-25T04:07:23.637Z` (`closed unexpectedly` — operator `pkill` during recovery)
`2026-04-26T06:35:37.955Z`	`2026-04-26T07:10:47.037Z`, then again `2026-04-26T07:29:08.614Z`	`2026-04-26T08:05:41.878Z` (`closed unexpectedly` — operator `pkill` during recovery)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mempalace.mcp_server has no single-instance guard — concurrent client spawns race on the palace and trigger HNSW corruption (defense-in-depth for #976) #1229

Description

How this is different from #976 / #974

Steps to reproduce

Observed behavior — evidence from a 30-hour Desktop transport log

Why this is a MemPalace-side problem (and not just a Claude Desktop problem)

Expected behavior

Suggested fix

What I haven't been able to isolate

Related upstream

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mempalace.mcp_server has no single-instance guard — concurrent client spawns race on the palace and trigger HNSW corruption (defense-in-depth for #976) #1229

Description

Description

How this is different from #976 / #974

Steps to reproduce

Observed behavior — evidence from a 30-hour Desktop transport log

Why this is a MemPalace-side problem (and not just a Claude Desktop problem)

Expected behavior

Suggested fix

What I haven't been able to isolate

Related upstream

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions