Skip to content

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)#1289

Merged
igorls merged 1 commit intodevelopfrom
fix/mcp-server-collection-reopen-crash
May 1, 2026
Merged

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)#1289
igorls merged 1 commit intodevelopfrom
fix/mcp-server-collection-reopen-crash

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented May 1, 2026

Summary

Companion to #1262.

#1262 split get_or_create_collection into get_collection + fallback create_collection inside ChromaBackend.get_collection, fixing the chromadb 1.5.x Rust-binding SIGSEGV that fires when stored collection metadata differs from the call-site's _HNSW_BLOAT_GUARD payload.

The MCP server has the same payload at a parallel call site that #1262 did not touch:

mempalace/mcp_server.py:287

raw = client.get_or_create_collection(
    _config.collection_name,
    metadata={
        \"hnsw:space\": \"cosine\",
        \"hnsw:num_threads\": 1,
        **_HNSW_BLOAT_GUARD,
    },
)

_get_collection(create=True) is reached by tool_add_drawer (line 794) and tool_diary_write (line 1118). The Stop hook fires mempalace_diary_write at session end — exactly the crash path #1089 was filed for. Without this follow-up, legacy palaces whose stored metadata predates the bloat-guard expansion still SIGSEGV the MCP server on reopen.

Fix

Apply the same try/except split here. New behaviour:

try:
    raw = client.get_collection(_config.collection_name)
except _ChromaNotFoundError:
    raw = client.create_collection(
        _config.collection_name,
        metadata={
            \"hnsw:space\": \"cosine\",
            \"hnsw:num_threads\": 1,
            **_HNSW_BLOAT_GUARD,
        },
    )
_pin_hnsw_threads(raw)

_pin_hnsw_threads still runs after both branches, preserving the legacy-palace retrofit comment that already lived above this block.

Test plan

  • New regression test test_get_collection_create_true_avoids_get_or_create_on_reopen patches get_or_create_collection at the chromadb client class level (not the instance — chromadb's mtime-change detection rebuilds the client between calls, so an instance-level spy doesn't survive) and asserts the second _get_collection(create=True) call never reaches it.
  • Verified the test fails on develop tip without this fix (assert col2 is not None — None returned because the spy raises and the outer try/except swallows it).
  • All 65 tests in tests/test_mcp_server.py pass with the fix.
  • ruff check + ruff format --check clean (CI-pinned 0.4.x).

#1262)

#1262 split `get_or_create_collection` into `get_collection` + fallback
`create_collection` inside `ChromaBackend.get_collection`, fixing the
chromadb 1.5.x Rust-binding SIGSEGV that fires when stored collection
metadata differs from the call-site's `_HNSW_BLOAT_GUARD` payload.

The MCP server's `_get_collection(create=True)` carries the same metadata
payload at `mcp_server.py:287` and routes through chromadb's Python
client directly, bypassing the backend layer. Both `tool_add_drawer`
and `tool_diary_write` reach this site on every invocation, and the
Stop hook fires `mempalace_diary_write` at session end — which was
exactly the crash path #1089 named.

Apply the same try/except split here so legacy palaces whose stored
metadata predates the bloat-guard expansion no longer crash on the
MCP-server reopen path. Regression test patches
`get_or_create_collection` at the chromadb client class level (not the
instance — chromadb's mtime-change detection rebuilds the client between
calls, so an instance-level spy doesn't survive) and asserts the second
`_get_collection(create=True)` call never reaches it.
Copilot AI review requested due to automatic review settings May 1, 2026 01:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Follow-up to #1262 that applies the same ChromaDB 1.5.x crash-avoidance pattern in the MCP server: avoid get_or_create_collection(..., metadata=...) on reopen by trying get_collection() first and only creating with metadata when missing.

Changes:

  • Update mcp_server._get_collection(create=True) to use get_collection with a create_collection fallback on _ChromaNotFoundError.
  • Add a regression test ensuring reopen does not call get_or_create_collection (the metadata-mismatch SIGSEGV path).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
mempalace/mcp_server.py Splits get_or_create_collection into get_collection + create_collection fallback to avoid ChromaDB 1.5.x metadata-mismatch segfaults on reopen.
tests/test_mcp_server.py Adds regression coverage to ensure _get_collection(create=True) avoids get_or_create_collection on reopen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@igorls igorls merged commit 5e540da into develop May 1, 2026
10 checks passed
igorls added a commit that referenced this pull request May 1, 2026
The same try/except split that #1262 applied at the backend layer
(ChromaBackend.get_collection) was needed at the parallel call site
in mcp_server._get_collection(create=True), which carries the same
metadata payload directly to chromadb's Python client. Both reopen
paths in mempalace now bypass get_or_create_collection on existing
collections, closing the SIGSEGV class for both tool_add_drawer
and tool_diary_write (the Stop hook's path).
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 1, 2026
Catches up xdev-patches with 112 commits from MemPalace/develop, including:
- v3.3.4 release
- MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running
  MCP server & mempalace-api)
- MemPalace#1287 HNSW divergence floor fix
- MemPalace#1288 BLOB seq_id decode in repair
- MemPalace#1180 cross-wing tunnels by shared topics
- MemPalace#1194 wing-slug normalization for hyphenated dirs

Conflict resolution: hooks_cli.py and mcp_server.py both had local patches
(6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash)
that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on
those two files — upstream's version handles separate transcript/project
ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local
config.py regex relaxation auto-merged cleanly and is preserved.

Safety tag: pre-upstream-merge-20260501-091227 (rollback target).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
mjc pushed a commit to mjc/mempalace that referenced this pull request May 2, 2026
…lace#1299)

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
…SEGV (upstream MemPalace#1289)

Mirrors the backend-layer fix from MemPalace#1262 at the MCP server's _get_collection
cache rebuild path. Also imports _HNSW_BLOAT_GUARD for correct metadata on
fresh collection creation.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
Sync with upstream/develop via cherry-pick of 9 critical bugfixes:
- HNSW index bloat prevention (MemPalace#1191)
- SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289)
- blob-seq marker fast-path (MemPalace#1177)
- palace_graph None-metadata guard (MemPalace#1201)
- palace_graph security tunnels (MemPalace#1168)
- hyphenated wing name normalization (MemPalace#1197)
- searcher _tokenize None guard (MemPalace#1198)
- mcp_server diary topic sanitize (MemPalace#936)

Tests: 1459 default + 113 benchmark + 6/7 stress all pass.
(random-kill stress test was failing on dev pre-merge; not regressed.)
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…Palace#1289/MemPalace#1303

After the 2026-05-03 develop sync to commit 1888b67:
- Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed
  ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303
  plumbed embedding_function= through the reopen path.
- Trailing tracking paragraph updated (sync target, count of merged/
  open/closed PRs, list of features brought in).
- Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with
  rebase queued, MemPalace#1289 + MemPalace#1303 added as merged.

Note: fork-only _get_session_recovery_collection still uses the older
get_or_create_collection pattern; theoretical SIGSEGV exposure on
legacy recovery collections only. Tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…attern to recovery collection

`_get_session_recovery_collection` (fork-only, introduced for the
checkpoint collection split in row 23) used the older
`get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just
fixed in `_get_collection`. Same theoretical SIGSEGV class on
chromadb 1.5.x when stored metadata diverges from the call site,
plus the `embedding_function=` omission MemPalace#1303 caught.

Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()`
and the create-path uses try `get_collection` / except
`NotFoundError` → `create_collection`. Recovery collection's
metadata is intentionally lighter than the main collection's
(no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is
small enough that segment bloat isn't a concern.

Existing `test_session_recovery.py` covers happy paths; 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
A small improvement to `_get_collection`: log the exception on failure
(instead of returning `None` silently) and retry once after clearing
the cached client and collection. Behaviour on the happy path is
unchanged.

## Why

The current shape (`try: ... except Exception: return None`) has two
costs in practice on a long-running deployment:

1. Invisible incidents. When a transient ChromaDB error poisons the
   cached `_client_cache` / `_collection_cache`, every subsequent
   call returns `None` but the operator never sees the underlying
   exception in the log. Debugging has to start from the symptoms —
   404s, "no palace" responses — without the actual stack.
2. No self-healing. Once the cache is stale, the only way back to a
   working state is a process restart. A single retry that clears
   the caches first resolves the most common shape (a transient
   error left bad state behind) without operator intervention.

## What changes

- One `logger.exception(...)` per failed attempt, including the
  palace path so multi-palace deployments can disambiguate.
- One retry after the first failure, with `_client_cache`,
  `_collection_cache`, `_metadata_cache`, `_metadata_cache_time`
  reset before the second attempt so the retry rebuilds from
  scratch rather than reusing a poisoned handle.
- Happy path (first attempt succeeds) is unchanged: same return,
  same caching, no extra log line.

## Why this is still useful after MemPalace#1289 / MemPalace#1303

MemPalace#1289 split `get_or_create_collection` into `get_collection` +
`create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303
plumbed `embedding_function=` through both reopen branches. Those
fix specific *crash* shapes. This PR addresses the *aftermath* —
when an error of any class poisons the client/collection cache and
every subsequent call returns `None` invisibly. Retry-once
self-heals; logging makes the underlying exception visible. The
retry loop sits *above* the get-then-create logic, so retries
benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding.

## Rebased on top of MemPalace#1289 + MemPalace#1303

This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two
commits (2038269 + 757eb7a) wrapped the older `get_or_create`
body. Rebased onto the new `try get_collection / except
NotFoundError → create_collection` body — the retry/log envelope
is unchanged, only the inner body picks up the merged fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants