fix(storage): stop ChromaDB from crashing when reopening an existing … by Legion345 · Pull Request #1262 · MemPalace/mempalace

Legion345 · 2026-04-28T20:18:58Z

What does this PR do?

When opening an existing palace, get_or_create_collection was called with hnsw metadata on every open. In ChromaDB 1.5.x, if that metadata differs from what's already stored, the Rust bindings segfault — no traceback, just a core dump. This is what causes the stop hook to crash at session end.

Fix: try get_collection first, fall back to create_collection only when the collection doesn't exist yet. Existing palaces open without touching their metadata; new ones are created with the full settings as before.

Credit to @jphein who described this exact approach in #1089 and has been running it on their fork since April 10 with zero crashes across 400+ starts.

How to test

Open a palace twice in the same session:

from mempalace.backends.chroma import ChromaBackend
import tempfile, os

with tempfile.TemporaryDirectory() as d:
    palace = os.path.join(d, "palace")
    backend = ChromaBackend()
    backend.get_collection(palace, collection_name="mempalace_drawers", create=True)
    backend.get_collection(palace, collection_name="mempalace_drawers", create=True)
    print("no crash")

Or run the new regression tests directly:
python -m pytest tests/test_backends.py -v -k "idempotent or preserves_existing"

Checklist

Tests pass (python -m pytest tests/ -v)
No hardcoded paths
Linter passes (ruff check .)

…palace

@Legion345

… main The "fix: _get_client() get-then-create guard" row in the fork-ahead table was misleading — the guard landed in fork as d3a2d22 on 2026-04-18 but was silently dropped during the develop merge-conflict resolution in 7e18a70 on 2026-04-25. Current fork main does not carry it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1 + _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard. Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed 2026-04-28 by @Legion345) implements the canonical version — try get_collection / except chromadb.errors.NotFoundError → create — which is an improvement over our older InvalidCollectionException reference (NotFoundError is the renamed-in-1.5.x exception class). Once MemPalace#1262 merges, this row clears via develop sync. Until then the fork is exposed if upstream ever changes default hnsw:* metadata between releases — the same drift shape that caused the original crash.

jphein · 2026-04-30T20:36:49Z

Thanks for picking this up @Legion345 — and for the kind credit, that was generous.

A couple of corrections about the field data, since you anchored the PR description on it and I want to make sure the maintainer isn't reviewing under a wrong impression:

The "since April 10" / "400+ starts, zero crashes" line came from my #1089 issue body, and both numbers are off. The guard actually landed on our fork at d3a2d22 on 2026-04-18 (8 days later than I wrote there), and it was dropped during a develop merge-conflict resolution at 7e18a70 on 2026-04-25. So the guard was alive in fork main for 7 days, not 20. I've edited #1089 with the correction so the record there is right.

What's actually true today (and apologies — this paragraph has been edited twice already; the prior versions overcorrected in two different directions). Tracing the call graph: chromadb's get_or_create_collection is called from two places on current fork main — backends/chroma.py:1061 and mcp_server.py:295 — both with identical metadata (hnsw:space, hnsw:num_threads: 1, plus the bloat-guard keys from #1191). The fork's 2-arg ChromaBackend.get_or_create_collection(path, name) legacy shim at chroma.py:1101 (used by migrate.py:237 and cli.py:1002) delegates through get_collection(create=True) to the same metadata-passing line, so metadata is consistent across opens. The residual exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural one. I don't have a clean recurrence count to share. Net: this PR is still genuinely needed because the chromadb-side crash class isn't fixed, and the "validated by JP's fork" framing isn't load-bearing in either direction.

On the diff itself: the approach is right. One small thing I'd flag as a positive — you're catching chromadb.errors.NotFoundError, which is the 1.5.x-renamed exception. Our older code referenced InvalidCollectionException, and yours is the right one going forward. Both regression tests cover the case cleanly. Once this merges I'll pull it in via the next develop sync.

Thanks again for taking this on, and apologies for the back-and-forth on this comment — I shouldn't have edited my own correction with another wrong claim.

…call sites The earlier 2026-04-30 wording said current fork main calls get_or_create_collection with stable/consistent metadata, sidestepping the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the defensive guard. That's not actually true. Audited the four get_or_create_collection call sites on current fork main: - backends/chroma.py:1040 full hnsw:* set (space + num_threads + bloat-guard) - mcp_server.py:295 full hnsw:* set - migrate.py:235 no metadata - cli.py:905 no metadata The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call- site metadata diverging from stored collection metadata) is structurally present. The fork's apparent stability is presumably because the no-metadata sites operate on different collections (mempalace_compressed, the migrate temp collection) than the full- metadata sites (mempalace_drawers) — so the divergence never reaches the same row in the same metadata table. That's a load-bearing assumption worth flagging: it's circumstantial, not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites together so the guard wraps every site rather than just chroma.py.

The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four call sites, trigger condition structurally present") both missed the mark. Third try here, traced from the code: - chromadb's get_or_create_collection is called directly from two places: backends/chroma.py:1061 and mcp_server.py:295. Both pass identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard). - The fork's ChromaBackend.get_or_create_collection(path, name) at chroma.py:1101 is a 2-arg legacy shim that delegates via self.get_collection(palace_path, name, create=True) to the same metadata-passing line. - migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach chromadb with full metadata too. So metadata IS consistent across all five reachable call sites, not divergent. The earlier "non-uniform" claim mistook the fork's API surface for chromadb's call signature. The residual SIGSEGV exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural call-graph one. Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… update sync status MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds get-then-create wrapping in chromadb backend. Shepherding via review comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears via develop sync. MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once + log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and MemPalace#1286 land, the _get_collection path is both crash-resilient and self-healing. Status line refreshed: develop has moved to fdfaf01 (Gemini CLI normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync will clear those plus row 15 once MemPalace#1262 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…pen-crash fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)

Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively

The release was originally cut on 2026-04-27 but did not tag that day. Three additional bug fixes have been folded in since then (#1262, #1287, #1288) and the actual tag will happen on 2026-04-30. Update the header date to match.

The same try/except split that #1262 applied at the backend layer (ChromaBackend.get_collection) was needed at the parallel call site in mcp_server._get_collection(create=True), which carries the same metadata payload directly to chromadb's Python client. Both reopen paths in mempalace now bypass get_or_create_collection on existing collections, closing the SIGSEGV class for both tool_add_drawer and tool_diary_write (the Stop hook's path).

Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…lace#1299) `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface.

… (upstream MemPalace#1262)

…SEGV (upstream MemPalace#1289) Mirrors the backend-layer fix from MemPalace#1262 at the MCP server's _get_collection cache rebuild path. Also imports _HNSW_BLOAT_GUARD for correct metadata on fresh collection creation.

Sync with upstream/develop via cherry-pick of 9 critical bugfixes: - HNSW index bloat prevention (MemPalace#1191) - SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289) - blob-seq marker fast-path (MemPalace#1177) - palace_graph None-metadata guard (MemPalace#1201) - palace_graph security tunnels (MemPalace#1168) - hyphenated wing name normalization (MemPalace#1197) - searcher _tokenize None guard (MemPalace#1198) - mcp_server diary topic sanitize (MemPalace#936) Tests: 1459 default + 113 benchmark + 6/7 stress all pass. (random-kill stress test was failing on dev pre-merge; not regressed.)

…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

fix(storage): stop ChromaDB from crashing when reopening an existing …

d7f4638

…palace

Legion345 requested review from bensig, igorls and milla-jovovich as code owners April 28, 2026 20:18

igorls merged commit 73541d1 into MemPalace:develop May 1, 2026
6 checks passed

igorls mentioned this pull request May 1, 2026

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262) #1289

Merged

4 tasks

igorls added a commit that referenced this pull request May 1, 2026

Merge pull request #1289 from MemPalace/fix/mcp-server-collection-reo…

5e540da

…pen-crash fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)

bespined mentioned this pull request May 1, 2026

MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session #1299

Closed

igorls mentioned this pull request May 1, 2026

fix(mcp_server): pass embedding_function= on collection reopen (#1299) #1303

Merged

5 tasks

Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026

fix(chroma): split get_or_create_collection to prevent reopen SIGSEGV…

c8b8a46

… (upstream MemPalace#1262)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): stop ChromaDB from crashing when reopening an existing …#1262

fix(storage): stop ChromaDB from crashing when reopening an existing …#1262
igorls merged 1 commit intoMemPalace:developfrom
Legion345:fix/stop-hook-crash

Legion345 commented Apr 28, 2026

Uh oh!

jphein commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Legion345 commented Apr 28, 2026

What does this PR do?

How to test

Uh oh!

jphein commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jphein commented Apr 30, 2026 •

edited

Loading