Skip to content

fix(storage): stop ChromaDB from crashing when reopening an existing …#1262

Merged
igorls merged 1 commit intoMemPalace:developfrom
Legion345:fix/stop-hook-crash
May 1, 2026
Merged

fix(storage): stop ChromaDB from crashing when reopening an existing …#1262
igorls merged 1 commit intoMemPalace:developfrom
Legion345:fix/stop-hook-crash

Conversation

@Legion345
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #1089.

When opening an existing palace, get_or_create_collection was called with hnsw metadata on every open. In ChromaDB 1.5.x, if that metadata differs from what's already stored, the Rust bindings segfault — no traceback, just a core dump. This is what causes the stop hook to crash at session end.

Fix: try get_collection first, fall back to create_collection only when the collection doesn't exist yet. Existing palaces open without touching their metadata; new ones are created with the full settings as before.

Credit to @jphein who described this exact approach in #1089 and has been running it on their fork since April 10 with zero crashes across 400+ starts.

How to test

Open a palace twice in the same session:

from mempalace.backends.chroma import ChromaBackend
import tempfile, os

with tempfile.TemporaryDirectory() as d:
    palace = os.path.join(d, "palace")
    backend = ChromaBackend()
    backend.get_collection(palace, collection_name="mempalace_drawers", create=True)
    backend.get_collection(palace, collection_name="mempalace_drawers", create=True)
    print("no crash") 

Or run the new regression tests directly:
python -m pytest tests/test_backends.py -v -k "idempotent or preserves_existing"

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

jphein added a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
… main

The "fix: _get_client() get-then-create guard" row in the fork-ahead
table was misleading — the guard landed in fork as d3a2d22 on
2026-04-18 but was silently dropped during the develop merge-conflict
resolution in 7e18a70 on 2026-04-25. Current fork main does not carry
it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1
+ _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the
ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard.

Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed
2026-04-28 by @Legion345) implements the canonical version — try
get_collection / except chromadb.errors.NotFoundError → create — which
is an improvement over our older InvalidCollectionException reference
(NotFoundError is the renamed-in-1.5.x exception class).

Once MemPalace#1262 merges, this row clears via develop sync. Until then the
fork is exposed if upstream ever changes default hnsw:* metadata
between releases — the same drift shape that caused the original
crash.
@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 30, 2026

Thanks for picking this up @Legion345 — and for the kind credit, that was generous.

A couple of corrections about the field data, since you anchored the PR description on it and I want to make sure the maintainer isn't reviewing under a wrong impression:

The "since April 10" / "400+ starts, zero crashes" line came from my #1089 issue body, and both numbers are off. The guard actually landed on our fork at d3a2d22 on 2026-04-18 (8 days later than I wrote there), and it was dropped during a develop merge-conflict resolution at 7e18a70 on 2026-04-25. So the guard was alive in fork main for 7 days, not 20. I've edited #1089 with the correction so the record there is right.

What's actually true today (and apologies — this paragraph has been edited twice already; the prior versions overcorrected in two different directions). Tracing the call graph: chromadb's get_or_create_collection is called from two places on current fork main — backends/chroma.py:1061 and mcp_server.py:295 — both with identical metadata (hnsw:space, hnsw:num_threads: 1, plus the bloat-guard keys from #1191). The fork's 2-arg ChromaBackend.get_or_create_collection(path, name) legacy shim at chroma.py:1101 (used by migrate.py:237 and cli.py:1002) delegates through get_collection(create=True) to the same metadata-passing line, so metadata is consistent across opens. The residual exposure is legacy palaces created under older chromadb whose stored metadata differs from the current expanded set — a per-palace-history risk, not a structural one. I don't have a clean recurrence count to share. Net: this PR is still genuinely needed because the chromadb-side crash class isn't fixed, and the "validated by JP's fork" framing isn't load-bearing in either direction.

On the diff itself: the approach is right. One small thing I'd flag as a positive — you're catching chromadb.errors.NotFoundError, which is the 1.5.x-renamed exception. Our older code referenced InvalidCollectionException, and yours is the right one going forward. Both regression tests cover the case cleanly. Once this merges I'll pull it in via the next develop sync.

Thanks again for taking this on, and apologies for the back-and-forth on this comment — I shouldn't have edited my own correction with another wrong claim.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
…call sites

The earlier 2026-04-30 wording said current fork main calls
get_or_create_collection with stable/consistent metadata, sidestepping
the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the
defensive guard. That's not actually true.

Audited the four get_or_create_collection call sites on current fork
main:

- backends/chroma.py:1040     full hnsw:* set (space + num_threads + bloat-guard)
- mcp_server.py:295           full hnsw:* set
- migrate.py:235              no metadata
- cli.py:905                  no metadata

The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call-
site metadata diverging from stored collection metadata) is
structurally present. The fork's apparent stability is presumably
because the no-metadata sites operate on different collections
(mempalace_compressed, the migrate temp collection) than the full-
metadata sites (mempalace_drawers) — so the divergence never reaches
the same row in the same metadata table.

That's a load-bearing assumption worth flagging: it's circumstantial,
not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites
together so the guard wraps every site rather than just chroma.py.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens
sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four
call sites, trigger condition structurally present") both missed the
mark. Third try here, traced from the code:

- chromadb's get_or_create_collection is called directly from two
  places: backends/chroma.py:1061 and mcp_server.py:295. Both pass
  identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard).

- The fork's ChromaBackend.get_or_create_collection(path, name) at
  chroma.py:1101 is a 2-arg legacy shim that delegates via
  self.get_collection(palace_path, name, create=True) to the same
  metadata-passing line.

- migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach
  chromadb with full metadata too.

So metadata IS consistent across all five reachable call sites, not
divergent. The earlier "non-uniform" claim mistook the fork's API
surface for chromadb's call signature.

The residual SIGSEGV exposure is legacy palaces created under older
chromadb whose stored metadata differs from the current expanded set —
a per-palace-history risk, not a structural call-graph one.

Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
… update sync status

MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds
get-then-create wrapping in chromadb backend. Shepherding via review
comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears
via develop sync.

MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once +
log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and
MemPalace#1286 land, the _get_collection path is both crash-resilient and
self-healing.

Status line refreshed: develop has moved to fdfaf01 (Gemini CLI
normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync
will clear those plus row 15 once MemPalace#1262 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@igorls igorls merged commit 73541d1 into MemPalace:develop May 1, 2026
6 checks passed
igorls added a commit that referenced this pull request May 1, 2026
…pen-crash

fix(mcp_server): split get_or_create_collection on reopen (follow-up to #1262)
igorls added a commit that referenced this pull request May 1, 2026
Three fixes landed on develop after the initial release-prep cut and
were brought in via the develop merge. Document them in the 3.3.4
Bug Fixes section so the release notes reflect what users will
actually receive.

- #1287 - HNSW divergence floor scales with hnsw:sync_threshold
  (resolves a silent-fallback regression introduced by the
  interaction between #1191 and #1227 in this release)
- #1262 - ChromaBackend get_or_create_collection split, fixing the
  stop-hook SIGSEGV class on legacy palaces with mismatched stored
  metadata (#1089)
- #1288 / #1254 - repair --mode max-seq-id heuristic now decodes
  BLOB-typed embeddings.seq_id, restoring the un-poison path added
  in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
igorls added a commit that referenced this pull request May 1, 2026
The release was originally cut on 2026-04-27 but did not tag that day.
Three additional bug fixes have been folded in since then (#1262,
#1287, #1288) and the actual tag will happen on 2026-04-30. Update
the header date to match.
igorls added a commit that referenced this pull request May 1, 2026
The same try/except split that #1262 applied at the backend layer
(ChromaBackend.get_collection) was needed at the parallel call site
in mcp_server._get_collection(create=True), which carries the same
metadata payload directly to chromadb's Python client. Both reopen
paths in mempalace now bypass get_or_create_collection on existing
collections, closing the SIGSEGV class for both tool_add_drawer
and tool_diary_write (the Stop hook's path).
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 1, 2026
Catches up xdev-patches with 112 commits from MemPalace/develop, including:
- v3.3.4 release
- MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running
  MCP server & mempalace-api)
- MemPalace#1287 HNSW divergence floor fix
- MemPalace#1288 BLOB seq_id decode in repair
- MemPalace#1180 cross-wing tunnels by shared topics
- MemPalace#1194 wing-slug normalization for hyphenated dirs

Conflict resolution: hooks_cli.py and mcp_server.py both had local patches
(6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash)
that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on
those two files — upstream's version handles separate transcript/project
ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local
config.py regex relaxation auto-merged cleanly and is preserved.

Safety tag: pre-upstream-merge-20260501-091227 (rollback target).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
mjc pushed a commit to mjc/mempalace that referenced this pull request May 2, 2026
…lace#1299)

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
…SEGV (upstream MemPalace#1289)

Mirrors the backend-layer fix from MemPalace#1262 at the MCP server's _get_collection
cache rebuild path. Also imports _HNSW_BLOAT_GUARD for correct metadata on
fresh collection creation.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
Sync with upstream/develop via cherry-pick of 9 critical bugfixes:
- HNSW index bloat prevention (MemPalace#1191)
- SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289)
- blob-seq marker fast-path (MemPalace#1177)
- palace_graph None-metadata guard (MemPalace#1201)
- palace_graph security tunnels (MemPalace#1168)
- hyphenated wing name normalization (MemPalace#1197)
- searcher _tokenize None guard (MemPalace#1198)
- mcp_server diary topic sanitize (MemPalace#936)

Tests: 1459 default + 113 benchmark + 6/7 stress all pass.
(random-kill stress test was failing on dev pre-merge; not regressed.)
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…Palace#1289/MemPalace#1303

After the 2026-05-03 develop sync to commit 1888b67:
- Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed
  ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303
  plumbed embedding_function= through the reopen path.
- Trailing tracking paragraph updated (sync target, count of merged/
  open/closed PRs, list of features brought in).
- Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with
  rebase queued, MemPalace#1289 + MemPalace#1303 added as merged.

Note: fork-only _get_session_recovery_collection still uses the older
get_or_create_collection pattern; theoretical SIGSEGV exposure on
legacy recovery collections only. Tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request May 3, 2026
…attern to recovery collection

`_get_session_recovery_collection` (fork-only, introduced for the
checkpoint collection split in row 23) used the older
`get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just
fixed in `_get_collection`. Same theoretical SIGSEGV class on
chromadb 1.5.x when stored metadata diverges from the call site,
plus the `embedding_function=` omission MemPalace#1303 caught.

Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()`
and the create-path uses try `get_collection` / except
`NotFoundError` → `create_collection`. Recovery collection's
metadata is intentionally lighter than the main collection's
(no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is
small enough that segment bloat isn't a concern.

Existing `test_session_recovery.py` covers happy paths; 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGSEGV on get_or_create_collection when call-site metadata differs from stored collection metadata (chromadb 1.5.x)

3 participants