Conversation
…event SIGSEGV on stale HNSW (#1121, #1132, #1263) PR #1173 wired quarantine_stale_hnsw into the static make_client() helper but not into the instance _client() method. As a result every non-MCP entry point (CLI mining, search, repair, status) — which all use get_collection / _get_or_create_collection / _client() — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, replicated palace, or crashed-mid-write. Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw) pre-open pass into a single private static helper ChromaBackend._prepare_palace_for_open(). Both make_client() and _client() now route through it, so the _quarantined_paths once-per- palace-per-process gate is preserved (no runtime thrash on hot paths) and behaviour stays identical — the fix is purely about extending the existing protection to the path that was missing it. Tests: - test_client_quarantines_corrupt_segment_on_first_open mirrors the existing make_client test and verifies _client() actually renames a corrupt segment on first open. - test_client_quarantines_only_on_first_call_per_palace verifies the cache gate prevents re-running quarantine across repeated _client() calls — important because _client() is hit on every backend op. Closes #1121. Closes #1132. Closes #1263. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
CI requires whole-file format on touched files; pre-existing drift only.
xcarbo
added a commit
to xcarbo/mempalace
that referenced
this pull request
May 4, 2026
Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
4 tasks
This was referenced May 6, 2026
YulelogPagoda
pushed a commit
to YulelogPagoda/mempalace
that referenced
this pull request
May 6, 2026
…1286) A transient chromadb exception inside `_get_collection` was swallowed by the bare `except Exception: return None`, leaving every subsequent tool call hitting the same poisoned cache silently. The fix wraps the body in a `for attempt in range(2)` loop: on attempt 0 failure, log via `logger.exception(...)` and clear `_client_cache` / `_collection_cache` / `_metadata_cache` so the next iteration forces `_get_client()` to rebuild from scratch — that path now re-runs `quarantine_stale_hnsw` (per MemPalace#1322), so the second attempt heals the common stale-handle case automatically. If both attempts fail, return `None` (matches the prior contract for permanent failures). Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`: - `test_get_collection_retries_once_on_exception` — first attempt raises via a monkeypatched `_get_client`, second attempt succeeds; assert the caller gets the collection back, not None. - `test_get_collection_returns_none_after_two_failures` — both attempts fail, assert we exhaust the loop and return None (no infinite retry). Surgical extraction from PR MemPalace#1286, which carried the same fix idea (plus a fork-sync bundle that couldn't be merged); credit to the original author below. Co-authored-by: Jeffrey Hein <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1121. Closes #1132. Closes #1263.
Root cause
PR #1173 wired
quarantine_stale_hnswinto the staticChromaBackend.make_client()helper, but not into the instance_client()method thatget_collection/_get_or_create_collectionroute through. As a result every non-MCP entry point — CLI mining, search, repair, status — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, a replicated palace, or a crashed-mid-write.Three issues, one missed wire.
Fix
Extract the
(_fix_blob_seq_ids + gated quarantine_stale_hnsw)pre-open pass into a single private static helperChromaBackend._prepare_palace_for_open(). Bothmake_client()and_client()now route through it.Behaviour preserved:
_quarantined_pathsonce-per-palace-per-process gate still applies, so no runtime thrash on hot paths (_client()is hit on every backend op).make_client()semantics are byte-identical to before (it now just delegates)._client()runs the prep pass only inside its existing rebuild block (cached is None or inode_changed or mtime_changed or mtime_appeared), so cached fast-path callers pay nothing.Diff shape
Tests
Two new tests mirror the existing
make_clientcoverage:test_client_quarantines_corrupt_segment_on_first_open— builds a palace with a stale + corrupt HNSW segment, calls_client(), asserts the segment was renamed to*.drift-*.test_client_quarantines_only_on_first_call_per_palace— three back-to-back_client()calls firequarantine_stale_hnswexactly once (gate works on the hot path).Test plan
pytest tests/ -v --ignore=tests/benchmarks -x -k "client or quarantine or chroma or hnsw"— 94 passedpytest tests/ --ignore=tests/benchmarks -x— 1472 passed, 1 skippedruff check .— cleanruff format --checkclean for the two files I touched (pre-existing format drift on develop in unrelated files is left alone)🤖 Generated with Claude Code