Skip to content

fix(backends/chroma): wire quarantine_stale_hnsw into _client() (#1121 #1132 #1263)#1322

Merged
igorls merged 3 commits intodevelopfrom
fix/1121-1132-1263-client-quarantine
May 3, 2026
Merged

fix(backends/chroma): wire quarantine_stale_hnsw into _client() (#1121 #1132 #1263)#1322
igorls merged 3 commits intodevelopfrom
fix/1121-1132-1263-client-quarantine

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented May 3, 2026

Closes #1121. Closes #1132. Closes #1263.

Root cause

PR #1173 wired quarantine_stale_hnsw into the static ChromaBackend.make_client() helper, but not into the instance _client() method that get_collection / _get_or_create_collection route through. As a result every non-MCP entry point — CLI mining, search, repair, status — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, a replicated palace, or a crashed-mid-write.

Three issues, one missed wire.

Fix

Extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw) pre-open pass into a single private static helper ChromaBackend._prepare_palace_for_open(). Both make_client() and _client() now route through it.

Behaviour preserved:

  • _quarantined_paths once-per-palace-per-process gate still applies, so no runtime thrash on hot paths (_client() is hit on every backend op).
  • make_client() semantics are byte-identical to before (it now just delegates).
  • _client() runs the prep pass only inside its existing rebuild block (cached is None or inode_changed or mtime_changed or mtime_appeared), so cached fast-path callers pay nothing.
Diff shape
@staticmethod
def _prepare_palace_for_open(palace_path: str) -> None:
    _fix_blob_seq_ids(palace_path)
    if palace_path not in ChromaBackend._quarantined_paths:
        quarantine_stale_hnsw(palace_path)
        ChromaBackend._quarantined_paths.add(palace_path)

@staticmethod
def make_client(palace_path: str):
    ChromaBackend._prepare_palace_for_open(palace_path)
    return chromadb.PersistentClient(path=palace_path)

def _client(self, palace_path: str):
    ...
    if cached is None or inode_changed or mtime_changed or mtime_appeared:
        ChromaBackend._prepare_palace_for_open(palace_path)
        cached = chromadb.PersistentClient(path=palace_path)
        ...

Tests

Two new tests mirror the existing make_client coverage:

  • test_client_quarantines_corrupt_segment_on_first_open — builds a palace with a stale + corrupt HNSW segment, calls _client(), asserts the segment was renamed to *.drift-*.
  • test_client_quarantines_only_on_first_call_per_palace — three back-to-back _client() calls fire quarantine_stale_hnsw exactly once (gate works on the hot path).

Test plan

  • pytest tests/ -v --ignore=tests/benchmarks -x -k "client or quarantine or chroma or hnsw" — 94 passed
  • pytest tests/ --ignore=tests/benchmarks -x — 1472 passed, 1 skipped
  • ruff check . — clean
  • ruff format --check clean for the two files I touched (pre-existing format drift on develop in unrelated files is left alone)

🤖 Generated with Claude Code

…event SIGSEGV on stale HNSW (#1121, #1132, #1263)

PR #1173 wired quarantine_stale_hnsw into the static make_client() helper
but not into the instance _client() method. As a result every non-MCP
entry point (CLI mining, search, repair, status) — which all use
get_collection / _get_or_create_collection / _client() — skipped the
cold-start quarantine pass and could SIGSEGV on a stale HNSW segment
left over from a partial flush, replicated palace, or crashed-mid-write.

Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw)
pre-open pass into a single private static helper
ChromaBackend._prepare_palace_for_open(). Both make_client() and
_client() now route through it, so the _quarantined_paths once-per-
palace-per-process gate is preserved (no runtime thrash on hot paths)
and behaviour stays identical — the fix is purely about extending the
existing protection to the path that was missing it.

Tests:

- test_client_quarantines_corrupt_segment_on_first_open mirrors the
  existing make_client test and verifies _client() actually renames a
  corrupt segment on first open.
- test_client_quarantines_only_on_first_call_per_palace verifies the
  cache gate prevents re-running quarantine across repeated _client()
  calls — important because _client() is hit on every backend op.

Closes #1121. Closes #1132. Closes #1263.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings May 3, 2026 01:56
@igorls igorls added bug Something isn't working storage labels May 3, 2026
@igorls igorls added this to the v3.3.5 milestone May 3, 2026
igorls added 2 commits May 2, 2026 22:58
CI requires whole-file format on touched files; pre-existing drift only.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@igorls igorls merged commit 6f88b2a into develop May 3, 2026
6 checks passed
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 4, 2026
Catches up on a month of upstream work. Highlights pulled in:

- MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool)
- MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status
- MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW
- MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate
- MemPalace#1321 cli: honor --palace flag in cmd_init
- MemPalace#1314 kg temporal params fix
- MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read
- MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read
- MemPalace#1303 mcp_server: pass embedding_function= on collection reopen
- MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json
- Various ruff format passes on touched files

Conflict resolution (CHANGELOG.md only — code files all auto-merged):

- 3.3.5 unreleased section header from upstream kept above 3.3.4
- 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new
  MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels
  (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and
  auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail
  (upstream's version was a strict subset).

xdev patches preserved (still on this branch, untouched by merge):
- 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings
- 3fad61d fix(config): allow leading dash in wing names

Not pushed to origin — run tests locally and decide when to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
YulelogPagoda pushed a commit to YulelogPagoda/mempalace that referenced this pull request May 6, 2026
…1286)

A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per MemPalace#1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).

Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
  via a monkeypatched `_get_client`, second attempt succeeds; assert the
  caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
  fail, assert we exhaust the loop and return None (no infinite retry).

Surgical extraction from PR MemPalace#1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.

Co-authored-by: Jeffrey Hein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working storage

Projects

None yet

2 participants