fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) by igorls · Pull Request #1227 · MemPalace/mempalace

igorls · 2026-04-26T22:54:25Z

Closes #1222.

Summary

When chromadb's HNSW segment freezes at a stale max_elements while sqlite keeps accumulating embeddings (the failure mode reported in #1222: 16 384 vs 192 997), the next segment load segfaults the MCP server on every tool call. This PR adds detection + graceful degradation so the palace stays reachable.

Detection — hnsw_capacity_status (mempalace/backends/chroma.py) compares sqlite's embedding count against the HNSW pickle's id_to_label size. Pure-filesystem read: no chromadb client, no hnswlib import, no segment load.
Read-only CLI — mempalace repair-status prints the comparison and recommends repair rebuild when divergence exceeds the flush-lag threshold (2× chromadb's sync_threshold of 1000).
MCP startup pre-flight — mcp_server runs the probe at startup and on every reconnect. When divergent, it sets _vector_disabled and surfaces the state through tool_status / tool_reconnect responses.
BM25-only fallback — _bm25_only_via_sqlite (mempalace/searcher.py) reads drawers via chromadb's own FTS5 trigram index and re-ranks with the existing Okapi-BM25, so search keeps working even when vector search is unsafe to invoke.

The pickle reader uses a tight-allowlist unpickler (_SafePersistentDataUnpickler) — only chromadb.segment.impl.vector.local_persistent_hnsw.PersistentData resolves; everything else, including os.system, is rejected before find_class returns.

Threshold rationale

ChromaDB 1.5.x flushes HNSW asynchronously, so the index lags sqlite by up to sync_threshold (default 1000) records under steady write load. Two windows worth (2000 absolute, or 10% of sqlite_count, whichever is greater) is the floor for "real divergence." The #1222 case (176 613 missing of 192 997) blasts past either; a typical post-mine palace shows a few hundred missing.

Test plan

17 new tests in tests/test_hnsw_capacity.py covering segment-id lookup, the unpickler allowlist (incl. an os.system rejection probe), all four capacity-status branches, BM25 fallback shape/filtering, and the repair-status CLI output
Full pytest suite: 1408 passed
ruff check + ruff format --check (CI-pinned 0.4.x): clean
Manual verification on a real diverged palace (cannot reproduce the 16 384/192 997 case locally; probe and fallback do work on a live 1798/1000 palace as a softer divergence)

…1222) When chromadb's HNSW segment freezes at a stale max_elements while sqlite keeps accumulating embeddings, the next chromadb open segfaults the MCP server on every tool call. Adds a pure-filesystem capacity probe (zero chromadb interaction), a `mempalace repair-status` read-only health check, and a BM25-only sqlite fallback so the palace stays reachable even when vector search is unavailable. * `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly via a tight-allowlist unpickler — no hnswlib import, no segment load. * MCP server runs the probe at startup and after every reconnect; sets `_vector_disabled` and routes search to the sqlite FTS5 + BM25 path. * `tool_status` and `tool_reconnect` surface the fallback state. * Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).

Copilot

Pull request overview

Adds a safety probe and fallback path to keep MemPalace usable when ChromaDB’s persistent HNSW index metadata diverges from the SQLite embeddings table (the #1222 segfault-on-load failure mode), plus a CLI status command and MCP-server surfacing of the degraded state.

Changes:

Add a read-only HNSW capacity probe (hnsw_capacity_status) that compares SQLite embedding counts vs. HNSW pickle element counts without loading the segment.
Add BM25-only search fallback that reads from chroma.sqlite3 + FTS5 trigram candidates when vector search is disabled.
Add mempalace repair-status command and MCP-server preflight/reconnect logic to detect divergence and report a “vector disabled” state.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`mempalace/backends/chroma.py`	Implements SQLite/HNSW metadata divergence detection using a restricted unpickler.
`mempalace/searcher.py`	Adds sqlite-only BM25 fallback search and a `vector_disabled` switch in `search_memories()`.
`mempalace/mcp_server.py`	Adds preflight probing and surfaces/uses the vector-disabled flag in MCP tool responses.
`mempalace/repair.py`	Adds a read-only `status()` operation to print divergence info for drawers/closets.
`mempalace/cli.py`	Adds `repair-status` CLI command wiring to the new repair status function.
`tests/test_hnsw_capacity.py`	Adds tests for the probe, restricted unpickler behavior, BM25 fallback, and CLI output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T23:01:25Z

+            "vector_disabled_reason": _vector_disabled_reason,
+            "hint": (
+                "duplicate detection requires vector search; run "
+                "`mempalace repair rebuild` to restore"


This hint recommends mempalace repair rebuild, but there is no such CLI subcommand (the CLI has mempalace repair / mempalace repair-status). Update the hint text to the actual repair command so users aren’t sent to a non-existent invocation.

Suggested change

"`mempalace repair rebuild` to restore"

"`mempalace repair` to restore"

Copilot · 2026-04-26T23:01:25Z

+            rows = conn.execute(
+                """
+                SELECT e.id
+                FROM embeddings e
+                JOIN segments s ON e.segment_id = s.id
+                JOIN collections c ON s.collection = c.id
+                WHERE c.name = 'mempalace_drawers'
+                ORDER BY e.created_at DESC
+                LIMIT ?
+                """,
+                (max_candidates,),
+            ).fetchall()
+            candidate_ids = [r[0] for r in rows]


The recency-window query isn’t wrapped in try/except sqlite3.Error, so if the palace’s SQLite schema differs (missing embedding_fulltext_search, created_at, etc.) this fallback path can still raise and break search. Catch sqlite errors here too and fall back to a safer ordering (e.g., by e.id/rowid) or return an empty result set with an error field.

Suggested change

rows = conn.execute(

"""

SELECT e.id

FROM embeddings e

JOIN segments s ON e.segment_id = s.id

JOIN collections c ON s.collection = c.id

WHERE c.name = 'mempalace_drawers'

ORDER BY e.created_at DESC

LIMIT ?

""",

(max_candidates,),

).fetchall()

candidate_ids = [r[0] for r in rows]

try:

rows = conn.execute(

"""

SELECT e.id

FROM embeddings e

JOIN segments s ON e.segment_id = s.id

JOIN collections c ON s.collection = c.id

WHERE c.name = 'mempalace_drawers'

ORDER BY e.created_at DESC

LIMIT ?

""",

(max_candidates,),

).fetchall()

candidate_ids = [r[0] for r in rows]

except sqlite3.Error:

logger.debug(

"Recency fallback query failed; using id-ordered fallback",

exc_info=True,

)

try:

rows = conn.execute(

"""

SELECT e.id

FROM embeddings e

JOIN segments s ON e.segment_id = s.id

JOIN collections c ON s.collection = c.id

WHERE c.name = 'mempalace_drawers'

ORDER BY e.id DESC

LIMIT ?

""",

(max_candidates,),

).fetchall()

candidate_ids = [r[0] for r in rows]

except sqlite3.Error:

logger.debug(

"Id-ordered fallback query failed; returning no SQLite candidates",

exc_info=True,

)

candidate_ids = []

Copilot · 2026-04-26T23:01:26Z

+    captured = capsys.readouterr().out
+    assert "DIVERGED" in captured
+    assert "mempalace repair rebuild" in captured
+    assert out["drawers"]["diverged"] is True


These assertions lock the output to mempalace repair rebuild, but that command isn’t provided by the CLI (it’s mempalace repair / mempalace repair-status, or python -m mempalace.repair rebuild). Once the user-facing guidance is corrected, update the expected strings here accordingly.

Copilot · 2026-04-26T23:01:26Z

+    repair_status(palace_path=str(tmp_path))
+    captured = capsys.readouterr().out
+    assert "DIVERGED" not in captured
+    assert "mempalace repair rebuild" not in captured


This assertion also expects the non-existent mempalace repair rebuild command string. After updating the status output to a valid CLI invocation, adjust this test to match.

Copilot · 2026-04-26T23:01:26Z

+# successful repair via :func:`tool_reconnect` (which re-runs the probe).
+_vector_disabled = False
+_vector_disabled_reason = ""
+_vector_capacity_status: dict | None = None


dict | None is Python 3.10+ syntax. This project declares Python >=3.9, so this annotation will raise a SyntaxError on 3.9 runtimes. Use Optional[dict] (or dict[str, Any] | None behind from __future__ import annotations) to keep 3.9 compatibility.

Copilot · 2026-04-26T23:01:26Z

+    # Ensure the capacity probe has been run at least once before we
+    # decide which path to take — _get_client triggers it on first open.
+    _get_client()


Calling _get_client() here will create a ChromaDB PersistentClient even when the probe already decided vector search is unsafe. If the #1222 segfault can be triggered by client/collection open, this defeats the fallback. Prefer re-running _refresh_vector_disabled_flag() (pure sqlite/pickle) and only touching ChromaDB when _vector_disabled is false.

Suggested change

# Ensure the capacity probe has been run at least once before we

# decide which path to take — _get_client triggers it on first open.

_get_client()

# Ensure the vector-disabled probe has been run using the safe

# sqlite/pickle path before deciding whether vector search is allowed.

_refresh_vector_disabled_flag()

Copilot · 2026-04-26T23:01:27Z

+    at a stale ``max_elements`` while sqlite keeps accumulating rows.
+    Once the divergence is large enough, every tool call segfaults when
+    chromadb tries to load the undersized HNSW. Running ``mempalace
+    repair status`` *before* opening the segment lets the operator


The docstring says to run mempalace repair status, but the CLI addition in this PR is mempalace repair-status (and python -m mempalace.repair status for standalone). Update the command name here to match the actual entrypoint so operators don’t copy/paste an invalid command.

Suggested change

repair status`` *before* opening the segment lets the operator

repair-status`` *before* opening the segment lets the operator

Copilot · 2026-04-26T23:01:27Z

+            print(f"    note:           {info['message']}")
+
+    if drawers["diverged"] or closets["diverged"]:
+        print("\n  Recommended: run `mempalace repair rebuild` to rebuild the index.")


This recommendation prints mempalace repair rebuild, but the CLI only provides mempalace repair (rebuild) and mempalace repair-status. Suggesting a non-existent command will confuse operators; update the suggested invocation to a valid one.

Suggested change

print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.")

print("\n Recommended: run `mempalace repair` to rebuild the index.")

Copilot · 2026-04-26T23:01:27Z

        "protocol": PALACE_PROTOCOL,
        "aaak_dialect": AAAK_SPEC,
    }
+    if _vector_disabled:
+        # Surface the #1222 fallback state so the AI knows search results
+        # are BM25-only and recommends the operator run repair.
+        result["vector_disabled"] = True
+        result["vector_disabled_reason"] = _vector_disabled_reason
+        if _vector_capacity_status:


Even when _vector_disabled is set, tool_status() will already have opened the Chroma collection / called count() earlier in the function. In the #1222 failure mode (segfault on segment load), that still risks crashing the MCP server before you can return this fallback state. Consider short-circuiting before any ChromaDB access when vector is disabled (or when the probe reports divergence), and compute status via direct sqlite reads instead.

Copilot · 2026-04-26T23:01:27Z

+        if not _vector_disabled:
+            logger.warning(
+                "HNSW capacity divergence detected (%s) — routing search to "
+                "BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore "


The log message recommends mempalace repair rebuild, but the CLI only defines mempalace repair and mempalace repair-status (no repair rebuild subcommand). This guidance will fail if copy/pasted; update the message to point at the actual supported command (or at python -m mempalace.repair rebuild if that’s what you intend).

Suggested change

"BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore "

"BM25-only sqlite fallback. Run `mempalace repair` to restore "

Five Copilot review issues + the Python 3.9 CI failure rolled into one follow-up: * Replace ``dict | None`` annotated assignment with a type-comment so module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red). * Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace repair`` (rebuild) and ``mempalace repair-status``. Updated all user-facing messages, docstrings, and test assertions. * Replace ``_get_client()`` in ``tool_search`` with the safe ``_refresh_vector_disabled_flag`` probe so the fallback isn't defeated by the very chromadb client load it's trying to avoid. * Short-circuit ``tool_status`` to a pure-sqlite reader (``_tool_status_via_sqlite``) when divergence is detected so wing / room counts come back without ever opening the persistent client. * Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an ``id``-ordered fallback so legacy schemas missing ``created_at`` don't break BM25 search. New test covers the sqlite-status short-circuit. 1409 tests pass.

igorls · 2026-04-27T01:01:20Z

Addressed Copilot review + the Python 3.9 CI red in 57ac669:

Python 3.9 CI: replaced the dict | None annotated assignment in mcp_server.py with a type-comment so module load doesn't evaluate PEP 604 syntax on 3.9.
Wrong CLI command in user-facing strings: mempalace repair rebuild doesn't exist (only mempalace repair and mempalace repair-status). Fixed every message, docstring, and test assertion.
tool_search defeated its own fallback: was calling _get_client() to trigger the probe, which constructs the very chromadb client the fallback is meant to avoid. Switched to _refresh_vector_disabled_flag() (pure sqlite/pickle, no chromadb).
tool_status opened the chromadb collection unconditionally: added _tool_status_via_sqlite short-circuit that runs when _vector_disabled is set — wing/room counts come back via direct embedding_metadata reads so status stays reachable in the segfault scenario.
Recency-window query was unguarded: wrapped in sqlite3.Error handling with an id-ordered fallback for legacy schemas missing created_at.

CI is green across lint + 3.9/3.11/3.13/macOS/Windows. 1409 tests pass locally.

22 commits from upstream/develop, including: - HNSW capacity divergence detection + BM25-only sqlite fallback when vector layer is unloadable (MemPalace#1222 / MemPalace#1227 — vector_disabled kwarg routes search through chromadb-bypass path on segfault risk). - HNSW index bloat prevention via batch_size + sync_threshold metadata on collection create (MemPalace#1191). - Hooks: always mine the active transcript as convos, additive to MEMPAL_DIR (MemPalace#1230 / MemPalace#1231). Restructured into `_get_mine_targets()` list approach + separate `_ingest_transcript()`. Daemon-strict gate preserved on each entry point. - Wing-name normalization for hyphenated dirs across miner / convo_miner / palace_graph (MemPalace#1194 / MemPalace#1195 / MemPalace#1197). - `narrow _fix_blob_seq_ids` shim + `repair --mode max-seq-id` for legacy 0.6.x BLOB-poisoned palaces (MemPalace#1135). Conflict resolution: - README.md: kept fork-shaped narrative, dropped upstream's sweep tip injection (per fork-readme handling memory). - hooks/README.md: adopted upstream's accurate `MEMPAL_DIR` additive description; kept `MEMPAL_PYTHON` env-var name (matches actual `hooks/*.sh` scripts in both forks). - mempalace/cli.py: consolidated duplicate `--mode` argparse declarations into one with all 4 choices (rebuild/legacy/reorganize/max-seq-id). - mempalace/hooks_cli.py: adopted upstream's `_get_mine_targets()` + `_ingest_transcript()` shape; added `_daemon_strict()` guard at entry of `_maybe_auto_ingest`, `_mine_sync`, and `_ingest_transcript` so the daemon-strict architecture still skips local writes. - mempalace/mcp_server.py: kept both `kind=` and `vector_disabled=` kwargs on the `search_memories` call. - mempalace/searcher.py: kept fork's `_count_in_scope`, `_sqlite_fallback_and_scope`, `_apply_kind_text_filter` AND upstream's `_bm25_only_via_sqlite`. Both `kind` and `vector_disabled` parameters on `search_memories`. Tests: 1510 passing (up from 1366 — upstream brought new test suites).

Five Copilot review issues + the Python 3.9 CI failure rolled into one follow-up: * Replace ``dict | None`` annotated assignment with a type-comment so module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red). * Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace repair`` (rebuild) and ``mempalace repair-status``. Updated all user-facing messages, docstrings, and test assertions. * Replace ``_get_client()`` in ``tool_search`` with the safe ``_refresh_vector_disabled_flag`` probe so the fallback isn't defeated by the very chromadb client load it's trying to avoid. * Short-circuit ``tool_status`` to a pure-sqlite reader (``_tool_status_via_sqlite``) when divergence is detected so wing / room counts come back without ever opening the persistent client. * Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an ``id``-ordered fallback so legacy schemas missing ``created_at`` don't break BM25 search. New test covers the sqlite-status short-circuit. 1409 tests pass.

The capacity probe added in #1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". #1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original #1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively

Copilot AI review requested due to automatic review settings April 26, 2026 22:54

igorls requested review from bensig and milla-jovovich as code owners April 26, 2026 22:54

Copilot started reviewing on behalf of igorls April 26, 2026 22:54 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

igorls merged commit 9dbb4ce into develop Apr 27, 2026
6 checks passed

This was referenced Apr 27, 2026

fix: prevent HNSW index bloat from resize+persist cycles #1191

Merged

fix: narrow _fix_blob_seq_ids + add repair --mode max-seq-id #1135

Merged

Seph396 mentioned this pull request Apr 28, 2026

repair: silent exit 0 when SQLite advances but HNSW remains stale; missing quarantine_stale_hnsw() invocation #1263

Closed

Wolfreaper85 mentioned this pull request Apr 29, 2026

fix(search): hint at mempalace repair when filtered query hits HNSW index mismatch (#1035) #1081

Open

4 tasks

messelink mentioned this pull request May 1, 2026

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold #1287

Merged

briankingdomlight mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1307

Closed

potterdigital mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227

fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227
igorls merged 2 commits intodevelopfrom
fix/hnsw-capacity-divergence-1222

igorls commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Uh oh!

igorls commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"`mempalace repair rebuild` to restore"
	"`mempalace repair` to restore"

	repair status`` before opening the segment lets the operator
	repair-status`` before opening the segment lets the operator

	print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.")
	print("\n Recommended: run `mempalace repair` to rebuild the index.")

	"BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore "
	"BM25-only sqlite fallback. Run `mempalace repair` to restore "

Conversation

igorls commented Apr 26, 2026

Summary

Threshold rationale

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

igorls commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants