fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227
fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227
Conversation
…1222) When chromadb's HNSW segment freezes at a stale max_elements while sqlite keeps accumulating embeddings, the next chromadb open segfaults the MCP server on every tool call. Adds a pure-filesystem capacity probe (zero chromadb interaction), a `mempalace repair-status` read-only health check, and a BM25-only sqlite fallback so the palace stays reachable even when vector search is unavailable. * `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly via a tight-allowlist unpickler — no hnswlib import, no segment load. * MCP server runs the probe at startup and after every reconnect; sets `_vector_disabled` and routes search to the sqlite FTS5 + BM25 path. * `tool_status` and `tool_reconnect` surface the fallback state. * Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).
There was a problem hiding this comment.
Pull request overview
Adds a safety probe and fallback path to keep MemPalace usable when ChromaDB’s persistent HNSW index metadata diverges from the SQLite embeddings table (the #1222 segfault-on-load failure mode), plus a CLI status command and MCP-server surfacing of the degraded state.
Changes:
- Add a read-only HNSW capacity probe (
hnsw_capacity_status) that compares SQLite embedding counts vs. HNSW pickle element counts without loading the segment. - Add BM25-only search fallback that reads from
chroma.sqlite3+ FTS5 trigram candidates when vector search is disabled. - Add
mempalace repair-statuscommand and MCP-server preflight/reconnect logic to detect divergence and report a “vector disabled” state.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/backends/chroma.py |
Implements SQLite/HNSW metadata divergence detection using a restricted unpickler. |
mempalace/searcher.py |
Adds sqlite-only BM25 fallback search and a vector_disabled switch in search_memories(). |
mempalace/mcp_server.py |
Adds preflight probing and surfaces/uses the vector-disabled flag in MCP tool responses. |
mempalace/repair.py |
Adds a read-only status() operation to print divergence info for drawers/closets. |
mempalace/cli.py |
Adds repair-status CLI command wiring to the new repair status function. |
tests/test_hnsw_capacity.py |
Adds tests for the probe, restricted unpickler behavior, BM25 fallback, and CLI output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "vector_disabled_reason": _vector_disabled_reason, | ||
| "hint": ( | ||
| "duplicate detection requires vector search; run " | ||
| "`mempalace repair rebuild` to restore" |
There was a problem hiding this comment.
This hint recommends mempalace repair rebuild, but there is no such CLI subcommand (the CLI has mempalace repair / mempalace repair-status). Update the hint text to the actual repair command so users aren’t sent to a non-existent invocation.
| "`mempalace repair rebuild` to restore" | |
| "`mempalace repair` to restore" |
| rows = conn.execute( | ||
| """ | ||
| SELECT e.id | ||
| FROM embeddings e | ||
| JOIN segments s ON e.segment_id = s.id | ||
| JOIN collections c ON s.collection = c.id | ||
| WHERE c.name = 'mempalace_drawers' | ||
| ORDER BY e.created_at DESC | ||
| LIMIT ? | ||
| """, | ||
| (max_candidates,), | ||
| ).fetchall() | ||
| candidate_ids = [r[0] for r in rows] |
There was a problem hiding this comment.
The recency-window query isn’t wrapped in try/except sqlite3.Error, so if the palace’s SQLite schema differs (missing embedding_fulltext_search, created_at, etc.) this fallback path can still raise and break search. Catch sqlite errors here too and fall back to a safer ordering (e.g., by e.id/rowid) or return an empty result set with an error field.
| rows = conn.execute( | |
| """ | |
| SELECT e.id | |
| FROM embeddings e | |
| JOIN segments s ON e.segment_id = s.id | |
| JOIN collections c ON s.collection = c.id | |
| WHERE c.name = 'mempalace_drawers' | |
| ORDER BY e.created_at DESC | |
| LIMIT ? | |
| """, | |
| (max_candidates,), | |
| ).fetchall() | |
| candidate_ids = [r[0] for r in rows] | |
| try: | |
| rows = conn.execute( | |
| """ | |
| SELECT e.id | |
| FROM embeddings e | |
| JOIN segments s ON e.segment_id = s.id | |
| JOIN collections c ON s.collection = c.id | |
| WHERE c.name = 'mempalace_drawers' | |
| ORDER BY e.created_at DESC | |
| LIMIT ? | |
| """, | |
| (max_candidates,), | |
| ).fetchall() | |
| candidate_ids = [r[0] for r in rows] | |
| except sqlite3.Error: | |
| logger.debug( | |
| "Recency fallback query failed; using id-ordered fallback", | |
| exc_info=True, | |
| ) | |
| try: | |
| rows = conn.execute( | |
| """ | |
| SELECT e.id | |
| FROM embeddings e | |
| JOIN segments s ON e.segment_id = s.id | |
| JOIN collections c ON s.collection = c.id | |
| WHERE c.name = 'mempalace_drawers' | |
| ORDER BY e.id DESC | |
| LIMIT ? | |
| """, | |
| (max_candidates,), | |
| ).fetchall() | |
| candidate_ids = [r[0] for r in rows] | |
| except sqlite3.Error: | |
| logger.debug( | |
| "Id-ordered fallback query failed; returning no SQLite candidates", | |
| exc_info=True, | |
| ) | |
| candidate_ids = [] |
| captured = capsys.readouterr().out | ||
| assert "DIVERGED" in captured | ||
| assert "mempalace repair rebuild" in captured | ||
| assert out["drawers"]["diverged"] is True |
There was a problem hiding this comment.
These assertions lock the output to mempalace repair rebuild, but that command isn’t provided by the CLI (it’s mempalace repair / mempalace repair-status, or python -m mempalace.repair rebuild). Once the user-facing guidance is corrected, update the expected strings here accordingly.
| repair_status(palace_path=str(tmp_path)) | ||
| captured = capsys.readouterr().out | ||
| assert "DIVERGED" not in captured | ||
| assert "mempalace repair rebuild" not in captured |
There was a problem hiding this comment.
This assertion also expects the non-existent mempalace repair rebuild command string. After updating the status output to a valid CLI invocation, adjust this test to match.
| # successful repair via :func:`tool_reconnect` (which re-runs the probe). | ||
| _vector_disabled = False | ||
| _vector_disabled_reason = "" | ||
| _vector_capacity_status: dict | None = None |
There was a problem hiding this comment.
dict | None is Python 3.10+ syntax. This project declares Python >=3.9, so this annotation will raise a SyntaxError on 3.9 runtimes. Use Optional[dict] (or dict[str, Any] | None behind from __future__ import annotations) to keep 3.9 compatibility.
| # Ensure the capacity probe has been run at least once before we | ||
| # decide which path to take — _get_client triggers it on first open. | ||
| _get_client() |
There was a problem hiding this comment.
Calling _get_client() here will create a ChromaDB PersistentClient even when the probe already decided vector search is unsafe. If the #1222 segfault can be triggered by client/collection open, this defeats the fallback. Prefer re-running _refresh_vector_disabled_flag() (pure sqlite/pickle) and only touching ChromaDB when _vector_disabled is false.
| # Ensure the capacity probe has been run at least once before we | |
| # decide which path to take — _get_client triggers it on first open. | |
| _get_client() | |
| # Ensure the vector-disabled probe has been run using the safe | |
| # sqlite/pickle path before deciding whether vector search is allowed. | |
| _refresh_vector_disabled_flag() |
| at a stale ``max_elements`` while sqlite keeps accumulating rows. | ||
| Once the divergence is large enough, every tool call segfaults when | ||
| chromadb tries to load the undersized HNSW. Running ``mempalace | ||
| repair status`` *before* opening the segment lets the operator |
There was a problem hiding this comment.
The docstring says to run mempalace repair status, but the CLI addition in this PR is mempalace repair-status (and python -m mempalace.repair status for standalone). Update the command name here to match the actual entrypoint so operators don’t copy/paste an invalid command.
| repair status`` *before* opening the segment lets the operator | |
| repair-status`` *before* opening the segment lets the operator |
| print(f" note: {info['message']}") | ||
|
|
||
| if drawers["diverged"] or closets["diverged"]: | ||
| print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.") |
There was a problem hiding this comment.
This recommendation prints mempalace repair rebuild, but the CLI only provides mempalace repair (rebuild) and mempalace repair-status. Suggesting a non-existent command will confuse operators; update the suggested invocation to a valid one.
| print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.") | |
| print("\n Recommended: run `mempalace repair` to rebuild the index.") |
| "protocol": PALACE_PROTOCOL, | ||
| "aaak_dialect": AAAK_SPEC, | ||
| } | ||
| if _vector_disabled: | ||
| # Surface the #1222 fallback state so the AI knows search results | ||
| # are BM25-only and recommends the operator run repair. | ||
| result["vector_disabled"] = True | ||
| result["vector_disabled_reason"] = _vector_disabled_reason | ||
| if _vector_capacity_status: |
There was a problem hiding this comment.
Even when _vector_disabled is set, tool_status() will already have opened the Chroma collection / called count() earlier in the function. In the #1222 failure mode (segfault on segment load), that still risks crashing the MCP server before you can return this fallback state. Consider short-circuiting before any ChromaDB access when vector is disabled (or when the probe reports divergence), and compute status via direct sqlite reads instead.
| if not _vector_disabled: | ||
| logger.warning( | ||
| "HNSW capacity divergence detected (%s) — routing search to " | ||
| "BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore " |
There was a problem hiding this comment.
The log message recommends mempalace repair rebuild, but the CLI only defines mempalace repair and mempalace repair-status (no repair rebuild subcommand). This guidance will fail if copy/pasted; update the message to point at the actual supported command (or at python -m mempalace.repair rebuild if that’s what you intend).
| "BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore " | |
| "BM25-only sqlite fallback. Run `mempalace repair` to restore " |
Five Copilot review issues + the Python 3.9 CI failure rolled into one follow-up: * Replace ``dict | None`` annotated assignment with a type-comment so module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red). * Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace repair`` (rebuild) and ``mempalace repair-status``. Updated all user-facing messages, docstrings, and test assertions. * Replace ``_get_client()`` in ``tool_search`` with the safe ``_refresh_vector_disabled_flag`` probe so the fallback isn't defeated by the very chromadb client load it's trying to avoid. * Short-circuit ``tool_status`` to a pure-sqlite reader (``_tool_status_via_sqlite``) when divergence is detected so wing / room counts come back without ever opening the persistent client. * Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an ``id``-ordered fallback so legacy schemas missing ``created_at`` don't break BM25 search. New test covers the sqlite-status short-circuit. 1409 tests pass.
|
Addressed Copilot review + the Python 3.9 CI red in 57ac669:
CI is green across lint + 3.9/3.11/3.13/macOS/Windows. 1409 tests pass locally. |
22 commits from upstream/develop, including: - HNSW capacity divergence detection + BM25-only sqlite fallback when vector layer is unloadable (MemPalace#1222 / MemPalace#1227 — vector_disabled kwarg routes search through chromadb-bypass path on segfault risk). - HNSW index bloat prevention via batch_size + sync_threshold metadata on collection create (MemPalace#1191). - Hooks: always mine the active transcript as convos, additive to MEMPAL_DIR (MemPalace#1230 / MemPalace#1231). Restructured into `_get_mine_targets()` list approach + separate `_ingest_transcript()`. Daemon-strict gate preserved on each entry point. - Wing-name normalization for hyphenated dirs across miner / convo_miner / palace_graph (MemPalace#1194 / MemPalace#1195 / MemPalace#1197). - `narrow _fix_blob_seq_ids` shim + `repair --mode max-seq-id` for legacy 0.6.x BLOB-poisoned palaces (MemPalace#1135). Conflict resolution: - README.md: kept fork-shaped narrative, dropped upstream's sweep tip injection (per fork-readme handling memory). - hooks/README.md: adopted upstream's accurate `MEMPAL_DIR` additive description; kept `MEMPAL_PYTHON` env-var name (matches actual `hooks/*.sh` scripts in both forks). - mempalace/cli.py: consolidated duplicate `--mode` argparse declarations into one with all 4 choices (rebuild/legacy/reorganize/max-seq-id). - mempalace/hooks_cli.py: adopted upstream's `_get_mine_targets()` + `_ingest_transcript()` shape; added `_daemon_strict()` guard at entry of `_maybe_auto_ingest`, `_mine_sync`, and `_ingest_transcript` so the daemon-strict architecture still skips local writes. - mempalace/mcp_server.py: kept both `kind=` and `vector_disabled=` kwargs on the `search_memories` call. - mempalace/searcher.py: kept fork's `_count_in_scope`, `_sqlite_fallback_and_scope`, `_apply_kind_text_filter` AND upstream's `_bm25_only_via_sqlite`. Both `kind` and `vector_disabled` parameters on `search_memories`. Tests: 1510 passing (up from 1366 — upstream brought new test suites).
Five Copilot review issues + the Python 3.9 CI failure rolled into one follow-up: * Replace ``dict | None`` annotated assignment with a type-comment so module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red). * Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace repair`` (rebuild) and ``mempalace repair-status``. Updated all user-facing messages, docstrings, and test assertions. * Replace ``_get_client()`` in ``tool_search`` with the safe ``_refresh_vector_disabled_flag`` probe so the fallback isn't defeated by the very chromadb client load it's trying to avoid. * Short-circuit ``tool_status`` to a pure-sqlite reader (``_tool_status_via_sqlite``) when divergence is detected so wing / room counts come back without ever opening the persistent client. * Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an ``id``-ordered fallback so legacy schemas missing ``created_at`` don't break BM25 search. New test covers the sqlite-status short-circuit. 1409 tests pass.
The capacity probe added in #1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". #1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original #1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
Closes #1222.
Summary
When chromadb's HNSW segment freezes at a stale
max_elementswhile sqlite keeps accumulating embeddings (the failure mode reported in #1222: 16 384 vs 192 997), the next segment load segfaults the MCP server on every tool call. This PR adds detection + graceful degradation so the palace stays reachable.hnsw_capacity_status(mempalace/backends/chroma.py) compares sqlite's embedding count against the HNSW pickle'sid_to_labelsize. Pure-filesystem read: no chromadb client, nohnswlibimport, no segment load.mempalace repair-statusprints the comparison and recommendsrepair rebuildwhen divergence exceeds the flush-lag threshold (2× chromadb'ssync_thresholdof 1000).mcp_serverruns the probe at startup and on every reconnect. When divergent, it sets_vector_disabledand surfaces the state throughtool_status/tool_reconnectresponses._bm25_only_via_sqlite(mempalace/searcher.py) reads drawers via chromadb's own FTS5 trigram index and re-ranks with the existing Okapi-BM25, so search keeps working even when vector search is unsafe to invoke.The pickle reader uses a tight-allowlist unpickler (
_SafePersistentDataUnpickler) — onlychromadb.segment.impl.vector.local_persistent_hnsw.PersistentDataresolves; everything else, includingos.system, is rejected beforefind_classreturns.Threshold rationale
ChromaDB 1.5.x flushes HNSW asynchronously, so the index lags sqlite by up to
sync_threshold(default 1000) records under steady write load. Two windows worth (2000 absolute, or 10% of sqlite_count, whichever is greater) is the floor for "real divergence." The #1222 case (176 613 missing of 192 997) blasts past either; a typical post-mine palace shows a few hundred missing.Test plan
tests/test_hnsw_capacity.pycovering segment-id lookup, the unpickler allowlist (incl. anos.systemrejection probe), all four capacity-status branches, BM25 fallback shape/filtering, and therepair-statusCLI outputruff check+ruff format --check(CI-pinned 0.4.x): clean