Skip to content

fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227

Merged
igorls merged 2 commits intodevelopfrom
fix/hnsw-capacity-divergence-1222
Apr 27, 2026
Merged

fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222)#1227
igorls merged 2 commits intodevelopfrom
fix/hnsw-capacity-divergence-1222

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 26, 2026

Closes #1222.

Summary

When chromadb's HNSW segment freezes at a stale max_elements while sqlite keeps accumulating embeddings (the failure mode reported in #1222: 16 384 vs 192 997), the next segment load segfaults the MCP server on every tool call. This PR adds detection + graceful degradation so the palace stays reachable.

  • Detectionhnsw_capacity_status (mempalace/backends/chroma.py) compares sqlite's embedding count against the HNSW pickle's id_to_label size. Pure-filesystem read: no chromadb client, no hnswlib import, no segment load.
  • Read-only CLImempalace repair-status prints the comparison and recommends repair rebuild when divergence exceeds the flush-lag threshold (2× chromadb's sync_threshold of 1000).
  • MCP startup pre-flightmcp_server runs the probe at startup and on every reconnect. When divergent, it sets _vector_disabled and surfaces the state through tool_status / tool_reconnect responses.
  • BM25-only fallback_bm25_only_via_sqlite (mempalace/searcher.py) reads drawers via chromadb's own FTS5 trigram index and re-ranks with the existing Okapi-BM25, so search keeps working even when vector search is unsafe to invoke.

The pickle reader uses a tight-allowlist unpickler (_SafePersistentDataUnpickler) — only chromadb.segment.impl.vector.local_persistent_hnsw.PersistentData resolves; everything else, including os.system, is rejected before find_class returns.

Threshold rationale

ChromaDB 1.5.x flushes HNSW asynchronously, so the index lags sqlite by up to sync_threshold (default 1000) records under steady write load. Two windows worth (2000 absolute, or 10% of sqlite_count, whichever is greater) is the floor for "real divergence." The #1222 case (176 613 missing of 192 997) blasts past either; a typical post-mine palace shows a few hundred missing.

Test plan

  • 17 new tests in tests/test_hnsw_capacity.py covering segment-id lookup, the unpickler allowlist (incl. an os.system rejection probe), all four capacity-status branches, BM25 fallback shape/filtering, and the repair-status CLI output
  • Full pytest suite: 1408 passed
  • ruff check + ruff format --check (CI-pinned 0.4.x): clean
  • Manual verification on a real diverged palace (cannot reproduce the 16 384/192 997 case locally; probe and fallback do work on a live 1798/1000 palace as a softer divergence)

…1222)

When chromadb's HNSW segment freezes at a stale max_elements while
sqlite keeps accumulating embeddings, the next chromadb open segfaults
the MCP server on every tool call. Adds a pure-filesystem capacity probe
(zero chromadb interaction), a `mempalace repair-status` read-only
health check, and a BM25-only sqlite fallback so the palace stays
reachable even when vector search is unavailable.

* `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly
  via a tight-allowlist unpickler — no hnswlib import, no segment load.
* MCP server runs the probe at startup and after every reconnect; sets
  `_vector_disabled` and routes search to the sqlite FTS5 + BM25 path.
* `tool_status` and `tool_reconnect` surface the fallback state.
* Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).
Copilot AI review requested due to automatic review settings April 26, 2026 22:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a safety probe and fallback path to keep MemPalace usable when ChromaDB’s persistent HNSW index metadata diverges from the SQLite embeddings table (the #1222 segfault-on-load failure mode), plus a CLI status command and MCP-server surfacing of the degraded state.

Changes:

  • Add a read-only HNSW capacity probe (hnsw_capacity_status) that compares SQLite embedding counts vs. HNSW pickle element counts without loading the segment.
  • Add BM25-only search fallback that reads from chroma.sqlite3 + FTS5 trigram candidates when vector search is disabled.
  • Add mempalace repair-status command and MCP-server preflight/reconnect logic to detect divergence and report a “vector disabled” state.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
mempalace/backends/chroma.py Implements SQLite/HNSW metadata divergence detection using a restricted unpickler.
mempalace/searcher.py Adds sqlite-only BM25 fallback search and a vector_disabled switch in search_memories().
mempalace/mcp_server.py Adds preflight probing and surfaces/uses the vector-disabled flag in MCP tool responses.
mempalace/repair.py Adds a read-only status() operation to print divergence info for drawers/closets.
mempalace/cli.py Adds repair-status CLI command wiring to the new repair status function.
tests/test_hnsw_capacity.py Adds tests for the probe, restricted unpickler behavior, BM25 fallback, and CLI output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/mcp_server.py Outdated
"vector_disabled_reason": _vector_disabled_reason,
"hint": (
"duplicate detection requires vector search; run "
"`mempalace repair rebuild` to restore"
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hint recommends mempalace repair rebuild, but there is no such CLI subcommand (the CLI has mempalace repair / mempalace repair-status). Update the hint text to the actual repair command so users aren’t sent to a non-existent invocation.

Suggested change
"`mempalace repair rebuild` to restore"
"`mempalace repair` to restore"

Copilot uses AI. Check for mistakes.
Comment thread mempalace/searcher.py Outdated
Comment on lines +431 to +443
rows = conn.execute(
"""
SELECT e.id
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = 'mempalace_drawers'
ORDER BY e.created_at DESC
LIMIT ?
""",
(max_candidates,),
).fetchall()
candidate_ids = [r[0] for r in rows]
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recency-window query isn’t wrapped in try/except sqlite3.Error, so if the palace’s SQLite schema differs (missing embedding_fulltext_search, created_at, etc.) this fallback path can still raise and break search. Catch sqlite errors here too and fall back to a safer ordering (e.g., by e.id/rowid) or return an empty result set with an error field.

Suggested change
rows = conn.execute(
"""
SELECT e.id
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = 'mempalace_drawers'
ORDER BY e.created_at DESC
LIMIT ?
""",
(max_candidates,),
).fetchall()
candidate_ids = [r[0] for r in rows]
try:
rows = conn.execute(
"""
SELECT e.id
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = 'mempalace_drawers'
ORDER BY e.created_at DESC
LIMIT ?
""",
(max_candidates,),
).fetchall()
candidate_ids = [r[0] for r in rows]
except sqlite3.Error:
logger.debug(
"Recency fallback query failed; using id-ordered fallback",
exc_info=True,
)
try:
rows = conn.execute(
"""
SELECT e.id
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = 'mempalace_drawers'
ORDER BY e.id DESC
LIMIT ?
""",
(max_candidates,),
).fetchall()
candidate_ids = [r[0] for r in rows]
except sqlite3.Error:
logger.debug(
"Id-ordered fallback query failed; returning no SQLite candidates",
exc_info=True,
)
candidate_ids = []

Copilot uses AI. Check for mistakes.
Comment on lines +348 to +351
captured = capsys.readouterr().out
assert "DIVERGED" in captured
assert "mempalace repair rebuild" in captured
assert out["drawers"]["diverged"] is True
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions lock the output to mempalace repair rebuild, but that command isn’t provided by the CLI (it’s mempalace repair / mempalace repair-status, or python -m mempalace.repair rebuild). Once the user-facing guidance is corrected, update the expected strings here accordingly.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_hnsw_capacity.py Outdated
Comment on lines +360 to +363
repair_status(palace_path=str(tmp_path))
captured = capsys.readouterr().out
assert "DIVERGED" not in captured
assert "mempalace repair rebuild" not in captured
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion also expects the non-existent mempalace repair rebuild command string. After updating the status output to a valid CLI invocation, adjust this test to match.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py Outdated
# successful repair via :func:`tool_reconnect` (which re-runs the probe).
_vector_disabled = False
_vector_disabled_reason = ""
_vector_capacity_status: dict | None = None
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict | None is Python 3.10+ syntax. This project declares Python >=3.9, so this annotation will raise a SyntaxError on 3.9 runtimes. Use Optional[dict] (or dict[str, Any] | None behind from __future__ import annotations) to keep 3.9 compatibility.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py Outdated
Comment on lines +526 to +528
# Ensure the capacity probe has been run at least once before we
# decide which path to take — _get_client triggers it on first open.
_get_client()
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling _get_client() here will create a ChromaDB PersistentClient even when the probe already decided vector search is unsafe. If the #1222 segfault can be triggered by client/collection open, this defeats the fallback. Prefer re-running _refresh_vector_disabled_flag() (pure sqlite/pickle) and only touching ChromaDB when _vector_disabled is false.

Suggested change
# Ensure the capacity probe has been run at least once before we
# decide which path to take — _get_client triggers it on first open.
_get_client()
# Ensure the vector-disabled probe has been run using the safe
# sqlite/pickle path before deciding whether vector search is allowed.
_refresh_vector_disabled_flag()

Copilot uses AI. Check for mistakes.
Comment thread mempalace/repair.py Outdated
at a stale ``max_elements`` while sqlite keeps accumulating rows.
Once the divergence is large enough, every tool call segfaults when
chromadb tries to load the undersized HNSW. Running ``mempalace
repair status`` *before* opening the segment lets the operator
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says to run mempalace repair status, but the CLI addition in this PR is mempalace repair-status (and python -m mempalace.repair status for standalone). Update the command name here to match the actual entrypoint so operators don’t copy/paste an invalid command.

Suggested change
repair status`` *before* opening the segment lets the operator
repair-status`` *before* opening the segment lets the operator

Copilot uses AI. Check for mistakes.
Comment thread mempalace/repair.py Outdated
print(f" note: {info['message']}")

if drawers["diverged"] or closets["diverged"]:
print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.")
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recommendation prints mempalace repair rebuild, but the CLI only provides mempalace repair (rebuild) and mempalace repair-status. Suggesting a non-existent command will confuse operators; update the suggested invocation to a valid one.

Suggested change
print("\n Recommended: run `mempalace repair rebuild` to rebuild the index.")
print("\n Recommended: run `mempalace repair` to rebuild the index.")

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py Outdated
Comment on lines +378 to +386
"protocol": PALACE_PROTOCOL,
"aaak_dialect": AAAK_SPEC,
}
if _vector_disabled:
# Surface the #1222 fallback state so the AI knows search results
# are BM25-only and recommends the operator run repair.
result["vector_disabled"] = True
result["vector_disabled_reason"] = _vector_disabled_reason
if _vector_capacity_status:
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when _vector_disabled is set, tool_status() will already have opened the Chroma collection / called count() earlier in the function. In the #1222 failure mode (segfault on segment load), that still risks crashing the MCP server before you can return this fallback state. Consider short-circuiting before any ChromaDB access when vector is disabled (or when the probe reports divergence), and compute status via direct sqlite reads instead.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py Outdated
if not _vector_disabled:
logger.warning(
"HNSW capacity divergence detected (%s) — routing search to "
"BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore "
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message recommends mempalace repair rebuild, but the CLI only defines mempalace repair and mempalace repair-status (no repair rebuild subcommand). This guidance will fail if copy/pasted; update the message to point at the actual supported command (or at python -m mempalace.repair rebuild if that’s what you intend).

Suggested change
"BM25-only sqlite fallback. Run `mempalace repair rebuild` to restore "
"BM25-only sqlite fallback. Run `mempalace repair` to restore "

Copilot uses AI. Check for mistakes.
Five Copilot review issues + the Python 3.9 CI failure rolled into one
follow-up:

* Replace ``dict | None`` annotated assignment with a type-comment so
  module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red).
* Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace
  repair`` (rebuild) and ``mempalace repair-status``. Updated all
  user-facing messages, docstrings, and test assertions.
* Replace ``_get_client()`` in ``tool_search`` with the safe
  ``_refresh_vector_disabled_flag`` probe so the fallback isn't
  defeated by the very chromadb client load it's trying to avoid.
* Short-circuit ``tool_status`` to a pure-sqlite reader
  (``_tool_status_via_sqlite``) when divergence is detected so wing /
  room counts come back without ever opening the persistent client.
* Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an
  ``id``-ordered fallback so legacy schemas missing ``created_at``
  don't break BM25 search.

New test covers the sqlite-status short-circuit. 1409 tests pass.
@igorls igorls merged commit 9dbb4ce into develop Apr 27, 2026
6 checks passed
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Apr 27, 2026

Addressed Copilot review + the Python 3.9 CI red in 57ac669:

  • Python 3.9 CI: replaced the dict | None annotated assignment in mcp_server.py with a type-comment so module load doesn't evaluate PEP 604 syntax on 3.9.
  • Wrong CLI command in user-facing strings: mempalace repair rebuild doesn't exist (only mempalace repair and mempalace repair-status). Fixed every message, docstring, and test assertion.
  • tool_search defeated its own fallback: was calling _get_client() to trigger the probe, which constructs the very chromadb client the fallback is meant to avoid. Switched to _refresh_vector_disabled_flag() (pure sqlite/pickle, no chromadb).
  • tool_status opened the chromadb collection unconditionally: added _tool_status_via_sqlite short-circuit that runs when _vector_disabled is set — wing/room counts come back via direct embedding_metadata reads so status stays reachable in the segfault scenario.
  • Recency-window query was unguarded: wrapped in sqlite3.Error handling with an id-ordered fallback for legacy schemas missing created_at.

CI is green across lint + 3.9/3.11/3.13/macOS/Windows. 1409 tests pass locally.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 27, 2026
22 commits from upstream/develop, including:

- HNSW capacity divergence detection + BM25-only sqlite fallback when
  vector layer is unloadable (MemPalace#1222 / MemPalace#1227 — vector_disabled kwarg
  routes search through chromadb-bypass path on segfault risk).
- HNSW index bloat prevention via batch_size + sync_threshold metadata
  on collection create (MemPalace#1191).
- Hooks: always mine the active transcript as convos, additive to
  MEMPAL_DIR (MemPalace#1230 / MemPalace#1231). Restructured into `_get_mine_targets()`
  list approach + separate `_ingest_transcript()`. Daemon-strict gate
  preserved on each entry point.
- Wing-name normalization for hyphenated dirs across miner / convo_miner
  / palace_graph (MemPalace#1194 / MemPalace#1195 / MemPalace#1197).
- `narrow _fix_blob_seq_ids` shim + `repair --mode max-seq-id` for
  legacy 0.6.x BLOB-poisoned palaces (MemPalace#1135).

Conflict resolution:
- README.md: kept fork-shaped narrative, dropped upstream's sweep tip
  injection (per fork-readme handling memory).
- hooks/README.md: adopted upstream's accurate `MEMPAL_DIR` additive
  description; kept `MEMPAL_PYTHON` env-var name (matches actual
  `hooks/*.sh` scripts in both forks).
- mempalace/cli.py: consolidated duplicate `--mode` argparse declarations
  into one with all 4 choices (rebuild/legacy/reorganize/max-seq-id).
- mempalace/hooks_cli.py: adopted upstream's `_get_mine_targets()` +
  `_ingest_transcript()` shape; added `_daemon_strict()` guard at entry
  of `_maybe_auto_ingest`, `_mine_sync`, and `_ingest_transcript` so
  the daemon-strict architecture still skips local writes.
- mempalace/mcp_server.py: kept both `kind=` and `vector_disabled=`
  kwargs on the `search_memories` call.
- mempalace/searcher.py: kept fork's `_count_in_scope`,
  `_sqlite_fallback_and_scope`, `_apply_kind_text_filter` AND upstream's
  `_bm25_only_via_sqlite`. Both `kind` and `vector_disabled` parameters
  on `search_memories`.

Tests: 1510 passing (up from 1366 — upstream brought new test suites).
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
Five Copilot review issues + the Python 3.9 CI failure rolled into one
follow-up:

* Replace ``dict | None`` annotated assignment with a type-comment so
  module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red).
* Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace
  repair`` (rebuild) and ``mempalace repair-status``. Updated all
  user-facing messages, docstrings, and test assertions.
* Replace ``_get_client()`` in ``tool_search`` with the safe
  ``_refresh_vector_disabled_flag`` probe so the fallback isn't
  defeated by the very chromadb client load it's trying to avoid.
* Short-circuit ``tool_status`` to a pure-sqlite reader
  (``_tool_status_via_sqlite``) when divergence is detected so wing /
  room counts come back without ever opening the persistent client.
* Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an
  ``id``-ordered fallback so legacy schemas missing ``created_at``
  don't break BM25 search.

New test covers the sqlite-status short-circuit. 1409 tests pass.
igorls pushed a commit that referenced this pull request May 1, 2026
The capacity probe added in #1227 hardcoded a 2,000-row floor for the
"diverged" decision. The comment justifying that number explicitly tied
it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization
windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling".

#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without
updating the floor. Result: any palace created with the bloat guard
flips between OK and DIVERGED on every flush cycle. Steady-state
divergence sits at 0–50K (the natural queue depth), and the 2,000 floor
trips the guardrail the moment the queue exceeds 10% of sqlite_count.
The MCP server then routes search to BM25-only and disables duplicate
detection for ~80% of the write cycle on actively-mined ≥100K palaces,
even though chromadb is behaving correctly.

This change reads the configured `hnsw:sync_threshold` from
`collection_metadata` per palace and scales the floor to 2 × that value.
The 10% relative term and the original #1222 detection capability are
unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer)
still trips, regardless of whether the collection was created with
sync_threshold=1000 or 50000.

Behavior summary:

| Collection's sync_threshold | New floor | Old floor |
|---|---|---|
| Missing (legacy palace)     | 2000      | 2000 (unchanged)
| 1000 (chromadb default)     | 2000      | 2000 (unchanged)
| 50000 (#1191 bloat guard)   | 100000    | 2000 (the bug)

Tests:
- test_capacity_status_tolerates_lag_under_large_sync_threshold (regression
  for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK)
- test_capacity_status_still_flags_real_corruption_under_large_sync (#1222
  shape with bloat-guard collection — still detects corruption)
- test_capacity_status_default_threshold_when_no_sync_metadata (legacy
  palaces without the metadata row use the 2000 fallback floor)
- test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch
  scales too — 30K under sync_threshold=50000 is no longer flagged)

All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests
in tests/test_backends.py still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
igorls added a commit that referenced this pull request May 1, 2026
Three fixes landed on develop after the initial release-prep cut and
were brought in via the develop merge. Document them in the 3.3.4
Bug Fixes section so the release notes reflect what users will
actually receive.

- #1287 - HNSW divergence floor scales with hnsw:sync_threshold
  (resolves a silent-fallback regression introduced by the
  interaction between #1191 and #1227 in this release)
- #1262 - ChromaBackend get_or_create_collection split, fixing the
  stop-hook SIGSEGV class on legacy palaces with mismatched stored
  metadata (#1089)
- #1288 / #1254 - repair --mode max-seq-id heuristic now decodes
  BLOB-typed embeddings.seq_id, restoring the un-poison path added
  in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call

2 participants