Skip to content

fix: stale HNSW index detection when external process writes to chroma#625

Closed
yukinoli wants to merge 2 commits intoMemPalace:developfrom
yukinoli:fix/hnsw-stale-index
Closed

fix: stale HNSW index detection when external process writes to chroma#625
yukinoli wants to merge 2 commits intoMemPalace:developfrom
yukinoli:fix/hnsw-stale-index

Conversation

@yukinoli
Copy link
Copy Markdown

Problem

When another process (e.g. mempalace mine from a CLI, or a second MCP session pointed at the same palace) writes embeddings to chroma.sqlite3 while an MCP server is already serving queries, the in-process HNSW index becomes stale. Subsequent vector searches fail with:

Error finding id <uuid> in HNSW index

Repro: run mempalace mine ~/some-dir in one shell while the MCP server holds a cached PersistentClient in another. Vector search crashes on the next query.

I hit this in a setup where an autonomous agent runs the MCP server continuously and a cron job does mempalace mine every few hours — the MCP server would start failing silently mid-session until restarted.

Fix

On every _get_client() call, compare the raw SQLite embedding count (read straight from chroma.sqlite3 via sqlite3, bypassing chroma's own cache) against a cached snapshot stored in a new _last_known_count module global.

  • First call: build a fresh client and snapshot the count. No rebuild.
  • Subsequent calls with the same count: return the cached client (no change from prior behavior).
  • Subsequent calls with a changed count: clear chroma's system cache, tear down the cached PersistentClient, drop the _collection_cache, and build a fresh client. Update the snapshot.

The check is cheap — one SELECT COUNT(*) FROM embeddings against SQLite — so it won't slow down hot paths.

Tests

New tests/test_hnsw_stale.py with five regression tests:

  • TestSqliteEmbeddingCount::test_returns_zero_when_db_missing — helper handles a palace directory with no chroma.sqlite3 yet.
  • TestSqliteEmbeddingCount::test_returns_actual_count_after_add — helper returns the true count once embeddings are present.
  • TestStaleIndexDetection::test_rebuilds_when_external_process_adds_embeddings — main regression. Seeds the palace, caches a client via _get_client(), simulates an external process by opening a second chromadb.PersistentClient and adding more embeddings, then verifies the next _get_client() call returns a new client instance and clears _collection_cache.
  • TestStaleIndexDetection::test_no_rebuild_when_count_unchanged — stable count returns the cached client identity.
  • TestStaleIndexDetection::test_rebuild_skipped_on_first_call_even_if_db_exists — first call snapshots without a spurious rebuild.

All 33 existing test_mcp_server.py tests still pass locally.

Notes

  • The fix is localized to _get_client() / _sqlite_embedding_count() in mempalace/mcp_server.py — no API changes, no new dependencies.
  • _last_known_count starts at 0; the first-call branch explicitly returns early after snapshotting to avoid treating "was 0, is now N" as a stale-index event on cold start.

🤖 Generated with Claude Code

yukinoli and others added 2 commits April 11, 2026 14:31
When another process (e.g. mempalace mine from OpenClaw) writes
embeddings to chroma.sqlite3 while the MCP server is running, the
in-process HNSW index becomes stale. Vector queries then fail with
"Error finding id" because the index references old IDs.

Fix: on every _get_client() call, compare the SQLite embedding count
against the last known value. On mismatch, tear down the cached
PersistentClient, clear chroma's system cache, and build a fresh one.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Three test classes cover the fix:

- TestSqliteEmbeddingCount: the raw count helper handles a missing DB
  and returns the true count after embeddings are added.
- TestStaleIndexDetection: the main regression — simulates an external
  chromadb client (representing `mempalace mine` or another MCP session)
  writing to the same palace, and verifies _get_client() rebuilds its
  cached client and clears the collection cache.
- Also verifies: no rebuild when the count is unchanged, and the very
  first _get_client() call snapshots the count without rebuilding even
  when the DB already contains data.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 12, 2026

We hit this exact bug independently and solved it differently — happy to compare notes.

Our approach (in our fork): _get_client() calls os.stat('chroma.sqlite3') and compares both st_ino (catches file replacement from repair/nuke) and st_mtime (catches in-place external writes). If either changes by more than an epsilon, we tear down and reconnect.

st = os.stat(db_path)
inode_changed = st.st_ino and st.st_ino != _palace_db_inode
mtime_changed = st.st_mtime and abs(st.st_mtime - _palace_db_mtime) > 0.01
if _client_cache is None or inode_changed or mtime_changed:
    # tear down and reconnect

A few tradeoffs worth discussing:

  • No SQL query: stat() is slightly cheaper than SELECT COUNT(*) FROM embeddings, and avoids opening a second SQLite connection on every _get_client() call. On hot paths (every MCP tool call) that adds up.
  • Catches more than embedding adds: mtime changes on metadata updates and deletes too, where the embedding count stays the same. The COUNT approach would miss those cases.
  • Doesn't catch content-only rewrites: if an external process upserts the same number of embeddings, COUNT stays flat but ours catches it via mtime. Tradeoff is roughly symmetric.
  • FAT/exFAT caveat: we guard current_inode != 0 since those filesystems return 0 — same safe fallback applies to mtime (we skip reconnect if stat returns 0.0).

We also added a mempalace_reconnect MCP tool for the edge case where col.update() changes metadata without reliably bumping mtime — useful for giving the AI a manual escape hatch during a long session.

We're planning to submit our approach as a separate PR for comparison. Happy to collaborate on combining both — e.g. COUNT as a belt-and-suspenders check where mtime isn't reliable, or using your test suite as the foundation since the five regression tests here are solid.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 12, 2026

Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution!

@bensig bensig closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants