fix: stale HNSW index detection when external process writes to chroma by yukinoli · Pull Request #625 · MemPalace/mempalace

yukinoli · 2026-04-11T12:41:49Z

Problem

When another process (e.g. mempalace mine from a CLI, or a second MCP session pointed at the same palace) writes embeddings to chroma.sqlite3 while an MCP server is already serving queries, the in-process HNSW index becomes stale. Subsequent vector searches fail with:

Error finding id <uuid> in HNSW index

Repro: run mempalace mine ~/some-dir in one shell while the MCP server holds a cached PersistentClient in another. Vector search crashes on the next query.

I hit this in a setup where an autonomous agent runs the MCP server continuously and a cron job does mempalace mine every few hours — the MCP server would start failing silently mid-session until restarted.

Fix

On every _get_client() call, compare the raw SQLite embedding count (read straight from chroma.sqlite3 via sqlite3, bypassing chroma's own cache) against a cached snapshot stored in a new _last_known_count module global.

First call: build a fresh client and snapshot the count. No rebuild.
Subsequent calls with the same count: return the cached client (no change from prior behavior).
Subsequent calls with a changed count: clear chroma's system cache, tear down the cached PersistentClient, drop the _collection_cache, and build a fresh client. Update the snapshot.

The check is cheap — one SELECT COUNT(*) FROM embeddings against SQLite — so it won't slow down hot paths.

Tests

New tests/test_hnsw_stale.py with five regression tests:

TestSqliteEmbeddingCount::test_returns_zero_when_db_missing — helper handles a palace directory with no chroma.sqlite3 yet.
TestSqliteEmbeddingCount::test_returns_actual_count_after_add — helper returns the true count once embeddings are present.
TestStaleIndexDetection::test_rebuilds_when_external_process_adds_embeddings — main regression. Seeds the palace, caches a client via _get_client(), simulates an external process by opening a second chromadb.PersistentClient and adding more embeddings, then verifies the next _get_client() call returns a new client instance and clears _collection_cache.
TestStaleIndexDetection::test_no_rebuild_when_count_unchanged — stable count returns the cached client identity.
TestStaleIndexDetection::test_rebuild_skipped_on_first_call_even_if_db_exists — first call snapshots without a spurious rebuild.

All 33 existing test_mcp_server.py tests still pass locally.

Notes

The fix is localized to _get_client() / _sqlite_embedding_count() in mempalace/mcp_server.py — no API changes, no new dependencies.
_last_known_count starts at 0; the first-call branch explicitly returns early after snapshotting to avoid treating "was 0, is now N" as a stale-index event on cold start.

🤖 Generated with Claude Code

When another process (e.g. mempalace mine from OpenClaw) writes embeddings to chroma.sqlite3 while the MCP server is running, the in-process HNSW index becomes stale. Vector queries then fail with "Error finding id" because the index references old IDs. Fix: on every _get_client() call, compare the SQLite embedding count against the last known value. On mismatch, tear down the cached PersistentClient, clear chroma's system cache, and build a fresh one. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Three test classes cover the fix: - TestSqliteEmbeddingCount: the raw count helper handles a missing DB and returns the true count after embeddings are added. - TestStaleIndexDetection: the main regression — simulates an external chromadb client (representing `mempalace mine` or another MCP session) writing to the same palace, and verifies _get_client() rebuilds its cached client and clears the collection cache. - Also verifies: no rebuild when the count is unchanged, and the very first _get_client() call snapshots the count without rebuilding even when the DB already contains data. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

jphein · 2026-04-12T02:40:31Z

We hit this exact bug independently and solved it differently — happy to compare notes.

Our approach (in our fork): _get_client() calls os.stat('chroma.sqlite3') and compares both st_ino (catches file replacement from repair/nuke) and st_mtime (catches in-place external writes). If either changes by more than an epsilon, we tear down and reconnect.

st = os.stat(db_path)
inode_changed = st.st_ino and st.st_ino != _palace_db_inode
mtime_changed = st.st_mtime and abs(st.st_mtime - _palace_db_mtime) > 0.01
if _client_cache is None or inode_changed or mtime_changed:
    # tear down and reconnect

A few tradeoffs worth discussing:

No SQL query: stat() is slightly cheaper than SELECT COUNT(*) FROM embeddings, and avoids opening a second SQLite connection on every _get_client() call. On hot paths (every MCP tool call) that adds up.
Catches more than embedding adds: mtime changes on metadata updates and deletes too, where the embedding count stays the same. The COUNT approach would miss those cases.
Doesn't catch content-only rewrites: if an external process upserts the same number of embeddings, COUNT stays flat but ours catches it via mtime. Tradeoff is roughly symmetric.
FAT/exFAT caveat: we guard current_inode != 0 since those filesystems return 0 — same safe fallback applies to mtime (we skip reconnect if stat returns 0.0).

We also added a mempalace_reconnect MCP tool for the edge case where col.update() changes metadata without reliably bumping mtime — useful for giving the AI a manual escape hatch during a long session.

We're planning to submit our approach as a separate PR for comparison. Happy to collaborate on combining both — e.g. COUNT as a belt-and-suspenders check where mtime isn't reliable, or using your test suite as the foundation since the five regression tests here are solid.

bensig · 2026-04-12T04:42:04Z

Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution!

yukinoli and others added 2 commits April 11, 2026 14:31

yukinoli requested review from bensig and milla-jovovich as code owners April 11, 2026 12:41

Ari4ka approved these changes Apr 11, 2026

View reviewed changes

bensig changed the base branch from main to develop April 11, 2026 22:21

bensig requested a review from igorls as a code owner April 11, 2026 22:21

jphein mentioned this pull request Apr 12, 2026

fix: detect mtime changes in _get_collection to prevent stale HNSW index #663

Closed

bensig closed this Apr 12, 2026

Ari4ka approved these changes Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stale HNSW index detection when external process writes to chroma#625

fix: stale HNSW index detection when external process writes to chroma#625
yukinoli wants to merge 2 commits intoMemPalace:developfrom
yukinoli:fix/hnsw-stale-index

yukinoli commented Apr 11, 2026

Uh oh!

jphein commented Apr 12, 2026

Uh oh!

bensig commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yukinoli commented Apr 11, 2026

Problem

Fix

Tests

Notes

Uh oh!

jphein commented Apr 12, 2026

Uh oh!

bensig commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants