fix: stale HNSW index detection when external process writes to chroma#625
fix: stale HNSW index detection when external process writes to chroma#625yukinoli wants to merge 2 commits intoMemPalace:developfrom
Conversation
When another process (e.g. mempalace mine from OpenClaw) writes embeddings to chroma.sqlite3 while the MCP server is running, the in-process HNSW index becomes stale. Vector queries then fail with "Error finding id" because the index references old IDs. Fix: on every _get_client() call, compare the SQLite embedding count against the last known value. On mismatch, tear down the cached PersistentClient, clear chroma's system cache, and build a fresh one. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Three test classes cover the fix: - TestSqliteEmbeddingCount: the raw count helper handles a missing DB and returns the true count after embeddings are added. - TestStaleIndexDetection: the main regression — simulates an external chromadb client (representing `mempalace mine` or another MCP session) writing to the same palace, and verifies _get_client() rebuilds its cached client and clears the collection cache. - Also verifies: no rebuild when the count is unchanged, and the very first _get_client() call snapshots the count without rebuilding even when the DB already contains data. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
We hit this exact bug independently and solved it differently — happy to compare notes. Our approach (in our fork): st = os.stat(db_path)
inode_changed = st.st_ino and st.st_ino != _palace_db_inode
mtime_changed = st.st_mtime and abs(st.st_mtime - _palace_db_mtime) > 0.01
if _client_cache is None or inode_changed or mtime_changed:
# tear down and reconnectA few tradeoffs worth discussing:
We also added a We're planning to submit our approach as a separate PR for comparison. Happy to collaborate on combining both — e.g. COUNT as a belt-and-suspenders check where mtime isn't reliable, or using your test suite as the foundation since the five regression tests here are solid. |
|
Closing — this is superseded by recently merged PRs to develop. Thank you for the contribution! |
Problem
When another process (e.g.
mempalace minefrom a CLI, or a second MCP session pointed at the same palace) writes embeddings tochroma.sqlite3while an MCP server is already serving queries, the in-process HNSW index becomes stale. Subsequent vector searches fail with:Repro: run
mempalace mine ~/some-dirin one shell while the MCP server holds a cachedPersistentClientin another. Vector search crashes on the next query.I hit this in a setup where an autonomous agent runs the MCP server continuously and a cron job does
mempalace mineevery few hours — the MCP server would start failing silently mid-session until restarted.Fix
On every
_get_client()call, compare the raw SQLite embedding count (read straight fromchroma.sqlite3viasqlite3, bypassing chroma's own cache) against a cached snapshot stored in a new_last_known_countmodule global.PersistentClient, drop the_collection_cache, and build a fresh client. Update the snapshot.The check is cheap — one
SELECT COUNT(*) FROM embeddingsagainst SQLite — so it won't slow down hot paths.Tests
New
tests/test_hnsw_stale.pywith five regression tests:TestSqliteEmbeddingCount::test_returns_zero_when_db_missing— helper handles a palace directory with no chroma.sqlite3 yet.TestSqliteEmbeddingCount::test_returns_actual_count_after_add— helper returns the true count once embeddings are present.TestStaleIndexDetection::test_rebuilds_when_external_process_adds_embeddings— main regression. Seeds the palace, caches a client via_get_client(), simulates an external process by opening a secondchromadb.PersistentClientand adding more embeddings, then verifies the next_get_client()call returns a new client instance and clears_collection_cache.TestStaleIndexDetection::test_no_rebuild_when_count_unchanged— stable count returns the cached client identity.TestStaleIndexDetection::test_rebuild_skipped_on_first_call_even_if_db_exists— first call snapshots without a spurious rebuild.All 33 existing
test_mcp_server.pytests still pass locally.Notes
_get_client()/_sqlite_embedding_count()inmempalace/mcp_server.py— no API changes, no new dependencies._last_known_countstarts at 0; the first-call branch explicitly returns early after snapshotting to avoid treating "was 0, is now N" as a stale-index event on cold start.🤖 Generated with Claude Code