fix: detect mtime changes in _get_client to prevent stale HNSW index#757
fix: detect mtime changes in _get_client to prevent stale HNSW index#757
Conversation
_get_collection() cached the ChromaDB collection wrapper but only invalidated on process restart. When external tools (CLI mining, scripts) wrote to the palace database in-place, the in-memory HNSW index became stale — vector searches returned stale results until the MCP server process restarted. Fix: check both inode and mtime of chroma.sqlite3 on every _get_collection() call. Inode changes catch full rebuilds (repair/nuke); mtime changes catch in-place writes. Epsilon comparison (0.01s) avoids spurious reconnects from filesystem timestamp granularity. Also adds mempalace_reconnect MCP tool for manual cache flush — useful after metadata-only updates (col.update()) that may not reliably change mtime.
… tests Addresses Copilot review feedback on #663. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- tests/test_mcp_server.py: remove unused `result =` assignment (ruff F841) and gate test_missing_db_invalidates_cache on non-Windows platforms — on Windows, ChromaDB keeps chroma.sqlite3 open through the cached client, so os.remove() raises PermissionError before the cache-invalidation path runs. - mempalace/mcp_server.py: ruff-format multi-global declaration.
#757 landed mtime/inode cache invalidation and mempalace_reconnect in mcp_server._get_client(). Both are Chroma-specific (stat of chroma.sqlite3). They should migrate into ChromaBackend.get_collection and ChromaBackend.close_palace during the §10 cleanup so the freshness contract lives inside the backend, not in the caller.
There was a problem hiding this comment.
Pull request overview
This PR improves MCP server reliability when the underlying ChromaDB SQLite file is modified externally by detecting on-disk DB changes (mtime + inode) and forcing cache invalidation to avoid serving results from a stale in-memory HNSW index.
Changes:
- Extend
_get_client()cache invalidation logic to trigger onchroma.sqlite3mtime drift (with epsilon) in addition to inode changes, and handle “DB file disappeared” scenarios. - Add a new MCP maintenance tool
mempalace_reconnectto explicitly flush caches and reconnect. - Add tests covering mtime/inode invalidation, missing DB invalidation, and reconnect tool responses (with Windows skip for the file-lock case).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
mempalace/mcp_server.py |
Adds mtime-based stale detection in _get_client() and introduces tool_reconnect exposed as mempalace_reconnect. |
tests/test_mcp_server.py |
Adds cache invalidation + reconnect tests; skips the DB-delete test on Windows due to Chroma file locking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Prime the cache | ||
| col1 = mcp_server._get_collection() | ||
| assert col1 is not None | ||
|
|
||
| # Simulate an external write changing the mtime | ||
| old_mtime = mcp_server._palace_db_mtime | ||
| monkeypatch.setattr(mcp_server, "_palace_db_mtime", old_mtime - 10.0) | ||
|
|
||
| # _get_collection should detect the mtime drift and reconnect | ||
| col2 = mcp_server._get_collection() | ||
| assert col2 is not None |
There was a problem hiding this comment.
The test doesn’t currently assert that the cache was actually invalidated/reconnected—col2 could be the same cached object and this would still pass. Consider asserting that _client_cache and/or _collection_cache was replaced (e.g., identity change) and that _palace_db_mtime was updated to the current stat value after the call.
| assert col1 is not None | ||
|
|
||
| # Simulate a rebuild that changes the inode | ||
| monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999) | ||
|
|
||
| col2 = mcp_server._get_collection() | ||
| assert col2 is not None |
There was a problem hiding this comment.
Similar to the mtime test, this only checks that the collection is non-None; it doesn’t verify that an inode mismatch caused a reconnect. Add an assertion that the cached client/collection object identity changes (or that _collection_cache was cleared and recreated) when _palace_db_inode is forced to a different value.
| assert col1 is not None | |
| # Simulate a rebuild that changes the inode | |
| monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999) | |
| col2 = mcp_server._get_collection() | |
| assert col2 is not None | |
| assert col1 is not None | |
| cached_col1 = mcp_server._collection_cache | |
| assert cached_col1 is col1 | |
| # Simulate a rebuild that changes the inode | |
| monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999) | |
| col2 = mcp_server._get_collection() | |
| assert col2 is not None | |
| assert col2 is not col1 | |
| assert mcp_server._collection_cache is col2 | |
| assert mcp_server._collection_cache is not cached_col1 |
| mcp_server._get_collection() | ||
| # The key assertion: the old cached collection was dropped |
There was a problem hiding this comment.
The comment says _get_collection should return None after deleting chroma.sqlite3, but the test never asserts the return value nor that _collection_cache was cleared. Add assertions that the call returns None and that _collection_cache is None to ensure the stale cached collection can’t be served.
| mcp_server._get_collection() | |
| # The key assertion: the old cached collection was dropped | |
| col2 = mcp_server._get_collection() | |
| assert col2 is None | |
| # The key assertion: the old cached collection was dropped | |
| assert mcp_server._collection_cache is None |
| global \ | ||
| _client_cache, \ | ||
| _collection_cache, \ | ||
| _palace_db_inode, \ | ||
| _palace_db_mtime, \ | ||
| _metadata_cache, \ | ||
| _metadata_cache_time |
There was a problem hiding this comment.
Avoid using backslash line continuations in a global statement; it’s brittle and inconsistent with the rest of this module’s style. Prefer two global ... statements (or otherwise restructure) so formatting/linting doesn’t depend on trailing backslashes.
| global \ | |
| _client_cache, \ | |
| _collection_cache, \ | |
| _palace_db_inode, \ | |
| _palace_db_mtime, \ | |
| _metadata_cache, \ | |
| _metadata_cache_time | |
| global _client_cache, _collection_cache, _palace_db_inode, _palace_db_mtime | |
| global _metadata_cache, _metadata_cache_time |
| global _collection_cache, _palace_db_inode, _palace_db_mtime | ||
| _collection_cache = None | ||
| _palace_db_inode = 0 | ||
| _palace_db_mtime = 0.0 | ||
| try: | ||
| col = _get_collection() | ||
| if col is None: |
There was a problem hiding this comment.
tool_reconnect() is described as forcing a reconnect, but it only clears _collection_cache and relies on inode/mtime detection to recreate _client_cache. If os.stat() fails (permissions/transient errors), _get_client() will reuse the existing _client_cache, so this tool won’t reliably reconnect. Consider explicitly clearing _client_cache and the metadata cache/time here to guarantee a fresh handle.
| } | ||
| return {"success": True, "message": "Reconnected to palace", "drawers": col.count()} | ||
| except Exception as e: | ||
| return {"success": False, "error": str(e)} |
There was a problem hiding this comment.
The exception path returns {success: False, error: ...} but omits the message/drawers keys that are present in the other return paths. For easier client handling, return a consistent response shape across success/failure cases (e.g., always include message and drawers).
| return {"success": False, "error": str(e)} | |
| return { | |
| "success": False, | |
| "message": "Failed to reconnect to palace", | |
| "drawers": 0, | |
| "error": str(e), | |
| } |
PR #761 bumped pyproject.toml to 3.2.0 but missed three other version strings, causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows). - mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency) - README.md: version badge shield 3.1.0 → 3.2.0 - integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0 - CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries for #685, #690, #707, #716, #734, #755, #757, #761 Verified locally: 689/689 tests pass, ruff clean.
PR MemPalace#761 bumped pyproject.toml to 3.2.0 but missed three other version strings, causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows). - mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency) - README.md: version badge shield 3.1.0 → 3.2.0 - integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0 - CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries for MemPalace#685, MemPalace#690, MemPalace#707, MemPalace#716, MemPalace#734, MemPalace#755, MemPalace#757, MemPalace#761 Verified locally: 689/689 tests pass, ruff clean.
PR MemPalace#761 bumped pyproject.toml to 3.2.0 but missed three other version strings, causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows). - mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency) - README.md: version badge shield 3.1.0 → 3.2.0 - integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0 - CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries for MemPalace#685, MemPalace#690, MemPalace#707, MemPalace#716, MemPalace#734, MemPalace#755, MemPalace#757, MemPalace#761 Verified locally: 689/689 tests pass, ruff clean.
Main had 9 commits that never back-merged into develop after the v3.2.0 release cycle. Resolving conflicts as follows: - mempalace/version.py: keep develop (3.3.0 release target) - README.md: keep develop (Milla's #835 audit is authoritative — main had stale 19 tools / 170 tokens / "30x lossless" / v3.0.0 label) - hooks/mempal_{save,precompact}_hook.sh: keep develop (#786 reversed the #666 "decision=block" behavior intentionally to stop hooks from making agents write in chat) - pyproject.toml: auto-merged — keeps develop's 3.3.0 and picks up main's chromadb upper-bound removal (#690) - CONTRIBUTING.md, mempalace/hooks_cli.py: auto-merged cleanly — picks up main's improvements (fork-first clone, more detailed block reason strings that name MemPalace and specific tools) - integrations/openclaw/SKILL.md: bumped 3.2.0 → 3.3.0 (version tracks the package per #761 convention) - CHANGELOG.md: manual merge — kept develop's preamble + Unreleased v3.3.0 section + footer links; folded main's richer v3.2.0 entries (Packaging section for #690/#761; Bug Fixes #685/#677/#716/#707/ #755/#757; Documentation #734/#733) into the v3.2.0 section; deduped the split Documentation sections that auto-merge produced
…nd registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).
Four defects surfaced by the automated review, fixed with targeted tests: 1. BaseCollection.update() default now validates that documents / metadatas / embeddings lengths match ids, raising ValueError instead of silently misaligning pairs or raising IndexError (base.py). 2. ChromaCollection.query() now rejects the two ambiguous input shapes up front — neither or both of query_texts / query_embeddings, and empty input lists — with clear ValueError messages rather than delegating to chromadb's less-obvious errors (chroma.py). 3. QueryResult.empty() accepts embeddings_requested=True to preserve the outer-query dimension with empty hit lists when the caller asked for embeddings, matching the spec rule that included fields carry the outer shape even when empty (base.py). ChromaCollection.query() threads this through on the empty-result path (chroma.py). 4. ChromaBackend cache-freshness check now matches the semantics from mcp_server._get_client (merged via #757) on three edge cases Copilot called out: (a) invalidate when chroma.sqlite3 disappears while a cached client is held, (b) treat a 0→nonzero stat transition as a change so a cache built when the DB did not yet exist is refreshed, (c) re-stat after PersistentClient constructs the DB lazily so freshness reflects the post-creation state (chroma.py). Tests: 978 passed (up from 970), 8 new tests covering the fixes.
Every remaining row in "Still ahead of upstream" now carries a status so the reader can tell at a glance whether each change is being upstreamed, pending a PR, or deliberately fork-local. Dropped: - "Guard ChromaDB 1.5.x metadata-mismatch segfault" — this row was overstated. The memory file for today's debugging notes that the try-get/except-create pattern is defensive code that never reproduced a specific crash (the actual crashes traced to HNSW drift). Leaving it in "Still ahead" implied an upstream-candidate fix, which it isn't. Code stays in place as defensive, but the README no longer claims it as a fork-ahead feature. Moved to Superseded: - "Stale HNSW mtime detection + mempalace_reconnect" — upstream took a different approach in MemPalace#757. Our broader inode+mtime detection and the mempalace_reconnect MCP tool remain as fork-local convenience; they're just not "ahead of upstream" anymore. Statuses now populated: - Linked PR number for the 7 changes with active upstream PRs (MemPalace#659, MemPalace#660, MemPalace#661, MemPalace#673 with APPROVED note, MemPalace#999, MemPalace#1000, MemPalace#1005). - "PR pending" for 3 items that are good candidates but unfiled: epsilon mtime comparison, max_distance parameter, tool output mining. - "fork-only" for 2 items we keep intentionally without pitching upstream: .blob_seq_ids_migrated marker (narrow), bulk_check_mined (complementary to upstream's MemPalace#784 file-locking). Legend sentence added above the table explains the three status values. 42 README-claim tests pass.
Replacement for #663 — rebased on
developwith CI fixes. Authorship preserved (@jphein on the feature commits).Problem
When external tools write to the palace database (CLI mining, scripts), the MCP server's cached ChromaDB collection becomes stale — its in-memory HNSW index doesn't know about new vectors. Searches return incomplete results until the MCP server process restarts.
developalready detects inode changes in_get_client()(catches full rebuilds via repair/nuke/purge), but not mtime changes (misses in-place writes frommempalace mineand external scripts).Changes
_get_client()now compares bothst_inoandst_mtimeofchroma.sqlite3; invalidates the cached client on either change. Epsilon comparison (> 0.01s) avoids spurious reconnects from filesystem timestamp rounding. Thecurrent_inode != 0guard preserves safe behavior on FAT/exFAT.mempalace_reconnectMCP tool — explicit cache flush for cases where automatic detection misses (e.g., metadata-onlycol.update()calls) and for tests/scripts needing a guaranteed-fresh handle.CI fixes in this PR (beyond #663)
tests/test_mcp_server.py— drop unusedresult =assignment (F841).tests/test_mcp_server.py::test_missing_db_invalidates_cache—@pytest.mark.skipif(sys.platform == \"win32\", ...). On Windows, ChromaDB keepschroma.sqlite3open via the cached client, soos.remove()raisesPermissionErrorbefore the cache-invalidation path runs. Linux/macOS still exercise the path.mempalace/mcp_server.py— ruff-format the new multi-global declaration.Relation to RFC 001 (#743)
This fix lands in
mcp_server._get_client()today. Under RFC 001 §2.6, handle-caching and freshness belong inside each backend, and the equivalent oftool_reconnectbecomesbackend.close_palace(palace_ref). The §10 cleanup PR will migrate this logic intoChromaBackend.get_collection()/ChromaBackend.close_palace()when it lands. I'll add this PR to the §11 in-flight table in a follow-up to RFC 001.Credit
Original design + implementation: @jphein (#663). Original approval: @Ari4ka on #663. The feature commits on this branch carry jphein's authorship; only the CI fix commit is mine.
Test plan
ruff check .cleanruff format --check .cleantest_missing_db_invalidates_cache)Closes #663.