fix: detect mtime changes in _get_client to prevent stale HNSW index by igorls · Pull Request #757 · MemPalace/mempalace

igorls · 2026-04-13T04:52:47Z

Replacement for #663 — rebased on develop with CI fixes. Authorship preserved (@jphein on the feature commits).

Problem

When external tools write to the palace database (CLI mining, scripts), the MCP server's cached ChromaDB collection becomes stale — its in-memory HNSW index doesn't know about new vectors. Searches return incomplete results until the MCP server process restarts.

develop already detects inode changes in _get_client() (catches full rebuilds via repair/nuke/purge), but not mtime changes (misses in-place writes from mempalace mine and external scripts).

Changes

mtime-based stale index detection — _get_client() now compares both st_ino and st_mtime of chroma.sqlite3; invalidates the cached client on either change. Epsilon comparison (> 0.01s) avoids spurious reconnects from filesystem timestamp rounding. The current_inode != 0 guard preserves safe behavior on FAT/exFAT.
mempalace_reconnect MCP tool — explicit cache flush for cases where automatic detection misses (e.g., metadata-only col.update() calls) and for tests/scripts needing a guaranteed-fresh handle.

CI fixes in this PR (beyond #663)

tests/test_mcp_server.py — drop unused result = assignment (F841).
tests/test_mcp_server.py::test_missing_db_invalidates_cache — @pytest.mark.skipif(sys.platform == \"win32\", ...). On Windows, ChromaDB keeps chroma.sqlite3 open via the cached client, so os.remove() raises PermissionError before the cache-invalidation path runs. Linux/macOS still exercise the path.
mempalace/mcp_server.py — ruff-format the new multi-global declaration.

Relation to RFC 001 (#743)

This fix lands in mcp_server._get_client() today. Under RFC 001 §2.6, handle-caching and freshness belong inside each backend, and the equivalent of tool_reconnect becomes backend.close_palace(palace_ref). The §10 cleanup PR will migrate this logic into ChromaBackend.get_collection() / ChromaBackend.close_palace() when it lands. I'll add this PR to the §11 in-flight table in a follow-up to RFC 001.

Credit

Original design + implementation: @jphein (#663). Original approval: @Ari4ka on #663. The feature commits on this branch carry jphein's authorship; only the CI fix commit is mine.

Test plan

ruff check . clean
ruff format --check . clean
5 cache-invalidation tests pass locally on Linux (including test_missing_db_invalidates_cache)
Full test_mcp_server.py suite runs — no regressions
CI green on all platforms (Windows will skip the file-lock test)

Closes #663.

_get_collection() cached the ChromaDB collection wrapper but only invalidated on process restart. When external tools (CLI mining, scripts) wrote to the palace database in-place, the in-memory HNSW index became stale — vector searches returned stale results until the MCP server process restarted. Fix: check both inode and mtime of chroma.sqlite3 on every _get_collection() call. Inode changes catch full rebuilds (repair/nuke); mtime changes catch in-place writes. Epsilon comparison (0.01s) avoids spurious reconnects from filesystem timestamp granularity. Also adds mempalace_reconnect MCP tool for manual cache flush — useful after metadata-only updates (col.update()) that may not reliably change mtime.

… tests Addresses Copilot review feedback on #663. Co-Authored-By: Claude Opus 4.6 <[email protected]>

- tests/test_mcp_server.py: remove unused `result =` assignment (ruff F841) and gate test_missing_db_invalidates_cache on non-Windows platforms — on Windows, ChromaDB keeps chroma.sqlite3 open through the cached client, so os.remove() raises PermissionError before the cache-invalidation path runs. - mempalace/mcp_server.py: ruff-format multi-global declaration.

#757 landed mtime/inode cache invalidation and mempalace_reconnect in mcp_server._get_client(). Both are Chroma-specific (stat of chroma.sqlite3). They should migrate into ChromaBackend.get_collection and ChromaBackend.close_palace during the §10 cleanup so the freshness contract lives inside the backend, not in the caller.

Copilot

Pull request overview

This PR improves MCP server reliability when the underlying ChromaDB SQLite file is modified externally by detecting on-disk DB changes (mtime + inode) and forcing cache invalidation to avoid serving results from a stale in-memory HNSW index.

Changes:

Extend _get_client() cache invalidation logic to trigger on chroma.sqlite3 mtime drift (with epsilon) in addition to inode changes, and handle “DB file disappeared” scenarios.
Add a new MCP maintenance tool mempalace_reconnect to explicitly flush caches and reconnect.
Add tests covering mtime/inode invalidation, missing DB invalidation, and reconnect tool responses (with Windows skip for the file-lock case).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`mempalace/mcp_server.py`	Adds mtime-based stale detection in `_get_client()` and introduces `tool_reconnect` exposed as `mempalace_reconnect`.
`tests/test_mcp_server.py`	Adds cache invalidation + reconnect tests; skips the DB-delete test on Windows due to Chroma file locking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T04:56:55Z

+        # Prime the cache
+        col1 = mcp_server._get_collection()
+        assert col1 is not None
+
+        # Simulate an external write changing the mtime
+        old_mtime = mcp_server._palace_db_mtime
+        monkeypatch.setattr(mcp_server, "_palace_db_mtime", old_mtime - 10.0)
+
+        # _get_collection should detect the mtime drift and reconnect
+        col2 = mcp_server._get_collection()
+        assert col2 is not None


The test doesn’t currently assert that the cache was actually invalidated/reconnected—col2 could be the same cached object and this would still pass. Consider asserting that _client_cache and/or _collection_cache was replaced (e.g., identity change) and that _palace_db_mtime was updated to the current stat value after the call.

Copilot · 2026-04-13T04:56:56Z

+        assert col1 is not None
+
+        # Simulate a rebuild that changes the inode
+        monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)
+
+        col2 = mcp_server._get_collection()
+        assert col2 is not None


Similar to the mtime test, this only checks that the collection is non-None; it doesn’t verify that an inode mismatch caused a reconnect. Add an assertion that the cached client/collection object identity changes (or that _collection_cache was cleared and recreated) when _palace_db_inode is forced to a different value.

Suggested change

assert col1 is not None

# Simulate a rebuild that changes the inode

monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)

col2 = mcp_server._get_collection()

assert col2 is not None

assert col1 is not None

cached_col1 = mcp_server._collection_cache

assert cached_col1 is col1

# Simulate a rebuild that changes the inode

monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)

col2 = mcp_server._get_collection()

assert col2 is not None

assert col2 is not col1

assert mcp_server._collection_cache is col2

assert mcp_server._collection_cache is not cached_col1

Copilot · 2026-04-13T04:56:56Z

+        mcp_server._get_collection()
+        # The key assertion: the old cached collection was dropped


The comment says _get_collection should return None after deleting chroma.sqlite3, but the test never asserts the return value nor that _collection_cache was cleared. Add assertions that the call returns None and that _collection_cache is None to ensure the stale cached collection can’t be served.

Suggested change

mcp_server._get_collection()

# The key assertion: the old cached collection was dropped

col2 = mcp_server._get_collection()

assert col2 is None

# The key assertion: the old cached collection was dropped

assert mcp_server._collection_cache is None

Copilot · 2026-04-13T04:56:56Z

+    global \
+        _client_cache, \
+        _collection_cache, \
+        _palace_db_inode, \
+        _palace_db_mtime, \
+        _metadata_cache, \
+        _metadata_cache_time


Avoid using backslash line continuations in a global statement; it’s brittle and inconsistent with the rest of this module’s style. Prefer two global ... statements (or otherwise restructure) so formatting/linting doesn’t depend on trailing backslashes.

Suggested change

global \

_client_cache, \

_collection_cache, \

_palace_db_inode, \

_palace_db_mtime, \

_metadata_cache, \

_metadata_cache_time

global _client_cache, _collection_cache, _palace_db_inode, _palace_db_mtime

global _metadata_cache, _metadata_cache_time

Copilot · 2026-04-13T04:56:57Z

+    global _collection_cache, _palace_db_inode, _palace_db_mtime
+    _collection_cache = None
+    _palace_db_inode = 0
+    _palace_db_mtime = 0.0
+    try:
+        col = _get_collection()
+        if col is None:


tool_reconnect() is described as forcing a reconnect, but it only clears _collection_cache and relies on inode/mtime detection to recreate _client_cache. If os.stat() fails (permissions/transient errors), _get_client() will reuse the existing _client_cache, so this tool won’t reliably reconnect. Consider explicitly clearing _client_cache and the metadata cache/time here to guarantee a fresh handle.

Copilot · 2026-04-13T04:56:57Z

+            }
+        return {"success": True, "message": "Reconnected to palace", "drawers": col.count()}
+    except Exception as e:
+        return {"success": False, "error": str(e)}


The exception path returns {success: False, error: ...} but omits the message/drawers keys that are present in the other return paths. For easier client handling, return a consistent response shape across success/failure cases (e.g., always include message and drawers).

Suggested change

return {"success": False, "error": str(e)}

return {

"success": False,

"message": "Failed to reconnect to palace",

"drawers": 0,

"error": str(e),

}

PR #761 bumped pyproject.toml to 3.2.0 but missed three other version strings, causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows). - mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency) - README.md: version badge shield 3.1.0 → 3.2.0 - integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0 - CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries for #685, #690, #707, #716, #734, #755, #757, #761 Verified locally: 689/689 tests pass, ruff clean.

PR MemPalace#761 bumped pyproject.toml to 3.2.0 but missed three other version strings, causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows). - mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency) - README.md: version badge shield 3.1.0 → 3.2.0 - integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0 - CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries for MemPalace#685, MemPalace#690, MemPalace#707, MemPalace#716, MemPalace#734, MemPalace#755, MemPalace#757, MemPalace#761 Verified locally: 689/689 tests pass, ruff clean.

Main had 9 commits that never back-merged into develop after the v3.2.0 release cycle. Resolving conflicts as follows: - mempalace/version.py: keep develop (3.3.0 release target) - README.md: keep develop (Milla's #835 audit is authoritative — main had stale 19 tools / 170 tokens / "30x lossless" / v3.0.0 label) - hooks/mempal_{save,precompact}_hook.sh: keep develop (#786 reversed the #666 "decision=block" behavior intentionally to stop hooks from making agents write in chat) - pyproject.toml: auto-merged — keeps develop's 3.3.0 and picks up main's chromadb upper-bound removal (#690) - CONTRIBUTING.md, mempalace/hooks_cli.py: auto-merged cleanly — picks up main's improvements (fork-first clone, more detailed block reason strings that name MemPalace and specific tools) - integrations/openclaw/SKILL.md: bumped 3.2.0 → 3.3.0 (version tracks the package per #761 convention) - CHANGELOG.md: manual merge — kept develop's preamble + Unreleased v3.3.0 section + footer links; folded main's richer v3.2.0 entries (Packaging section for #690/#761; Bug Fixes #685/#677/#716/#707/ #755/#757; Documentation #734/#733) into the v3.2.0 section; deduped the split Documentation sections that auto-merge produced

…nd registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).

Four defects surfaced by the automated review, fixed with targeted tests: 1. BaseCollection.update() default now validates that documents / metadatas / embeddings lengths match ids, raising ValueError instead of silently misaligning pairs or raising IndexError (base.py). 2. ChromaCollection.query() now rejects the two ambiguous input shapes up front — neither or both of query_texts / query_embeddings, and empty input lists — with clear ValueError messages rather than delegating to chromadb's less-obvious errors (chroma.py). 3. QueryResult.empty() accepts embeddings_requested=True to preserve the outer-query dimension with empty hit lists when the caller asked for embeddings, matching the spec rule that included fields carry the outer shape even when empty (base.py). ChromaCollection.query() threads this through on the empty-result path (chroma.py). 4. ChromaBackend cache-freshness check now matches the semantics from mcp_server._get_client (merged via #757) on three edge cases Copilot called out: (a) invalidate when chroma.sqlite3 disappears while a cached client is held, (b) treat a 0→nonzero stat transition as a change so a cache built when the DB did not yet exist is refreshed, (c) re-stat after PersistentClient constructs the DB lazily so freshness reflects the post-creation state (chroma.py). Tests: 978 passed (up from 970), 8 new tests covering the fixes.

Every remaining row in "Still ahead of upstream" now carries a status so the reader can tell at a glance whether each change is being upstreamed, pending a PR, or deliberately fork-local. Dropped: - "Guard ChromaDB 1.5.x metadata-mismatch segfault" — this row was overstated. The memory file for today's debugging notes that the try-get/except-create pattern is defensive code that never reproduced a specific crash (the actual crashes traced to HNSW drift). Leaving it in "Still ahead" implied an upstream-candidate fix, which it isn't. Code stays in place as defensive, but the README no longer claims it as a fork-ahead feature. Moved to Superseded: - "Stale HNSW mtime detection + mempalace_reconnect" — upstream took a different approach in MemPalace#757. Our broader inode+mtime detection and the mempalace_reconnect MCP tool remain as fork-local convenience; they're just not "ahead of upstream" anymore. Statuses now populated: - Linked PR number for the 7 changes with active upstream PRs (MemPalace#659, MemPalace#660, MemPalace#661, MemPalace#673 with APPROVED note, MemPalace#999, MemPalace#1000, MemPalace#1005). - "PR pending" for 3 items that are good candidates but unfiled: epsilon mtime comparison, max_distance parameter, tool output mining. - "fork-only" for 2 items we keep intentionally without pitching upstream: .blob_seq_ids_migrated marker (narrow), bulk_check_mined (complementary to upstream's MemPalace#784 file-locking). Legend sentence added above the table explains the three status values. 42 README-claim tests pass.

jphein and others added 3 commits April 13, 2026 01:51

fix: handle missing DB in cache, reconnect error reporting, add mtime…

9c1512d

… tests Addresses Copilot review feedback on #663. Co-Authored-By: Claude Opus 4.6 <[email protected]>

igorls requested review from bensig and milla-jovovich as code owners April 13, 2026 04:52

Copilot AI review requested due to automatic review settings April 13, 2026 04:52

igorls merged commit e200ce2 into develop Apr 13, 2026
8 checks passed

Copilot started reviewing on behalf of igorls April 13, 2026 04:53 View session

igorls mentioned this pull request Apr 13, 2026

fix: detect mtime changes in _get_collection to prevent stale HNSW index #663

Closed

Copilot AI reviewed Apr 13, 2026

View reviewed changes

igorls mentioned this pull request Apr 13, 2026

release: v3.2.0 #762

Merged

4 tasks

igorls deleted the fix/mcp-mtime-reconnect-develop branch April 13, 2026 22:37

This was referenced Apr 14, 2026

release: v3.3.0 #839

Merged

sync: main → develop (post v3.3.0 release) #842

Closed

fuzzymoomoo mentioned this pull request Apr 14, 2026

fix: release Chroma handles before migrate swap on Windows #890

Open

igorls mentioned this pull request Apr 18, 2026

refactor(backends): RFC 001 §10 cleanup — typed results, PalaceRef, registry #995

Merged

4 tasks

jphein mentioned this pull request Apr 26, 2026

docs: RFC 001 — storage backend plugin specification #743

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect mtime changes in _get_client to prevent stale HNSW index#757

fix: detect mtime changes in _get_client to prevent stale HNSW index#757
igorls merged 3 commits intodevelopfrom
fix/mcp-mtime-reconnect-develop

igorls commented Apr 13, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		mcp_server._get_collection()
		# The key assertion: the old cached collection was dropped

Conversation

igorls commented Apr 13, 2026

Problem

Changes

CI fixes in this PR (beyond #663)

Relation to RFC 001 (#743)

Credit

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants