Skip to content

fix: detect mtime changes in _get_client to prevent stale HNSW index#757

Merged
igorls merged 3 commits intodevelopfrom
fix/mcp-mtime-reconnect-develop
Apr 13, 2026
Merged

fix: detect mtime changes in _get_client to prevent stale HNSW index#757
igorls merged 3 commits intodevelopfrom
fix/mcp-mtime-reconnect-develop

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 13, 2026

Replacement for #663 — rebased on develop with CI fixes. Authorship preserved (@jphein on the feature commits).

Problem

When external tools write to the palace database (CLI mining, scripts), the MCP server's cached ChromaDB collection becomes stale — its in-memory HNSW index doesn't know about new vectors. Searches return incomplete results until the MCP server process restarts.

develop already detects inode changes in _get_client() (catches full rebuilds via repair/nuke/purge), but not mtime changes (misses in-place writes from mempalace mine and external scripts).

Changes

  1. mtime-based stale index detection_get_client() now compares both st_ino and st_mtime of chroma.sqlite3; invalidates the cached client on either change. Epsilon comparison (> 0.01s) avoids spurious reconnects from filesystem timestamp rounding. The current_inode != 0 guard preserves safe behavior on FAT/exFAT.
  2. mempalace_reconnect MCP tool — explicit cache flush for cases where automatic detection misses (e.g., metadata-only col.update() calls) and for tests/scripts needing a guaranteed-fresh handle.

CI fixes in this PR (beyond #663)

  • tests/test_mcp_server.py — drop unused result = assignment (F841).
  • tests/test_mcp_server.py::test_missing_db_invalidates_cache@pytest.mark.skipif(sys.platform == \"win32\", ...). On Windows, ChromaDB keeps chroma.sqlite3 open via the cached client, so os.remove() raises PermissionError before the cache-invalidation path runs. Linux/macOS still exercise the path.
  • mempalace/mcp_server.py — ruff-format the new multi-global declaration.

Relation to RFC 001 (#743)

This fix lands in mcp_server._get_client() today. Under RFC 001 §2.6, handle-caching and freshness belong inside each backend, and the equivalent of tool_reconnect becomes backend.close_palace(palace_ref). The §10 cleanup PR will migrate this logic into ChromaBackend.get_collection() / ChromaBackend.close_palace() when it lands. I'll add this PR to the §11 in-flight table in a follow-up to RFC 001.

Credit

Original design + implementation: @jphein (#663). Original approval: @Ari4ka on #663. The feature commits on this branch carry jphein's authorship; only the CI fix commit is mine.

Test plan

  • ruff check . clean
  • ruff format --check . clean
  • 5 cache-invalidation tests pass locally on Linux (including test_missing_db_invalidates_cache)
  • Full test_mcp_server.py suite runs — no regressions
  • CI green on all platforms (Windows will skip the file-lock test)

Closes #663.

jphein and others added 3 commits April 13, 2026 01:51
_get_collection() cached the ChromaDB collection wrapper but only
invalidated on process restart. When external tools (CLI mining,
scripts) wrote to the palace database in-place, the in-memory HNSW
index became stale — vector searches returned stale results until the
MCP server process restarted.

Fix: check both inode and mtime of chroma.sqlite3 on every
_get_collection() call. Inode changes catch full rebuilds (repair/nuke);
mtime changes catch in-place writes. Epsilon comparison (0.01s) avoids
spurious reconnects from filesystem timestamp granularity.

Also adds mempalace_reconnect MCP tool for manual cache flush — useful
after metadata-only updates (col.update()) that may not reliably change
mtime.
… tests

Addresses Copilot review feedback on #663.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- tests/test_mcp_server.py: remove unused `result =` assignment
  (ruff F841) and gate test_missing_db_invalidates_cache on
  non-Windows platforms — on Windows, ChromaDB keeps
  chroma.sqlite3 open through the cached client, so os.remove()
  raises PermissionError before the cache-invalidation path runs.
- mempalace/mcp_server.py: ruff-format multi-global declaration.
Copilot AI review requested due to automatic review settings April 13, 2026 04:52
@igorls igorls merged commit e200ce2 into develop Apr 13, 2026
8 checks passed
igorls added a commit that referenced this pull request Apr 13, 2026
#757 landed mtime/inode cache invalidation and mempalace_reconnect
in mcp_server._get_client(). Both are Chroma-specific (stat of
chroma.sqlite3). They should migrate into ChromaBackend.get_collection
and ChromaBackend.close_palace during the §10 cleanup so the freshness
contract lives inside the backend, not in the caller.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves MCP server reliability when the underlying ChromaDB SQLite file is modified externally by detecting on-disk DB changes (mtime + inode) and forcing cache invalidation to avoid serving results from a stale in-memory HNSW index.

Changes:

  • Extend _get_client() cache invalidation logic to trigger on chroma.sqlite3 mtime drift (with epsilon) in addition to inode changes, and handle “DB file disappeared” scenarios.
  • Add a new MCP maintenance tool mempalace_reconnect to explicitly flush caches and reconnect.
  • Add tests covering mtime/inode invalidation, missing DB invalidation, and reconnect tool responses (with Windows skip for the file-lock case).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
mempalace/mcp_server.py Adds mtime-based stale detection in _get_client() and introduces tool_reconnect exposed as mempalace_reconnect.
tests/test_mcp_server.py Adds cache invalidation + reconnect tests; skips the DB-delete test on Windows due to Chroma file locking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_mcp_server.py
Comment on lines +662 to +672
# Prime the cache
col1 = mcp_server._get_collection()
assert col1 is not None

# Simulate an external write changing the mtime
old_mtime = mcp_server._palace_db_mtime
monkeypatch.setattr(mcp_server, "_palace_db_mtime", old_mtime - 10.0)

# _get_collection should detect the mtime drift and reconnect
col2 = mcp_server._get_collection()
assert col2 is not None
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn’t currently assert that the cache was actually invalidated/reconnected—col2 could be the same cached object and this would still pass. Consider asserting that _client_cache and/or _collection_cache was replaced (e.g., identity change) and that _palace_db_mtime was updated to the current stat value after the call.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_mcp_server.py
Comment on lines +684 to +690
assert col1 is not None

# Simulate a rebuild that changes the inode
monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)

col2 = mcp_server._get_collection()
assert col2 is not None
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the mtime test, this only checks that the collection is non-None; it doesn’t verify that an inode mismatch caused a reconnect. Add an assertion that the cached client/collection object identity changes (or that _collection_cache was cleared and recreated) when _palace_db_inode is forced to a different value.

Suggested change
assert col1 is not None
# Simulate a rebuild that changes the inode
monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)
col2 = mcp_server._get_collection()
assert col2 is not None
assert col1 is not None
cached_col1 = mcp_server._collection_cache
assert cached_col1 is col1
# Simulate a rebuild that changes the inode
monkeypatch.setattr(mcp_server, "_palace_db_inode", 99999)
col2 = mcp_server._get_collection()
assert col2 is not None
assert col2 is not col1
assert mcp_server._collection_cache is col2
assert mcp_server._collection_cache is not cached_col1

Copilot uses AI. Check for mistakes.
Comment thread tests/test_mcp_server.py
Comment on lines +717 to +718
mcp_server._get_collection()
# The key assertion: the old cached collection was dropped
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says _get_collection should return None after deleting chroma.sqlite3, but the test never asserts the return value nor that _collection_cache was cleared. Add assertions that the call returns None and that _collection_cache is None to ensure the stale cached collection can’t be served.

Suggested change
mcp_server._get_collection()
# The key assertion: the old cached collection was dropped
col2 = mcp_server._get_collection()
assert col2 is None
# The key assertion: the old cached collection was dropped
assert mcp_server._collection_cache is None

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py
Comment on lines +141 to +147
global \
_client_cache, \
_collection_cache, \
_palace_db_inode, \
_palace_db_mtime, \
_metadata_cache, \
_metadata_cache_time
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using backslash line continuations in a global statement; it’s brittle and inconsistent with the rest of this module’s style. Prefer two global ... statements (or otherwise restructure) so formatting/linting doesn’t depend on trailing backslashes.

Suggested change
global \
_client_cache, \
_collection_cache, \
_palace_db_inode, \
_palace_db_mtime, \
_metadata_cache, \
_metadata_cache_time
global _client_cache, _collection_cache, _palace_db_inode, _palace_db_mtime
global _metadata_cache, _metadata_cache_time

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py
Comment on lines +1019 to +1025
global _collection_cache, _palace_db_inode, _palace_db_mtime
_collection_cache = None
_palace_db_inode = 0
_palace_db_mtime = 0.0
try:
col = _get_collection()
if col is None:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_reconnect() is described as forcing a reconnect, but it only clears _collection_cache and relies on inode/mtime detection to recreate _client_cache. If os.stat() fails (permissions/transient errors), _get_client() will reuse the existing _client_cache, so this tool won’t reliably reconnect. Consider explicitly clearing _client_cache and the metadata cache/time here to guarantee a fresh handle.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py
}
return {"success": True, "message": "Reconnected to palace", "drawers": col.count()}
except Exception as e:
return {"success": False, "error": str(e)}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception path returns {success: False, error: ...} but omits the message/drawers keys that are present in the other return paths. For easier client handling, return a consistent response shape across success/failure cases (e.g., always include message and drawers).

Suggested change
return {"success": False, "error": str(e)}
return {
"success": False,
"message": "Failed to reconnect to palace",
"drawers": 0,
"error": str(e),
}

Copilot uses AI. Check for mistakes.
igorls added a commit that referenced this pull request Apr 13, 2026
PR #761 bumped pyproject.toml to 3.2.0 but missed three other version strings,
causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows).

- mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency)
- README.md: version badge shield 3.1.0 → 3.2.0
- integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0
- CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries
  for #685, #690, #707, #716, #734, #755, #757, #761

Verified locally: 689/689 tests pass, ruff clean.
@igorls igorls mentioned this pull request Apr 13, 2026
4 tasks
@igorls igorls deleted the fix/mcp-mtime-reconnect-develop branch April 13, 2026 22:37
sha2fiddy pushed a commit to sha2fiddy/mempalace that referenced this pull request Apr 13, 2026
PR MemPalace#761 bumped pyproject.toml to 3.2.0 but missed three other version strings,
causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows).

- mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency)
- README.md: version badge shield 3.1.0 → 3.2.0
- integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0
- CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries
  for MemPalace#685, MemPalace#690, MemPalace#707, MemPalace#716, MemPalace#734, MemPalace#755, MemPalace#757, MemPalace#761

Verified locally: 689/689 tests pass, ruff clean.
sha2fiddy pushed a commit to sha2fiddy/mempalace that referenced this pull request Apr 14, 2026
PR MemPalace#761 bumped pyproject.toml to 3.2.0 but missed three other version strings,
causing test_version_consistency to fail on develop CI (macos, linux 3.11, windows).

- mempalace/version.py: 3.1.0 → 3.2.0 (unblocks test_version_consistency)
- README.md: version badge shield 3.1.0 → 3.2.0
- integrations/openclaw/SKILL.md: 3.1.0 → 3.2.0
- CHANGELOG.md: rename [Unreleased] → [3.2.0] — 2026-04-13, add entries
  for MemPalace#685, MemPalace#690, MemPalace#707, MemPalace#716, MemPalace#734, MemPalace#755, MemPalace#757, MemPalace#761

Verified locally: 689/689 tests pass, ruff clean.
igorls added a commit that referenced this pull request Apr 14, 2026
Main had 9 commits that never back-merged into develop after the v3.2.0
release cycle. Resolving conflicts as follows:

- mempalace/version.py: keep develop (3.3.0 release target)
- README.md: keep develop (Milla's #835 audit is authoritative — main
  had stale 19 tools / 170 tokens / "30x lossless" / v3.0.0 label)
- hooks/mempal_{save,precompact}_hook.sh: keep develop (#786 reversed
  the #666 "decision=block" behavior intentionally to stop hooks from
  making agents write in chat)
- pyproject.toml: auto-merged — keeps develop's 3.3.0 and picks up
  main's chromadb upper-bound removal (#690)
- CONTRIBUTING.md, mempalace/hooks_cli.py: auto-merged cleanly —
  picks up main's improvements (fork-first clone, more detailed
  block reason strings that name MemPalace and specific tools)
- integrations/openclaw/SKILL.md: bumped 3.2.0 → 3.3.0 (version
  tracks the package per #761 convention)
- CHANGELOG.md: manual merge — kept develop's preamble + Unreleased
  v3.3.0 section + footer links; folded main's richer v3.2.0 entries
  (Packaging section for #690/#761; Bug Fixes #685/#677/#716/#707/
  #755/#757; Documentation #734/#733) into the v3.2.0 section;
  deduped the split Documentation sections that auto-merge produced
igorls added a commit that referenced this pull request Apr 18, 2026
…nd registry (RFC 001 §10)

Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres,
#700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target
to align against.

Scope (this PR):

- Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at
  the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps
  existing callers working while the attribute-access migration proceeds.
- BaseCollection is now kwargs-only across add/upsert/query/get/delete/update
  with ABC defaults for estimated_count/close/health and a non-atomic default
  update() (§1.1–1.2).
- PalaceRef replaces raw path strings at the backend boundary (§2.2).
- BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3).
- mempalace.backends entry-point group + in-tree registry with
  resolve_backend_for_palace priority order matching §3.2–3.3.
- ChromaCollection normalizes chroma returns into typed results; unknown
  where-clause operators raise UnsupportedFilterError (no silent drop, §1.4).
- ChromaBackend absorbs the inode/mtime client-cache freshness check
  previously duplicated in mcp_server._get_client() (§10 + PR #757).
- searcher.py migrated to typed-attribute access as the reference call
  site; remaining callers land in a follow-up.
- pyproject: chroma registered via [project.entry-points."mempalace.backends"].

Out of scope (explicit follow-ups):

- Full caller migration off the dict-compat shim across palace.py,
  mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py,
  palace_graph.py, cli.py, closet_llm.py.
- Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5).
- maintenance_state() / run_maintenance() benchmark hooks (§7.3).
- AbstractBackendContractSuite full coverage (§7.1–7.2).
- mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8).

Tests: 970 passed (up from 967 on develop); new coverage for typed results,
empty-result outer-shape preservation, \$regex rejection, registry lookup,
priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection.

Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).
igorls added a commit that referenced this pull request Apr 18, 2026
Four defects surfaced by the automated review, fixed with targeted tests:

1. BaseCollection.update() default now validates that documents / metadatas /
   embeddings lengths match ids, raising ValueError instead of silently
   misaligning pairs or raising IndexError (base.py).

2. ChromaCollection.query() now rejects the two ambiguous input shapes up
   front — neither or both of query_texts / query_embeddings, and empty input
   lists — with clear ValueError messages rather than delegating to chromadb's
   less-obvious errors (chroma.py).

3. QueryResult.empty() accepts embeddings_requested=True to preserve the
   outer-query dimension with empty hit lists when the caller asked for
   embeddings, matching the spec rule that included fields carry the outer
   shape even when empty (base.py). ChromaCollection.query() threads this
   through on the empty-result path (chroma.py).

4. ChromaBackend cache-freshness check now matches the semantics from
   mcp_server._get_client (merged via #757) on three edge cases Copilot
   called out: (a) invalidate when chroma.sqlite3 disappears while a cached
   client is held, (b) treat a 0→nonzero stat transition as a change so a
   cache built when the DB did not yet exist is refreshed, (c) re-stat
   after PersistentClient constructs the DB lazily so freshness reflects
   the post-creation state (chroma.py).

Tests: 978 passed (up from 970), 8 new tests covering the fixes.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 18, 2026
Every remaining row in "Still ahead of upstream" now carries a status
so the reader can tell at a glance whether each change is being
upstreamed, pending a PR, or deliberately fork-local.

Dropped:
- "Guard ChromaDB 1.5.x metadata-mismatch segfault" — this row was
  overstated. The memory file for today's debugging notes that the
  try-get/except-create pattern is defensive code that never
  reproduced a specific crash (the actual crashes traced to HNSW
  drift). Leaving it in "Still ahead" implied an upstream-candidate
  fix, which it isn't. Code stays in place as defensive, but the
  README no longer claims it as a fork-ahead feature.

Moved to Superseded:
- "Stale HNSW mtime detection + mempalace_reconnect" — upstream took
  a different approach in MemPalace#757. Our broader inode+mtime detection
  and the mempalace_reconnect MCP tool remain as fork-local
  convenience; they're just not "ahead of upstream" anymore.

Statuses now populated:
- Linked PR number for the 7 changes with active upstream PRs
  (MemPalace#659, MemPalace#660, MemPalace#661, MemPalace#673 with APPROVED note, MemPalace#999, MemPalace#1000, MemPalace#1005).
- "PR pending" for 3 items that are good candidates but unfiled:
  epsilon mtime comparison, max_distance parameter, tool output
  mining.
- "fork-only" for 2 items we keep intentionally without pitching
  upstream: .blob_seq_ids_migrated marker (narrow), bulk_check_mined
  (complementary to upstream's MemPalace#784 file-locking).

Legend sentence added above the table explains the three status
values. 42 README-claim tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants