Skip to content

fix(mcp): lazy per-path KnowledgeGraph cache (#1136)#1160

Merged
igorls merged 8 commits intoMemPalace:developfrom
mvalentsev:fix/mcp-kg-lazy-per-path-cache
May 6, 2026
Merged

fix(mcp): lazy per-path KnowledgeGraph cache (#1136)#1160
igorls merged 8 commits intoMemPalace:developfrom
mvalentsev:fix/mcp-kg-lazy-per-path-cache

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 24, 2026

Summary

Replaces the module-level _kg = KnowledgeGraph(...) singleton in
mcp_server.py with a lazy, per-path cache. The KG path is now resolved
on each tool invocation, so a multi-tenant host can rotate
MEMPALACE_PALACE_PATH between calls and have each call address the
correct sqlite file. The ChromaDB side of the server was already
per-call via _get_client(); this brings the KG side in line.

Closes #1136.

Before / After

Before (module scope, import-time):

if _args.palace:
    _kg = KnowledgeGraph(db_path=...)
else:
    _kg = KnowledgeGraph()
# tool call: _kg.add_triple(...)

After:

_kg_by_path: dict[str, KnowledgeGraph] = {}
_kg_cache_lock = threading.Lock()

def _get_kg() -> KnowledgeGraph:
    path = os.path.abspath(_resolve_kg_path())
    kg = _kg_by_path.get(path)       # lock-free hot path
    if kg is not None:
        return kg
    with _kg_cache_lock:              # cold path, double-checked
        kg = _kg_by_path.get(path)
        if kg is None:
            kg = KnowledgeGraph(db_path=path)
            _kg_by_path[path] = kg
    return kg
# tool call: _get_kg().add_triple(...)

Follows Option A from the reporter in #1136.

Context: earlier attempt

A previous attempt in #167 introduced a lazy single-instance init. It
was closed on 2026-04-12 with a note saying it had been reworked in
#647 and #667, but neither of those PRs actually touched the singleton
pattern. That is why #1136 landed on current develop with the eager
init still in place. Easy to verify against upstream/develop at
mempalace/mcp_server.py:101 before this PR.

Scope notes

  • Strict backward-compat for the default path. When the server starts
    without --palace, KG keeps writing to
    ~/.mempalace/knowledge_graph.sqlite3 regardless of
    MEMPALACE_PALACE_PATH. Multi-tenant path rotation applies only
    when --palace was given at startup. The "no --palace but env var
    set" case is the subject of fix: KG path mismatch between MCP server and CLI #540 and is explicitly out of scope here.
  • No atexit cleanup of KG connections; same posture as before this PR.
  • No eviction policy on the cache. Per-process palace count is small;
    each KnowledgeGraph holds one sqlite connection plus a lock.

Test plan

  • _patch_mcp_server helper patches _get_kg instead of _kg
  • test_lazy_init_no_import_side_effect: fresh subprocess import
    under HOME=tmp does not create the sqlite file
  • test_get_kg_returns_same_instance: cache hits return is-equal
  • test_get_kg_different_paths_different_instances: rotating env
    gives distinct KGs
  • test_multi_tenant_env_switch: reporter's scenario (write under
    A, query under B is empty, query under A again returns the fact)
  • test_cache_thread_safe: 16 concurrent _get_kg() end up with
    one shared instance and one cache entry
  • Full pytest tests/ --ignore=tests/benchmarks: 1080 passed
  • ruff check + ruff format --check clean

Breaking changes (internal)

Direct attribute access to mempalace.mcp_server._kg is removed. This
is an underscore-prefixed private attribute; the only in-tree callers
were three test files, all migrated in this PR.

@mvalentsev mvalentsev marked this pull request as ready for review April 24, 2026 08:20
@igorls igorls added bug Something isn't working area/mcp MCP server and tools area/kg Knowledge graph labels Apr 24, 2026
@mvalentsev mvalentsev force-pushed the fix/mcp-kg-lazy-per-path-cache branch 2 times, most recently from 2f93e6c to e9dc63f Compare May 1, 2026 20:33
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, tool_reconnect() clears only the ChromaDB client/collection caches and leaves the per-path KnowledgeGraph cache intact, so after external edits/replacement of knowledge_graph.sqlite3 the server can keep using an old open SQLite connection.

Severity: action required | Category: reliability

How to fix: Invalidate KG cache on reconnect

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

tool_reconnect() currently resets only ChromaDB caches. With the new per-path KG cache, KnowledgeGraph instances can remain open and be reused after an external KG file replacement/rebuild, leading to stale results and lingering handles.

Issue Context

_get_kg() caches KnowledgeGraph(db_path=path) objects in _kg_by_path. KnowledgeGraph holds an open sqlite3.Connection until close() is called.

Fix Focus Areas

  • mempalace/mcp_server.py[108-129]
  • mempalace/mcp_server.py[1342-1381]

Suggested change

In tool_reconnect(), iterate over _kg_by_path.values() and call close() (best-effort), then clear _kg_by_path (optionally under _kg_cache_lock) so the next tool call reopens the KG from disk.

We noticed a couple of other issues in this PR as well - happy to share if helpful.


Spotted by Qodo code review - free for open-source projects.

@mvalentsev mvalentsev force-pushed the fix/mcp-kg-lazy-per-path-cache branch from e9dc63f to 0de41d7 Compare May 2, 2026 13:00
@igorls igorls added this to the v3.3.5 milestone May 2, 2026
mvalentsev added 6 commits May 3, 2026 17:43
…1136)

Swap the module-level KnowledgeGraph singleton for a lazy, per-path
cache keyed by the resolved sqlite path. Import no longer creates a
sqlite file as a side effect, and MCP servers started with --palace
now route KG calls to the correct tenant when MEMPALACE_PALACE_PATH
changes between calls, matching the per-call behavior of _get_client()
on the ChromaDB side.

Default-path behavior is preserved: without --palace at startup, KG
stays on DEFAULT_KG_PATH regardless of env var. The "no --palace but
env var set" case is MemPalace#540's scope and is not changed here.
Direct module-attribute patching of _kg is obsolete after the lazy
cache refactor. Switch test helpers to patch _get_kg instead so the
fixture KG replaces the factory rather than a now-missing singleton.

- tests/test_mcp_server.py: _patch_mcp_server helper
- tests/benchmarks/test_mcp_bench.py: _patch_mcp_config helper
- tests/benchmarks/test_memory_profile.py: inline patch in test_tool_status_repeated_calls
TestKGLazyCache covers the scenarios behind the lazy per-path refactor:

- test_lazy_init_no_import_side_effect: a fresh subprocess import does
  not create ~/.mempalace/knowledge_graph.sqlite3 (what closed PR MemPalace#167
  was aiming at).
- test_get_kg_returns_same_instance: two _get_kg() calls under the same
  resolved path return the same object, cache has one entry.
- test_get_kg_different_paths_different_instances: rotating env var
  produces distinct KGs.
- test_multi_tenant_env_switch: the exact scenario from MemPalace#1136 — write
  under path A, query under path B returns empty, switching back to A
  sees the fact.
- test_cache_thread_safe: 16 threads racing _get_kg() end up with one
  shared instance and one cache entry.
Passing a stripped env dict without SYSTEMROOT/WINDIR breaks Python
bootstrap on Windows (_Py_HashRandomization_Init). Inherit the parent
env and strip MEMPAL* vars instead, then override HOME/USERPROFILE to
the tmp dir.
Inline comments referencing MemPalace#1136 and MemPalace#540 add no information the
identifiers do not already convey. PR description carries the context;
code stays quiet.
tool_reconnect cleared ChromaDB caches but left _kg_by_path entries
intact. After an external replacement of knowledge_graph.sqlite3 the
server kept serving the old open sqlite3.Connection, returning stale
results.

Now iterate _kg_by_path under _kg_cache_lock, call close() best-effort,
and clear the dict so the next tool call reopens the KG from disk.
Two new tests in TestKGLazyCache verify cache invalidation and that a
failing close() does not block the clear.
@mvalentsev mvalentsev force-pushed the fix/mcp-kg-lazy-per-path-cache branch from 0de41d7 to 0a62658 Compare May 3, 2026 12:44
mvalentsev added 2 commits May 3, 2026 21:43
Race scenario: a KG tool handler calls _get_kg() and gets a live
KnowledgeGraph; another thread fires tool_reconnect() between that
return and the handler's kg.add_triple()/kg.query_entity()/etc call.
tool_reconnect drains _kg_by_path and closes the underlying
sqlite3.Connection; the handler then raises sqlite3.ProgrammingError:
'Cannot operate on a closed database', which surfaces as a -32000
to the MCP client even though the user just asked for a reconnect.

New _call_kg(op) helper wraps each handler's kg call in a one-shot
retry: catch exactly sqlite3.ProgrammingError, evict the stale entry
(only if the cache slot still points at the closed instance — another
thread may have already replaced it), and rerun op against a fresh
_get_kg(). Beyond one retry give up so a sustained close-stream
surfaces clearly instead of looping.

All five KG handlers (tool_kg_query, tool_kg_add, tool_kg_invalidate,
tool_kg_timeline, tool_kg_stats) now route through _call_kg.

Tests pin the contract:
  * retries with a fresh KG and returns the second result
  * non-ProgrammingError exceptions propagate without retry
  * gives up after exactly one retry on sustained close
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Heads-up: b8816e0 adds a _call_kg(op) retry wrapper for the concurrent-close race between a KG handler and tool_reconnect. All five KG handlers (tool_kg_query / add / invalidate / timeline / stats) now route through _call_kg(lambda kg: kg.method(...)). On sqlite3.ProgrammingError: Cannot operate on a closed database the wrapper evicts the stale entry from _kg_by_path (only if the slot still points at the closed instance -- another thread may have already replaced it) and retries once with a fresh _get_kg(). Beyond one retry the error propagates -- a sustained close stream is a different bug and a hung loop is worse than a clear failure surface.

Adds import sqlite3 plus three tests in tests/test_mcp_server.py::TestKGLazyCache:

  • test_call_kg_retries_after_concurrent_close -- closed KG -> retry returns from a fresh KG.
  • test_call_kg_does_not_retry_on_other_errors -- non-ProgrammingError exceptions propagate without retry; we don't want the guard masking real bugs.
  • test_call_kg_gives_up_after_one_retry -- the loop is bounded; sustained closes raise.

The body's _get_kg() call-site example is still the right way to access the KG; this is the supplementary safety layer between _get_kg() and the actual sqlite3 call.

@igorls igorls merged commit 53675dd into MemPalace:develop May 6, 2026
6 checks passed
igorls added a commit that referenced this pull request May 6, 2026
…1282 #1167 #1160

Bundled CHANGELOG entries for the seven Tier-1 PRs merged today, including
the behavior-change call-out for #1167 (KG date validators now reject
non-ISO inputs that previously produced silent empty results).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kg Knowledge graph area/mcp MCP server and tools bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KnowledgeGraph module-level singleton prevents safe multi-tenant use

3 participants