Skip to content

Lazy DB init: defer ~/.memtomem/memtomem.db creation until first tool call #399

@memtomem

Description

@memtomem

Continuation of the docs-only fix in #381 — addressing the broader behavior the comment thread on that PR identified.

Problem

Any MCP client connecting to memtomem — not just the memtomem-server verify command — creates ~/.memtomem/memtomem.db on handshake. #381 addressed only the docs verify command; the broader vector is every client (Claude Code, Cursor, Windsurf, Gemini CLI, any MCP session) that connects to memtomem after registration instantiates the DB before the user runs mm init.

Reproduction (observed 2026-04-23, memtomem==0.1.23):

rm -rf ~/.memtomem
claude mcp add memtomem -s user -- uvx --from memtomem memtomem-server
ls ~/.memtomem                  # still absent
claude mcp list                 # → ~/.memtomem suddenly populated

Observed: the directory appeared at the moment of claude mcp list. The exact spawner is not yet pinpointed — claude mcp list itself does a health-check spawn, and a concurrent Claude Code session may also rescan and reconnect when registration lands in ~/.claude.json. Both paths execute the same memtomem-server startup.

Root cause

Startup eagerly creates state at two layers:

  1. server/__init__.py — line 149 pid_dir.mkdir(...) creates ~/.memtomem/; lines 150–173 open + flock .server.pid (advisory lock). Runs on every memtomem-server spawn, including short-lived health-check spawns.
  2. server/lifespan.py:105app_lifespan calls create_components() which runs await storage.initialize() (server/component_factory.py:66), creating ~/.memtomem/memtomem.db with all schemas, before the MCP initialize handshake yields.

MCP initialize + tools/list + resources/list do not need the DB — tool metadata is bound at import time in server/__init__.py:42-105 (decorators) and filtered at lines 127–139, both lazy-independent. Only tool handlers and resource handlers actually touch the DB (mem_add / mem_index / mem_search / mem_recall / mem_list / mem_read and the memtomem://* resources in server/resources.py:12-78, all app.storage.* calls).

Note on config read paths (out of scope for lazy init): Mem2MemConfig() + load_config_d() + load_config_overrides() read ~/.memtomem/config.json and ~/.memtomem/config.d/*.json, but both are read-only and no-op when absent (config.py:787, config.py:947). They don't write or create the directory.

Proposal

Defer create_components() from lifespan startup to first tool call that needs it. Sketch:

  • AppContext gains _components: Components | None, _init_lock: asyncio.Lock, async ensure_initialized() -> Components.
  • app_lifespan loads config, sets up logging, creates webhook manager (storage-free), yields. No create_components, no watcher / scheduler / watchdog start.
  • Tool handlers and resource handlers call await app.ensure_initialized() before touching storage / embedder / index_engine / search_pipeline.
  • First-call init runs create_components, then starts watcher / consolidation scheduler / policy scheduler / health watchdog — the tail of today's app_lifespan moves inside ensure_initialized.
  • embedding_broken moves from AppContext field to a property reading _components.embedding_broken post-init; gate helpers like _check_embedding_mismatch need to ensure-init before checking (or be called from already-init'd handlers).
  • Shutdown in app_lifespan.finally inspects _components; if still None, only webhook cleanup runs.

Implementation staging

This is not a single-file change. Suggested PR breakdown (or commit ladder if single PR):

  1. Plumbing: AppContext._components/_init_lock/ensure_initialized + _get_app_initialized helper. No behavior change yet (call ensure_initialized from lifespan startup so existing tests pass). Tests for the lock semantics + concurrent first-call.
  2. Handler migration: every @mcp.tool and @mcp.resource and @register action handler — ~30+ files under server/tools/ and server/resources.py — switches from _get_app to _get_app_initialized (or inserts await app.ensure_initialized()). Same-PR audit: embedding_broken reads in tools migrate to property accessor.
  3. Lifespan slimming: remove create_components + watcher / scheduler / watchdog startup from app_lifespan; move into ensure_initialized. Now ~/.memtomem/memtomem.db is no longer created on handshake.
  4. Tests: fresh-state acceptance tests (DB absent after handshake), concurrent first-call race, in-process MCP client doing handshake-only.

Open questions / accepted regressions

  1. main() .server.pid + mkdir stays eager. Advisory lock needs early acquisition. Leaves ~/.memtomem/ present after a claude mcp list spawn (empty after atexit unlink at server/__init__.py:173). Follow-up: relocate .server.pid to $XDG_RUNTIME_DIR (ties into mm uninstall liveness check only sees MCP server pid — silently ignores mm web and other DB writers #384 / memtomem-server leaves stale .server.pid on exit — risks mm uninstall liveness false-positive via PID recycling #387).

  2. Degraded mode (MCP server should degrade gracefully on embedding mismatch instead of fail-fast crash #349) startup-warning visibility regression. Currently _log.warning("Embedding dimension mismatch detected at startup ...") (component_factory.py:78-84) fires when the server boots, so users see it in stderr immediately. Under lazy init the warning fires on first tool call, and the user only learns of the mismatch via the actionable error in that tool's response. Recovery tools (mem_embedding_reset, mem_status, mem_stats) remain callable. Accepted regression — document in changelog.

  3. Background scheduler/watcher start delay is a real regression, not "near-zero." Today, an idle server (e.g. editor opened in evening, no tool calls until morning) still runs consolidation / policy / health-watchdog in the background. Under lazy init, schedulers don't start until first tool call. Two paths:

    • (a) Decoupled scheduler startup: spin schedulers as a separate lifespan task that themselves call ensure_initialized — but then schedulers immediately trigger DB creation, defeating the goal.
    • (b) Accept the regression: schedulers start on first tool call. An idle server with zero tool calls does no maintenance — consistent with "no DB to maintain."

    (b) is simpler and consistent. Accepted regression — document in changelog; if anyone needs background-without-tool-calls semantics later, that's a separate feature.

  4. Concurrent first-call race. Two tools arriving simultaneously both see _components is None. _init_lock serializes. Tricky path: degraded-mode storage swap (component_factory.py:85-99) on second-storage-open failure — needs to be exercised in concurrent test.

  5. mem_status on uninitialized state. Simplest: ensure_initialized triggers init. Alternative (return "no state yet" without creating DB) makes status and real tools disagree. Pick the first.

  6. Lazy-init failure observability. When ensure_initialized raises (e.g. DB permission error, embedding provider import failure), the failure surfaces in the requesting tool's response. Format: stderr log via existing logger.error(...) + structured tool error {"error": "initialization_failed", "reason": "<message>"} so MCP clients can render. Record this in docs/troubleshoot.md.

Acceptance

  • rm -rf ~/.memtomem && claude mcp add memtomem -s user -- uvx --from memtomem memtomem-server && sleep 2 && ls ~/.memtomem/memtomem.db → file not found.
  • In-process MCP client integration test: handshake (initialize) + tools/list + resources/list + shutdown leaves ~/.memtomem/memtomem.db absent.
  • In-process MCP client integration test: handshake + ping (if supported) + shutdown leaves DB absent.
  • First mem_search (or any handler-bearing tool) on fresh state creates the DB and returns a result.
  • First memtomem://sources resource fetch on fresh state creates the DB.
  • Concurrent first-call from two tools → single create_components invocation, both responses succeed.
  • mem_embedding_reset callable on legacy-DB fresh state with config.embedding.provider != none (current MCP server should degrade gracefully on embedding mismatch instead of fail-fast crash #349 scenario still recoverable, just discovered on first call instead of startup).
  • Changelog entry documenting (a) startup-warning visibility moves to first tool call for MCP server should degrade gracefully on embedding mismatch instead of fail-fast crash #349 mismatch, (b) background schedulers start on first tool call rather than handshake.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions