Skip to content

feat: add update command for incremental palace sync#353

Open
phobicdotno wants to merge 25 commits intoMemPalace:developfrom
phobicdotno:feat/update-command
Open

feat: add update command for incremental palace sync#353
phobicdotno wants to merge 25 commits intoMemPalace:developfrom
phobicdotno:feat/update-command

Conversation

@phobicdotno
Copy link
Copy Markdown

Summary

Adds mempalace update <dir> — an incremental sync that detects new, changed, and deleted files without requiring a full re-mine.

  • New files on disk are mined into the palace
  • Changed files (detected via source_mtime) have old drawers removed and are re-mined
  • Deleted files have their stale drawers cleaned up
  • Unchanged files are skipped

Usage

mempalace update ~/projects/my_app
mempalace update ~/projects/my_app --dry-run    # preview changes

Supports --wing, --no-gitignore, --include-ignored, --agent, --dry-run — same flags as mine.

Changes

  • mempalace/miner.py — Added update() function using existing scan_project(), process_file(), and source_mtime tracking
  • mempalace/cli.py — Added update subparser and cmd_update() handler
  • tests/test_miner.py — Added test_update_detects_changes() covering new/changed/deleted file detection

Test plan

  • All tests pass (pytest tests/ -v)
  • Ruff format and lint clean
  • No new dependencies
  • Tests run without API keys or network access

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: mempalace update — Incremental Sync

Great addition — incremental sync is something I've been wanting for our integration project. The approach of leveraging existing scan_project() and process_file() is exactly right (no parallel reimplementation), and the new/changed/deleted classification is clean. A few observations:

1. Hardcoded limit=100000 on collection.get()

results = collection.get(
    where={"wing": wing},
    include=["metadatas"],
    limit=100000,
)

ChromaDB's get() with limit=None (or omitted) returns all matching results — no pagination needed. The hardcoded 100K silently truncates large palaces: wing gets partially synced, some files appear as "new" every run, and stale drawers never get cleaned up. Since there's no warning when the limit is hit, this would be a hard-to-debug issue.

Suggestion: just drop the limit parameter entirely:

results = collection.get(
    where={"wing": wing},
    include=["metadatas"],
)

2. Silent except Exception on wing query

try:
    results = collection.get(
        where={"wing": wing},
        include=["metadatas"],
        limit=100000,
    )
except Exception:
    results = {"ids": [], "metadatas": []}

If the wing query fails (corrupt DB, permission error, ChromaDB bug), this silently treats the entire palace as empty. Every file becomes "new", gets re-mined, and you end up with duplicate drawers alongside the originals. That's the opposite of what a user expects from update.

At minimum, logging the exception would help. But arguably, if we can't read the existing state, the update should abort rather than silently double everything:

try:
    results = collection.get(
        where={"wing": wing},
        include=["metadatas"],
    )
except Exception as e:
    print(f"  ✗ Failed to read wing '{wing}': {e}")
    print("  Cannot determine current state — aborting update.")
    return

3. Float equality for mtime comparison

if stored_mtime is not None and float(stored_mtime) == current_mtime:
    unchanged += 1

I tested this and ChromaDB does round-trip Python floats bit-exactly (at least on 0.6.x with the SQLite backend), so this works in practice today. But it's worth noting that this is an implementation detail of ChromaDB's storage layer, not a documented guarantee. If they ever switch to a format that loses precision (JSON serialization, for example), this comparison would start seeing every file as "changed".

A small epsilon would make this robust against future storage changes:

if stored_mtime is not None and abs(float(stored_mtime) - current_mtime) < 0.01:
    unchanged += 1

10ms is well below any real file modification granularity (most filesystems have 1-second or coarser resolution). Low priority since it works today, but it's cheap insurance.

4. Test coverage gaps

The single integration test is solid for the happy path, but a couple of edge cases would be nice:

  • Dry-run mode: Verify that --dry-run reports changes but doesn't modify the collection. A quick assert col.count() == initial_count after a dry-run update would do it.
  • Deletion-only path: A test where only files are deleted (no new/changed) to exercise the cleanup path in isolation.
  • Empty palace: Calling update before any mine — currently the non-dry-run path would hit get_collection() which creates the collection, then see everything as new. That's probably fine, but worth a test to document the behavior.

5. Minor: redundant float() cast

palace_files[sf]["source_mtime"] = float(meta["source_mtime"])

…and later:

if stored_mtime is not None and float(stored_mtime) == current_mtime:

The value is already cast to float when building palace_files, so the second float(stored_mtime) is a no-op. Minor, but removing it reduces cognitive load.


Overall this is a clean, well-scoped feature that reuses existing infrastructure correctly. The main things I'd want addressed before merge are #1 (the silent 100K truncation) and #2 (the silent exception on wing query). The rest are nice-to-haves.

Nice work 👍

@phobicdotno phobicdotno force-pushed the feat/update-command branch from 0a5649a to 782f537 Compare April 9, 2026 09:16
@phobicdotno
Copy link
Copy Markdown
Author

Thanks for the thorough review — all great catches. Pushed fixes:

#1 (limit=100000) — Dropped the limit parameter entirely. ChromaDB returns all matching results without it.

#2 (silent except) — Now prints the error and aborts instead of silently treating the palace as empty. No more phantom duplicates.

#3 (float mtime epsilon) — Added abs(stored_mtime - current_mtime) < 0.01 instead of exact equality. Also removed the redundant float() cast (#5).

#4 (tests) — Added three new tests:

  • test_sync_dry_run_does_not_modify — verifies --dry-run doesn't touch the collection
  • test_sync_deletion_only — deletion-only path, confirms all drawers removed
  • test_sync_on_empty_palace — calling sync before any mine, treats everything as new

All 16 tests pass, ruff clean.

Addresses review feedback:
- Remove hardcoded limit=100000 on collection.get()
- Abort on wing query failure instead of silently treating palace as empty
- Use epsilon comparison for mtime floats
- Remove redundant float() cast
- Add tests: dry-run, deletion-only, empty-palace edge cases
@phobicdotno phobicdotno force-pushed the feat/update-command branch from 782f537 to 8b9cbde Compare April 9, 2026 10:41
@phobicdotno
Copy link
Copy Markdown
Author

Also addressed #5 — applied the same epsilon comparison (abs(float(stored_mtime) - current_mtime) < 0.01) to file_already_mined() for consistency with the sync function. The float() cast is kept here since the value comes directly from ChromaDB metadata (may be stored as string).

virgil-at-biocompute and others added 22 commits April 10, 2026 12:09
The initialize handler hardcoded protocolVersion "2024-11-05", which
causes newer MCP clients (e.g. Claude Code) to reject the connection
when they negotiate "2025-11-25" or later.

Echo the client's requested version if it is in the supported set,
otherwise fall back to the latest supported version. This keeps
backwards compatibility with older clients while allowing newer ones
to connect.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix command injection in hook script (pass paths via sys.argv)
- Add sanitize_name/sanitize_content validators in config.py
- Add 10MB file size guard + symlink skip in miners
- Fix SQLite connection leak in knowledge_graph.py (reuse connection)
- Use `with conn:` for proper transaction handling
- Consolidate shared palace operations into palace.py
- Add write-ahead log for audit trail on writes/deletes
- Add metadata cache with 30s TTL for status/taxonomy calls
- Upgrade md5 → sha256 for drawer/triple IDs
- Harden file permissions (0o700/0o600)
- Pin chromadb>=0.5.0,<0.7

Based on PR MemPalace#252 by @anthonyonazure with lint fixes applied.

Co-Authored-By: anthonyonazure <[email protected]>
…_DIRS

- _client → _client_cache to match conftest.py reset fixture
- _get_collection now uses _get_client() return value instead of stale ref
- Restore .pytest_cache and other dirs missing from palace.py SKIP_DIRS
The 30s TTL metadata cache returned stale data between test runs and
after write operations. Reverted to direct col.get() reads which match
the original behavior and pass all tests.
Remove datetime.now() from drawer_id hash so same content + wing + room
always produces the same ID. This enables the idempotency check that
returns "already_exists" on duplicate writes.
… harden security

Review fixes (from Sage's review):
- Restore mtime check in file_already_mined (check_mtime=True for miner)
- Restore limit=10000 on MCP metadata fetches to prevent OOM on large palaces
- Apply _SAFE_NAME_RE regex in sanitize_name (was dead code)
- Drop raw_aaak metadata duplication in diary_write
- chmod 0o700 on WAL dir, 0o600 on WAL file
- Add check_same_thread=False on KnowledgeGraph SQLite connection
- Remove __del__ (unreliable) and dead PRAGMA foreign_keys=ON
Covers the check_mtime=True path in palace.py to meet 85% coverage threshold.
Three critical bugfixes:

1. MCP server hangs on null arguments (MemPalace#394) — `params.get("arguments", {})`
   returns None when JSON has `"arguments": null`. Changed to `or {}`.

2. cmd_repair infinite recursion (MemPalace#395) — trailing slash on palace_path
   caused backup_path to be inside the source dir. Strip trailing sep.

3. OOM on large transcript files (MemPalace#396) — split_mega_files.py and
   normalize.py load entire files into memory. Added 500MB safety limit
   with clear skip/error messages.

Closes MemPalace#394, MemPalace#395, MemPalace#396.
…ixes

- Move file size check before try block so IOError propagates cleanly
  (not caught by the except OSError handler below it)
- Wrap os.path.getsize in its own try/except to preserve existing
  test_normalize_io_error behavior on missing files
- Add test_normalize_rejects_large_file (mocked getsize)
- Add test_null_arguments_does_not_hang (MemPalace#394)
- Add test_cmd_repair_trailing_slash_does_not_recurse (MemPalace#395)

532 tests pass locally, 0 regressions.
PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06.
Highlights: Claude/Codex plugin packaging (MemPalace#270), security hardening (MemPalace#387),
honest AAAK stats + benchmark corrections (MemPalace#147), Windows compatibility fixes,
Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more.

See GitHub release notes for full changelog.

Co-authored-by: milla-jovovich <[email protected]>
bump-plugin-version.yml has been failing on every merge to main since
today's security + plugin-packaging work, because it tries to push
directly to main and branch protection blocks it. It also conflicts
with the manual version-management pattern we're currently using
(manual bumps in PRs like MemPalace#409 for 3.1.0).

Renaming to .yml.disabled so GitHub Actions skips it. If we want
auto-bumps later, the workflow needs to open a PR instead of pushing
directly, and coordinate with manual version bumps.

Co-authored-by: milla-jovovich <[email protected]>
* fix: add mcp command with setup guidance

* fix: include --palace guidance in mcp command output

* fix: make mcp guidance commands copy-pastable

---------

Co-authored-by: Milla J <[email protected]>
Complete OpenClaw skill exposing all MCP tools with session protocol,
auto-install spec, and setup instructions for OpenClaw + other MCP hosts.

Covers all 20 tools: search, check_duplicate, status, list_wings,
list_rooms, get_taxonomy, get_aaak_spec, kg_query, kg_add,
kg_invalidate, kg_timeline, kg_stats, traverse, find_tunnels,
graph_stats, add_drawer, delete_drawer, diary_write, diary_read.

Based on PR MemPalace#207 by @wanikua — updated to v3.1.0, added missing tools
(check_duplicate, get_aaak_spec), expanded parameter docs, added
OpenClaw CLI setup command.

Co-Authored-By: wanikua <[email protected]>
Addresses review feedback:
- Remove hardcoded limit=100000 on collection.get()
- Abort on wing query failure instead of silently treating palace as empty
- Use epsilon comparison for mtime floats
- Remove redundant float() cast
- Add tests: dry-run, deletion-only, empty-palace edge cases
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@igorls igorls added area/cli CLI commands area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants