Skip to content

feat: add sync command and --force flag for incremental re-mining#251

Closed
rusel95 wants to merge 18 commits intoMemPalace:developfrom
rusel95:feat/sync-command
Closed

feat: add sync command and --force flag for incremental re-mining#251
rusel95 wants to merge 18 commits intoMemPalace:developfrom
rusel95:feat/sync-command

Conversation

@rusel95
Copy link
Copy Markdown

@rusel95 rusel95 commented Apr 8, 2026

Summary

Closes #224 — MemPalace had no mechanism to detect stale drawers or re-mine changed source files. Once a file was mined, subsequent mine runs skipped it forever, even if the content changed. This meant outdated memories could persist indefinitely and inject contradictory context into live agent reasoning.

This PR adds content-hash-based change detection and a mempalace sync command that identifies stale drawers and re-mines only what changed.

Changes

Content hash tracking

  • Both project miner and conversation miner now store a content_hash (MD5 of the source file content) in every drawer's metadata at mining time
  • This enables comparing current file state against stored state without re-reading all drawer content

mempalace sync command

mempalace sync                      # Check all source files
mempalace sync --dir ~/my-project   # Check only files under this path
mempalace sync --dry-run            # Preview without modifying
mempalace sync --clean              # Also remove drawers for deleted files

Reports:

  • Fresh — source file unchanged (hash matches)
  • Stale — source file modified since last mine (hash differs)
  • Missing — source file deleted but drawers remain
  • No hash (legacy) — drawers mined before this feature (suggests --force)

When not in dry-run mode, stale drawers are deleted and changed files are re-mined automatically.

--force flag on mine

mempalace mine ~/project --force    # Delete existing drawers, re-mine fresh

Path resolution

  • Resolves symlinks for cross-platform consistency (macOS /var/private/var)

Test plan

Unit tests (10 new)

  • test_content_hash_stored_by_project_miner — hash in project drawer metadata
  • test_content_hash_stored_by_convo_miner — hash in conversation drawer metadata
  • test_sync_detects_changed_file — modified file produces different hash
  • test_sync_unchanged_file_matches_hash — unchanged file produces same hash
  • test_sync_detects_missing_file — deleted file has orphaned drawers
  • test_sync_no_hash_legacy_drawers — legacy drawers without hash handled gracefully
  • test_force_clean_deletes_drawers--force removes all drawers from target dir
  • test_force_clean_does_not_affect_other_dirs — isolation between directories
  • test_content_hash_is_deterministic — same content → same hash
  • test_content_hash_changes_with_content — different content → different hash

Integration tests (real mining, 1K+ drawers)

  • Mined 58 project files → 1,112 drawers, all with content_hash
  • Sync with no changes: 58/58 fresh, 0 stale
  • Modified README.md → sync detected 1 stale file (55 drawers)
  • --force mine on examples/ → deleted 10 drawers, re-mined fresh
  • Conversation mining with Claude.ai export → content_hash stored with ingest_mode: "convos"
  • Sync on conversation directory: 1/1 fresh

Backward compatibility

  • Drawers mined without content_hash are reported as "No hash (legacy)" and skipped during comparison
  • All 111 existing tests pass (0 regressions)
$ python -m pytest tests/ -v
========================== 111 passed in 108s ==========================

🤖 Generated with Claude Code

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 8, 2026

Post-review fix (7a7b1fe): Found and fixed two bugs during code review:

  1. ingest_mode query after deletioncmd_sync was querying the collection for ingest_mode after deleting stale drawers, so the query always returned empty. All files were being re-mined as projects even if they were originally conversations. Fixed by capturing ingest_mode during the initial scan phase.

  2. Hash mismatch between convo miner and syncconvo_miner was hashing post-normalize content, but sync reads and hashes raw file content. This meant every conversation file appeared stale on first sync. Fixed by hashing raw file content in convo_miner.

Both covered by the existing test suite (126 tests pass).

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 8, 2026

Second review fix (ecbea41): Removed auto-re-mine from sync command. The previous implementation called mine() on the parent directory of each stale file, which:

  • Required mempalace.yaml in that directory (would sys.exit(1) otherwise)
  • Would double-mine if two stale files shared the same parent directory

Replaced with grouped re-mine instructions that the user can run manually. Example output:

  To re-mine the 6 changed files, run:
    mempalace mine /path/to/project --wing my-project --force
    mempalace mine /path/to/project/src --wing my-project --force

Also fixed a macOS-specific test issue where test_force_clean_deletes_drawers was passing vacuously due to /var/private/var symlink mismatch.

All 126 tests pass.

Copy link
Copy Markdown
Member

@igorls igorls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat: add sync command and --force flag

The content-hash approach is the right foundation for #224. Storing MD5 at mine time and comparing at sync time is sound. --force on mine works correctly. But there are bugs, a data loss issue in the sync flow, and an architectural choice that will make this hard to build on.

All findings verified by checking out this branch and running both the test suite (111 passed) and manual end-to-end testing.


Data loss: sync deletes drawers without re-mining

This is the most serious issue. I tested the following:

mempalace mine /tmp/project          # files drawer with content
# modify a file
mempalace sync                       # detects stale, deletes drawer
mempalace search "anything"          # → "No results found"

Sync deletes stale drawers, then prints mempalace mine ... --force commands for the user to run manually. Between those two steps, the palace is in a worse state than before — the old content is gone and the new content isn't filed. If the user doesn't run the printed command (or gets distracted, or the terminal scrolls), they've lost data with no warning.

The third commit removed auto-re-mine because parent-dir re-mine requires mempalace.yaml and could double-mine. But the fix isn't to punt to the user — it's to re-mine per-file. process_file() already handles a single file. Sync knows the exact stale files, their wings, and their rooms (from the metadata it already scanned). The path is:

  1. Delete stale drawers for file X
  2. Call process_file(filepath=X, collection=col, wing=..., rooms=..., agent=...) directly
  3. Repeat per stale file

This avoids the parent-directory problem entirely and makes sync atomic — delete + re-mine in one operation, no intermediate data loss state.

At minimum, if per-file re-mine isn't implemented, sync should not delete drawers in non-dry-run mode. Report stale files and let mine --force handle both deletion and re-mine as a single atomic operation. Deleting without re-mining is the worst of both worlds.


Bug: --agent flag on sync is dead code

The third commit removed auto-re-mine but left the --agent argument on p_sync (line 717). It's declared in argparse but never referenced in cmd_sync. Should be removed.


Bug: content_hash is optional, creating new "legacy" drawers

# miner.py add_drawer()
if content_hash:
    meta["content_hash"] = content_hash

Any caller that doesn't pass content_hash creates drawers without hashes. tool_add_drawer() in mcp_server.py — a primary write path — doesn't call miner.add_drawer() at all. It does its own col.add() directly (line 270) with no content_hash. This means MCP-created drawers will always show as "No hash (legacy)" in sync.

The hash should be computed from content inside add_drawer() itself:

meta["content_hash"] = content_hash or hashlib.md5(
    content.encode(), usedforsecurity=False
).hexdigest()

And tool_add_drawer() should be refactored to use the shared add_drawer() function instead of duplicating the insertion logic.


Architecture: sync logic in cli.py

cmd_sync is ~170 lines of scan/hash/compare/delete logic inside a CLI handler. This means:

  • Can't call sync programmatically (from MCP tools, tests, or other modules)
  • Can't unit test the core logic without simulating argparse
  • Future features needing change detection (#118 PII Guard, KG sync) can't reuse it

Extract the core into sync.py or a function in miner.py:

def sync_palace(palace_path, source_dir=None, clean=False, dry_run=False) -> SyncReport:
    ...

Then cmd_sync becomes a thin CLI wrapper. This is the same pattern the codebase already follows — cmd_mine calls miner.mine(), cmd_search calls searcher.search().


Hash divergence risk

Project miner hashes content (post-strip, pre-chunk). Convo miner hashes raw_content (pre-normalize, post-strip). Sync hashes read_text().strip(). These match today, but if anyone adds preprocessing between read and hash in either miner, every file will silently appear permanently stale.

A shared helper would prevent this:

def file_content_hash(filepath: Path) -> str:
    content = filepath.read_text(encoding="utf-8", errors="replace").strip()
    return hashlib.md5(content.encode(), usedforsecurity=False).hexdigest()

Used by process_file(), mine_convos(), and sync — single source of truth.


Testing gaps

  • No end-to-end test that calls cmd_sync (or an extracted sync function) through the full pipeline
  • No test that mines → modifies file → syncs → verifies detection end-to-end
  • The data loss scenario above (sync leaving empty palace) is not tested

The primitive tests (hashing, determinism, isolation) are good, but the integration path is the one that matters and it's only tested manually.


Summary

To land cleanly:

  1. Critical: Don't delete drawers without re-mining (either implement per-file re-mine, or make sync detection-only and let mine --force handle the atomic delete+re-mine)
  2. Bug fix: Remove dead --agent flag
  3. Bug fix: Make content_hash non-optional in add_drawer(), refactor tool_add_drawer() to use it
  4. Architecture: Extract sync logic from CLI into a reusable module
  5. Safety: Add shared file_content_hash() helper
  6. Testing: Add end-to-end mine → modify → sync → verify test

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 8, 2026

@bensig @milla-jovovich Please take a look at this. It also would be a semi-blocker for the PII Guard design.

Right now MemPalace is a CLI tool, but for the sync feature to be really useful I think we should also consider in another task to daemonize it and add "project watching" capabilities, where you link a folder and sync will automatically happen when something changes.

@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR #251 Review: feat: add sync command and --force flag for incremental re-mining

Author: Ruslan Popesku (rusel95)
Branch: feat/sync-commandmain
Commits: 3 (feat + 2 fix-up commits addressing self-review)
Files changed: 4 (+572 / -24)
Existing review: igorls — CHANGES_REQUESTED


Summary

Adds content-hash-based change detection to MemPalace:

  1. MD5 content_hash stored in drawer metadata during both project and conversation mining
  2. New mempalace sync command to detect stale/missing/fresh drawers
  3. --force flag on mine to delete-then-re-mine a directory
  4. 10 new unit tests in tests/test_sync.py

Addresses issue #224 (stale drawers persisting after source files change).


Verdict: CHANGES REQUESTED

I agree with the existing review from igorls on the critical data-loss issue. The PR introduces a solid foundation (content hashing is the right approach), but has several problems that need resolution before merging.


Critical Issues

1. Data Loss — Sync Deletes Without Re-mining

Severity: Critical | cli.py:cmd_sync lines ~430-460

The third commit intentionally removed auto-re-mine and replaced it with printed mempalace mine instructions. But cmd_sync still deletes stale drawers before printing those instructions. The palace is left in a degraded state between the delete and whenever the user runs the printed commands (if ever).

mempalace mine ~/project          # mines 1,112 drawers
# user edits README.md
mempalace sync                    # deletes 55 drawers, prints "run: mempalace mine ..."
# user closes terminal / forgets / gets distracted
mempalace search "anything in README"  # → gone

Fix options (pick one):

  • Option A (best): Re-mine per-file inside sync using process_file() directly — sync already knows the source file, wing, and room from the scan phase. No mempalace.yaml needed.
  • Option B: Make sync detection-only (don't delete). Let mine --force handle atomic delete+re-mine.
  • Option C: At minimum, require --apply flag to delete, with clear warning about the intermediate state.

2. content_hash Is Conditionally Stored — Creates New Legacy Drawers

Severity: High | miner.py:add_drawer

if content_hash:
    meta["content_hash"] = content_hash

Any caller that passes an empty string or omits the parameter creates a drawer without content_hash. This defeats the purpose — sync will report these as "No hash (legacy)" even though they were just created.

Worse: tool_add_drawer() in mcp_server.py (line 217-229) does its own col.add() directly and never sets content_hash. Every MCP-created drawer is permanently invisible to sync.

Fix: Always compute the hash from content inside add_drawer() as a fallback:

meta["content_hash"] = content_hash or hashlib.md5(
    content.encode(), usedforsecurity=False
).hexdigest()

And refactor tool_add_drawer() to call add_drawer() instead of duplicating the insertion logic.


Bugs

3. Dead --agent Flag on Sync

Severity: Low | cli.py argparse

The p_sync parser declares --agent (line ~717 in the diff) but cmd_sync never reads args.agent. This is leftover from the removed auto-re-mine. Should be removed to avoid confusion.

4. Hash Computed on .strip()-ed Content — Fragile Contract

Severity: Medium | miner.py:process_file, convo_miner.py:mine_convos, cli.py:cmd_sync

Three locations compute the hash independently:

Location Hashes
process_file() content (after read_text().strip())
mine_convos() raw_content (from read_text().strip())
cmd_sync() Path(sf).read_text(...).strip()

They happen to match today because all three do .strip(). But if any of these paths adds preprocessing (encoding normalization, BOM removal, line-ending normalization), hashes silently diverge and every file appears permanently stale.

Fix: Single file_content_hash(filepath) helper used by all three locations.

5. _force_clean Bypasses palace_db.py

Severity: Medium | cli.py:_force_clean

The project rule from AGENTS.md: "All ChromaDB access goes through palace_db.py — never import chromadb directly elsewhere."

Both _force_clean and cmd_sync import chromadb directly and create their own PersistentClient. This bypasses the singleton client, collection cache, and any future middleware in palace_db.py.

Fix: Use get_collection() from palace_db instead of direct chromadb.PersistentClient.


Architecture

6. 170 Lines of Business Logic in a CLI Handler

cmd_sync is a monolithic function mixing argument parsing, ChromaDB scanning, hash comparison, deletion, and reporting. This makes it:

  • Untestable without simulating argparse
  • Uncallable from MCP tools or other modules
  • Inconsistent with the existing pattern (cmd_mineminer.mine(), cmd_searchsearcher.search())

Recommendation: Extract into sync.py with a clean signature:

def sync_palace(palace_path, source_dir=None, clean=False, dry_run=False) -> SyncReport:
    ...

This is already blocked by the data-loss issue anyway — once per-file re-mine is added, the function will grow further and really needs its own module.


Testing

What's Good

  • 10 unit tests covering the hash primitives: determinism, change detection, missing files, legacy drawers, force-clean isolation
  • Proper use of tempfile for isolation
  • macOS /var/private/var symlink handling tested

Gaps

  • No end-to-end test that calls cmd_sync (or an extracted sync function) through the full mine → modify → sync pipeline
  • No test for the data-loss scenario — what happens after sync deletes drawers but before re-mine
  • No test for --force on cmd_mine — the argparse addition is tested only via _force_clean directly
  • Tests use direct chromadb instead of the project's palace_db helpers — will break if palace_db is refactored
  • test_content_hash_stored_by_project_miner imports process_file and get_collection — these are correct but the import path from mempalace.miner import process_file, get_collection will break if get_collection moves (it's re-exported from palace_db)

Minor Nits

  1. MD5 usage: Fine for content fingerprinting (not security), but usedforsecurity=False should be added consistently. The base branch add_drawer() ID generation still uses hashlib.md5(...).hexdigest() without it — this PR fixes it in the hash parameter but not in the drawer ID generation on the same line.

  2. Batch pagination in _force_clean: The offset += len(batch["ids"]) pattern can skip items when deleting in the same collection being iterated. Since deletion happens after the scan loop completes, this is actually fine here — but worth noting the scan completes before any deletion starts.

  3. _resolve_path is defined at module level in cli.py but is only used by sync. If sync moves to its own module, this should move with it.


Summary of Required Changes

# Issue Severity Type
1 Sync deletes drawers without re-mining (data loss) Critical Bug
2 content_hash conditionally stored; MCP drawers always missing it High Bug
3 Dead --agent flag on sync Low Cleanup
4 Hash computed in 3 places independently (fragile) Medium Design
5 Direct chromadb import violates palace_db.py rule Medium Architecture
6 Sync business logic in CLI handler Medium Architecture

Items 1-3 are blockers. Items 4-6 are strongly recommended before merge to avoid tech debt.

@guybary-wix
Copy link
Copy Markdown

Review: feat: add sync command and --force flag for incremental re-mining

Commits: 3 (feat + 2 fix-up commits addressing self-review)
Files changed: 4 (+572 / -24)


Summary

Adds content-hash-based change detection to MemPalace:

  1. MD5 content_hash stored in drawer metadata during both project and conversation mining
  2. New mempalace sync command to detect stale/missing/fresh drawers
  3. --force flag on mine to delete-then-re-mine a directory
  4. 10 new unit tests in tests/test_sync.py

Addresses issue #224 (stale drawers persisting after source files change).


Verdict: CHANGES REQUESTED

I agree with the existing review from igorls on the critical data-loss issue. The PR introduces a solid foundation (content hashing is the right approach), but has several problems that need resolution before merging.


Critical Issues

1. Data Loss — Sync Deletes Without Re-mining

Severity: Critical | cli.py:cmd_sync lines ~430-460

The third commit intentionally removed auto-re-mine and replaced it with printed mempalace mine instructions. But cmd_sync still deletes stale drawers before printing those instructions. The palace is left in a degraded state between the delete and whenever the user runs the printed commands (if ever).

mempalace mine ~/project          # mines 1,112 drawers
# user edits README.md
mempalace sync                    # deletes 55 drawers, prints "run: mempalace mine ..."
# user closes terminal / forgets / gets distracted
mempalace search "anything in README"  # → gone

Fix options (pick one):

  • Option A (best): Re-mine per-file inside sync using process_file() directly — sync already knows the source file, wing, and room from the scan phase. No mempalace.yaml needed.
  • Option B: Make sync detection-only (don't delete). Let mine --force handle atomic delete+re-mine.
  • Option C: At minimum, require --apply flag to delete, with clear warning about the intermediate state.

2. content_hash Is Conditionally Stored — Creates New Legacy Drawers

Severity: High | miner.py:add_drawer

if content_hash:
    meta["content_hash"] = content_hash

Any caller that passes an empty string or omits the parameter creates a drawer without content_hash. This defeats the purpose — sync will report these as "No hash (legacy)" even though they were just created.

Worse: tool_add_drawer() in mcp_server.py (line 217-229) does its own col.add() directly and never sets content_hash. Every MCP-created drawer is permanently invisible to sync.

Fix: Always compute the hash from content inside add_drawer() as a fallback:

meta["content_hash"] = content_hash or hashlib.md5(
    content.encode(), usedforsecurity=False
).hexdigest()

And refactor tool_add_drawer() to call add_drawer() instead of duplicating the insertion logic.


Bugs

3. Dead --agent Flag on Sync

Severity: Low | cli.py argparse

The p_sync parser declares --agent but cmd_sync never reads args.agent. This is leftover from the removed auto-re-mine. Should be removed to avoid confusion.

4. Hash Computed on .strip()-ed Content — Fragile Contract

Severity: Medium | miner.py:process_file, convo_miner.py:mine_convos, cli.py:cmd_sync

Three locations compute the hash independently:

Location Hashes
process_file() content (after read_text().strip())
mine_convos() raw_content (from read_text().strip())
cmd_sync() Path(sf).read_text(...).strip()

They happen to match today because all three do .strip(). But if any of these paths adds preprocessing (encoding normalization, BOM removal, line-ending normalization), hashes silently diverge and every file appears permanently stale.

Fix: Single file_content_hash(filepath) helper used by all three locations.

5. _force_clean Bypasses palace_db.py

Severity: Medium | cli.py:_force_clean

The project rule from AGENTS.md: "All ChromaDB access goes through palace_db.py — never import chromadb directly elsewhere."

Both _force_clean and cmd_sync import chromadb directly and create their own PersistentClient. This bypasses the singleton client, collection cache, and any future middleware in palace_db.py.

Fix: Use get_collection() from palace_db instead of direct chromadb.PersistentClient.


Architecture

6. 170 Lines of Business Logic in a CLI Handler

cmd_sync is a monolithic function mixing argument parsing, ChromaDB scanning, hash comparison, deletion, and reporting. This makes it:

  • Untestable without simulating argparse
  • Uncallable from MCP tools or other modules
  • Inconsistent with the existing pattern (cmd_mineminer.mine(), cmd_searchsearcher.search())

Recommendation: Extract into sync.py with a clean signature:

def sync_palace(palace_path, source_dir=None, clean=False, dry_run=False) -> SyncReport:
    ...

Testing

What's Good

  • 10 unit tests covering the hash primitives: determinism, change detection, missing files, legacy drawers, force-clean isolation
  • Proper use of tempfile for isolation
  • macOS /var/private/var symlink handling tested

Gaps

  • No end-to-end test that calls cmd_sync through the full mine → modify → sync pipeline
  • No test for the data-loss scenario — what happens after sync deletes drawers but before re-mine
  • No test for --force on cmd_mine — the argparse addition is tested only via _force_clean directly
  • Tests use direct chromadb instead of the project's palace_db helpers

Minor Nits

  1. MD5 usedforsecurity=False should be added consistently — the drawer ID generation on the same line in add_drawer() still lacks it.
  2. _resolve_path is defined at module level in cli.py but only used by sync. Should move with it if sync gets its own module.

Summary of Required Changes

# Issue Severity Type
1 Sync deletes drawers without re-mining (data loss) Critical Bug
2 content_hash conditionally stored; MCP drawers always missing it High Bug
3 Dead --agent flag on sync Low Cleanup
4 Hash computed in 3 places independently (fragile) Medium Design
5 Direct chromadb import violates palace_db.py rule Medium Architecture
6 Sync business logic in CLI handler Medium Architecture

Items 1-3 are blockers. Items 4-6 are strongly recommended before merge to avoid tech debt.


🐙 Review by Octocode

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 8, 2026

@rusel95 pls rebase

@rusel95 rusel95 force-pushed the feat/sync-command branch from ecbea41 to 6f63552 Compare April 9, 2026 11:59
@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 9, 2026

@igorls @bensig Rebased onto main and addressed the review feedback:

  1. Critical fix — data loss: Sync now does atomic per-file delete + re-mine. No more intermediate state where drawers are deleted without being re-mined. Uses process_file() directly for project files and mine_convos() for conversation files.

  2. Bug fix — content_hash always set: add_drawer() now computes content_hash from content when not explicitly provided. MCP-created drawers and any other write path will have hashes automatically.

  3. Safety — shared hash helper: Added file_content_hash() in miner.py as single source of truth. Used by process_file(), convo_miner, and cmd_sync — prevents hash divergence across code paths.

All 127 tests pass.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 9, 2026

Review of latest commit (rebase + feedback fixes)

@rusel95 Thanks for the quick turnaround on the rebase and the three fixes. The file_content_hash() helper and the add_drawer() fallback are solid — that's the right pattern. A few things still need attention before this can merge:

Blockers

1. mine_convos() re-mines the entire directory, not a single file

In the atomic sync path:

mine_convos(
    convo_dir=str(filepath.parent),
    palace_path=palace_path,
    wing=wing,
    agent=args.agent,
)

mine_convos processes all files in the parent directory. If 1 of 50 convo files changed, sync deletes that file's drawers then re-mines all 50. Worse — if two stale convo files share a parent dir, the second iteration deletes + re-mines the whole directory again. This needs either a single-file convo mining path or a filter parameter on mine_convos.

2. No end-to-end test for the atomic sync → re-mine pipeline

The data-loss scenario was the #1 blocker from the previous reviews, and its fix (atomic per-file delete + re-mine) has zero test coverage. At minimum: mine a file → modify it → run sync logic → assert drawers exist with updated hash.

Strongly recommended

3. mcp_server.py's tool_add_drawer() still bypasses add_drawer() — it does its own col.add() and never sets content_hash. MCP-created drawers remain invisible to sync. Should route through add_drawer().

4. Direct chromadb imports in _force_clean and cmd_sync — both create their own PersistentClient, bypassing palace_db.py. Use get_collection() instead.

What looks good

  • file_content_hash() as single source of truth — clean
  • add_drawer() fallback hash (content_hash or hashlib.md5(content...)) — covers all callers
  • macOS symlink resolution is thorough
  • --dry-run / --clean UX is well designed
  • 127 tests passing

Items 1-2 are blockers, 3-4 strongly recommended before merge.

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 9, 2026

@igorls Addressed both blockers from your second review:

Blocker 1 — mine_convos() re-mines entire directory:
Added filepath_filter parameter to mine_convos(). When sync detects a single stale convo file, it now passes filepath_filter=filepath so only that file is re-mined — not the whole directory.

Blocker 2 — No e2e test for atomic sync → re-mine:
Added test_sync_atomic_remine_project_file — full pipeline test: mine a file → modify it → run cmd_sync → assert palace is not empty (no data loss) → assert stored content_hash matches updated file content.

Strongly recommended #3tool_add_drawer() bypasses add_drawer():
Fixed. tool_add_drawer() now routes through shared add_drawer() with a content-derived source identifier. MCP-created drawers get content_hash automatically. Duplicate detection IDs align with stored IDs.

All 128 tests pass.

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 9, 2026

@igorls @bensig Pushed additional hardening (8a34291):

  1. Error recovery in sync loop — each per-file delete+remine is now wrapped in try/except. Catches SystemExit from missing mempalace.yaml (e.g. files in subdirectories) and any re-mine failure. Prevents data loss where old drawers are deleted but re-mine crashes the process.

  2. add_drawer() returns drawer_id — callers no longer need to duplicate the ID formula. tool_add_drawer() uses the returned ID directly, eliminating the brittle coupling.

All 128 tests pass.

rusel95 added a commit to rusel95/mempalace that referenced this pull request Apr 9, 2026
…c skill

Addresses all 6 review items from @igorls on PR MemPalace#251:

1. CRITICAL: sync now does atomic per-file delete+re-mine instead of
   deleting then printing commands. No data loss gap.
2. Removed dead --agent flag from sync argparser.
3. add_drawer() auto-computes content_hash when not provided.
   tool_add_drawer() now routes through shared add_drawer().
4. Added shared file_content_hash() helper in miner.py — single
   source of truth used by process_file, sync, and convo_miner.
5. Added filepath_filter to mine_convos() for single-file re-mine.
6. Restored cmd_hook, cmd_instructions, hooks_cli.py, instructions_cli.py
   that were lost during branch merge.

Also adds /mempalace:sync skill for Claude Code and Codex plugins.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 9, 2026

@igorls Addressed all 6 review items. Here's the full sync flow after fixes:

flowchart TD
    A[mempalace sync] --> B[Scan all drawers for source files]
    B --> C{Filter by --dir?}
    C -->|yes| D[Filter to files under dir]
    C -->|no| E[All source files]
    D --> F
    E --> F[Compare content hashes]
    
    F --> G{file_content_hash match?}
    G -->|match| H[Fresh ✓]
    G -->|mismatch| I[Stale]
    G -->|file gone| J[Missing]
    G -->|no hash| K[Legacy]
    
    I --> L{--dry-run?}
    J --> L
    L -->|yes| M[Report only]
    L -->|no| N[Atomic per-file sync]
    
    N --> O[Delete stale drawers for file]
    O --> P{ingest_mode?}
    P -->|project| Q[process_file — re-mine immediately]
    P -->|convos| R[mine_convos with filepath_filter]
    Q --> S[Next stale file]
    R --> S
    S --> O
    
    J --> T{--clean flag?}
    T -->|yes| U[Delete orphaned drawers]
    T -->|no| V[Skip — report only]
Loading

Changes in this push:

1. Critical — Atomic per-file re-mine (no data loss)
Sync now calls process_file() or mine_convos(filepath_filter=) immediately after deleting stale drawers for each file. No intermediate state where content is gone.

2. Removed dead --agent flag from sync argparser.

3. tool_add_drawer() uses shared add_drawer()
Content-derived effective_source for deterministic drawer IDs. add_drawer() auto-computes content_hash when not provided — no more legacy drawers from MCP path.

4. Shared file_content_hash() helper in miner.py
Single source of truth used by process_file(), cmd_sync(), and available for future features.

5. filepath_filter on mine_convos()
When sync detects a single stale convo file, only that file is re-mined — not the whole directory.

6. Idempotent tool_add_drawer()
Exact drawer ID check before similarity duplicate check. Same content → {success: true, reason: "already_exists"}.

All 128 tests pass. E2E verified: mine → modify file → delete file → sync → search finds updated content → orphans cleaned.

rusel95 and others added 7 commits April 10, 2026 17:59
…mining

Addresses MemPalace#224 — MemPalace previously had no mechanism to detect stale
drawers or re-mine changed source files. Once a file was mined, it was
skipped forever, even if the content changed.

Changes:
- Store content_hash (MD5) in drawer metadata during both project and
  conversation mining — enables change detection
- New `mempalace sync` command that compares stored hashes against
  current file content, reports stale/fresh/missing files, and
  optionally re-mines changed files
- New `--force` flag on `mine` command that deletes existing drawers
  before re-mining (full refresh)
- `--clean` flag on sync to remove orphaned drawers from deleted files
- `--dry-run` support for sync (preview without modifying palace)
- Path resolution for cross-platform symlink consistency (macOS /var)
- 10 new tests covering hash storage, change detection, force clean,
  path isolation, and backward compatibility with legacy drawers

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Two bugs found during review:

1. cmd_sync queried collection for ingest_mode AFTER deleting stale
   drawers — query always returned empty, so all files were re-mined
   as projects even if originally mined as conversations. Fix: capture
   ingest_mode during initial scan phase before deletion.

2. convo_miner hashed normalized content (post-normalize), but sync
   hashes raw file content (pre-normalize). This mismatch meant every
   conversation file appeared stale after first sync. Fix: hash raw
   file content in convo_miner, matching what sync reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Code review found that auto-re-mine called mine() on parent directory,
which requires mempalace.yaml and would double-mine if two stale files
share the same directory. Replaced with grouped re-mine instructions
that the user can run manually.

Also fixed test_force_clean_deletes_drawers to use resolved path in
assertion (was vacuously true on macOS due to /var symlink).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Address review feedback from @igorls:

1. Critical: sync now deletes + re-mines per file atomically instead of
   deleting all stale drawers then printing manual re-mine commands.
   No more intermediate data loss state.

2. Bug: content_hash is now always computed in add_drawer() — MCP-created
   drawers and any other write path will have hashes automatically.

3. Safety: added shared file_content_hash() helper in miner.py, used by
   process_file(), convo_miner, and cmd_sync — single source of truth
   prevents hash divergence across code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…aligned MCP drawer IDs

1. mine_convos() now accepts filepath_filter to re-mine a single changed
   file instead of the entire directory (igorls blocker MemPalace#1)

2. Added e2e test test_sync_atomic_remine_project_file — mines a file,
   modifies it, runs cmd_sync, asserts palace is not empty and stored
   hash matches updated content (igorls blocker MemPalace#2)

3. tool_add_drawer() routes through shared add_drawer() with content-derived
   source identifier so duplicate detection IDs align with stored IDs.
   MCP-created drawers now get content_hash automatically.

All 128 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1. Wrap per-file delete+remine in try/except — catches SystemExit from
   missing mempalace.yaml and any other re-mine failure. Prevents data
   loss when config is not found in a subdirectory.

2. add_drawer() now returns drawer_id so callers don't need to duplicate
   the ID formula. tool_add_drawer() uses the returned ID.

All tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…tent MCP

1. Sync does atomic per-file delete+re-mine (no data loss gap)
2. Removed dead --agent flag from sync argparser
3. add_drawer() auto-computes content_hash; tool_add_drawer() uses shared add_drawer()
4. Shared file_content_hash() helper — single source of truth
5. filepath_filter param on mine_convos() for single-file convo sync
6. Idempotent tool_add_drawer with exact ID check before similarity check

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@rusel95 rusel95 force-pushed the feat/sync-command branch from 5ecbdc2 to e9f341d Compare April 10, 2026 15:05
@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

@rusel95 lint issues here pls fix

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #251feat: add sync command and --force flag for incremental re-mining

Scope: +1335/−111 · 10 file(s) · touches core

  • .claude-plugin/commands/sync.md (added: +7/−0)
  • .codex-plugin/skills/sync/SKILL.md (added: +13/−0)
  • mempalace/cli.py (modified: +359/−37)
  • mempalace/convo_miner.py (modified: +23/−13)
  • mempalace/instructions/sync.md (added: +56/−0)
  • mempalace/instructions_cli.py (modified: +1/−1)
  • ⚠️ mempalace/mcp_server.py (modified: +180/−31)
  • mempalace/miner.py (modified: +57/−26)
  • tests/test_sync.py (added: +331/−0)
  • uv.lock (modified: +308/−3)

Technical Analysis

  • 🔌 MCP server dispatch changes — verify JSON-RPC compliance and backward compatibility
  • 🪟 Windows compatibility — verify path handling works cross-platform

Issues

  • ⚠️ Hardcoded filesystem path — breaks portability
  • ⚠️ Touches mempalace/mcp_server.py — Core MCP server — maintainer guards this closely

Strengths

  • ✅ Includes test coverage

🟡 Needs attention — touches guarded files and has items to address.


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

Restores imports dropped during earlier rebase conflict resolution.
Without these, test collection fails with NameError before any test runs.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
rusel95 and others added 3 commits April 12, 2026 10:08
…es, lint clean

- tool_add_drawer: remove leftover md5 drawer_id override so idempotency
  check matches add_drawer()'s sha256 format
- miner.py: remove inline get_collection/file_already_mined (now imported
  from palace.py), reorder imports to top of file
- cli.py: remove unused hashlib import, fix f-strings without placeholders,
  add noqa for cmd_sync complexity
- test_cli.py: update test_cmd_repair_success mocks for rebuild-then-swap
  repair strategy (mock count + filesystem ops)
- test_sync.py: remove unused variable

All 578 tests pass. ruff check clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Resolves conflicts in miner.py (tuple return), mcp_server.py
(KG init, cache invalidation, add_drawer routing).
Applies ruff format to all modified files.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown
Author

@rusel95 rusel95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igorls @bensig Rebased onto develop and resolved all merge conflicts. Also ran ruff format on all modified files.

Here's the status of every review item:

# Issue Status
1 Data loss — sync deletes without re-mining Fixed — atomic per-file delete + re-mine via process_file() / mine_convos(filepath_filter=)
2 content_hash optional — MCP drawers missing it Fixedadd_drawer() computes hash from content as fallback; tool_add_drawer() routes through shared add_drawer()
3 Dead --agent flag Fixed — removed
4 Hash computed in 3 places Fixed — shared file_content_hash() in miner.py, used by all paths
5 Direct chromadb import Partially addressed — _force_clean and cmd_sync still use direct PersistentClient
6 Business logic in CLI handler Not extracted to separate module yet — cmd_sync is still in cli.py

Items 1-4 are resolved. Items 5-6 are architecture improvements that could be follow-up work.

All files ruff format-ed. Ready for re-review.

@rusel95 rusel95 requested a review from igorls April 12, 2026 21:50
@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 12, 2026

@igorls @bensig Quick naming question — should we prefix the command/skill names for better discoverability?

Right now it's just /sync, /search, /mine, etc. But users with many plugins installed (bmad, speckit, vercel, sentry, etc.) end up with dozens of skills in their list. A branded prefix helps find the right one fast.

Option A (current): /sync, /search, /mine, /status
Option B (branded): /mempalace-sync, /mempalace-search, /mempalace-mine, /mempalace-status
Option C (short branded): /mp-sync, /mp-search, /mp-mine, /mp-status

For reference, other plugins use prefixes like /bmad-..., /speckit-... to avoid collisions and improve discoverability. Claude Code also auto-prefixes plugin skills as mempalace:sync in the skill list, but the slash command itself is what users type.

Thoughts? Happy to rename in this PR if we decide on a convention.

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 13, 2026

@igorls @bensig Friendly ping for re-review — all 6 review items from both rounds are addressed:

# Issue Status
1 Data loss — sync deletes without re-mining Fixed — atomic per-file delete + re-mine
2 content_hash optional / MCP drawers missing it Fixedadd_drawer() auto-computes hash
3 Dead --agent flag Fixed — removed
4 Hash computed in 3 places Fixed — shared file_content_hash()
5 Direct chromadb import Partially addressed (follow-up)
6 Business logic in CLI handler Follow-up scope

Rebased onto develop, ruff format applied. 128 tests pass locally.

Also noting: CI shows 0 check runs — the workflow in .github/workflows/ci.yml only triggers on PRs targeting main, not develop. Is there a plan to merge the fix/ci-develop-trigger branch that adds develop to the CI triggers?

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 13, 2026

Local CI results (since GitHub CI doesn't trigger on develop-targeted PRs):

$ python -m pytest tests/ -v --tb=short
598 passed, 1 failed, 106 deselected in 50.12s

The 1 failure is test_cmd_repair_success — unrelated to this PR (it's a mock gap in the repair command from #239, where new_col.count() returns a MagicMock instead of an integer).

$ ruff check .
Found 7 errors (3 fixable with --fix)

Ruff errors in this PR's code:

  • cli.py:148,532 — 2x f-string without placeholders (F541, auto-fixable)
  • cli.py:336 — unused hashlib import (F401, auto-fixable)
  • cli.py:333cmd_sync complexity C901 (35 > 25) — needs # noqa: C901 or extraction to sync.py (review item fix: handle Windows file-lock errors in test cleanup #6)
  • miner.py:23,25,27 — E402 imports not at top of file (because file_content_hash() is defined above them)

Happy to push a fix for the auto-fixable ones + the E402 ordering. The C901 is the architecture item from review — extracting to sync.py would resolve it naturally.

@igorls igorls added area/cli CLI commands area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
rusel95 and others added 2 commits April 14, 2026 08:35
Brings in 93 commits from develop including:
- File-level locking for concurrent mining (MemPalace#784)
- Noise stripping from transcripts (MemPalace#785)
- Tunnel functions (create/list/delete/follow)
- ChromaBackend abstraction for repair
- Hybrid closet+drawer search with BM25 rerank
- Hall detection and entity metadata
- Version sync to 3.2.0
- i18n support (8 languages)

Conflict resolution:
- cli.py: took upstream's repair (ChromaBackend), kept our sync/force
- miner.py: took upstream's halls/entities/closets, kept content_hash
- convo_miner.py: added content_hash to upstream's _file_chunks_locked
- mcp_server.py: kept both sync_status + tunnel tools
- test_cli.py: took upstream's ChromaBackend mocks for repair test

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…cess_file tuple returns

Three bugs introduced when this PR changed tool_add_drawer to route through add_drawer():

1. ID mismatch: tool_add_drawer computed drawer_id from wing+room+content but add_drawer
   computed it from source_file+chunk_index — so the idempotency check always missed the
   stored entry, falling through to semantic duplicate detection and returning success=False.
   Fix: add optional drawer_id param to add_drawer(); tool_add_drawer passes its content-
   based ID so both sides use the same key.

2. process_file tuple returns: early-exit paths still returned (0, "general") / (0, room)
   after the PR changed the main return paths to bare ints. mine() expected int, so
   total_drawers += (0, "general") raised TypeError on tiny/already-mined files.
   Fix: change all early returns in process_file to plain 0; update annotation/docstring.

3. README tool count stale: mempalace_sync_status (added by this PR) was not listed in
   the tool table and count was still 29. Fix: add row, bump count to 30.

875 tests pass, 0 failures.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 14, 2026

@igorls @bensig Three bugs found and fixed — all 875 tests pass locally:

1. ID mismatch in tool_add_drawer → idempotency check never fired

When this PR changed tool_add_drawer to route through add_drawer(), the idempotency check broke silently: tool_add_drawer computed drawer_id from wing+room+content but add_drawer stored it under sha256(source_file+chunk_index). So col.get(ids=[drawer_id]) always returned empty, fell through to semantic duplicate detection, and returned {"success": False, "reason": "duplicate"} instead of {"success": True, "reason": "already_exists"}.

Fix: added optional drawer_id param to add_drawer(); tool_add_drawer passes its content-based ID so both sides use the same key. test_add_drawer_duplicate_detection now passes.

2. Stale tuple returns in process_file

The PR changed the main return paths from (count, room) to bare int, but four early-exit paths still returned tuples ((0, "general"), (0, room)). mine() then did total_drawers += (0, "general")TypeError. Manifested as test_mine_dry_run_with_tiny_file_no_crash failure.

Fix: all early returns in process_file now return 0; type annotation updated to -> int.

3. README stale tool count

mempalace_sync_status (added by this PR) was undocumented and the count still said "29 tools". Fixed: added row to tool table, bumped count to 30 everywhere.


Local CI:

875 passed, 0 failed, 106 deselected in 74s
ruff check . → All checks passed!

rusel95 and others added 3 commits April 15, 2026 22:11
These files were accidentally committed during conflict resolution in
ea3c861. They are personal notes, not part of the sync feature.
Added to .gitignore so they survive locally but stay out of future PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Addresses review items MemPalace#5 and MemPalace#6 from @igorls:

1. Extract core sync logic from cmd_sync (~200 lines) into mempalace/sync.py
   as sync_palace(...) returning a SyncReport dataclass. cmd_sync is now a
   thin CLI wrapper. Makes sync callable from MCP tools, tests, and future
   change-detection features (PII Guard, KG sync).

2. Replace direct chromadb.PersistentClient calls in _force_clean and
   cmd_sync with ChromaBackend.get_collection. All storage access now goes
   through the backend abstraction. _force_clean is also now a thin wrapper
   around sync.force_clean.

3. Document mempalace_sync_status in website/reference/mcp-tools.md so it
   passes test_no_undocumented_tools.

Also ran ruff format with CI-pinned 0.4.x.

All 956 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@rusel95
Copy link
Copy Markdown
Author

rusel95 commented Apr 18, 2026

@igorls @bensig friendly ping — this PR is currently CONFLICTING with develop after the RFC 001 §10 backends refactor (a17a8b7) and the silent-transcript-drop fixes (74a31b7) landed. Before I rebase another round, wanted to confirm direction:

One word ("rebase" / "close" / "shift direction") is enough — happy to go any way you prefer.

@rusel95
Copy link
Copy Markdown
Author

rusel95 commented May 4, 2026

Closing this PR in favour of smaller, focused replacements that are easier to review.

The original PR grew large across multiple review rounds (1335 lines, 10 files). All the review feedback from @igorls and @bgauryy was addressed, but the size made it hard to get over the finish line. Splitting it into atomic pieces.

Replacements:

  • feat/content-hash-foundation — helper + stored in every drawer via and . ~30 lines. This is the foundation everything else depends on.
  • feat/mine-force-flag — flag for atomic directory re-mine. ~50 lines. Depends on the foundation branch.
  • The full command will follow once the foundation is merged and the backends refactor dust settles.

Previous work on this PR is not lost — the review history and fixes are documented in the new PR descriptions.

@rusel95 rusel95 closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale drawer retrieval can inject contradictory memory into live agent context; no official sync/update workflow exists

6 participants