Skip to content

feat(cli): mempalace mined + purge --source-file#7

Merged
jphein merged 2 commits intomainfrom
feat/mine-by-source-file
May 6, 2026
Merged

feat(cli): mempalace mined + purge --source-file#7
jphein merged 2 commits intomainfrom
feat/mine-by-source-file

Conversation

@jphein
Copy link
Copy Markdown
Owner

@jphein jphein commented May 6, 2026

Reopened after parent #2 squash-merge closed the original PR (#4).

Closes the 'removing manually mined data' half of JP's mining-management ask. Adds:

  • mempalace purge --source-file <path> (extends existing purge with a third filter)
  • mempalace mined (lists mined source files grouped by wing × source_file)

Addresses 4 Copilot findings from original PR #4:

  • wing filter pushed into chromadb where clause (no full-collection scan)
  • argparse rejects negative --limit at parse time
  • module __doc__ epilog updated with new commands
  • new dispatch test for mempalace mined

Original PR: #4 (closed by parent merge)
Tests: 1566 pass, ruff clean.

🤖 Generated with Claude Code

jphein and others added 2 commits May 5, 2026 18:19
Adds two ergonomic gaps in the mining-management surface JP asked for:

1. **mempalace purge --source-file <path>** — extends the existing
   purge command with a third filter alongside --wing and --room.
   Single-filter or combined ($and) usage works the same way wing/room
   do; the filter accepts a metadata.source_file exact-match. End-to-
   end test exercises a real palace and confirms only the matching
   drawers are deleted, leaving siblings intact.

2. **mempalace mined** — companion to `mempalace status`, but groups
   by wing × source_file rather than wing × room. Answers "which
   files have I mined into this wing?" so an operator can pick
   targets for the new --source-file purge. Skips drawers without a
   source_file metadata key (diary entries, kg drawers, etc.).
   Honors --wing and --limit (default 50 sources per wing, --limit 0
   to show all). Uses the same paginated col.get() pattern as
   miner.status() to handle 100K+ drawer palaces without tripping
   sqlite's max-variable limit.

Together these close the "removing manually mined data" half of JP's
ask. The "automining when files change" half is queued for a separate
file-watcher daemon PR.

Pre-existing `mempalace mine <dir>` already covers "adding" (and
re-mines on mtime change via bulk_check_mined dedup), so no command
needed for refresh.

Tests: 8 new test cases (3 for purge --source-file with mock + real
palace; 3 for cmd_mined including end-to-end groupBy + wing filter +
no-palace error path; 2 updated to add the new arg to the test
namespace builder). Full suite: 1564 passed, 1 skipped, lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Four findings from Copilot's review pass on the mining-management CLI:

1. **--wing filter pushed into the query** (Copilot cli.py:800).
   Previously cmd_mined fetched every drawer in the palace and filtered
   in Python, doing O(total) work even for "mempalace mined --wing X"
   on a 150K-drawer palace where X has 12K. Pre-fetch IDs via
   col.get(where=...) to size the loop, then pass where= on each
   batch get(). On the canonical palace, --wing filters now scan
   ~10x less data.

2. **Reject negative --limit at parse time** (Copilot cli.py:818).
   argparse type=int silently accepted -1, producing nonsensical
   "... -2 more" output. Add a typed _nonneg_int validator that
   raises ArgumentTypeError on negatives. argparse turns that into
   a clean exit code 2 with usage error message.

3. **Update module __doc__** (Copilot cli.py:1495). main() uses the
   module docstring as the help epilog. The hard-coded command list
   stopped at "status" — added "mined" and "purge --source-file"
   so `mempalace --help` reflects the new commands.

4. **Dispatch test for `mempalace mined`** (Copilot cli.py:1542).
   Sibling commands have main()-level dispatch tests; mined didn't.
   A typo or registration regression would have slipped through unit
   coverage. Added test_main_mined_dispatches and a paired
   test_main_mined_rejects_negative_limit (verifies #2).

Tests: 68 cli tests pass; full suite green at 1566 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings May 6, 2026 01:19
@jphein jphein merged commit 2e6ced9 into main May 6, 2026
7 checks passed
@jphein jphein deleted the feat/mine-by-source-file branch May 6, 2026 01:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the MemPalace CLI with operator tooling to (1) list mined sources grouped by wing and (2) purge drawers based on the source_file metadata, closing the “remove manually mined data” half of the mining-management request.

Changes:

  • Add mempalace purge --source-file <path> and update purge filtering/error messaging to support wing/room/source-file criteria.
  • Add mempalace mined to aggregate and display mined source_file entries grouped by wing (with --wing and --limit support, including argparse-time rejection of negative limits).
  • Add/extend CLI tests covering the new purge filter, mined output behavior, mined dispatch, and negative --limit parsing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mempalace/cli.py Implements cmd_mined, extends cmd_purge with --source-file, updates CLI help and dispatch, adds non-negative --limit parsing.
tests/test_cli.py Adds coverage for purge-by-source-file, mined grouping/filtering, mined dispatch, and argparse rejection of negative --limit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/cli.py
Comment on lines +787 to +816
where = {"wing": args.wing} if args.wing else None
if where is not None:
try:
scope = col.get(where=where, include=[])
scope_ids = scope.get("ids") if isinstance(scope, dict) else getattr(scope, "ids", [])
total = len(scope_ids or [])
except Exception:
total = col.count()
else:
total = col.count()
wing_sources: dict = defaultdict(lambda: defaultdict(int))
batch_size = 5000
offset = 0
while offset < total:
kwargs = {"limit": batch_size, "offset": offset, "include": ["metadatas"]}
if where is not None:
kwargs["where"] = where
r = col.get(**kwargs)
batch = r.get("metadatas") if isinstance(r, dict) else getattr(r, "metadatas", [])
if not batch:
break
for m in batch:
m = m or {}
src = m.get("source_file")
if not src:
continue
wing = m.get("wing", "?")
wing_sources[wing][src] += 1
offset += len(batch)

jphein added a commit that referenced this pull request May 6, 2026
Four findings, batched into one cleanup PR.

PR #6 — hooks_cli.hook_precompact docstring lie. Said "mine the
transcript synchronously" but the local-mode _ingest_transcript
uses subprocess.Popen (background) and daemon-strict mode is
fire-and-forget HTTP. Rewrite docstring to describe the actual
two-step flow: (1) transcript ingest (best-effort, may still be
running when compaction starts), (2) _mine_sync of MEMPAL_DIR
(real synchronous subprocess.run).

PR #7 — cmd_mined unpaginated col.get + silent truncation. Old
flow called col.get(where=..., include=[]) once to size the loop,
then paginated with the same where clause. ChromaDB's get() has
a silent ~10K id truncation, so the upfront sizing call would
undercount on a wing >10K drawers. Drop the upfront sizing
entirely; loop on empty-batch instead. Plus an early-exit when
a batch is shorter than batch_size (saves one trailing empty
get on exact-multiple wing counts).

PR #8 — cmd_repair docstring stale. Said "rebuild palace vector
index" but the function still dispatches max-seq-id mode. Rewrite
to describe both modes + acknowledge the deleted reorganize mode.

PR #10 — non-Projects dashed-project divergence. Documented
limitation. Tried adding -dev-/-code-/etc. markers; reverted
because those are also ambiguous (`~/dev/<project>` and
`~/dev/<parent>/<project>` encode identically — a -dev-<rest>
match would catch parent+project, breaking the existing
test_wing_from_transcript_path_non_projects_layout test that
expects the last-token fallback for `-dev-MemPalace-mempalace`
→ `mempalace`). Instead document the limitation in the docstring
and add a regression test pinning the current behavior so future
"fixes" don't break the layouts that DO want the fallback.

Tests: +1 (regression test for the documented limitation). Full
suite: 1552 passed, 1 skipped, ruff clean.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants