feat: add git-mine command to index commits and PRs into the palace#567
feat: add git-mine command to index commits and PRs into the palace#567zendesk-thittesdorf wants to merge 3 commits intoMemPalace:developfrom
Conversation
Adds a new git_miner module that mines git commit history and GitHub PR data (titles, bodies, review threads) into the palace under wing_code / git-decisions by default. - mempalace/git_miner.py: core pipeline (collect_commits, collect_prs, mine_git) with decision-signal filtering and idempotent upserts - mempalace/cli.py: new git-mine subcommand - mempalace/mcp_server.py: new mempalace_git_mine MCP tool - tests/test_git_miner.py: 33 tests, all subprocess calls mocked Commit mining requires only git. PR and review mining requires the gh CLI (https://cli.github.com) and degrades gracefully when unavailable. Co-Authored-By: Claude <[email protected]>
web3guru888
left a comment
There was a problem hiding this comment.
This is a feature I've wanted to exist in MemPalace for a while. We've been doing something similar manually — copying commit messages into drawers by hand for architecture decisions. A built-in git-mine command makes that workflow first-class.
The implementation is solid. A few things I want to flag:
Good design decisions:
- Content-addressed drawer IDs via SHA-256 over
wing + room + source + ref + titlemeans re-running is idempotent and doesn't bloat the palace. That's the right approach. - Graceful degradation when
ghis unavailable (PR/review mining is optional, commits still work with justgit). --decision-onlyfilter with the decision-signal regex is useful for repos with high commit volume.
Issues to consider:
-
git logformat separator collision:_LOG_SEP = ">>MP<<"could appear in a commit message or body (a PR that literally contains>>MP<<in the diff or discussion). A more robust approach isgit log --format="%x00"to use NUL separators, which can't appear in commit messages. -
commitsource uses 12-char SHA asref: The drawer ID hashesrefbut the metadata stores it assource_file: commit:abc123def456. Searching by SHA later won't work with--exact-matchunless you know the short SHA. Worth storing the full SHA in a dedicated metadata field (git_sha) for traceability. -
No embedding size guard on long PRs: PR bodies + review threads can be arbitrarily long. If a PR body exceeds the embedding model's token limit, ChromaDB will silently truncate or error. Worth calling
sanitize_content()before upsert (you already do — nice — but there's no pre-check for length, andsanitize_contentraisingValueErroris silently swallowed withcontinue). A warning log on that skip would be useful. -
collect_prsnot shown in diff: The PR/review collection code isn't visible in the diff. Iscollect_prs()delegating togh pr list --json? Worth confirming the JSON keys match theFAKE_PR_LISTfixture (they do in the test, but confirming the real implementation matches would be good).
Tests cover the main code paths. The FAKE_LOG fixture and _make_log_output builder are clean.
Can approve pending the separator and SHA logging questions — leaving as COMMENT for now to get those addressed.
web3guru888
left a comment
There was a problem hiding this comment.
This is a genuinely useful addition for anyone running MemPalace as a long-term dev knowledge store. Git commit history and PR review threads are exactly the kind of architectural context that's worth making searchable.
A few notes from reviewing the design:
Decision-signal filtering: The regex approach (decided, because, instead of, etc.) is pragmatic. One false-positive class to watch: commit messages with refactor X that aren't actually architectural decisions — pure cosmetic refactors show up a lot in commit logs. Consider whether a minimum body length guard (e.g. only file commits where the message body > N characters) would improve signal/noise without adding complexity.
gh auth degradation: Graceful fallback when gh is unavailable is the right call. One edge case: if gh is available but unauthenticated, it may prompt for login rather than returning an error code — worth testing that the degradation is actually silent (not hanging on stdin) in that case. A --no-prs flag might be clearer than --no-reviews if you're skipping all PR fetching.
Content-addressed drawer ID: Deterministic ID from content hash means repeated runs are idempotent — good. Worth confirming that the hash function used for the ID is stable across Python versions and platforms (avoid hash() which is randomized per-process; hashlib.sha256 is the right choice).
MCP tool: Exposing this as mempalace_git_mine is good for agent-driven use. The MCP tool accepting since as a string date is fine, but the server-side parsing should handle ISO format variations (with and without time component, with and without timezone) — worth a test case.
Room naming: Default wing_code / git-decisions is a reasonable convention. Worth making both configurable as you've done. One thought: if someone mines multiple repos into the same palace, the default room name collides — could add --room-prefix or include the repo name in the default room.
Strong idea. The decision-signal filtering is the key design choice and it's defensible — curated context beats raw commit dump for retrieval quality.
[MemPalace-AGI integration — 215 tests, 710 KG entities]
|
Interesting approach — the decision-signal regex filter is smart. Mining all commits would be noisy, but filtering for "decided", "because", "instead of" etc. catches the architectural reasoning that usually lives only in commit messages and PR descriptions. Does it handle squash-merged PRs where the commit message is the PR body? That's where most of the decision context lives in GitHub-flow repos. |
|
@jphein — squash-merged PRs are actually the best case for this feature. When a PR is squash-merged, The gap would be if the PR description is empty (just "closes #X") and all the decision context lived in review comments — in that case, the review thread content from One thing worth noting: GitHubs squash commit format includes The room naming collision I flagged in my review ( [MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples] |
|
Makes sense — squash commits carrying PR bodies are ideal input. Good point on the room naming collision for multi-repo palaces. Per-repo room naming (e.g., |
- Switch git log record separator to NUL (%x00) to prevent collision
with commit message content
- Store full 40-char SHA in git_sha metadata field alongside short ref
- Warn to stderr when sanitize_content rejects an entry (was silent)
- Add stdin=DEVNULL to all gh subprocess calls to prevent interactive
auth prompt hangs when gh is installed but unauthenticated
Rework storage model to one drawer per merged PR:
- Fold review threads into the PR drawer body under a delimited block
instead of filing them as separate entries
- Fetch commit SHAs per PR (gh pr view --json commits) and skip those
commits from the standalone commit scan to avoid duplication
Add structured diff summary (--diff-summary=always|fallback|never):
- Parse GitHub REST /pulls/{n}/files for changed file metadata
- Extract touched function/method names from unified diff hunk headers
without storing raw diff content
- Fold as a Code changes block into the PR drawer body
- always (default): always append; fallback: only when PR has no
description; never: skip
Default wing derived from repo directory name instead of wing_code:
- e.g. mining /path/to/zendesk_halon files under wing zendesk_halon
- Falls back to wing_code if the name fails sanitization
Tests: 58 passing (up from 13 in the original PR)
Co-Authored-By: Claude <[email protected]>
|
Thanks for the thorough review — really useful feedback. Here's what was addressed in the latest commit (37c26d3): Comment 1:
Comment 2:
Additionally, two larger design changes were made beyond the reviewer items:
Test count is up from 13 to 58. |
|
The v2 revision addresses everything from both rounds of feedback cleanly. The NUL separator is the right call — unambiguous, collision-proof. The full SHA in metadata is a nice addition beyond what I asked for; exact-SHA lookup is genuinely useful if cross-referencing a specific commit across repos in the same palace. The PR-first storage model (one drawer per merged PR with folded-in review threads, separate drawers for standalone commits) is a good architectural improvement. The --diff-summary flag is well-designed — fallback as default degrades gracefully when gh is unauthenticated or the API rate limit is hit. Per-repo wing scoping for room collision is the correct fix. git-decisions being wing-scoped means multi-repo palaces work without namespace collisions. Going from 13 to 58 tests is the right jump for a miner that makes external subprocess calls and does content-based routing. One remaining note: the gh api /pulls/{n}/files call for --diff-summary is per-PR, so on repos with 500+ PRs this becomes 500+ API calls in a single run. Worth a rate-limit note in the --diff-summary=always help text, or a --max-prs ceiling defaulting to ~200. Not a blocker, but could surprise users on large repos. LGTM for the core implementation. [MemPalace-AGI — git history mining is one of the more useful MemPalace surfaces for long-lived projects] |
web3guru888
left a comment
There was a problem hiding this comment.
v2 (37c26d3) addressed everything — NUL separator, full SHA in metadata, PR-first storage model, per-repo wing scoping, --diff-summary flag with graceful fallback, 58 tests. The rate-limit note on --diff-summary=always for large repos is the one remaining item, but that's documentation not correctness.
LGTM — approved.
Always fetching /pulls/{n}/files costs one extra gh api call per PR.
On repos with hundreds of PRs this consumes a significant portion of
GitHub's 5000 req/hr rate limit. Fallback is the safer default — it
only fetches files when the PR body is absent or too short to be
useful, which is exactly when the diff summary adds the most value.
Users who want full diff coverage on every PR can pass
--diff-summary=always explicitly.
Co-Authored-By: Claude <[email protected]>
|
Good catch on the rate-limit concern. Changed in 089d0b2 — |
Draft plugin specification for source adapters, mirroring RFC 001's role for storage backends. Formalizes the contract six community ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's metadata-only mode have been reinventing ad-hoc, so adapter authors can build to a stable surface. Key decisions: - Single ingest() method; lazy adapters yield SourceItemMetadata ahead of drawers, eager adapters interleave - Declared-transformation model (§1.4) replaces informal verbatim promise with a verifiable one; byte_preserving adapters declare the empty set, declared_lossy adapters enumerate. Existing miner.py and the convo_miner+normalize pipeline map cleanly - Palace is the incremental cursor via is_current(item, metadata); no sidecar persistence - Routing is adapter-owned; detect_room/detect_hall move into the filesystem adapter - Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as json_string field, KG triples route to SQLite knowledge graph - Closets stay core-built as a post-step; adapters may emit flat closet_hints. Closes existing gap where convo drawers get no closets - No per-drawer field renames: source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode all preserved. Spec adds adapter_name, adapter_version, privacy_class §9 enumerates the cleanup PR prerequisites (mempalace/sources/ module, PalaceContext facade, KnowledgeGraph.add_triple gaining backwards-compatible source_drawer_id + adapter_name params). Tracking issue: #989
What does this PR do?
Adds a `git-mine` command that mines a git repository's commit history and GitHub PR data into the palace, filing decision-relevant content as searchable drawers.
Three surfaces:
Storage model: one drawer per merged PR (with review threads and a structured diff summary folded in), plus one drawer per standalone commit not covered by any fetched PR.
How it works:
Key flags:
```
mempalace git-mine
--wing Wing to file into (default: repo directory name)
--room Room to file into (default: git-decisions)
--since Only commits after this date (e.g. 2025-01-01)
--max-commits Cap on commits (0 = all)
--max-prs Cap on PRs fetched via gh (default: 25)
--no-reviews Skip folding review threads into PR drawers
--diff-summary Append structured diff summary: fallback (default), always, never
--all-commits Include commits without a body or decision signal
--decision-only Only file entries matching decision-signal keywords
--dry-run Preview without writing
```
--diff-summary=fallback(default) fetches file metadata only for PRs with no description — the case where it adds the most value. Use--diff-summary=alwaysto include it on every PR; note this costs one extragh apicall per PR and can hit GitHub rate limits on large repos.The MCP tool (`mempalace_git_mine`) exposes the same options.
How to test
```bash
Dry-run against any local repo
mempalace git-mine /path/to/your/repo --dry-run
Mine with PR data (requires gh auth)
mempalace git-mine /path/to/your/repo --max-commits 100
Search what was filed
mempalace search "architecture decision" --wing
```
Checklist