Skip to content

feat: add git-mine command to index commits and PRs into the palace#567

Open
zendesk-thittesdorf wants to merge 3 commits intoMemPalace:developfrom
zendesk-thittesdorf:feat/git-mine
Open

feat: add git-mine command to index commits and PRs into the palace#567
zendesk-thittesdorf wants to merge 3 commits intoMemPalace:developfrom
zendesk-thittesdorf:feat/git-mine

Conversation

@zendesk-thittesdorf
Copy link
Copy Markdown

@zendesk-thittesdorf zendesk-thittesdorf commented Apr 10, 2026

What does this PR do?

Adds a `git-mine` command that mines a git repository's commit history and GitHub PR data into the palace, filing decision-relevant content as searchable drawers.

Three surfaces:

  • `mempalace/git_miner.py` — core pipeline
  • `mempalace git-mine ` — CLI subcommand
  • `mempalace_git_mine` — MCP tool

Storage model: one drawer per merged PR (with review threads and a structured diff summary folded in), plus one drawer per standalone commit not covered by any fetched PR.

How it works:

  1. Fetches merged PRs via `gh pr list`; fetches review threads and commit SHAs per PR via `gh pr view --json reviews,commits`
  2. Optionally fetches `gh api /pulls/{n}/files` to build a structured diff summary (changed files + touched function names from hunk headers)
  3. Parses `git log`; skips commits whose SHA belongs to a fetched PR to avoid duplication
  4. Filters entries with a decision-signal regex (`decided`, `because`, `instead of`, `refactor`, etc.)
  5. Files content as idempotent upserts using a deterministic content-addressed drawer ID

Key flags:
```
mempalace git-mine
--wing Wing to file into (default: repo directory name)
--room Room to file into (default: git-decisions)
--since Only commits after this date (e.g. 2025-01-01)
--max-commits Cap on commits (0 = all)
--max-prs Cap on PRs fetched via gh (default: 25)
--no-reviews Skip folding review threads into PR drawers
--diff-summary Append structured diff summary: fallback (default), always, never
--all-commits Include commits without a body or decision signal
--decision-only Only file entries matching decision-signal keywords
--dry-run Preview without writing
```

--diff-summary=fallback (default) fetches file metadata only for PRs with no description — the case where it adds the most value. Use --diff-summary=always to include it on every PR; note this costs one extra gh api call per PR and can hit GitHub rate limits on large repos.

The MCP tool (`mempalace_git_mine`) exposes the same options.

How to test

```bash

Dry-run against any local repo

mempalace git-mine /path/to/your/repo --dry-run

Mine with PR data (requires gh auth)

mempalace git-mine /path/to/your/repo --max-commits 100

Search what was filed

mempalace search "architecture decision" --wing
```

Checklist

  • Tests pass (`python -m pytest tests/ -v`) — 58 tests
  • No hardcoded paths
  • Linter passes (`ruff check .`)

Adds a new git_miner module that mines git commit history and GitHub PR
data (titles, bodies, review threads) into the palace under wing_code /
git-decisions by default.

- mempalace/git_miner.py: core pipeline (collect_commits, collect_prs,
  mine_git) with decision-signal filtering and idempotent upserts
- mempalace/cli.py: new git-mine subcommand
- mempalace/mcp_server.py: new mempalace_git_mine MCP tool
- tests/test_git_miner.py: 33 tests, all subprocess calls mocked

Commit mining requires only git. PR and review mining requires the gh
CLI (https://cli.github.com) and degrades gracefully when unavailable.

Co-Authored-By: Claude <[email protected]>
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a feature I've wanted to exist in MemPalace for a while. We've been doing something similar manually — copying commit messages into drawers by hand for architecture decisions. A built-in git-mine command makes that workflow first-class.

The implementation is solid. A few things I want to flag:

Good design decisions:

  • Content-addressed drawer IDs via SHA-256 over wing + room + source + ref + title means re-running is idempotent and doesn't bloat the palace. That's the right approach.
  • Graceful degradation when gh is unavailable (PR/review mining is optional, commits still work with just git).
  • --decision-only filter with the decision-signal regex is useful for repos with high commit volume.

Issues to consider:

  1. git log format separator collision: _LOG_SEP = ">>MP<<" could appear in a commit message or body (a PR that literally contains >>MP<< in the diff or discussion). A more robust approach is git log --format="%x00" to use NUL separators, which can't appear in commit messages.

  2. commit source uses 12-char SHA as ref: The drawer ID hashes ref but the metadata stores it as source_file: commit:abc123def456. Searching by SHA later won't work with --exact-match unless you know the short SHA. Worth storing the full SHA in a dedicated metadata field (git_sha) for traceability.

  3. No embedding size guard on long PRs: PR bodies + review threads can be arbitrarily long. If a PR body exceeds the embedding model's token limit, ChromaDB will silently truncate or error. Worth calling sanitize_content() before upsert (you already do — nice — but there's no pre-check for length, and sanitize_content raising ValueError is silently swallowed with continue). A warning log on that skip would be useful.

  4. collect_prs not shown in diff: The PR/review collection code isn't visible in the diff. Is collect_prs() delegating to gh pr list --json? Worth confirming the JSON keys match the FAKE_PR_LIST fixture (they do in the test, but confirming the real implementation matches would be good).

Tests cover the main code paths. The FAKE_LOG fixture and _make_log_output builder are clean.

Can approve pending the separator and SHA logging questions — leaving as COMMENT for now to get those addressed.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a genuinely useful addition for anyone running MemPalace as a long-term dev knowledge store. Git commit history and PR review threads are exactly the kind of architectural context that's worth making searchable.

A few notes from reviewing the design:

Decision-signal filtering: The regex approach (decided, because, instead of, etc.) is pragmatic. One false-positive class to watch: commit messages with refactor X that aren't actually architectural decisions — pure cosmetic refactors show up a lot in commit logs. Consider whether a minimum body length guard (e.g. only file commits where the message body > N characters) would improve signal/noise without adding complexity.

gh auth degradation: Graceful fallback when gh is unavailable is the right call. One edge case: if gh is available but unauthenticated, it may prompt for login rather than returning an error code — worth testing that the degradation is actually silent (not hanging on stdin) in that case. A --no-prs flag might be clearer than --no-reviews if you're skipping all PR fetching.

Content-addressed drawer ID: Deterministic ID from content hash means repeated runs are idempotent — good. Worth confirming that the hash function used for the ID is stable across Python versions and platforms (avoid hash() which is randomized per-process; hashlib.sha256 is the right choice).

MCP tool: Exposing this as mempalace_git_mine is good for agent-driven use. The MCP tool accepting since as a string date is fine, but the server-side parsing should handle ISO format variations (with and without time component, with and without timezone) — worth a test case.

Room naming: Default wing_code / git-decisions is a reasonable convention. Worth making both configurable as you've done. One thought: if someone mines multiple repos into the same palace, the default room name collides — could add --room-prefix or include the repo name in the default room.

Strong idea. The decision-signal filtering is the key design choice and it's defensible — curated context beats raw commit dump for retrieval quality.


[MemPalace-AGI integration — 215 tests, 710 KG entities]

@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 11, 2026

Interesting approach — the decision-signal regex filter is smart. Mining all commits would be noisy, but filtering for "decided", "because", "instead of" etc. catches the architectural reasoning that usually lives only in commit messages and PR descriptions.

Does it handle squash-merged PRs where the commit message is the PR body? That's where most of the decision context lives in GitHub-flow repos.

@web3guru888
Copy link
Copy Markdown

@jphein — squash-merged PRs are actually the best case for this feature. When a PR is squash-merged, git log sees a single commit whose message is the PR title + description (GitHubs default squash format). The decision-signal regex will catch the architectural intent from the PR description, and the collect_prs() path would also have fetched the full PR body and review threads independently.

The gap would be if the PR description is empty (just "closes #X") and all the decision context lived in review comments — in that case, the review thread content from gh pr view --json reviews is the only source. The implementation already handles this via the --include-reviews flag.

One thing worth noting: GitHubs squash commit format includes Co-authored-by lines but not the individual commits that were squashed. If the squashed commits had good messages, theyre gone. For repos that squash everything, mining PR reviews directly is often more valuable than commit history.

The room naming collision I flagged in my review (git-decisions collides across repos) becomes more important here — for a multi-repo palace, include the repo name in the default room or the squash commits from different repos will land in the same room with no way to filter them.


[MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples]

@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 11, 2026

Makes sense — squash commits carrying PR bodies are ideal input. Good point on the room naming collision for multi-repo palaces. Per-repo room naming (e.g., git-decisions-mempalace) would solve it cleanly.

- Switch git log record separator to NUL (%x00) to prevent collision
  with commit message content
- Store full 40-char SHA in git_sha metadata field alongside short ref
- Warn to stderr when sanitize_content rejects an entry (was silent)
- Add stdin=DEVNULL to all gh subprocess calls to prevent interactive
  auth prompt hangs when gh is installed but unauthenticated

Rework storage model to one drawer per merged PR:
- Fold review threads into the PR drawer body under a delimited block
  instead of filing them as separate entries
- Fetch commit SHAs per PR (gh pr view --json commits) and skip those
  commits from the standalone commit scan to avoid duplication

Add structured diff summary (--diff-summary=always|fallback|never):
- Parse GitHub REST /pulls/{n}/files for changed file metadata
- Extract touched function/method names from unified diff hunk headers
  without storing raw diff content
- Fold as a Code changes block into the PR drawer body
- always (default): always append; fallback: only when PR has no
  description; never: skip

Default wing derived from repo directory name instead of wing_code:
- e.g. mining /path/to/zendesk_halon files under wing zendesk_halon
- Falls back to wing_code if the name fails sanitization

Tests: 58 passing (up from 13 in the original PR)

Co-Authored-By: Claude <[email protected]>
@zendesk-thittesdorf
Copy link
Copy Markdown
Author

Thanks for the thorough review — really useful feedback. Here's what was addressed in the latest commit (37c26d3):

Comment 1:

  • NUL separator_LOG_SEP switched from ">>MP<<" to "\x00" with %x00 in the git log format string. NUL can't appear in commit messages so collision is impossible.
  • Full SHA in metadataGitEntry gained a git_sha slot carrying the full 40-char SHA. The ChromaDB upsert now stores it as a git_sha metadata field alongside the existing short ref. You can now look up entries by exact SHA.
  • Warning on silent skip — the except ValueError: continue block now prints to stderr with the entry source, ref, and the exception message before skipping.
  • collect_prs JSON keys — confirmed: the real implementation calls gh pr list --json number,title,body,author,createdAt which maps exactly to the FAKE_PR_LIST fixture keys.

Comment 2:

  • gh unauthenticated stdin hang — both gh pr list and gh pr view calls now pass stdin=subprocess.DEVNULL. An unauthenticated gh will fail with a non-zero exit code rather than prompting interactively.
  • hashlib.sha256 stability — confirmed, hashlib.sha256 is used throughout drawer_id(). Python's built-in hash() is never used.
  • since ISO date parsing — no Python-side parsing needed: --since is passed directly to git log which accepts all ISO 8601 variants natively (date-only, with time, with timezone offset). Verified locally.
  • Room naming collision — resolved by the default-wing change: each repo now gets its own wing (e.g. zendesk_halon) so git-decisions is scoped per repo. Multiple repos no longer collide.

Additionally, two larger design changes were made beyond the reviewer items:

  • PR-first storage model — one drawer per merged PR with review threads folded in, plus one drawer per standalone commit not associated with any fetched PR. Commit SHAs belonging to a PR are collected from gh pr view --json commits and excluded from the commit scan to avoid duplication.
  • Structured diff summary--diff-summary=always|fallback|never appends a --- Code changes --- block to PR drawers listing changed files and touched function names extracted from unified diff hunk headers (via gh api /pulls/{n}/files). No raw diff content stored.

Test count is up from 13 to 58.

@web3guru888
Copy link
Copy Markdown

The v2 revision addresses everything from both rounds of feedback cleanly.

The NUL separator is the right call — unambiguous, collision-proof. The full SHA in metadata is a nice addition beyond what I asked for; exact-SHA lookup is genuinely useful if cross-referencing a specific commit across repos in the same palace.

The PR-first storage model (one drawer per merged PR with folded-in review threads, separate drawers for standalone commits) is a good architectural improvement. The --diff-summary flag is well-designed — fallback as default degrades gracefully when gh is unauthenticated or the API rate limit is hit.

Per-repo wing scoping for room collision is the correct fix. git-decisions being wing-scoped means multi-repo palaces work without namespace collisions.

Going from 13 to 58 tests is the right jump for a miner that makes external subprocess calls and does content-based routing.

One remaining note: the gh api /pulls/{n}/files call for --diff-summary is per-PR, so on repos with 500+ PRs this becomes 500+ API calls in a single run. Worth a rate-limit note in the --diff-summary=always help text, or a --max-prs ceiling defaulting to ~200. Not a blocker, but could surprise users on large repos.

LGTM for the core implementation.


[MemPalace-AGI — git history mining is one of the more useful MemPalace surfaces for long-lived projects]

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 (37c26d3) addressed everything — NUL separator, full SHA in metadata, PR-first storage model, per-repo wing scoping, --diff-summary flag with graceful fallback, 58 tests. The rate-limit note on --diff-summary=always for large repos is the one remaining item, but that's documentation not correctness.

LGTM — approved.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
Always fetching /pulls/{n}/files costs one extra gh api call per PR.
On repos with hundreds of PRs this consumes a significant portion of
GitHub's 5000 req/hr rate limit. Fallback is the safer default — it
only fetches files when the PR body is absent or too short to be
useful, which is exactly when the diff summary adds the most value.

Users who want full diff coverage on every PR can pass
--diff-summary=always explicitly.

Co-Authored-By: Claude <[email protected]>
@zendesk-thittesdorf
Copy link
Copy Markdown
Author

Good catch on the rate-limit concern. Changed in 089d0b2--diff-summary now defaults to fallback instead of always. Files are only fetched for PRs with no description (the case where the diff summary provides the most value). Users who want full diff coverage on every PR can opt in explicitly with --diff-summary=always.

@igorls igorls added area/cli CLI commands area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
bensig added a commit that referenced this pull request Apr 18, 2026
Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.

Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
  ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
  promise with a verifiable one; byte_preserving adapters declare
  the empty set, declared_lossy adapters enumerate. Existing
  miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
  no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
  filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
  json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
  closet_hints. Closes existing gap where convo drawers get no
  closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
  added_by, normalize_version, entities, ingest_mode all preserved.
  Spec adds adapter_name, adapter_version, privacy_class

§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).

Tracking issue: #989
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants