feat(init): scan manifests and git authors for real entity signal (v1)#1148
feat(init): scan manifests and git authors for real entity signal (v1)#1148
Conversation
The pattern-matching detector had several systematic false positives that
crowded the init review with nonsense. Concrete fixes:
- CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to
candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are
visible. Previously `MemPalace` fragmented into `Mem` + `Palace`.
- Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata
line like `Created: 2026-04-21` was scoring as dialogue and classifying
`Created` as a person.
- Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b`
(version-only). The previous `\b{name}[-v]\w+` matched `context-manager`,
`multi-word`, etc. - every hyphenated compound.
- Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They
produce pure-English-prose noise (`Contributor`, `Software`, `Covered`,
`Before`).
- Extra SKIP_DIRS: `.terraform`, `vendor`, `target`.
- Expand stopword list with capitalized participles/descriptors that
commonly appear at sentence start: `created`, `updated`, `extracted`,
`processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`,
`bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc.
- classify_entity: high-pronoun single-category signal now classifies as
person. A diary's main character gets referenced with pronouns, not
dialogue markers - requiring two signal categories demoted `Lu` (16
pronoun hits across 30 mentions) to uncertain. Gate on
`pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common
sentence-start words (`Never`, `Before`) with incidental proximity
stay uncertain.
`mempalace init` previously leaned entirely on regex-based entity extraction from prose. That path works for text-only folders but wastes signal in any codebase: the project's own name is already in `package.json` / `pyproject.toml` / `Cargo.toml` / `go.mod`, and the people who worked on it are in `git log`. This adds `project_scanner.py`, which becomes the primary signal source when real signal is available, with the regex detector preserved as the fallback for prose-only folders (diaries, research notes, writing). What it does: - Walks the target directory, parses manifests for canonical project names, and detects git repos by the presence of a `.git` directory. - For each repo, reads `git log` for authors and filters obvious bots (`[bot]`, `dependabot`, `renovate`, `github-actions`, names ending in `bot`, `-autoroll`). Importantly does NOT filter `@users.noreply.github.com` - that's GitHub's privacy-protected human email, used by real contributors. - Resolves author aliases with a union-find: commits that share a name OR an email collapse into one person. Picks the most-frequent real-name variant as display, ignoring handles and single-token usernames. - Flags "mine" projects: user is top-5 committer OR has >=10% of commits OR >=20 commits. Ordered by user_commits in the UX. - `discover_entities()` merges scanner results with the regex detector case-insensitively (so `mempalace` from pyproject absorbs `MemPalace` from docs), and suppresses the regex `uncertain` bucket when real signal is already found - the user doesn't need to adjudicate prose noise when the answer is already in git. Integration: `cmd_init` now calls `discover_entities` instead of running the regex detector directly. Same output shape, so `confirm_entities` works unchanged. Ships with 39 new tests covering manifest parsing, bot filtering, union-find dedup, git repo discovery, scan integration, and merge/fallback behavior. Existing 56 regex-detector tests all pass.
There was a problem hiding this comment.
Pull request overview
Updates mempalace init to discover projects and people using higher-signal sources (package manifests + git history) and only fall back to regex-based prose entity detection when needed.
Changes:
- Add
mempalace/project_scanner.pyto scan manifests for project names andgit logfor real commit authors, then merge with regex results. - Improve regex entity detection (CamelCase candidates, tighter versioned pattern, better dialogue handling, skip boilerplate files, expanded stopwords, pronoun-only person classification tweak).
- Wire
cmd_initto usediscover_entities()and add comprehensive tests for scanning/merging behavior.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/project_scanner.py |
New scanner module for manifest + git-author entity discovery and merging with regex results. |
mempalace/cli.py |
Switches init entity discovery to discover_entities() while keeping confirmation/output shape. |
mempalace/entity_detector.py |
Tightens patterns and heuristics; skips boilerplate files; improves pronoun-only classification behavior. |
mempalace/i18n/en.json |
Expands candidate extraction to CamelCase and expands stopwords to reduce noise. |
tests/test_project_scanner.py |
Adds tests for manifest parsing, git repo discovery, bot filtering, dedupe, scan, and merge behavior. |
tests/test_entity_detector.py |
Adds regression tests for the updated pronoun-signal classification logic. |
uv.lock |
Bumps local package version entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| total_commits = len(authors) | ||
| user_commits = 0 | ||
| author_counts: dict[str, int] = {} | ||
| for name, email in authors: | ||
| if _is_bot(name, email): | ||
| continue |
| if (root / ".git").is_dir(): | ||
| # Root is a repo — still walk for nested repos (submodules, etc.) | ||
| repos.append(root) |
| if dirpath != repo_root and (dirpath / ".git").is_dir(): | ||
| dirs.clear() | ||
| continue |
| def _walk(root: Path, max_depth: int = MAX_DEPTH): | ||
| for dirpath, dirs, files in os.walk(root): | ||
| dirs[:] = [d for d in dirs if d not in SKIP_DIRS and not d.startswith(".")] | ||
| rel = Path(dirpath).relative_to(root) if dirpath != str(root) else Path(".") |
| NOT walked into (we stop at the first repo boundary when descending).""" | ||
| (tmp_path / ".git").mkdir() | ||
| deep = tmp_path / "a" / "b" / "nested-repo" | ||
| deep.mkdir(parents=True) | ||
| (deep / ".git").mkdir() | ||
| repos = find_git_repos(tmp_path) | ||
| # Root IS found; nested still discovered on its own branch (not inside root's .git) | ||
| assert tmp_path in repos |
| if (dirpath / ".git").is_dir(): | ||
| repos.append(dirpath) | ||
| dirs.clear() # don't descend into this repo's contents from here |
| manifests = _collect_manifest_names(repo) | ||
| root_level = [m for m in manifests if m[2] == repo] | ||
| if root_level: | ||
| manifest_file, proj_name, _ = root_level[0] | ||
| elif manifests: | ||
| manifest_file, proj_name, _ = manifests[0] | ||
| else: | ||
| manifest_file, proj_name = None, repo.name |
| def _init_git_repo(path: Path, name: str = "Jane Doe", email: str = "[email protected]"): | ||
| """Helper: init a git repo with one commit.""" | ||
| subprocess.run(["git", "init", "-q"], cwd=path, check=True) | ||
| subprocess.run(["git", "config", "user.name", name], cwd=path, check=True) |
| subprocess.run( | ||
| ["git", "commit", "-q", "-m", "initial"], | ||
| cwd=path, | ||
| check=True, | ||
| env={"GIT_COMMITTER_NAME": name, "GIT_COMMITTER_EMAIL": email, "PATH": "/usr/bin:/bin"}, |
| def _parse_toml(path: Path) -> dict: | ||
| if tomllib is None: | ||
| return {} | ||
| try: | ||
| with open(path, "rb") as f: | ||
| return tomllib.load(f) | ||
| except (OSError, Exception): | ||
| return {} |
`tomllib` is stdlib only in Python 3.11+. On Python 3.9/3.10 (and the macOS runner) the scanner's toml parsing returned empty, so manifest lookups for `pyproject.toml` / `Cargo.toml` produced no name. CI surfaced this via 4 test_project_scanner.py failures on the 3.9 matrix. Add `tomli>=2.0.0` as a conditional dependency for `python_version < '3.11'` and fall back to it in `project_scanner.py`. The project still declares `requires-python = ">=3.9"` so the fallback is the correct shape.
|
@copilot apply changes based on the comments in this thread |
Applied the review-thread fixes in 9486d8b. This follow-up covers Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
…oritative sources Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.
…to develop MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two stacked children landed on their parent feature branches (now stale) rather than on develop. Only MemPalace#1148's commits reached develop via the direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore missing the LLM refinement, Claude-conversation scanner, and miner- registry wire-up that were ostensibly part of the release. This merge brings the stale `feat/llm-entity-refine` branch (which contains the rolled-up merge commit for MemPalace#1157 → MemPalace#1150 → everything below) into develop so the release tag includes it. No code changes here — only history recovery.
Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148, MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into develop via MemPalace#1175). Without these entries the 3.3.3 release notes on main would advertise only the hook/diary/search fixes that made it to develop through the first direct merge. Covers: - Manifest + git-author entity detection (MemPalace#1148) - Regex detector accuracy improvements (MemPalace#1148) - Optional --llm classification with Ollama / openai-compat / Anthropic provider abstraction and interactive UX (MemPalace#1150) - Claude Code conversation scanner (MemPalace#1150) - Init → miner registry wire-up so confirmed entities actually reach drawer metadata tagging (MemPalace#1157) - Case-insensitive project dedup across all sources (MemPalace#1175) - `mempalace mine` skips the generated entities.json artifact
… state Three stale sections updated: - Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck through → FILED as MemPalace#1177. Two new rows added for segfault fixes discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in make_client) that weren't in the queue because the bugs surfaced today, not during the original 2026-04-21 triage. - Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with current CI/review state. All rebased onto current upstream/develop and MERGEABLE as of today. - Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157 entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue), MemPalace#1166 palace-path security, MemPalace#340/MemPalace#1093 install regression, plus MemPalace#851 from the 2026-04-22 batch.
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection creation in both `get_collection(..., create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space` and `hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing for embedding-function injection (perf fix from MemPalace#1148/a4868a3). Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb treats HNSW config as immutable post-creation, so existing bloated palaces still need to be nuked and re-mined; this only protects fresh collections. Tests assert both keys land on the persisted collection metadata in both code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Summary
mempalace initnow gets project names from package manifests (package.json,pyproject.toml,Cargo.toml,go.mod) and people from git commit authors, rather than inferring both from regex over prose. The regex detector stays as a fallback for prose-only folders.This is version 1 — it fixes the codebase/project case. The prose-only case (diaries, transcripts, research notes with no git and no manifests) still depends on the regex detector and remains limited; phase 2 is an LLM-assisted entity pass for that use case.
Motivation
Running
mempalace initon a codebase produced systematic false positives from the regex-only path. Examples from a local sanity run:versioned/hyphenatedregex on any hyphenated compound (e.g. a prefix likemultimatchingmulti-word, a descriptor likehybridmatchinghybrid-search).Never,Before, past-tense verbs likeCreated,Extracted,Processed).[A-Z][a-z]{1,19}requires a lowercase run after one capital and fragmentedFooBarintoFooandBarseparately.said/laughed/askedverb pattern in commit messages or READMEs.After this PR, a codebase init pulls the canonical project name from its manifest and ranks contributors by commit volume, with the regex path filling in extra names only mentioned in prose.
What's in this PR
New module
mempalace/project_scanner.py:.gitdirectories to find repo boundariesgit log, filters bots by name pattern ([bot],dependabot,renovate,-autoroll, trailingbot, …)@users.noreply.github.comidentities — that's GitHub's privacy-protected format for real humans, not a bot markerdiscover_entities()merges scanner results with the regex detector case-insensitively (sofoofrom a manifest absorbsFoofrom docs) and suppresses the regexuncertainbucket when real signal is already presentRegex detector improvements (apply to the prose-only fallback):
FooBar,FooBAR,FooAPIare visible^NAME:\srequires ≥2 matches to count. A single metadata line likeCreated: 2026-04-21was scoring as dialogue and classifying the field name as a person.\b{name}[-_]v?\d+(?:\.\d+)*\b(version-only). The previous\b{name}[-v]\w+matched every hyphenated compound..terraform,vendor,targetcreated,updated,extracted,processed,total,auto,multi,hybrid,context,batch,never,before,after, …)classify_entity: high-pronoun single-category signal now classifies as person. The previous "require two signal categories" rule demoted a character referenced mostly with pronouns (the common case in narrative prose) to uncertain despite overwhelming evidence. New gate:pronoun_hits >= 5 AND ratio >= 0.2so common sentence-start words with incidental proximity stay uncertain.Integration:
cli.py cmd_initswitches todiscover_entities(); same output shape soconfirm_entitiesis unchanged.Tests:
tests/test_project_scanner.py) — manifest parsing, bot filtering, union-find dedup, git repo discovery, scan + merge + fallbackclassify_entityKnown limitations (addressed in phase 2)
Phase 2 plan: optional local-model or user-configured endpoint at init time to classify the top capitalized terms in a sampled text window. Consistent with MemPalace's local-first / zero-API-required default — opt-in, not required.
Test plan
uv run pytest tests/ --ignore=tests/benchmarks— full suite passesruff check mempalace/ tests/— cleanruff format --check mempalace/ tests/— cleanpython -m mempalace.project_scanner .on this repo — canonical project name frompyproject.tomlplus contributor list fromgit log