feat(init): scan manifests and git authors for real entity signal (v1) by igorls · Pull Request #1148 · MemPalace/mempalace

igorls · 2026-04-24T03:21:41Z

Summary

mempalace init now gets project names from package manifests (package.json, pyproject.toml, Cargo.toml, go.mod) and people from git commit authors, rather than inferring both from regex over prose. The regex detector stays as a fallback for prose-only folders.

This is version 1 — it fixes the codebase/project case. The prose-only case (diaries, transcripts, research notes with no git and no manifests) still depends on the regex detector and remains limited; phase 2 is an LLM-assisted entity pass for that use case.

Motivation

Running mempalace init on a codebase produced systematic false positives from the regex-only path. Examples from a local sanity run:

"Projects" included common English words caught by the old versioned/hyphenated regex on any hyphenated compound (e.g. a prefix like multi matching multi-word, a descriptor like hybrid matching hybrid-search).
"Uncertain" was a dumping ground for capitalized sentence-starts (Never, Before, past-tense verbs like Created, Extracted, Processed).
CamelCase project names were invisible because the candidate regex [A-Z][a-z]{1,19} requires a lowercase run after one capital and fragmented FooBar into Foo and Bar separately.
The regex path produced zero people for code repositories — there is no said/laughed/asked verb pattern in commit messages or READMEs.

After this PR, a codebase init pulls the canonical project name from its manifest and ranks contributors by commit volume, with the regex path filling in extra names only mentioned in prose.

What's in this PR

New module mempalace/project_scanner.py:

Parses manifests for canonical project names
Walks for .git directories to find repo boundaries
Collects authors via git log, filters bots by name pattern ([bot], dependabot, renovate, -autoroll, trailing bot, …)
Keeps @users.noreply.github.com identities — that's GitHub's privacy-protected format for real humans, not a bot marker
Union-find dedup: commits sharing a name OR an email collapse into one person, so the same contributor with multiple display-name / email aliases appears once
Marks "mine" projects: current git user is top-5 committer OR has ≥10% of commits OR ≥20 commits
Public entry point discover_entities() merges scanner results with the regex detector case-insensitively (so foo from a manifest absorbs Foo from docs) and suppresses the regex uncertain bucket when real signal is already present

Regex detector improvements (apply to the prose-only fallback):

CamelCase extraction pattern — multi-capital names like FooBar, FooBAR, FooAPI are visible
Dialogue ^NAME:\s requires ≥2 matches to count. A single metadata line like Created: 2026-04-21 was scoring as dialogue and classifying the field name as a person.
Versioned/hyphenated pattern tightened to \b{name}[-_]v?\d+(?:\.\d+)*\b (version-only). The previous \b{name}[-v]\w+ matched every hyphenated compound.
Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files (pure-prose noise)
Extra SKIP_DIRS: .terraform, vendor, target
Stopword expansion: capitalized participles/descriptors common at sentence start (created, updated, extracted, processed, total, auto, multi, hybrid, context, batch, never, before, after, …)
classify_entity: high-pronoun single-category signal now classifies as person. The previous "require two signal categories" rule demoted a character referenced mostly with pronouns (the common case in narrative prose) to uncertain despite overwhelming evidence. New gate: pronoun_hits >= 5 AND ratio >= 0.2 so common sentence-start words with incidental proximity stay uncertain.

Integration:

cli.py cmd_init switches to discover_entities(); same output shape so confirm_entities is unchanged.

Tests:

39 new tests (tests/test_project_scanner.py) — manifest parsing, bot filtering, union-find dedup, git repo discovery, scan + merge + fallback
2 new regression tests for classify_entity
All existing tests still pass; ruff clean

Known limitations (addressed in phase 2)

Prose-only folders still limited. A folder with no git and no manifests falls back to the regex detector. Common-word noise is reduced by the stopword expansion but fundamentally unsolvable without semantic understanding — a short capitalized token can be a project name, a city, or a character.
Sample size. The regex detector reads 10 files × 5KB each. Fine for sampling, but misses the tail in large prose corpora. Scanner path has no such limit for git/manifests.
No NER / no LLM. Either would properly disambiguate ambiguous tokens.

Phase 2 plan: optional local-model or user-configured endpoint at init time to classify the top capitalized terms in a sampled text window. Consistent with MemPalace's local-first / zero-API-required default — opt-in, not required.

Test plan

uv run pytest tests/ --ignore=tests/benchmarks — full suite passes
ruff check mempalace/ tests/ — clean
ruff format --check mempalace/ tests/ — clean
python -m mempalace.project_scanner . on this repo — canonical project name from pyproject.toml plus contributor list from git log
Test on a prose-only folder (diary/transcript) to surface where the regex fallback still falls short — feeds phase 2 scope

The pattern-matching detector had several systematic false positives that crowded the init review with nonsense. Concrete fixes: - CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are visible. Previously `MemPalace` fragmented into `Mem` + `Palace`. - Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata line like `Created: 2026-04-21` was scoring as dialogue and classifying `Created` as a person. - Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b` (version-only). The previous `\b{name}[-v]\w+` matched `context-manager`, `multi-word`, etc. - every hyphenated compound. - Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They produce pure-English-prose noise (`Contributor`, `Software`, `Covered`, `Before`). - Extra SKIP_DIRS: `.terraform`, `vendor`, `target`. - Expand stopword list with capitalized participles/descriptors that commonly appear at sentence start: `created`, `updated`, `extracted`, `processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`, `bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc. - classify_entity: high-pronoun single-category signal now classifies as person. A diary's main character gets referenced with pronouns, not dialogue markers - requiring two signal categories demoted `Lu` (16 pronoun hits across 30 mentions) to uncertain. Gate on `pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common sentence-start words (`Never`, `Before`) with incidental proximity stay uncertain.

`mempalace init` previously leaned entirely on regex-based entity extraction from prose. That path works for text-only folders but wastes signal in any codebase: the project's own name is already in `package.json` / `pyproject.toml` / `Cargo.toml` / `go.mod`, and the people who worked on it are in `git log`. This adds `project_scanner.py`, which becomes the primary signal source when real signal is available, with the regex detector preserved as the fallback for prose-only folders (diaries, research notes, writing). What it does: - Walks the target directory, parses manifests for canonical project names, and detects git repos by the presence of a `.git` directory. - For each repo, reads `git log` for authors and filters obvious bots (`[bot]`, `dependabot`, `renovate`, `github-actions`, names ending in `bot`, `-autoroll`). Importantly does NOT filter `@users.noreply.github.com` - that's GitHub's privacy-protected human email, used by real contributors. - Resolves author aliases with a union-find: commits that share a name OR an email collapse into one person. Picks the most-frequent real-name variant as display, ignoring handles and single-token usernames. - Flags "mine" projects: user is top-5 committer OR has >=10% of commits OR >=20 commits. Ordered by user_commits in the UX. - `discover_entities()` merges scanner results with the regex detector case-insensitively (so `mempalace` from pyproject absorbs `MemPalace` from docs), and suppresses the regex `uncertain` bucket when real signal is already found - the user doesn't need to adjudicate prose noise when the answer is already in git. Integration: `cmd_init` now calls `discover_entities` instead of running the regex detector directly. Same output shape, so `confirm_entities` works unchanged. Ships with 39 new tests covering manifest parsing, bot filtering, union-find dedup, git repo discovery, scan integration, and merge/fallback behavior. Existing 56 regex-detector tests all pass.

Copilot

Pull request overview

Updates mempalace init to discover projects and people using higher-signal sources (package manifests + git history) and only fall back to regex-based prose entity detection when needed.

Changes:

Add mempalace/project_scanner.py to scan manifests for project names and git log for real commit authors, then merge with regex results.
Improve regex entity detection (CamelCase candidates, tighter versioned pattern, better dialogue handling, skip boilerplate files, expanded stopwords, pronoun-only person classification tweak).
Wire cmd_init to use discover_entities() and add comprehensive tests for scanning/merging behavior.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`mempalace/project_scanner.py`	New scanner module for manifest + git-author entity discovery and merging with regex results.
`mempalace/cli.py`	Switches `init` entity discovery to `discover_entities()` while keeping confirmation/output shape.
`mempalace/entity_detector.py`	Tightens patterns and heuristics; skips boilerplate files; improves pronoun-only classification behavior.
`mempalace/i18n/en.json`	Expands candidate extraction to CamelCase and expands stopwords to reduce noise.
`tests/test_project_scanner.py`	Adds tests for manifest parsing, git repo discovery, bot filtering, dedupe, scan, and merge behavior.
`tests/test_entity_detector.py`	Adds regression tests for the updated pronoun-signal classification logic.
`uv.lock`	Bumps local package version entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        total_commits = len(authors)
+        user_commits = 0
+        author_counts: dict[str, int] = {}
+        for name, email in authors:
+            if _is_bot(name, email):
+                continue


+    if (root / ".git").is_dir():
+        # Root is a repo — still walk for nested repos (submodules, etc.)
+        repos.append(root)


+        if dirpath != repo_root and (dirpath / ".git").is_dir():
+            dirs.clear()
+            continue


+def _walk(root: Path, max_depth: int = MAX_DEPTH):
+    for dirpath, dirs, files in os.walk(root):
+        dirs[:] = [d for d in dirs if d not in SKIP_DIRS and not d.startswith(".")]
+        rel = Path(dirpath).relative_to(root) if dirpath != str(root) else Path(".")


+    NOT walked into (we stop at the first repo boundary when descending)."""
+    (tmp_path / ".git").mkdir()
+    deep = tmp_path / "a" / "b" / "nested-repo"
+    deep.mkdir(parents=True)
+    (deep / ".git").mkdir()
+    repos = find_git_repos(tmp_path)
+    # Root IS found; nested still discovered on its own branch (not inside root's .git)
+    assert tmp_path in repos


+        if (dirpath / ".git").is_dir():
+            repos.append(dirpath)
+            dirs.clear()  # don't descend into this repo's contents from here


+        manifests = _collect_manifest_names(repo)
+        root_level = [m for m in manifests if m[2] == repo]
+        if root_level:
+            manifest_file, proj_name, _ = root_level[0]
+        elif manifests:
+            manifest_file, proj_name, _ = manifests[0]
+        else:
+            manifest_file, proj_name = None, repo.name


+def _init_git_repo(path: Path, name: str = "Jane Doe", email: str = "[email protected]"):
+    """Helper: init a git repo with one commit."""
+    subprocess.run(["git", "init", "-q"], cwd=path, check=True)
+    subprocess.run(["git", "config", "user.name", name], cwd=path, check=True)


+    subprocess.run(
+        ["git", "commit", "-q", "-m", "initial"],
+        cwd=path,
+        check=True,
+        env={"GIT_COMMITTER_NAME": name, "GIT_COMMITTER_EMAIL": email, "PATH": "/usr/bin:/bin"},


+def _parse_toml(path: Path) -> dict:
+    if tomllib is None:
+        return {}
+    try:
+        with open(path, "rb") as f:
+            return tomllib.load(f)
+    except (OSError, Exception):
+        return {}


`tomllib` is stdlib only in Python 3.11+. On Python 3.9/3.10 (and the macOS runner) the scanner's toml parsing returned empty, so manifest lookups for `pyproject.toml` / `Cargo.toml` produced no name. CI surfaced this via 4 test_project_scanner.py failures on the 3.9 matrix. Add `tomli>=2.0.0` as a conditional dependency for `python_version < '3.11'` and fall back to it in `project_scanner.py`. The project still declares `requires-python = ">=3.9"` so the fallback is the correct shape.

igorls · 2026-04-24T03:44:56Z

@copilot apply changes based on the comments in this thread

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

Copilot · 2026-04-24T03:54:10Z

@copilot apply changes based on the comments in this thread

Applied the review-thread fixes in 9486d8b. This follow-up covers .git file repo boundaries, deterministic manifest selection, non-bot commit totals for ownership, clarified/expanded scanner tests, and portable git test helpers. Validation: python -m pytest tests/test_project_scanner.py tests/test_entity_detector.py -v, python -m ruff check mempalace/project_scanner.py tests/test_project_scanner.py, and python -m ruff format --check mempalace/project_scanner.py tests/test_project_scanner.py all pass.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

chroma-onnx-models.s3.amazonaws.com
- Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/benchmarks (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

…oritative sources Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.

…to develop MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two stacked children landed on their parent feature branches (now stale) rather than on develop. Only MemPalace#1148's commits reached develop via the direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore missing the LLM refinement, Claude-conversation scanner, and miner- registry wire-up that were ostensibly part of the release. This merge brings the stale `feat/llm-entity-refine` branch (which contains the rolled-up merge commit for MemPalace#1157 → MemPalace#1150 → everything below) into develop so the release tag includes it. No code changes here — only history recovery.

Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148, MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into develop via MemPalace#1175). Without these entries the 3.3.3 release notes on main would advertise only the hook/diary/search fixes that made it to develop through the first direct merge. Covers: - Manifest + git-author entity detection (MemPalace#1148) - Regex detector accuracy improvements (MemPalace#1148) - Optional --llm classification with Ollama / openai-compat / Anthropic provider abstraction and interactive UX (MemPalace#1150) - Claude Code conversation scanner (MemPalace#1150) - Init → miner registry wire-up so confirmed entities actually reach drawer metadata tagging (MemPalace#1157) - Case-insensitive project dedup across all sources (MemPalace#1175) - `mempalace mine` skips the generated entities.json artifact

@igorls

… state Three stale sections updated: - Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck through → FILED as MemPalace#1177. Two new rows added for segfault fixes discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in make_client) that weren't in the queue because the bugs surfaced today, not during the original 2026-04-21 triage. - Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with current CI/review state. All rebased onto current upstream/develop and MERGEABLE as of today. - Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157 entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue), MemPalace#1166 palace-path security, MemPalace#340/MemPalace#1093 install regression, plus MemPalace#851 from the 2026-04-22 batch.

Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection creation in both `get_collection(..., create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space` and `hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing for embedding-function injection (perf fix from MemPalace#1148/a4868a3). Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb treats HNSW config as immutable post-creation, so existing bloated palaces still need to be nuked and re-mined; this only protects fresh collections. Tests assert both keys land on the persisted collection metadata in both code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>

igorls added 2 commits April 24, 2026 00:20

Copilot AI review requested due to automatic review settings April 24, 2026 03:21

igorls requested review from bensig and milla-jovovich as code owners April 24, 2026 03:21

Copilot started reviewing on behalf of igorls April 24, 2026 03:22 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

igorls mentioned this pull request Apr 24, 2026

feat(init): optional LLM-assisted entity classification (phase 2) #1149

Open

Copilot started work on behalf of igorls April 24, 2026 03:45 View session

igorls mentioned this pull request Apr 24, 2026

feat(init): optional LLM-assisted entity refinement + Claude Code convo scanner (phase 2) #1150

Merged

7 tasks

Copilot AI and others added 5 commits April 24, 2026 03:48

fix(project-scanner): address review feedback

70d4c54

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

test(project-scanner): tighten git helper env handling

851ebeb

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

refactor(project-scanner): tidy manifest priority helpers

ec9084f

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

test(project-scanner): harden git helper execution

d4cc367

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

test(project-scanner): make gitdir fixtures portable

9486d8b

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3c277c46-20b3-4a43-8eb7-8ee2eb3cb55a Co-authored-by: igorls <[email protected]>

Copilot finished work on behalf of igorls April 24, 2026 03:55

igorls mentioned this pull request Apr 24, 2026

feat(init): wire confirmed entities into the miner's known-entities registry #1157

Merged

5 tasks

igorls added the enhancement New feature or request label Apr 24, 2026

igorls merged commit a851c7a into develop Apr 24, 2026
7 checks passed

This was referenced Apr 24, 2026

chore: rescue merged stacked PRs #1150 and #1157 into develop #1175

Merged

docs(changelog): document init entity-detection overhaul in 3.3.3 #1176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(init): scan manifests and git authors for real entity signal (v1)#1148

feat(init): scan manifests and git authors for real entity signal (v1)#1148
igorls merged 8 commits intodevelopfrom
feat/project-scanner-entity-detection

igorls commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

igorls commented Apr 24, 2026

Uh oh!

Copilot AI commented Apr 24, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

igorls commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's in this PR

Known limitations (addressed in phase 2)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

igorls commented Apr 24, 2026

Uh oh!

Copilot AI commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

igorls commented Apr 24, 2026 •

edited

Loading

Copilot AI commented Apr 24, 2026 •

edited

Loading