Skip to content

feat(init): scan manifests and git authors for real entity signal (v1)#1148

Merged
igorls merged 8 commits intodevelopfrom
feat/project-scanner-entity-detection
Apr 24, 2026
Merged

feat(init): scan manifests and git authors for real entity signal (v1)#1148
igorls merged 8 commits intodevelopfrom
feat/project-scanner-entity-detection

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 24, 2026

Summary

mempalace init now gets project names from package manifests (package.json, pyproject.toml, Cargo.toml, go.mod) and people from git commit authors, rather than inferring both from regex over prose. The regex detector stays as a fallback for prose-only folders.

This is version 1 — it fixes the codebase/project case. The prose-only case (diaries, transcripts, research notes with no git and no manifests) still depends on the regex detector and remains limited; phase 2 is an LLM-assisted entity pass for that use case.

Motivation

Running mempalace init on a codebase produced systematic false positives from the regex-only path. Examples from a local sanity run:

  • "Projects" included common English words caught by the old versioned/hyphenated regex on any hyphenated compound (e.g. a prefix like multi matching multi-word, a descriptor like hybrid matching hybrid-search).
  • "Uncertain" was a dumping ground for capitalized sentence-starts (Never, Before, past-tense verbs like Created, Extracted, Processed).
  • CamelCase project names were invisible because the candidate regex [A-Z][a-z]{1,19} requires a lowercase run after one capital and fragmented FooBar into Foo and Bar separately.
  • The regex path produced zero people for code repositories — there is no said/laughed/asked verb pattern in commit messages or READMEs.

After this PR, a codebase init pulls the canonical project name from its manifest and ranks contributors by commit volume, with the regex path filling in extra names only mentioned in prose.

What's in this PR

New module mempalace/project_scanner.py:

  • Parses manifests for canonical project names
  • Walks for .git directories to find repo boundaries
  • Collects authors via git log, filters bots by name pattern ([bot], dependabot, renovate, -autoroll, trailing bot, …)
  • Keeps @users.noreply.github.com identities — that's GitHub's privacy-protected format for real humans, not a bot marker
  • Union-find dedup: commits sharing a name OR an email collapse into one person, so the same contributor with multiple display-name / email aliases appears once
  • Marks "mine" projects: current git user is top-5 committer OR has ≥10% of commits OR ≥20 commits
  • Public entry point discover_entities() merges scanner results with the regex detector case-insensitively (so foo from a manifest absorbs Foo from docs) and suppresses the regex uncertain bucket when real signal is already present

Regex detector improvements (apply to the prose-only fallback):

  • CamelCase extraction pattern — multi-capital names like FooBar, FooBAR, FooAPI are visible
  • Dialogue ^NAME:\s requires ≥2 matches to count. A single metadata line like Created: 2026-04-21 was scoring as dialogue and classifying the field name as a person.
  • Versioned/hyphenated pattern tightened to \b{name}[-_]v?\d+(?:\.\d+)*\b (version-only). The previous \b{name}[-v]\w+ matched every hyphenated compound.
  • Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files (pure-prose noise)
  • Extra SKIP_DIRS: .terraform, vendor, target
  • Stopword expansion: capitalized participles/descriptors common at sentence start (created, updated, extracted, processed, total, auto, multi, hybrid, context, batch, never, before, after, …)
  • classify_entity: high-pronoun single-category signal now classifies as person. The previous "require two signal categories" rule demoted a character referenced mostly with pronouns (the common case in narrative prose) to uncertain despite overwhelming evidence. New gate: pronoun_hits >= 5 AND ratio >= 0.2 so common sentence-start words with incidental proximity stay uncertain.

Integration:

  • cli.py cmd_init switches to discover_entities(); same output shape so confirm_entities is unchanged.

Tests:

  • 39 new tests (tests/test_project_scanner.py) — manifest parsing, bot filtering, union-find dedup, git repo discovery, scan + merge + fallback
  • 2 new regression tests for classify_entity
  • All existing tests still pass; ruff clean

Known limitations (addressed in phase 2)

  1. Prose-only folders still limited. A folder with no git and no manifests falls back to the regex detector. Common-word noise is reduced by the stopword expansion but fundamentally unsolvable without semantic understanding — a short capitalized token can be a project name, a city, or a character.
  2. Sample size. The regex detector reads 10 files × 5KB each. Fine for sampling, but misses the tail in large prose corpora. Scanner path has no such limit for git/manifests.
  3. No NER / no LLM. Either would properly disambiguate ambiguous tokens.

Phase 2 plan: optional local-model or user-configured endpoint at init time to classify the top capitalized terms in a sampled text window. Consistent with MemPalace's local-first / zero-API-required default — opt-in, not required.

Test plan

  • uv run pytest tests/ --ignore=tests/benchmarks — full suite passes
  • ruff check mempalace/ tests/ — clean
  • ruff format --check mempalace/ tests/ — clean
  • python -m mempalace.project_scanner . on this repo — canonical project name from pyproject.toml plus contributor list from git log
  • Test on a prose-only folder (diary/transcript) to surface where the regex fallback still falls short — feeds phase 2 scope

igorls added 2 commits April 24, 2026 00:20
The pattern-matching detector had several systematic false positives that
crowded the init review with nonsense. Concrete fixes:

- CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to
  candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are
  visible. Previously `MemPalace` fragmented into `Mem` + `Palace`.
- Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata
  line like `Created: 2026-04-21` was scoring as dialogue and classifying
  `Created` as a person.
- Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b`
  (version-only). The previous `\b{name}[-v]\w+` matched `context-manager`,
  `multi-word`, etc. - every hyphenated compound.
- Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They
  produce pure-English-prose noise (`Contributor`, `Software`, `Covered`,
  `Before`).
- Extra SKIP_DIRS: `.terraform`, `vendor`, `target`.
- Expand stopword list with capitalized participles/descriptors that
  commonly appear at sentence start: `created`, `updated`, `extracted`,
  `processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`,
  `bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc.
- classify_entity: high-pronoun single-category signal now classifies as
  person. A diary's main character gets referenced with pronouns, not
  dialogue markers - requiring two signal categories demoted `Lu` (16
  pronoun hits across 30 mentions) to uncertain. Gate on
  `pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common
  sentence-start words (`Never`, `Before`) with incidental proximity
  stay uncertain.
`mempalace init` previously leaned entirely on regex-based entity
extraction from prose. That path works for text-only folders but wastes
signal in any codebase: the project's own name is already in
`package.json` / `pyproject.toml` / `Cargo.toml` / `go.mod`, and the
people who worked on it are in `git log`.

This adds `project_scanner.py`, which becomes the primary signal source
when real signal is available, with the regex detector preserved as the
fallback for prose-only folders (diaries, research notes, writing).

What it does:
- Walks the target directory, parses manifests for canonical project
  names, and detects git repos by the presence of a `.git` directory.
- For each repo, reads `git log` for authors and filters obvious bots
  (`[bot]`, `dependabot`, `renovate`, `github-actions`, names ending in
  `bot`, `-autoroll`). Importantly does NOT filter
  `@users.noreply.github.com` - that's GitHub's privacy-protected human
  email, used by real contributors.
- Resolves author aliases with a union-find: commits that share a name
  OR an email collapse into one person. Picks the most-frequent
  real-name variant as display, ignoring handles and single-token
  usernames.
- Flags "mine" projects: user is top-5 committer OR has >=10% of
  commits OR >=20 commits. Ordered by user_commits in the UX.
- `discover_entities()` merges scanner results with the regex detector
  case-insensitively (so `mempalace` from pyproject absorbs `MemPalace`
  from docs), and suppresses the regex `uncertain` bucket when real
  signal is already found - the user doesn't need to adjudicate prose
  noise when the answer is already in git.

Integration: `cmd_init` now calls `discover_entities` instead of
running the regex detector directly. Same output shape, so
`confirm_entities` works unchanged.

Ships with 39 new tests covering manifest parsing, bot filtering,
union-find dedup, git repo discovery, scan integration, and
merge/fallback behavior. Existing 56 regex-detector tests all pass.
Copilot AI review requested due to automatic review settings April 24, 2026 03:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates mempalace init to discover projects and people using higher-signal sources (package manifests + git history) and only fall back to regex-based prose entity detection when needed.

Changes:

  • Add mempalace/project_scanner.py to scan manifests for project names and git log for real commit authors, then merge with regex results.
  • Improve regex entity detection (CamelCase candidates, tighter versioned pattern, better dialogue handling, skip boilerplate files, expanded stopwords, pronoun-only person classification tweak).
  • Wire cmd_init to use discover_entities() and add comprehensive tests for scanning/merging behavior.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
mempalace/project_scanner.py New scanner module for manifest + git-author entity discovery and merging with regex results.
mempalace/cli.py Switches init entity discovery to discover_entities() while keeping confirmation/output shape.
mempalace/entity_detector.py Tightens patterns and heuristics; skips boilerplate files; improves pronoun-only classification behavior.
mempalace/i18n/en.json Expands candidate extraction to CamelCase and expands stopwords to reduce noise.
tests/test_project_scanner.py Adds tests for manifest parsing, git repo discovery, bot filtering, dedupe, scan, and merge behavior.
tests/test_entity_detector.py Adds regression tests for the updated pronoun-signal classification logic.
uv.lock Bumps local package version entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/project_scanner.py Outdated
Comment on lines +446 to +451
total_commits = len(authors)
user_commits = 0
author_counts: dict[str, int] = {}
for name, email in authors:
if _is_bot(name, email):
continue
Comment thread mempalace/project_scanner.py Outdated
Comment on lines +306 to +308
if (root / ".git").is_dir():
# Root is a repo — still walk for nested repos (submodules, etc.)
repos.append(root)
Comment thread mempalace/project_scanner.py Outdated
Comment on lines +325 to +327
if dirpath != repo_root and (dirpath / ".git").is_dir():
dirs.clear()
continue
Comment thread mempalace/project_scanner.py Outdated
def _walk(root: Path, max_depth: int = MAX_DEPTH):
for dirpath, dirs, files in os.walk(root):
dirs[:] = [d for d in dirs if d not in SKIP_DIRS and not d.startswith(".")]
rel = Path(dirpath).relative_to(root) if dirpath != str(root) else Path(".")
Comment thread tests/test_project_scanner.py Outdated
Comment on lines +188 to +195
NOT walked into (we stop at the first repo boundary when descending)."""
(tmp_path / ".git").mkdir()
deep = tmp_path / "a" / "b" / "nested-repo"
deep.mkdir(parents=True)
(deep / ".git").mkdir()
repos = find_git_repos(tmp_path)
# Root IS found; nested still discovered on its own branch (not inside root's .git)
assert tmp_path in repos
Comment thread mempalace/project_scanner.py Outdated
Comment on lines +312 to +314
if (dirpath / ".git").is_dir():
repos.append(dirpath)
dirs.clear() # don't descend into this repo's contents from here
Comment on lines +436 to +443
manifests = _collect_manifest_names(repo)
root_level = [m for m in manifests if m[2] == repo]
if root_level:
manifest_file, proj_name, _ = root_level[0]
elif manifests:
manifest_file, proj_name, _ = manifests[0]
else:
manifest_file, proj_name = None, repo.name
Comment thread tests/test_project_scanner.py Outdated
Comment on lines +205 to +208
def _init_git_repo(path: Path, name: str = "Jane Doe", email: str = "[email protected]"):
"""Helper: init a git repo with one commit."""
subprocess.run(["git", "init", "-q"], cwd=path, check=True)
subprocess.run(["git", "config", "user.name", name], cwd=path, check=True)
Comment thread tests/test_project_scanner.py Outdated
Comment on lines +213 to +217
subprocess.run(
["git", "commit", "-q", "-m", "initial"],
cwd=path,
check=True,
env={"GIT_COMMITTER_NAME": name, "GIT_COMMITTER_EMAIL": email, "PATH": "/usr/bin:/bin"},
Comment on lines +127 to +134
def _parse_toml(path: Path) -> dict:
if tomllib is None:
return {}
try:
with open(path, "rb") as f:
return tomllib.load(f)
except (OSError, Exception):
return {}
`tomllib` is stdlib only in Python 3.11+. On Python 3.9/3.10 (and the
macOS runner) the scanner's toml parsing returned empty, so manifest
lookups for `pyproject.toml` / `Cargo.toml` produced no name. CI
surfaced this via 4 test_project_scanner.py failures on the 3.9 matrix.

Add `tomli>=2.0.0` as a conditional dependency for `python_version <
'3.11'` and fall back to it in `project_scanner.py`. The project still
declares `requires-python = ">=3.9"` so the fallback is the correct
shape.
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Apr 24, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 24, 2026

@copilot apply changes based on the comments in this thread

Applied the review-thread fixes in 9486d8b. This follow-up covers .git file repo boundaries, deterministic manifest selection, non-bot commit totals for ownership, clarified/expanded scanner tests, and portable git test helpers. Validation: python -m pytest tests/test_project_scanner.py tests/test_entity_detector.py -v, python -m ruff check mempalace/project_scanner.py tests/test_project_scanner.py, and python -m ruff format --check mempalace/project_scanner.py tests/test_project_scanner.py all pass.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • chroma-onnx-models.s3.amazonaws.com
    • Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/benchmarks (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

igorls added a commit that referenced this pull request Apr 24, 2026
…oritative sources

Addresses issues found while reviewing the initial phase-2 implementation
against real data:

**Bug: uncertain bucket starved from the LLM.**
`discover_entities` was dropping the regex-uncertain bucket whenever real
git/manifest signal existed — which is exactly when `--llm` is most useful
for cleaning up prose noise. The uncertain candidates never reached the
refinement step. Fixed: only drop when `llm_provider is None`.

**Context collection: word boundaries, not substring.**
`_collect_contexts` used substring matching on lower-cased lines, so the
name "Go" matched "good", "going", "forgot". Switched to a
`(?<!\w)…(?!\w)` regex so short names only match at token boundaries.

**Authoritative-source detection replaces confidence threshold.**
Previously the refinement step skipped entries with `confidence >= 0.95`
to avoid second-guessing manifest-backed projects. That threshold was
fragile — the regex detector produces 0.99 confidence for things like
`code file reference (5x)` on framework names (OpenAPI, etc.), so those
skipped the LLM despite being regex-only noise. New helpers
`_is_authoritative_person` / `_is_authoritative_project` look at the
actual signal strings (commits, package.json, etc.) to decide.

**Now also refines regex-derived people.**
After #1148's high-pronoun-signal fix, the regex detector can promote
non-people to the `people` bucket (e.g. a capitalized common noun that
happened to appear near pronouns). The LLM now gets a chance to clean
those up, while git-authored people are still skipped.

**Robust JSON extraction.**
Small local models routinely wrap JSON output in prose ("Sure, here's
the classification: {…}"). The previous code-fence stripper failed on
that. `_extract_json_candidates` now does balanced-bracket extraction
with string-aware quote handling, so it recovers JSON from:
- raw responses
- markdown fenced blocks
- JSON embedded inside surrounding text
- multiple candidate objects/arrays

**Prompt guidance for frameworks vs user projects.**
Added an explicit instruction: frameworks, runtimes, APIs, cloud
services, and third-party vendors (Angular, OpenAPI, Terraform, Bun,
Google, etc.) are TOPIC unless the context clearly says it's the user's
own codebase. Directly addresses a false-positive pattern observed
during dev runs.

**Defensive mtime.**
`convo_scanner._safe_mtime` catches OSError during `stat()` — permission
changes, filesystem races, broken symlinks — and sorts the affected file
to the end of the newest-first order rather than crashing the scan.

**Cosmetic:** merged two adjacent f-strings on the same line in
`backends/chroma.py` and `llm_client.py` (no behaviour change).

15 new tests cover the OSError fallback, word-boundary matching, JSON
extraction variants, authoritative-source helpers, refining high-
confidence regex projects, and end-to-end LLM refinement preserving the
uncertain bucket.
@igorls igorls added the enhancement New feature or request label Apr 24, 2026
@igorls igorls merged commit a851c7a into develop Apr 24, 2026
7 checks passed
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
…to develop

MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two
stacked children landed on their parent feature branches (now stale)
rather than on develop. Only MemPalace#1148's commits reached develop via the
direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore
missing the LLM refinement, Claude-conversation scanner, and miner-
registry wire-up that were ostensibly part of the release.

This merge brings the stale `feat/llm-entity-refine` branch (which
contains the rolled-up merge commit for MemPalace#1157MemPalace#1150 → everything
below) into develop so the release tag includes it.

No code changes here — only history recovery.
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148,
MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into
develop via MemPalace#1175). Without these entries the 3.3.3 release notes on
main would advertise only the hook/diary/search fixes that made it to
develop through the first direct merge.

Covers:
- Manifest + git-author entity detection (MemPalace#1148)
- Regex detector accuracy improvements (MemPalace#1148)
- Optional --llm classification with Ollama / openai-compat / Anthropic
  provider abstraction and interactive UX (MemPalace#1150)
- Claude Code conversation scanner (MemPalace#1150)
- Init → miner registry wire-up so confirmed entities actually reach
  drawer metadata tagging (MemPalace#1157)
- Case-insensitive project dedup across all sources (MemPalace#1175)
- `mempalace mine` skips the generated entities.json artifact
jphein added a commit to jphein/mempalace that referenced this pull request Apr 24, 2026
… state

Three stale sections updated:

- Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck
  through → FILED as MemPalace#1177. Two new rows added for segfault fixes
  discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in
  make_client) that weren't in the queue because the bugs surfaced
  today, not during the original 2026-04-21 triage.

- Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with
  current CI/review state. All rebased onto current upstream/develop
  and MERGEABLE as of today.

- Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its
  constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157
  entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue),
  MemPalace#1166 palace-path security, MemPalace#340/MemPalace#1093 install regression, plus MemPalace#851
  from the 2026-04-22 batch.
funguf added a commit to funguf/mempalace that referenced this pull request Apr 25, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 on collection
creation in both `get_collection(..., create=True)` and the legacy
`create_collection()` path. Preserves existing `hnsw:space` and
`hnsw:num_threads=1` (race fix from MemPalace#976) and the `**ef_kwargs` plumbing
for embedding-function injection (perf fix from MemPalace#1148/a4868a3).

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb treats HNSW config as immutable post-creation,
so existing bloated palaces still need to be nuked and re-mined; this
only protects fresh collections.

Tests assert both keys land on the persisted collection metadata in
both code paths, which also covers the MemPalace#1161 "config silently dropped"
concern at CI time.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants