Skip to content

feat(init): wire confirmed entities into the miner's known-entities registry#1157

Merged
igorls merged 2 commits intofeat/llm-entity-refinefrom
feat/wire-entities-to-miner
Apr 24, 2026
Merged

feat(init): wire confirmed entities into the miner's known-entities registry#1157
igorls merged 2 commits intofeat/llm-entity-refinefrom
feat/wire-entities-to-miner

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 24, 2026

Summary

The init step's entity output was a dead file. miner.py has always read ~/.mempalace/known_entities.json to tag drawer metadata with recognized names, but nothing ever wrote it — so every improvement to init detection (manifest/git/regex/LLM) stopped at <project>/entities.json and never reached the path that actually uses it.

This wires init → registry. Per-project file is kept as an audit trail.

Measured value

On a representative prose snippet (eight sentences mentioning six real people and four real projects):

Registry state Entities recognized
Empty (current) 0
Populated by init 12, all correct

Multi-word names (Alice Example, Bob Sample) fail the frequency-threshold fallback because each word only appears once. Lowercase / hyphenated project names (my-lib, foo-bar) don't match the CamelCase regex. Both categories were completely invisible to the miner until now. Every recognized name becomes a semicolon-separated tag on the drawer, which ChromaDB uses for entity-filtered search.

Why stacked on #1150

Logically independent from the LLM refinement, but each earlier PR in the stack improves the input to the registry: #1148 added manifest/git authors, #1150 adds LLM-classified topics/people. This PR ensures all of that work reaches the one code path that uses it. Merging order is linear: #1148#1150 → this.

Implementation

miner.add_to_known_entities({category: [names]}) -> str (new):

  • Reads the existing registry, unions each category, writes back.
  • Case-insensitive dedup, preserves first-seen casing.
  • Tolerant of both on-disk shapes the miner already supports: list of names, or dict mapping name → code (dialect-style). In the dict case, new names are added as keys with None values so existing codes aren't overwritten.
  • Preserves untouched categories — merging {people: [...]} does not clobber an existing places or projects category.
  • Invalidates the in-process mtime cache so cmd_initcmd_mine in one run sees the write immediately.
  • Writes with ensure_ascii=False so non-ASCII names stay readable.
  • chmod 0o600 — the registry mirrors the user's confirm-step PII.

cmd_init now calls it at the end of the confirm-entities step, after the per-project entities.json is written.

Tests

17 new tests (tests/test_known_entities_registry.py), all offline:

  • Fresh-file creation
  • List-category union + case-insensitive dedup
  • Preservation of untouched categories
  • Dict-format registries (new keys added with None; existing codes preserved)
  • Malformed / non-dict file recovery (starts fresh)
  • Cache invalidation (same-process read reflects the write)
  • Unicode round-trip
  • End-to-end verification that _extract_entities_for_metadata picks up every registered name

Full suite: 1221 passed, ruff clean.

Test plan

  • uv run pytest tests/ --ignore=tests/benchmarks — full suite passes
  • ruff check mempalace/ tests/ — clean
  • ruff format --check mempalace/ tests/ — clean
  • Measured: 0 → 12 entity recognition delta on representative prose sample
  • Reviewer verification: run mempalace init <repo> then mempalace mine <repo>, confirm drawer metadata contains the registered names

…egistry

The init step's output was a dead file. miner.py has always read
`~/.mempalace/known_entities.json` to tag drawer metadata with
recognized names, but nothing ever wrote it — so init's careful
manifest + git + LLM detection work stopped at `<project>/entities.json`
and never reached the path that actually uses it.

Measured delta on a representative prose snippet (eight sentences
mentioning six real people and four real projects):
- Empty registry: 0 entities recognized (multi-word names fail the
  frequency threshold; lowercase/hyphenated project names don't match
  the CamelCase regex).
- Registry populated by init: 12 entities recognized (all correct, zero
  false positives).

Every recognized name becomes a semicolon-separated metadata tag on the
drawer, which ChromaDB uses for entity-filtered search.

Implementation:

- `miner.add_to_known_entities({category: [names]})` reads the existing
  registry, unions each category (case-insensitively, preserving first-
  seen casing), and writes back. The function is tolerant of the two
  on-disk shapes miner already supports: list of names, or dict mapping
  name → code (dialect-style). In the dict case new names are added as
  keys with `None` values so existing codes aren't overwritten.
- Invalidates the in-process mtime cache so same-process callers
  (`cmd_init` → `cmd_mine` in one run) see the write immediately.
- Writes with `ensure_ascii=False` so non-ASCII names (Gergő Móricz,
  Arturo Domínguez, etc.) stay readable on disk.
- Chmods 0o600 — the registry mirrors confirm-step PII from the user's
  git authors and local paths.

cmd_init now calls this at the end of the confirm-entities step, after
the per-project `entities.json` is written (which is kept as an audit
trail the user can inspect or hand-edit). The per-project file is still
excluded from mining via `SKIP_FILENAMES` from the earlier fix.

17 new tests cover: fresh-file creation, list-category union, case-
insensitive dedup, preservation of untouched categories, dict-format
registries, malformed/non-dict file recovery, cache invalidation,
unicode round-trip, and an end-to-end verification that the miner's
`_extract_entities_for_metadata` picks up every registered name.
Copilot AI review requested due to automatic review settings April 24, 2026 05:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires mempalace init’s confirmed entities into the global known-entity registry (~/.mempalace/known_entities.json) so the miner can tag drawer metadata with those names during mining, while still keeping <project>/entities.json as a per-project audit trail.

Changes:

  • Add miner.add_to_known_entities() to merge confirmed entities into the global registry and invalidate the in-process registry cache.
  • Update cmd_init to write entities.json with ensure_ascii=False and to call add_to_known_entities() after confirmation.
  • Add a dedicated offline test suite covering registry merge behavior, error tolerance, cache invalidation, Unicode, and miner metadata extraction.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
mempalace/miner.py Adds registry merge helper (add_to_known_entities) and cache invalidation for same-process init→mine workflows.
mempalace/cli.py Updates init to persist confirmed entities as UTF-friendly JSON and merge them into the global registry.
tests/test_known_entities_registry.py Adds comprehensive tests for registry creation/merge semantics and end-to-end miner metadata tagging recall.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/miner.py Outdated
Comment on lines +522 to +525
elif isinstance(current, dict):
for n in names:
if n and n not in current:
current[n] = None
Comment thread mempalace/cli.py Outdated
# global registry the miner reads at mine time.
if confirmed["people"] or confirmed["projects"]:
entities_path = Path(args.dir).expanduser().resolve() / "entities.json"
with open(entities_path, "w") as f:
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Apr 24, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 24, 2026

@copilot apply changes based on the comments in this thread

Applied the requested review fixes in 1b1854e: dict-format registry merges now dedupe case-insensitively while stringifying new names, and entities.json is written with encoding="utf-8". I also added a regression test for the dict-format case and re-ran the targeted pytest + ruff checks.

@igorls igorls added enhancement New feature or request area/mining File and conversation mining labels Apr 24, 2026
@igorls igorls merged commit 61d6c3c into feat/llm-entity-refine Apr 24, 2026
igorls added a commit that referenced this pull request Apr 24, 2026
…-develop

chore: rescue merged stacked PRs #1150 and #1157 into develop
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
…to develop

MemPalace#1148, MemPalace#1150, and MemPalace#1157 were reviewed and merged on GitHub, but the two
stacked children landed on their parent feature branches (now stale)
rather than on develop. Only MemPalace#1148's commits reached develop via the
direct merge. Release PR MemPalace#1159 (develop → main for v3.3.3) is therefore
missing the LLM refinement, Claude-conversation scanner, and miner-
registry wire-up that were ostensibly part of the release.

This merge brings the stale `feat/llm-entity-refine` branch (which
contains the rolled-up merge commit for MemPalace#1157MemPalace#1150 → everything
below) into develop so the release tag includes it.

No code changes here — only history recovery.
shrhoads pushed a commit to shrhoads/mempalace that referenced this pull request Apr 24, 2026
Adds entries to the 3.3.3 section for the work that landed via MemPalace#1148,
MemPalace#1150, MemPalace#1157, and MemPalace#1175 (rescued from stacked feature branches into
develop via MemPalace#1175). Without these entries the 3.3.3 release notes on
main would advertise only the hook/diary/search fixes that made it to
develop through the first direct merge.

Covers:
- Manifest + git-author entity detection (MemPalace#1148)
- Regex detector accuracy improvements (MemPalace#1148)
- Optional --llm classification with Ollama / openai-compat / Anthropic
  provider abstraction and interactive UX (MemPalace#1150)
- Claude Code conversation scanner (MemPalace#1150)
- Init → miner registry wire-up so confirmed entities actually reach
  drawer metadata tagging (MemPalace#1157)
- Case-insensitive project dedup across all sources (MemPalace#1175)
- `mempalace mine` skips the generated entities.json artifact
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants