feat: entity metadata + diary ingest + BM25 hybrid search by igorls · Pull Request #789 · MemPalace/mempalace

igorls · 2026-04-13T10:43:32Z

Summary

Three related v3.3 features bundled in a single commit by Milla (MSL): entity metadata on drawers, day-based diary ingest, and BM25 hybrid search. Plus a trimmed test file covering this stack's features.

Stacked on #788 (and transitively #784). Merge #784 → #788 → this.

Sub-features

1. Entity metadata (miner.py)

_extract_entities_for_metadata() pulls repeated capitalized proper nouns from drawer content and stores them semicolon-joined in drawer metadata. Enables entity-filtered search and feeds the closet layer's entity hints.

2. Diary ingest (`diary_ingest.py`, new file)

Day-based ingest pipeline for markdown diaries (e.g., ~/summaries/2026-04-13.md):

One drawer per day, upserted as the day grows
Respects CLOSET_CHAR_LIMIT for atomic topic packing
Tracks entry count in a state file, only re-processes new content
CLI: python -m mempalace.diary_ingest --dir ~/summaries

3. BM25 hybrid search (searcher.py)

_bm25_score(query, text) — keyword matching
_hybrid_rank(results, query) — 60% vector similarity + 40% BM25
Applied to both closet-first and direct drawer search paths
Catches exact-name / exact-term matches that dense embeddings miss

Why all three in one PR

The original commit 35bff98 bundled them, and splitting the commit would require rewriting it and losing Milla's clean authorship. Reviewers can evaluate each sub-feature independently — the three files touched are orthogonal (miner.py adds the entity helper; diary_ingest.py is a new standalone module; searcher.py adds BM25).

Test plan

New tests/test_closets.py — 16/16 pass. Trimmed from Milla's omnibus test file to only the features present in this stack (MineLock, BuildClosetLines, UpsertClosetLines, EntityMetadata, BM25, DiaryIngest). Strip-noise and tunnel tests live with their own PRs.
Full suite: 703/703 pass (2 version-consistency tests deselected — pre-existing develop bug, pyproject=3.2.0 vs version.py=3.1.0).

Callouts for reviewers

Stacked — depends on fix: file-level locking to prevent multi-agent duplicate drawers #784 (lock) and feat: closet layer — searchable index pointing to drawers #788 (closets). Will rebase cleanly after both merge.
BM25 weight hardcoded at 60/40 — might want a config knob, but current ratio is a reasonable default and matches what the _hybrid_rank docstring claims.
Diary ingest state file location — check where diary_ingest.py writes its "entries seen" state and whether that needs documenting.
Entity extraction is naive regex (\b[A-Z][a-z]{2,}\b with frequency >= 2). Works well on English; misses accents, single-char names, ALL-CAPS acronyms. Worth flagging as a known limitation.

Three features that close the gap between the architecture docs and the actual codebase: 1. Entity metadata on drawers and closets - _extract_entities_for_metadata() pulls names from known_entities.json + proper nouns appearing 2+ times - Stamped as "entities" field in ChromaDB metadata - Enables filterable search by person/project name 2. Day-based diary ingest (diary_ingest.py) - ONE drawer per day, upserted as the day grows - Closets pack topics atomically, never split mid-topic - Tracks entry count in state file, only processes new entries - Usage: python -m mempalace.diary_ingest --dir ~/summaries 3. BM25 hybrid search in searcher.py - _bm25_score() keyword matching complements vector similarity - _hybrid_rank() combines both signals (60% vector, 40% BM25) - Catches exact name/term matches that embeddings miss - Applied to both closet-first and direct drawer search paths 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Trimmed version of Milla's omnibus test_closets.py to only cover features present in this PR stack (#784 lock, #788 closets, this PR's entity/BM25/diary). Strip-noise tests will land with #785; tunnel tests will land with the tunnels PR. 16/16 pass. Co-Authored-By: MSL <[email protected]>

#819)

feat: closet layer — searchable index pointing to drawers

chore: forward closet layer (#788) into develop

…roduction Merges develop (closet hardening #826, strip_noise #785, lock #784) and replaces every sub-feature in this PR with a correct, tested implementation. Shippable now. ## 1. Real Okapi-BM25 (searcher.py) The prior `_bm25_score()` hardcoded `idf = log(2.0)` for every term — it was really a scaled TF, not BM25, and couldn't tell a discriminative term from a generic one. Replaced with `_bm25_scores(query, documents)` that computes proper IDF over the provided candidate corpus using the Lucene smoothed formula `log((N - df + 0.5) / (df + 0.5) + 1)`. Well- defined for re-ranking vector-retrieval candidates — IDF there measures how discriminative each term is *within the candidate set*, exactly the signal we want. `_hybrid_rank` also fixed: - Vector normalization is now absolute `max(0, 1 - dist)`, not `1 - dist/max_dist` — adding/removing a candidate no longer reshuffles the others. - BM25 is min-max normalized within candidates (bounded [0, 1]). - Closet path now re-ranks too (was previously returning closet-order hits without hybrid scoring). - `_hybrid_score` internal field stripped from output; `bm25_score` exposed for debugging. ## 2. Entity metadata (miner.py) - Reuses `_ENTITY_STOPLIST` from palace.py so sentence-starters like "When", "After", "The" no longer land as entities (regression test covers this). - Known-entity registry is cached at module level, keyed by the registry file's mtime — no more disk read per drawer. - File handle now uses a context manager. - Truncates the entity LIST (to 25) before joining — never splits a name in the middle. ## 3. Diary ingest (diary_ingest.py) - State file now lives at `~/.mempalace/state/diary_ingest_<hash>.json`, keyed by (palace_path, diary_dir). No more pollution of the user's content directory. - Drawer IDs now hash `(wing, date_str)` — a user with personal + work diaries on the same day no longer silently clobbers. - Each day's upsert runs inside `mine_lock(source_file)` so concurrent ingest from two terminals can't race. - `force=True` now calls `purge_file_closets` before rebuild so leftover numbered closets from a longer prior day don't orphan. ## 4. Tests (tests/test_closets.py) Merged this PR's MineLock/Entity/BM25/Diary tests with develop's hardened Build/Upsert/Purge/Rebuild/SearchClosetFirst tests. Added specific regression tests for every fix above: - entity stoplist applies (no "When/After/The") - entity list capped before join (no partial tokens) - registry cached by mtime (mock-verified zero re-reads) - BM25 IDF downweights terms present in every doc (real BM25 evidence) - hybrid rank absolute normalization stable against outliers - diary state file outside user's diary dir - diary wing-prefixed IDs prevent cross-wing date collisions 35/35 closet tests pass; full suite 743/743. ruff + format clean under CI-pinned 0.4.x.

Merges the hardened closet/entity/BM25/diary stack from #789 and fixes five correctness/durability issues in the tunnels module plus the directional/symmetric design question. ## Design: tunnels are now symmetric Per review discussion: a tunnel represents "these two things relate", not "A causes B". The canonical ID now hashes the *sorted* endpoint pair, so ``create_tunnel(A, B)`` and ``create_tunnel(B, A)`` resolve to the same record and the second call updates the label rather than creating a duplicate. ``follow_tunnels`` can be called from either endpoint and surfaces the other side consistently. The returned dict still preserves ``source``/``target`` in the order the caller supplied, so UIs that want to render the connection directionally can do so. ## Correctness fixes * **Atomic write** — ``_save_tunnels`` writes to ``tunnels.json.tmp`` and ``os.replace``s it into place. A crash mid-write can no longer leave a truncated file that silently reads back as ``[]`` and wipes every tunnel. Includes ``f.flush() + os.fsync`` before replace on platforms that support it. * **Concurrent-write lock** — ``create_tunnel`` and ``delete_tunnel`` wrap the load→mutate→save cycle in ``mine_lock(_TUNNEL_FILE)``. Without this, two agents creating tunnels simultaneously would both read the same snapshot and the later writer would drop the earlier writer's tunnel. * **Corrupt-file tolerance** — ``_load_tunnels`` now uses a context manager, validates that the loaded JSON is a list, and returns ``[]`` for any read failure. Subsequent ``create_tunnel`` then overwrites the corrupt file via atomic write — no manual recovery needed. * **Input validation** — new ``_require_name`` helper rejects empty or whitespace-only wing/room names with a clear ``ValueError``. Prevents phantom tunnels with blank endpoints from ever reaching the JSON store. * **Timezone-aware timestamps** — ``created_at`` / ``updated_at`` now use ``datetime.now(timezone.utc).isoformat()``, matching diary ingest and other recent modules. ## Tests (12 in TestTunnels) 5 original + 7 regression cases: * ``test_tunnel_is_symmetric`` — A↔B and B↔A dedupe to one record. * ``test_follow_tunnels_works_from_either_endpoint`` — symmetric surface. * ``test_empty_endpoint_fields_rejected`` — validation guard. * ``test_corrupt_tunnel_file_does_not_lose_new_writes`` — truncated JSON treated as empty; next create persists cleanly. * ``test_atomic_write_leaves_no_stray_tmp_file`` — no leftover ``.tmp``. * ``test_concurrent_creates_preserve_all_tunnels`` — 5 threads each create a distinct tunnel; all 5 persisted (regression for the read-modify-write race). * ``test_created_at_is_timezone_aware`` — ISO8601 has tz suffix. Merge resolutions: tests/test_closets.py combined develop's hardened closet/entity/BM25/diary tests with this PR's TestTunnels class. 755/755 tests pass. ruff + format clean under CI-pinned 0.4.x.

… path Merges the full hardened stack (#788 closets, #789 entity/BM25/diary, #790 tunnels) and reimplements the drawer-grep feature in a way that composes with the chunk-level closet-first search instead of fighting it. ## Background The original PR added "drawer-grep" on top of the pre-hardening closet code that returned whole-file blobs. My #788 hardening changed that path to return *chunk-level* hits by parsing each closet's ``→drawer_id`` pointers and hydrating exactly those drawers. That made the original drawer-grep grep-over-all-drawers logic redundant — the closet already points at the relevant chunk. What remained valuable from the original PR was the *context expansion* idea: a chunk boundary can clip a thought mid-stride (matched chunk says "here's a breakdown:" and the breakdown lives in the next chunk), so callers want ±1 neighbor chunks for free rather than a follow-up get_drawer call. ## Change New ``_expand_with_neighbors(drawers_col, doc, meta, radius=1)`` helper in searcher.py: * Reads ``source_file`` + ``chunk_index`` from the matched drawer's metadata. * Fetches the ±radius sibling chunks in a SINGLE ChromaDB query using ``$and + $in`` — no "fetch all drawers for source" blowup. * Sorts retrieved chunks by chunk_index, joins with ``\n\n``. * Does a cheap metadata-only second query to compute ``total_drawers`` so callers know where in the file they landed. * Graceful fallback to the matched doc alone on any ChromaDB failure or missing metadata — search never breaks because expansion failed. ``_closet_first_hits`` now calls this helper and tags each hit with ``drawer_index`` + ``total_drawers``. Hit shape stays consistent with the direct-search path (both still carry ``matched_via``) so callers can't tell which path produced a given hit except via that field. ## Tests 6 new cases in TestDrawerGrepExpansion: * neighbors returned in chunk_index order (not hash order) * edge case: matched chunk at index 0 — only next neighbor surfaces * edge case: matched chunk at last index — only prev neighbor surfaces * edge case: 1-drawer file — returns just the matched doc * missing/non-int chunk_index metadata — graceful fallback * end-to-end via ``search_memories`` — closet-first hit carries drawer_index, total_drawers, and includes ±1 neighbors 761/761 suite pass; ruff + format clean on CI-pinned 0.4.x. Merge resolutions: miner.py kept develop's purge+NORMALIZE_VERSION; searcher.py dropped the old whole-file-blob block entirely in favor of rebuilding context expansion on top of ``_closet_first_hits``; test_closets.py took develop's 47-test baseline and appended TestDrawerGrepExpansion.

Brings in PR #793 (optional LLM-based closet regeneration via user-configured OpenAI-compatible endpoint) and PR #795 (hybrid closet+drawer search — closets boost, never gate). Stack: #784 → #788 → #789 → #790 → #791 → #792 → #793 (+ #795). Findings hardened on our side ───────────────────────────── 1) closet_llm.regenerate_closets didn't use the blessed palace helpers. Before: * manual closets_col.get(where=...) + .delete(ids=...) with a silent ``except Exception: pass`` around both — if the purge failed, pre-existing regex closets survived alongside fresh LLM closets, giving the searcher double hits for the same source. * ``source.split('/')[-1][:30]`` to build the closet_id — quietly wrong on Windows paths (``C:\\proj\\a.md`` has no ``/``, so the whole string ends up in the ID). * no mine_lock around purge+upsert — a concurrent regex rebuild of the same source could interleave with our purge and leave a mix of regex and LLM pointers. * no ``normalize_version`` stamp on the LLM closets — the miner's stale-version gate would treat them as leftovers from an older schema and rebuild over them on the next mine. After: routes through ``purge_file_closets`` + ``mine_lock`` + ``os.path.basename`` + ``NORMALIZE_VERSION`` stamp. Regression tests cover each. 2) searcher.search_memories was still closet-first. PR #795 merged into #793's head to fix the recall regression documented in that PR (R@1 0.25 on narrative content vs. 0.42 baseline). The hybrid design makes closets a ranking boost rather than a gate: drawers are always queried at the floor, and matching closet hits (rank 0-4 within CLOSET_DISTANCE_CAP=1.5) add a boost of 0.40/0.25/0.15/0.08/0.04 to the effective distance. Merged to take the incoming hybrid design, with two cleanups: * kept the ``_expand_with_neighbors`` / ``_extract_drawer_ids_from_closet`` helpers as separately-tested utilities (still imported by tests and future callers); * replaced the fragile ``source_file.endswith(basename)`` reverse- lookup in the enrichment step with internal ``_source_file_full`` / ``_chunk_index`` fields stripped before return, so enrichment doesn't silently pick the wrong path when two sources share a basename across directories; * drawer-grep enrichment now sorts by ``chunk_index`` before neighbor expansion, so ``best_idx ± 1`` corresponds to actual document order rather than whatever order Chroma returned. 3) Closet-first tests in test_closets.py (``TestSearchMemoriesClosetFirst``, end-to-end ``test_closet_first_search_includes_drawer_index_and_total``) pinned contracts that the hybrid path now violates (``matched_via`` went from ``"closet"`` to ``"drawer+closet"``). Rewrote them around the new invariant: direct drawers are always the floor, closet agreement flips the hit's matched_via and exposes closet_preview. Verification ──────────── * 805/805 pass under ``uv run pytest tests/ -v --ignore=tests/benchmarks`` (13 new tests from PR #793 + 5 from PR #795 + 2 new regressions for the closet_llm hardening + the rewritten hybrid assertions in test_closets.py). * CI-pinned ruff 0.4.x clean on ``mempalace/`` + ``tests/`` (check + format both pass). * No new deps — closet_llm.py still uses stdlib ``urllib.request`` per the PR's "zero new dependencies" promise. Co-Authored-By: MSL <[email protected]>

Verified every factual claim against the live palace and repo. No content removed; all edits preserve existing sections. Accuracy fixes (numbers that drifted or were stale): - Drawer count unified: "134K / 135K+ / 137K+" → "137,949" in the status block and "137K" casually elsewhere. Matches palace today. - Room count: "60+" → "68" (distinct rooms per sqlite). - Auto-memory file count: "~dozens" → "17 files (this project)". - "73-stopword false positives" → "285 English entries and counting", with inline link to mempalace/i18n/en.json. The 73 number was from the pre-i18n era; today the stopword list lives in JSON and has grown to 285. Citation additions (claims that were bare): - Superseded section: upstream Okapi-BM25 now cites MemPalace#789, file-level locking cites MemPalace#784. - "Zep/Graphiti temporal graph model" now links to the getzep/graphiti repo. - Closed-PR flat list converted to inline links. - "Auto Dream feature flag" now uses the qualified anthropics/claude-code#38461 form. Structure (same content, better shape): - "Open problems" → "Active investigations". The sections are hard research questions, not failures; framing should match. - "Two-layer memory architecture" lifted out of Active investigations and promoted to its own top-level section right after Architectural principles. It's foundational, not an open problem. - New "Status at a glance" paragraph under the header pointing at Discussion MemPalace#1017, test count, the upstream-PR queue, and this repo's Issues for fork-specific feedback. - New one-paragraph "what this fork adds" summary above Why-this- fork-exists, so a stranger can understand the differentiator in under ten seconds. 42 README-claim tests still pass.

milla-jovovich and others added 2 commits April 13, 2026 07:40

igorls requested review from bensig and milla-jovovich as code owners April 13, 2026 10:43

igorls added the P0 critical label Apr 13, 2026

igorls assigned bensig and milla-jovovich Apr 13, 2026

igorls mentioned this pull request Apr 13, 2026

fix(search): hybrid closet+drawer retrieval — closets boost, never gate #795

Merged

3 tasks

shafdev and others added 4 commits April 13, 2026 13:06

fix: use microsecond timestamp and full content hash in diary entry ID (

5db651a

#819)

Merge pull request #788 from MemPalace/pr/closet-layer

49a6a69

feat: closet layer — searchable index pointing to drawers

Merge pull request #826 from MemPalace/pr/multi-agent-lock

95a8d71

chore: forward closet layer (#788) into develop

igorls merged commit 93ff6db into pr/closet-layer Apr 13, 2026

igorls mentioned this pull request Apr 13, 2026

feat(v3.3): land Milla's stacked closet/BM25/KG/LLM chain (#784-#795) on develop #829

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: entity metadata + diary ingest + BM25 hybrid search#789

feat: entity metadata + diary ingest + BM25 hybrid search#789
igorls merged 6 commits intopr/closet-layerfrom
pr/entity-diary-bm25

igorls commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

igorls commented Apr 13, 2026

Summary

Sub-features

1. Entity metadata (miner.py)

2. Diary ingest (diary_ingest.py, new file)

3. BM25 hybrid search (searcher.py)

Why all three in one PR

Test plan

Callouts for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

2. Diary ingest (`diary_ingest.py`, new file)