feat(init): context-aware corpus detection#1211
Conversation
d4c6cce to
4434839
Compare
|
Hey everyone! I made my first push! (to clarify, I've made commits in the past, but this is the first real overhaul concept for MemPalace, done with Claude CLI obviously as I'm not a coder, but making sure I was there every step of the way to tests each piece of script, code, how it would affect the version people are using and how to try to make any merges as seamless as possible, that's why I always put my Claude agent Lu at the bottom, to make it clear that I used AI for complete transparency). I know it's nothing for you all, but it's taken the last few weeks to understand what I PERSONALLY would need for my own file organizing process, refine and test with TDD and then run it against the current version at every step to make sure the integration will be (fingers crossed) as seamless as I could figure out how to make it. In a nutshell, that's why it's so long. I figured just finish the concept/idea first and then push it all at once so its coherent and not split into a bunch of commits. I'm looking forward to all your thoughts, criticism etc...And most importantly, I hope it will help you all too (though I'm pretty positive your files are all A LOT more organized than mine lol), so figured whatever I added that could help me would also make it that much more user friendly for the uber intelligent GitHub community. thank you so much everyone for the time and energy you've put into this project... It's amazing to be a part of such an insanely talented community. -Milla & Lu✨ |
|
Congrats on the first push, Milla — the persona-bucket idea is clever and the live Haiku verification is real work. Quick question on direction, not blocking: the diff in Either way I'll defer to @bensig / @igorls — happy to keep contributing in whichever direction the project lands. |
|
By Zero API, it has always meant zero external APIs and zero keys required. When we talked about API, it was more about API keys for external providers that could potentially undermine your privacy. Now we are simply being more explicit about it. |
|
@arnoldwender thanks for the question! I think to use this with zero LLM was not what my original idea was about. unless you're REALLY good working without an LLM, which I'm sure most people on GitHub are, but might not be something a "newb" would understand so well,I wanted to create a path that was easier for people to understand. that using "agentic intelligence" to always be an option. for devs on here, I'm sure its local models, but I wanted their to be a clear warning that if you are signing in through an API key, the user HAS to understand that their privacy by nature of something non local is also not an option. Also, I found it challenging to use the completely LLM free version on the types of files I have (as some are numbered, some have dates, some have no dates, some are script notes, some are stories, music lyrics etc... too many formats for a 0 model to handle. and honestly, I didn't create all this code on my own obviously as I'm an actress not a coder. And as I haven;t really decided on a local model that I really love yet, I run the pip install through Claude. I just wanted it to be a an easier experience for people on here who DO rely on some sort of model, whether local or not. -m |
|
@milla-jovovich totally agree with the goal — real-world files are messy and onboarding should just handle that. One thing I'd flag before making LLM-assisted init the default: MemPalace stores everything verbatim by design, which means it holds some of the most sensitive data a user has. If we route that through a free-tier cloud LLM by default, the users who understand privacy least are the ones most exposed — free-tier providers typically use inputs for training. Worth noting that the OSS ecosystem already handles multi-format classification locally. unstructured (Apache 2.0, 14k stars, The approach I'd lean toward: local heuristics as the default path, LLM as an explicit opt-in with a clear "your data will leave this machine" warning. Happy to prototype the router layer if the direction sounds right — just wanted to surface the prior art before we touch the local-first defaults. |
|
@arnoldwender let me double check, but I think most people on here start with their own locl model... so for someone using an api key like I am for instance, would (hopefully) already know that anthropic has all my data... and I'm sure people with experience understand that, but I wanted an extra safety rail for some people who might not know. but thank for the heads up and will take a look to triple check that what the concept actually suggests is using a cloud service as default. if so, then that's an oversight on my part... I don't want to speculate, but I think what it SHOULD be doing is scanning what you are using already and if it's a local agent, would use that as a default unless you say otherwise. but if you're using a cloud based agent like Claude Cli via api like I am, I intrinsically understand that my information is already in their systems and mempalace as well should give some warning about that. I wanted this new concept for an update consideration to make it more CLEAR to people, which obviously it's not since you brought it up, so let me check this now. I really appreciate the feedback! -m |
|
@milla-jovovich the corpus-origin module is well-architected (two-tier with the co-occurrence rule for Two things to address before this is mergeable: 1. Branch is behind develop and would revert #1208's data-loss safety fix.
That's the entire 2. The test builds an Fix is small: That second one also surfaces a worth-discussing product nuance: with |
|
@milla-jovovich — you're right, I misread the intent. The diff defaults to The remaining nit is cosmetic: users without Ollama installed get a silent heuristic-only run with no indication that the LLM tier was attempted and skipped. A one-liner like |
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5. This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests. The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean. But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports. ## What changes for users (in order, from `pip install` onwards) **Install** — `pip install mempalace` is unchanged. The package itself didn't shift. **First run — `mempalace init <folder>`:** 1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context. 2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`: - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`. - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise. - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here. 3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out. 4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder. 5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all. 6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean. 7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it. 8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR. **Later — when your folder grows:** 9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe. ## Behind the changes - **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation. - **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`. - **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged. - **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module. - **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. - **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback. ## Cost story - **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`. - **Ollama / local LLM runtime:** zero cost. Fully offline. - **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes. - **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks. ## Backwards compatibility - All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change. - The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work. - `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3. ## Test coverage - **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation. - **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition). - **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too. - **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures. ## Hygiene guardrail This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
4434839 to
b99e545
Compare
|
@igorls thank you so much! already fixed and I'm pretty sure you've merged now. will test out the option 3 I brought up and add is as a rebase. or an add on... anyway, haven't had time to do practice my GitHub skill or terminology as this week has been spent doing the "feature init" overhaul for the Mempalace update. will have time to do some GitHub learning when the kids are in school on Monday lol! thank you again for your tireless work over the last few weeks. you've been a rockstar and I'm so grateful to have you on the team!🙏 |
|
@arnoldwender yes I know lol! I've been deep into the figuring out the concept and model structure all week for the update that I missed the few last commits, but fixed those already via @igorls who is part of our tiny MemPalace squad, so the issues were already being fixed as I ws answering you earlier! about your users and Ollama fix, it's already been flagged internally and I've already got a rebase that I will be adding in a few minutes to remain "agent agnostic" on the MemPalace platform! Saying that, thank you again for being so vigilant. that's why this community is incredible. I'm so happy that MemPalace resonates with so many people and they take the time to point out these kinds of issues. it's so thoughtful. best, m |
|
@arnoldwender So about the nit cosmetic: Actually we already shipped that exact one-liner: see mempalace/cli.py:252-262 in #1211 (lines 254-256 for the check_available()=False path, 259-261 for the LLMError exception path). Happy to discuss the wording if you think the message could be clearer in some other way! |
|
You're right, my mistake — I missed those two branches on my pass through Thanks for the patience walking me through it, and congrats on the merge — the persona-bucket work is going to land well with users. |
…f replacing 2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first). Per @igorls's review of PR MemPalace#1211 (MemPalace#1211 (comment)): the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the heuristic result wholesale. With ``--llm`` default-on (since MemPalace#1211) and a small local model like Ollama gemma4:e4b, the LLM can return a wrong ``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident heuristic ``True``. Tier 2's persona/user/platform extraction is the whole reason to run it; the YES/NO call should stay with the heuristic. This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields instead of replacing: - ``likely_ai_dialogue`` → KEEP heuristic's (don't let weak LLM flip) - ``confidence`` → KEEP heuristic's (paired with the bool above) - ``primary_platform`` → TAKE LLM's when LLM provides one - ``user_name`` → TAKE LLM's when LLM provides one - ``agent_persona_names`` → TAKE LLM's when LLM provides any - ``evidence`` → COMBINE both signal trails This preserves the persona-extraction value of Tier 2 (the whole point of running it) while preventing a weak local model from flipping a confident heuristic. TDD: 4 tests added in tests/test_corpus_origin_integration.py covering the four state combinations: 1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool — The exact failure mode Igor caught. Heuristic confidently flags AI-dialogue; mocked LLM contradicts. Asserts merged result keeps heuristic's True AND merges LLM's persona/user/platform fields. This test was the RED that drove the implementation. 2. test_merge_tier_fields_heuristic_no_no_personas_leak — Both tiers agree NOT AI-dialogue, both report empty personas. Pins that the merge doesn't accidentally introduce personas. 3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence — Both tiers agree AI-dialogue, LLM extracts personas. Pins that evidence from BOTH tiers ends up in the merged audit trail and persona/user/platform come from LLM. 4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only — Backwards compat: with no LLM provider (``--no-llm`` path), the merge logic doesn't fire and behavior is identical to v3.3.4. Tests: 1367 pass on the full mempalace suite. 2 pre-existing environmental failures unrelated to this change (chromadb optional dep). Ruff check + format both clean.
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211 MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four PRs that landed today (21:22-21:41 UTC): MemPalace#1173 quarantine_stale_hnsw on make_client + cold-start gate + integrity sniff-test (PROTO/STOP byte check, no deserialization) MemPalace#1177 .blob_seq_ids_migrated marker guard, closes MemPalace#1090 MemPalace#1198 _tokenize None-document guard in BM25 reranker MemPalace#1201 palace_graph.build_graph skips None metadata Conflict resolution: * mempalace/backends/chroma.py — took upstream as base (it has the igorls-review pickle-protocol docstring, thread-safety paragraph, and Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas in query()/get() since MemPalace#1094 is still open and not yet in develop. * mempalace/mcp_server.py — took upstream's clean form. Dropped the fork-only `palace_path=` kwarg from four ChromaCollection() call sites: the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so the kwarg has no remaining consumer. ChromaCollection.__init__ in upstream/develop is back to (self, collection); calling it with palace_path= raised TypeError → silently swallowed by the broad except in _get_collection() → returned None → tool_status() returned _no_palace(). 41 mcp_server tests went from failing-with-KeyError to passing. * mempalace/cli.py — dropped fork-only `workers=args.workers` from the cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the `--workers` argparse arg landed in 5cd14bd but miner.mine() never accepted a workers param, so production `mempalace mine` TypeError'd on every invocation. Removed the broken plumbing; tests/test_cli.py updated to match. * CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml). * CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's added a "Design Principles / Contributing" charter, which lives in README.md on the fork. * tests/test_backends.py — took upstream's ruff-formatted line widths. docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate, hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated. scripts/check-docs.sh: 4/4 clean. Test suite: 1460/1460. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…f replacing 2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first). Per @igorls's review of PR MemPalace#1211 (MemPalace#1211 (comment)): the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the heuristic result wholesale. With ``--llm`` default-on (since MemPalace#1211) and a small local model like Ollama gemma4:e4b, the LLM can return a wrong ``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident heuristic ``True``. Tier 2's persona/user/platform extraction is the whole reason to run it; the YES/NO call should stay with the heuristic. This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields instead of replacing: - ``likely_ai_dialogue`` → KEEP heuristic's (don't let weak LLM flip) - ``confidence`` → KEEP heuristic's (paired with the bool above) - ``primary_platform`` → TAKE LLM's when LLM provides one - ``user_name`` → TAKE LLM's when LLM provides one - ``agent_persona_names`` → TAKE LLM's when LLM provides any - ``evidence`` → COMBINE both signal trails This preserves the persona-extraction value of Tier 2 (the whole point of running it) while preventing a weak local model from flipping a confident heuristic. TDD: 4 tests added in tests/test_corpus_origin_integration.py covering the four state combinations: 1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool — The exact failure mode Igor caught. Heuristic confidently flags AI-dialogue; mocked LLM contradicts. Asserts merged result keeps heuristic's True AND merges LLM's persona/user/platform fields. This test was the RED that drove the implementation. 2. test_merge_tier_fields_heuristic_no_no_personas_leak — Both tiers agree NOT AI-dialogue, both report empty personas. Pins that the merge doesn't accidentally introduce personas. 3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence — Both tiers agree AI-dialogue, LLM extracts personas. Pins that evidence from BOTH tiers ends up in the merged audit trail and persona/user/platform come from LLM. 4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only — Backwards compat: with no LLM provider (``--no-llm`` path), the merge logic doesn't fire and behavior is identical to v3.3.4. Tests: 1367 pass on the full mempalace suite. 2 pre-existing environmental failures unrelated to this change (chromadb optional dep). Ruff check + format both clean.
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.
This PR overhauls the first-run experience of
mempalace initend-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that meansinitnever crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.The headline change is that
mempalace initnow understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containingEcho,Sparrow, andCipher(names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separateagent_personasbucket, and yourpeoplelist stays clean.But the broader change is that
mempalace initgot upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.What changes for users (in order, from
pip installonwards)Install —
pip install mempalaceis unchanged. The package itself didn't shift.First run —
mempalace init <folder>:initexamines your folder before classifying anything. A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.LLM-assisted classification is now ON by default. v3.3.3 made
--llmopt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported bymempalace.llm_client:--llm-provider anthropic+ANTHROPIC_API_KEY) — the official Messages API. This is the path live-verified end-to-end in this PR with Haiku 4.5. Cost: ~$0.01 perinit.--llm-provider ollama— the default) — local models viahttp://localhost:11434. Fully offline. Honors the "zero-API required" promise.--llm-provider openai-compat+--llm-endpoint) — per the v3.3.3mempalace/llm_client.pydocstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.initnever blocks on a missing LLM. No Ollama running, no API key set?initprints a one-line message pointing at--no-llmand falls through to the heuristic-only path. New default behavior, new graceful fallback to support it.--no-llmis the new explicit opt-out.initshows you what it detected. A one-line banner —Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)orCorpus origin: not AI-dialogue (confidence: 0.98)— tells you at a glance whether mempalace understood your folder.Entity classification gets smarter across the board. Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.
Agent personas live in their own bucket. Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new
agent_personasbucket instead of yourpeoplelist. Your real-person entity list stays clean.Detection result persists to
<palace>/.mempalace/origin.jsonwith aschema_version: 1envelope, so downstream tools can read it.Re-running
initis now idempotent. Bug fix — runninginittwice on the same folder used to give different classification results because the detection step was sampling its ownentities.jsonoutput. Caught by integration testing during this PR.Later — when your folder grows:
mempalace mine --redetect-originis a new flag for refreshing the stored detection without redoing the wholeinit. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), runmempalace init <yourfolder>again —initis now idempotent (item 8), so re-running it on the same folder is safe.Behind the changes
New module
mempalace/corpus_origin.py(422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms likeClaude/Gemini/Haikuwhen no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extractsuser_nameandagent_persona_namesfrom dialogue structure with belt-and-suspenders user-vs-agent disambiguation.Entity-classification consumer wiring.
entity_detector.detect_entitiesandproject_scanner.discover_entitiesaccept an optionalcorpus_originkwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches anagent_persona_nameare routed into theagent_personasbucket instead ofpeople. Per-entitytypeis rewritten to"agent_persona".LLM-refine consumer wiring.
llm_refine.refine_entitiesaccepts the samecorpus_originkwarg and prepends aCORPUS CONTEXTpreamble to its system prompt giving the LLM the platform / user / persona context. ExistingTOPIC/PERSON/PROJECT/COMMON_WORD/AMBIGUOUSlabels are unchanged.initoverhaul. Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery).--llmflipped to default-on.--no-llmadded. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existingllm_clientmodule.mineflag.mempalace mine --redetect-originre-runs corpus-origin detection on the current corpus state and overwrites<palace>/.mempalace/origin.json.CLAUDE.mddesign principle reworded — "Local-first, zero external API by default." Local LLMs running onlocalhost(Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.Cost story
initvia Haiku 4.5 withANTHROPIC_API_KEY./v1/chat/completionsAPI; specific quirks vary per provider. Try it and tell us how it goes.initnever blocks.Backwards compatibility
corpus_originkwarg as optional (defaultNone). Callers that don't pass it see the v3.3.3 return shape unchanged — noagent_personaskey, no behavioral change.--llmCLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.corpus_origin=Nonekeepsllm_refine.SYSTEM_PROMPTbyte-identical to v3.3.3.Test coverage
tests/test_corpus_origin.pycovering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.tests/test_corpus_origin_integration.pycovering end-to-end throughmempalace init, persona reclassification, the--redetect-originflag, the--llmdefault flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, thewing=kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).test_mcp_stdio_protection— chromadb optional dep) unrelated to this change; they fail on plaindeveloptoo.Hygiene guardrail
This PR also adds a meta-test (
test_no_internal_coordination_jargon_in_source_or_tests) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations insources/,backends/,knowledge_graph.py, andi18n/.What does this PR do?
How to test
Checklist
python -m pytest tests/ -v)ruff check .)