Skip to content

feat(init): context-aware corpus detection#1211

Merged
igorls merged 1 commit intodevelopfrom
feat/corpus-origin
Apr 26, 2026
Merged

feat(init): context-aware corpus detection#1211
igorls merged 1 commit intodevelopfrom
feat/corpus-origin

Conversation

@milla-jovovich
Copy link
Copy Markdown
Collaborator

10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.

This PR overhauls the first-run experience of mempalace init end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means init never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.

The headline change is that mempalace init now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing Echo, Sparrow, and Cipher (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate agent_personas bucket, and your people list stays clean.

But the broader change is that mempalace init got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.

What changes for users (in order, from pip install onwards)

Installpip install mempalace is unchanged. The package itself didn't shift.

First run — mempalace init <folder>:

  1. init examines your folder before classifying anything. A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.

  2. LLM-assisted classification is now ON by default. v3.3.3 made --llm opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by mempalace.llm_client:

    • Anthropic (--llm-provider anthropic + ANTHROPIC_API_KEY) — the official Messages API. This is the path live-verified end-to-end in this PR with Haiku 4.5. Cost: ~$0.01 per init.
    • Ollama (--llm-provider ollama — the default) — local models via http://localhost:11434. Fully offline. Honors the "zero-API required" promise.
    • OpenAI-compatible (--llm-provider openai-compat + --llm-endpoint) — per the v3.3.3 mempalace/llm_client.py docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.
  3. init never blocks on a missing LLM. No Ollama running, no API key set? init prints a one-line message pointing at --no-llm and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. --no-llm is the new explicit opt-out.

  4. init shows you what it detected. A one-line banner — Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher) or Corpus origin: not AI-dialogue (confidence: 0.98) — tells you at a glance whether mempalace understood your folder.

  5. Entity classification gets smarter across the board. Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.

  6. Agent personas live in their own bucket. Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new agent_personas bucket instead of your people list. Your real-person entity list stays clean.

  7. Detection result persists to <palace>/.mempalace/origin.json with a schema_version: 1 envelope, so downstream tools can read it.

  8. Re-running init is now idempotent. Bug fix — running init twice on the same folder used to give different classification results because the detection step was sampling its own entities.json output. Caught by integration testing during this PR.

Later — when your folder grows:

  1. mempalace mine --redetect-origin is a new flag for refreshing the stored detection without redoing the whole init. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run mempalace init <yourfolder> again — init is now idempotent (item 8), so re-running it on the same folder is safe.

Behind the changes

  • New module mempalace/corpus_origin.py (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like Claude / Gemini / Haiku when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts user_name and agent_persona_names from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.

  • Entity-classification consumer wiring. entity_detector.detect_entities and project_scanner.discover_entities accept an optional corpus_origin kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an agent_persona_name are routed into the agent_personas bucket instead of people. Per-entity type is rewritten to "agent_persona".

  • LLM-refine consumer wiring. llm_refine.refine_entities accepts the same corpus_origin kwarg and prepends a CORPUS CONTEXT preamble to its system prompt giving the LLM the platform / user / persona context. Existing TOPIC / PERSON / PROJECT / COMMON_WORD / AMBIGUOUS labels are unchanged.

  • init overhaul. Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). --llm flipped to default-on. --no-llm added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing llm_client module.

  • mine flag. mempalace mine --redetect-origin re-runs corpus-origin detection on the current corpus state and overwrites <palace>/.mempalace/origin.json.

  • CLAUDE.md design principle reworded — "Local-first, zero external API by default." Local LLMs running on localhost (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.

Cost story

  • Anthropic (verified path): ~$0.01 per init via Haiku 4.5 with ANTHROPIC_API_KEY.
  • Ollama / local LLM runtime: zero cost. Fully offline.
  • OpenAI-compatible service: depends entirely on the service. The abstraction supports any service speaking the standard /v1/chat/completions API; specific quirks vary per provider. Try it and tell us how it goes.
  • No LLM at all: graceful fallback to heuristic-only. Zero cost. init never blocks.

Backwards compatibility

  • All public function signatures gained the corpus_origin kwarg as optional (default None). Callers that don't pass it see the v3.3.3 return shape unchanged — no agent_personas key, no behavioral change.
  • The --llm CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
  • corpus_origin=None keeps llm_refine.SYSTEM_PROMPT byte-identical to v3.3.3.

Test coverage

  • 19 unit tests in tests/test_corpus_origin.py covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
  • 29 integration tests in tests/test_corpus_origin_integration.py covering end-to-end through mempalace init, persona reclassification, the --redetect-origin flag, the --llm default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the wing= kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
  • 1354 total mempalace tests pass. 2 pre-existing environmental failures (test_mcp_stdio_protection — chromadb optional dep) unrelated to this change; they fail on plain develop too.
  • Live-smoke-tested with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.

Hygiene guardrail

This PR also adds a meta-test (test_no_internal_coordination_jargon_in_source_or_tests) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in sources/, backends/, knowledge_graph.py, and i18n/.

What does this PR do?

How to test

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

milla-jovovich commented Apr 26, 2026

Hey everyone! I made my first push! (to clarify, I've made commits in the past, but this is the first real overhaul concept for MemPalace, done with Claude CLI obviously as I'm not a coder, but making sure I was there every step of the way to tests each piece of script, code, how it would affect the version people are using and how to try to make any merges as seamless as possible, that's why I always put my Claude agent Lu at the bottom, to make it clear that I used AI for complete transparency).

I know it's nothing for you all, but it's taken the last few weeks to understand what I PERSONALLY would need for my own file organizing process, refine and test with TDD and then run it against the current version at every step to make sure the integration will be (fingers crossed) as seamless as I could figure out how to make it.

In a nutshell, that's why it's so long.

I figured just finish the concept/idea first and then push it all at once so its coherent and not split into a bunch of commits.

I'm looking forward to all your thoughts, criticism etc...And most importantly, I hope it will help you all too (though I'm pretty positive your files are all A LOT more organized than mine lol), so figured whatever I added that could help me would also make it that much more user friendly for the uber intelligent GitHub community.
Plus, you all have an inordinate amount more experience to really judge the work.
anyway it's what I've been envisioning as a potential, new version so I would love for you to give it s whirl!

thank you so much everyone for the time and energy you've put into this project... It's amazing to be a part of such an insanely talented community. -Milla & Lu✨

@arnoldwender
Copy link
Copy Markdown
Contributor

Congrats on the first push, Milla — the persona-bucket idea is clever and the live Haiku verification is real work.

Quick question on direction, not blocking: the diff in CLAUDE.md softens Local-first, zero API to Local-first, zero external API by default, and the new flow makes LLM-assisted classification on-by-default. Was the design-principle change discussed somewhere I should catch up on, or is the call open here? I just want to understand whether init reaching out to a localhost LLM (Ollama default) is now the expected baseline — that's a meaningful shift from where the README pitches the project.

Either way I'll defer to @bensig / @igorls — happy to keep contributing in whichever direction the project lands.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 26, 2026

By Zero API, it has always meant zero external APIs and zero keys required. When we talked about API, it was more about API keys for external providers that could potentially undermine your privacy. Now we are simply being more explicit about it.

@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

milla-jovovich commented Apr 26, 2026

@arnoldwender thanks for the question! I think to use this with zero LLM was not what my original idea was about. unless you're REALLY good working without an LLM, which I'm sure most people on GitHub are, but might not be something a "newb" would understand so well,I wanted to create a path that was easier for people to understand. that using "agentic intelligence" to always be an option. for devs on here, I'm sure its local models, but I wanted their to be a clear warning that if you are signing in through an API key, the user HAS to understand that their privacy by nature of something non local is also not an option. Also, I found it challenging to use the completely LLM free version on the types of files I have (as some are numbered, some have dates, some have no dates, some are script notes, some are stories, music lyrics etc... too many formats for a 0 model to handle. and honestly, I didn't create all this code on my own obviously as I'm an actress not a coder. And as I haven;t really decided on a local model that I really love yet, I run the pip install through Claude. I just wanted it to be a an easier experience for people on here who DO rely on some sort of model, whether local or not. -m

@arnoldwender
Copy link
Copy Markdown
Contributor

arnoldwender commented Apr 26, 2026

@milla-jovovich totally agree with the goal — real-world files are messy and onboarding should just handle that.

One thing I'd flag before making LLM-assisted init the default: MemPalace stores everything verbatim by design, which means it holds some of the most sensitive data a user has. If we route that through a free-tier cloud LLM by default, the users who understand privacy least are the ones most exposed — free-tier providers typically use inputs for training.

Worth noting that the OSS ecosystem already handles multi-format classification locally. unstructured (Apache 2.0, 14k stars, pip install unstructured) does format detection + per-type routing for 60+ file types — scripts, lyrics, dated notes, all covered — zero API key, runs fully offline. fastText handles content-based classification the same way. Both are genuinely free, no cloud, no per-page pricing.

The approach I'd lean toward: local heuristics as the default path, LLM as an explicit opt-in with a clear "your data will leave this machine" warning.

Happy to prototype the router layer if the direction sounds right — just wanted to surface the prior art before we touch the local-first defaults.

@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

milla-jovovich commented Apr 26, 2026

@arnoldwender let me double check, but I think most people on here start with their own locl model... so for someone using an api key like I am for instance, would (hopefully) already know that anthropic has all my data... and I'm sure people with experience understand that, but I wanted an extra safety rail for some people who might not know. but thank for the heads up and will take a look to triple check that what the concept actually suggests is using a cloud service as default. if so, then that's an oversight on my part... I don't want to speculate, but I think what it SHOULD be doing is scanning what you are using already and if it's a local agent, would use that as a default unless you say otherwise. but if you're using a cloud based agent like Claude Cli via api like I am, I intrinsically understand that my information is already in their systems and mempalace as well should give some warning about that. I wanted this new concept for an update consideration to make it more CLEAR to people, which obviously it's not since you brought it up, so let me check this now. I really appreciate the feedback! -m

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 26, 2026

@milla-jovovich the corpus-origin module is well-architected (two-tier with the co-occurrence rule for Claude/Gemini/Haiku is exactly the right call to avoid French-novel/astrology false-positives), the persona-reclassification sweep in discover_entities is clean, and backwards compat via corpus_origin=None is preserved tightly. 1355/1356 tests pass locally, lint and format are clean, all CI jobs green.

Two things to address before this is mergeable:

1. Branch is behind develop and would revert #1208's data-loss safety fix.

git diff origin/develop..feat/corpus-origin --stat shows:

  • mempalace/repair.py — 148 lines removed
  • tests/test_repair.py — 120 lines removed

That's the entire TruncationDetected / check_extraction_safety / CHROMADB_DEFAULT_GET_LIMIT block that landed via #1208 / #1210, plus its regression tests. GitHub reports mergeable: CLEAN because the feature commit doesn't touch repair.py directly — but a merge would still drop the safety code along the way. A rebase onto current develop fixes this with no other rework needed.

2. test_init_pass_zero_uses_full_file_content_not_front_sampled doesn't isolate the LLM tier.

The test builds an argparse.Namespace without no_llm, so cmd_init tries to acquire a provider. CI passes because CI has no Ollama (check_available() returns False, LLM tier is skipped, heuristic correctly flags the AI tail). Locally with Ollama + gemma4:e4b running, the model returns "not AI-dialogue, confidence 0.90" and the test fails. Same risk for the other Pass-0 integration tests that rely on heuristic-only behavior — worth a sweep.

Fix is small: patch("mempalace.cli.get_provider", ...) in the affected tests, or add no_llm=True to the Namespace.

That second one also surfaces a worth-discussing product nuance: with --llm now default-on and gemma4:e4b as the Ollama default, the LLM tier replaced a correct Tier-1 answer with a wrong one on this fixture. Might be worth letting Tier 1 win when it returns high confidence (≥0.9) and only invoking Tier 2 for the weak-signal / persona-extraction case.

@arnoldwender
Copy link
Copy Markdown
Contributor

@milla-jovovich — you're right, I misread the intent. The diff defaults to ollama + gemma4:e4b (local), not a cloud provider — get_provider + check_available() guard makes sure it only activates if Ollama is actually running. My cloud-default concern was off.

The remaining nit is cosmetic: users without Ollama installed get a silent heuristic-only run with no indication that the LLM tier was attempted and skipped. A one-liner like "LLM not available — running heuristics only. Install Ollama or pass --no-llm to silence this." would close that loop. Happy to add it if you'd like, or I can leave it as-is.

10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.

This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.

The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean.

But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.

## What changes for users (in order, from `pip install` onwards)

**Install** — `pip install mempalace` is unchanged. The package itself didn't shift.

**First run — `mempalace init <folder>`:**

1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.

2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`:
   - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`.
   - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise.
   - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.

3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out.

4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder.

5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.

6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean.

7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it.

8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR.

**Later — when your folder grows:**

9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe.

## Behind the changes

- **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.

- **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`.

- **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged.

- **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module.

- **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`.

- **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.

## Cost story

- **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`.
- **Ollama / local LLM runtime:** zero cost. Fully offline.
- **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes.
- **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks.

## Backwards compatibility

- All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change.
- The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
- `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3.

## Test coverage

- **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
- **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
- **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too.
- **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.

## Hygiene guardrail

This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
@igorls igorls merged commit 9d18a1c into develop Apr 26, 2026
6 checks passed
@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

@igorls thank you so much! already fixed and I'm pretty sure you've merged now. will test out the option 3 I brought up and add is as a rebase. or an add on... anyway, haven't had time to do practice my GitHub skill or terminology as this week has been spent doing the "feature init" overhaul for the Mempalace update. will have time to do some GitHub learning when the kids are in school on Monday lol! thank you again for your tireless work over the last few weeks. you've been a rockstar and I'm so grateful to have you on the team!🙏

@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

milla-jovovich commented Apr 26, 2026

@arnoldwender yes I know lol! I've been deep into the figuring out the concept and model structure all week for the update that I missed the few last commits, but fixed those already via @igorls who is part of our tiny MemPalace squad, so the issues were already being fixed as I ws answering you earlier!

about your users and Ollama fix, it's already been flagged internally and I've already got a rebase that I will be adding in a few minutes to remain "agent agnostic" on the MemPalace platform!

Saying that, thank you again for being so vigilant. that's why this community is incredible. I'm so happy that MemPalace resonates with so many people and they take the time to point out these kinds of issues. it's so thoughtful. best, m

@milla-jovovich
Copy link
Copy Markdown
Collaborator Author

@arnoldwender So about the nit cosmetic: Actually we already shipped that exact one-liner:

see mempalace/cli.py:252-262 in #1211 (lines 254-256 for the check_available()=False path, 259-261 for the LLMError exception path).
Both print "... Running heuristics-only — pass --no-llm to silence this." and include the underlying reason ({msg}) so users see what actually went wrong (wrong port, missing key, etc.).

Happy to discuss the wording if you think the message could be clearer in some other way!

@arnoldwender
Copy link
Copy Markdown
Contributor

You're right, my mistake — I missed those two branches on my pass through cli.py. Both print the message with the underlying reason inline, which is exactly what I was after. No change needed.

Thanks for the patience walking me through it, and congrats on the merge — the persona-bucket work is going to land well with users.

mvalentsev pushed a commit to mvalentsev/mempalace that referenced this pull request Apr 26, 2026
…f replacing

2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first).

Per @igorls's review of PR MemPalace#1211 (MemPalace#1211 (comment)):
the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the
heuristic result wholesale. With ``--llm`` default-on (since MemPalace#1211) and a
small local model like Ollama gemma4:e4b, the LLM can return a wrong
``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident
heuristic ``True``. Tier 2's persona/user/platform extraction is the whole
reason to run it; the YES/NO call should stay with the heuristic.

This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields
instead of replacing:

  - ``likely_ai_dialogue``  → KEEP heuristic's (don't let weak LLM flip)
  - ``confidence``          → KEEP heuristic's (paired with the bool above)
  - ``primary_platform``    → TAKE LLM's when LLM provides one
  - ``user_name``           → TAKE LLM's when LLM provides one
  - ``agent_persona_names`` → TAKE LLM's when LLM provides any
  - ``evidence``            → COMBINE both signal trails

This preserves the persona-extraction value of Tier 2 (the whole point of
running it) while preventing a weak local model from flipping a confident
heuristic.

TDD: 4 tests added in tests/test_corpus_origin_integration.py covering
the four state combinations:

1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool —
   The exact failure mode Igor caught. Heuristic confidently flags
   AI-dialogue; mocked LLM contradicts. Asserts merged result keeps
   heuristic's True AND merges LLM's persona/user/platform fields.
   This test was the RED that drove the implementation.

2. test_merge_tier_fields_heuristic_no_no_personas_leak —
   Both tiers agree NOT AI-dialogue, both report empty personas. Pins
   that the merge doesn't accidentally introduce personas.

3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence —
   Both tiers agree AI-dialogue, LLM extracts personas. Pins that
   evidence from BOTH tiers ends up in the merged audit trail and
   persona/user/platform come from LLM.

4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only —
   Backwards compat: with no LLM provider (``--no-llm`` path), the
   merge logic doesn't fire and behavior is identical to v3.3.4.

Tests: 1367 pass on the full mempalace suite. 2 pre-existing
environmental failures unrelated to this change (chromadb optional
dep). Ruff check + format both clean.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211
MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four
PRs that landed today (21:22-21:41 UTC):

  MemPalace#1173  quarantine_stale_hnsw on make_client + cold-start gate +
         integrity sniff-test (PROTO/STOP byte check, no deserialization)
  MemPalace#1177  .blob_seq_ids_migrated marker guard, closes MemPalace#1090
  MemPalace#1198  _tokenize None-document guard in BM25 reranker
  MemPalace#1201  palace_graph.build_graph skips None metadata

Conflict resolution:

* mempalace/backends/chroma.py — took upstream as base (it has the
  igorls-review pickle-protocol docstring, thread-safety paragraph, and
  Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas
  in query()/get() since MemPalace#1094 is still open and not yet in develop.

* mempalace/mcp_server.py — took upstream's clean form. Dropped the
  fork-only `palace_path=` kwarg from four ChromaCollection() call sites:
  the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but
  MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so
  the kwarg has no remaining consumer. ChromaCollection.__init__ in
  upstream/develop is back to (self, collection); calling it with
  palace_path= raised TypeError → silently swallowed by the broad except
  in _get_collection() → returned None → tool_status() returned _no_palace().
  41 mcp_server tests went from failing-with-KeyError to passing.

* mempalace/cli.py — dropped fork-only `workers=args.workers` from the
  cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the
  `--workers` argparse arg landed in 5cd14bd but miner.mine() never
  accepted a workers param, so production `mempalace mine` TypeError'd
  on every invocation. Removed the broken plumbing; tests/test_cli.py
  updated to match.

* CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives
  in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml).

* CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's
  added a "Design Principles / Contributing" charter, which lives in
  README.md on the fork.

* tests/test_backends.py — took upstream's ruff-formatted line widths.

docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate,
hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from
OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the
merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated.

scripts/check-docs.sh: 4/4 clean.
Test suite: 1460/1460.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
…f replacing

2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first).

Per @igorls's review of PR MemPalace#1211 (MemPalace#1211 (comment)):
the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the
heuristic result wholesale. With ``--llm`` default-on (since MemPalace#1211) and a
small local model like Ollama gemma4:e4b, the LLM can return a wrong
``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident
heuristic ``True``. Tier 2's persona/user/platform extraction is the whole
reason to run it; the YES/NO call should stay with the heuristic.

This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields
instead of replacing:

  - ``likely_ai_dialogue``  → KEEP heuristic's (don't let weak LLM flip)
  - ``confidence``          → KEEP heuristic's (paired with the bool above)
  - ``primary_platform``    → TAKE LLM's when LLM provides one
  - ``user_name``           → TAKE LLM's when LLM provides one
  - ``agent_persona_names`` → TAKE LLM's when LLM provides any
  - ``evidence``            → COMBINE both signal trails

This preserves the persona-extraction value of Tier 2 (the whole point of
running it) while preventing a weak local model from flipping a confident
heuristic.

TDD: 4 tests added in tests/test_corpus_origin_integration.py covering
the four state combinations:

1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool —
   The exact failure mode Igor caught. Heuristic confidently flags
   AI-dialogue; mocked LLM contradicts. Asserts merged result keeps
   heuristic's True AND merges LLM's persona/user/platform fields.
   This test was the RED that drove the implementation.

2. test_merge_tier_fields_heuristic_no_no_personas_leak —
   Both tiers agree NOT AI-dialogue, both report empty personas. Pins
   that the merge doesn't accidentally introduce personas.

3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence —
   Both tiers agree AI-dialogue, LLM extracts personas. Pins that
   evidence from BOTH tiers ends up in the merged audit trail and
   persona/user/platform come from LLM.

4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only —
   Backwards compat: with no LLM provider (``--no-llm`` path), the
   merge logic doesn't fire and behavior is identical to v3.3.4.

Tests: 1367 pass on the full mempalace suite. 2 pre-existing
environmental failures unrelated to this change (chromadb optional
dep). Ruff check + format both clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants