Conversation
… scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)
mvalentsev
added a commit
to mvalentsev/mempalace
that referenced
this pull request
Apr 16, 2026
3 tasks
lmanchu
added a commit
to lmanchu/mempalace
that referenced
this pull request
Apr 16, 2026
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.
This adds entity sections for both locales:
- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
surnames covering >95% of Taiwanese / PRC names), length capped
at {1,2} trailing chars so greedy matches don't swallow the
trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
transitions. This is the same mechanism used for Devanagari,
applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
name with no whitespace, so patterns are written as `{name}說`,
`{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
question words, conjunctions, UI nouns, and politeness forms.
**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.
**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.
Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`
Full suite: 957 passed, 0 failed.
5 tasks
igorls
added a commit
that referenced
this pull request
Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements
shafdev
pushed a commit
to shafdev/mempalace
that referenced
this pull request
Apr 17, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements
Scorpion1221
pushed a commit
to Scorpion1221/mempalace
that referenced
this pull request
Apr 18, 2026
* merge-upstream-2026-04-17: (70 commits) fix: diary and hook prompts now guide same-language writing as user fix: use smaller batch size for repair with API embedding models feat: pluggable embedding model with Gemini support fix: keep query rewrite in user's language for better vector search recall fix: replace AAAK guidance with natural language in diary_write and stop hooks feat: enhance recall gate with prioritized rules, query rewrite guidance, and few-shot examples fix: improve recall gate prompt to reduce false positives on CJK continuations fix: skip session-local continue recall feat: decide recall in one llm call feat: use previous assistant replies for recall hooks fix(website): correct false claims and stale numbers in live docs chore(website): add Google Analytics new landing page pt 2 new landing page fix: add explicit UTF-8 encoding to read_text() calls (MemPalace#776) feat: Update Indonesian translations Add Indonesian language support remove unnecessary comment fix: use pre-wrapped candidate patterns after MemPalace#932 refactor fix: use i18n candidate patterns for entity extraction in miner and palace ...
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
\bis a\w/non-\wtransition. Devanagari vowel signs (matras like ा ी ु) are Unicode categoryMc(Mark, Spacing Combining) — not\w. This causes\bto split mid-word on every matra: names likeअनीता(Anita) truncate toअनीत, and person-verb patterns like\bराज\s+ने\s+कहा\bnever match.boundary_charsfield (e.g."\\w\\u0900-\\u097F"for Hindi); the i18n loader expands\binto a script-aware lookaround and pre-wraps candidate patterns.Why now
PR #773 (Hindi) added a complete
entitysection per the #911 standard, but during review we verified that none of the patterns actually fire due to the\bissue. This fix unblocks #773 and any future Indic/Arabic/Hebrew/Thai locale.What changed
_script_boundary()builds a lookaround boundary;_expand_b()replaces\bwhenboundary_charsis set;_wrap_candidate()pre-wraps with boundary + capture group;_collect_entity_section()applies expansion during load so merge works transparentlyextract_candidatescompiles pre-wrapped patterns directly instead of re-wrapping with\bboundary_chars, person-verb scoring fires, truncation withoutboundary_chars, English regression unchangedAPI note:
get_entity_patterns()["candidate_patterns"]now returns fully-wrapped regex strings (boundary + capture group included). The only consumer isextract_candidates, which was updated to not double-wrap. Future callers should compile these directly.How locales use it
Add one field to the
entitysection inmempalace/i18n/<lang>.json:That's it. Every
\bin that locale's patterns is expanded automatically. Locales withoutboundary_chars(en, pt-br, ru, it) are completely unchanged.Test plan
boundary_chars, person-verb patterns fire on Hindi text, English behavior unchangedruff check .cleanruff format --checkclean