fix(entity_detector): script-aware word boundaries for combining-mark scripts by igorls · Pull Request #932 · MemPalace/mempalace

igorls · 2026-04-16T01:20:15Z

Summary

Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras like ा ी ु) are Unicode category Mc (Mark, Spacing Combining) — not \w. This causes \b to split mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match.
Same issue affects Arabic, Hebrew, Thai, Tamil, Burmese, Khmer — every script with combining marks.
Fix: locales declare an optional boundary_chars field (e.g. "\\w\\u0900-\\u097F" for Hindi); the i18n loader expands \b into a script-aware lookaround and pre-wraps candidate patterns.

Why now

PR #773 (Hindi) added a complete entity section per the #911 standard, but during review we verified that none of the patterns actually fire due to the \b issue. This fix unblocks #773 and any future Indic/Arabic/Hebrew/Thai locale.

What changed

File	Lines	What
mempalace/i18n/init.py	+143 / -48	`_script_boundary()` builds a lookaround boundary; `_expand_b()` replaces `\b` when `boundary_chars` is set; `_wrap_candidate()` pre-wraps with boundary + capture group; `_collect_entity_section()` applies expansion during load so merge works transparently
mempalace/entity_detector.py	+2 / -2	`extract_candidates` compiles pre-wrapped patterns directly instead of re-wrapping with `\b`
tests/test_entity_detector.py	+85	5 new tests: Devanagari name extraction with/without `boundary_chars`, person-verb scoring fires, truncation without `boundary_chars`, English regression unchanged

API note: get_entity_patterns()["candidate_patterns"] now returns fully-wrapped regex strings (boundary + capture group included). The only consumer is extract_candidates, which was updated to not double-wrap. Future callers should compile these directly.

How locales use it

Add one field to the entity section in mempalace/i18n/<lang>.json:

"entity": {
  "boundary_chars": "\\w\\u0900-\\u097F",
  ...
}

That's it. Every \b in that locale's patterns is expanded automatically. Locales without boundary_chars (en, pt-br, ru, it) are completely unchanged.

Test plan

5 new tests covering: name extraction with matras preserved, truncation regression without boundary_chars, person-verb patterns fire on Hindi text, English behavior unchanged
Full suite: 950 passed, 0 regressions
Lint: ruff check . clean
Format: ruff format --check clean

… scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)

zh-TW and zh-CN previously had no `entity` section. Calling `detect_entities(..., languages=("zh-TW",))` silently fell back to English patterns (i18n/__init__.py:231-233), so no Chinese names were ever extracted — Chinese-speaking users got zero people or projects detected from their own notes. This adds entity sections for both locales: - `candidate_pattern`: common-surname-prefixed CJK n-grams (~100 surnames covering >95% of Taiwanese / PRC names), length capped at {1,2} trailing chars so greedy matches don't swallow the trailing verb character (e.g. 朱宜振說). - `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK transitions. This is the same mechanism used for Devanagari, applied to the CJK range. - `person_verb_patterns`: Chinese verbs attach directly to the name with no whitespace, so patterns are written as `{name}說`, `{name}問`, `{name}決定` — no `\b` or `\s+` separators. - `dialogue_patterns`: full-width colon `：`, Chinese quotes 「」『』, plus the standard Latin forms. - `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱. - `stopwords`: ~140 common particles, pronouns, time expressions, question words, conjunctions, UI nouns, and politeness forms. **Known limitation** (explicitly covered by a test): CJK scripts have no word delimiters, so a name flanked by CJK on both sides with no punctuation or whitespace break is not extracted. This is a fundamental limit of regex-based CJK entity detection — resolving it would require a dictionary tokeniser. Realistic Chinese technical writing contains enough non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines) that 3+ occurrences normally produce matches. Verified against a realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences with 0.99 person-classification confidence. **Follow-ups** (separate PRs): same pattern for `ja` and `ko`, both of which currently share the silent fallback-to-English bug. Tests: 7 new tests in `tests/test_entity_detector.py`: - `test_zh_tw_candidate_extraction_at_boundaries` - `test_zh_tw_person_classification` - `test_zh_tw_stopwords_filter_common_particles` - `test_zh_tw_falls_back_to_english_for_non_cjk_names` - `test_zh_cn_candidate_extraction` - `test_zh_cn_and_zh_tw_union_covers_both_variants` - `test_zh_tw_known_limitation_inline_name_no_boundary` Full suite: 957 passed, 0 failed.

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements

* merge-upstream-2026-04-17: (70 commits) fix: diary and hook prompts now guide same-language writing as user fix: use smaller batch size for repair with API embedding models feat: pluggable embedding model with Gemini support fix: keep query rewrite in user's language for better vector search recall fix: replace AAAK guidance with natural language in diary_write and stop hooks feat: enhance recall gate with prioritized rules, query rewrite guidance, and few-shot examples fix: improve recall gate prompt to reduce false positives on CJK continuations fix: skip session-local continue recall feat: decide recall in one llm call feat: use previous assistant replies for recall hooks fix(website): correct false claims and stale numbers in live docs chore(website): add Google Analytics new landing page pt 2 new landing page fix: add explicit UTF-8 encoding to read_text() calls (MemPalace#776) feat: Update Indonesian translations Add Indonesian language support remove unnecessary comment fix: use pre-wrapped candidate patterns after MemPalace#932 refactor fix: use i18n candidate patterns for entity extraction in miner and palace ...

igorls requested review from bensig and milla-jovovich as code owners April 16, 2026 01:20

igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings labels Apr 16, 2026

igorls merged commit d4c9424 into develop Apr 16, 2026
6 checks passed

igorls mentioned this pull request Apr 16, 2026

feat: add Hindi language support to i18n module #773

Merged

mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 16, 2026

fix: use pre-wrapped candidate patterns after MemPalace#932 refactor

973bd62

mvalentsev mentioned this pull request Apr 16, 2026

fix: use i18n candidate patterns for entity extraction in miner and palace #931

Merged

3 tasks

lmanchu mentioned this pull request Apr 16, 2026

feat(i18n): add Traditional + Simplified Chinese entity detection #945

Merged

5 tasks

igorls mentioned this pull request Apr 16, 2026

release: v3.3.1 #957

Merged

8 tasks

mvalentsev mentioned this pull request Apr 19, 2026

feat(i18n): add Hebrew language support #1031

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(entity_detector): script-aware word boundaries for combining-mark scripts#932

fix(entity_detector): script-aware word boundaries for combining-mark scripts#932
igorls merged 1 commit intodevelopfrom
fix/entity-detector-non-latin-boundaries

igorls commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igorls commented Apr 16, 2026

Summary

Why now

What changed

How locales use it

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant