Skip to content

fix(entity_detector): script-aware word boundaries for combining-mark scripts#932

Merged
igorls merged 1 commit intodevelopfrom
fix/entity-detector-non-latin-boundaries
Apr 16, 2026
Merged

fix(entity_detector): script-aware word boundaries for combining-mark scripts#932
igorls merged 1 commit intodevelopfrom
fix/entity-detector-non-latin-boundaries

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 16, 2026

Summary

  • Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras like ा ी ु) are Unicode category Mc (Mark, Spacing Combining) — not \w. This causes \b to split mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match.
  • Same issue affects Arabic, Hebrew, Thai, Tamil, Burmese, Khmer — every script with combining marks.
  • Fix: locales declare an optional boundary_chars field (e.g. "\\w\\u0900-\\u097F" for Hindi); the i18n loader expands \b into a script-aware lookaround and pre-wraps candidate patterns.

Why now

PR #773 (Hindi) added a complete entity section per the #911 standard, but during review we verified that none of the patterns actually fire due to the \b issue. This fix unblocks #773 and any future Indic/Arabic/Hebrew/Thai locale.

What changed

File Lines What
mempalace/i18n/init.py +143 / -48 _script_boundary() builds a lookaround boundary; _expand_b() replaces \b when boundary_chars is set; _wrap_candidate() pre-wraps with boundary + capture group; _collect_entity_section() applies expansion during load so merge works transparently
mempalace/entity_detector.py +2 / -2 extract_candidates compiles pre-wrapped patterns directly instead of re-wrapping with \b
tests/test_entity_detector.py +85 5 new tests: Devanagari name extraction with/without boundary_chars, person-verb scoring fires, truncation without boundary_chars, English regression unchanged

API note: get_entity_patterns()["candidate_patterns"] now returns fully-wrapped regex strings (boundary + capture group included). The only consumer is extract_candidates, which was updated to not double-wrap. Future callers should compile these directly.

How locales use it

Add one field to the entity section in mempalace/i18n/<lang>.json:

"entity": {
  "boundary_chars": "\\w\\u0900-\\u097F",
  ...
}

That's it. Every \b in that locale's patterns is expanded automatically. Locales without boundary_chars (en, pt-br, ru, it) are completely unchanged.

Test plan

  • 5 new tests covering: name extraction with matras preserved, truncation regression without boundary_chars, person-verb patterns fire on Hindi text, English behavior unchanged
  • Full suite: 950 passed, 0 regressions
  • Lint: ruff check . clean
  • Format: ruff format --check clean

… scripts

Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
@igorls igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings labels Apr 16, 2026
@igorls igorls merged commit d4c9424 into develop Apr 16, 2026
6 checks passed
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 16, 2026
lmanchu added a commit to lmanchu/mempalace that referenced this pull request Apr 16, 2026
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.
igorls added a commit that referenced this pull request Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
@igorls igorls mentioned this pull request Apr 16, 2026
8 tasks
shafdev pushed a commit to shafdev/mempalace that referenced this pull request Apr 17, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (MemPalace#911) + script-aware word
  boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive
  locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/
  entity_registry (MemPalace#931)
- Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907),
  hi (MemPalace#773), id (MemPalace#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (MemPalace#946)
- KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887)
- Various smaller fixes and improvements
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request Apr 18, 2026
* merge-upstream-2026-04-17: (70 commits)
  fix: diary and hook prompts now guide same-language writing as user
  fix: use smaller batch size for repair with API embedding models
  feat: pluggable embedding model with Gemini support
  fix: keep query rewrite in user's language for better vector search recall
  fix: replace AAAK guidance with natural language in diary_write and stop hooks
  feat: enhance recall gate with prioritized rules, query rewrite guidance, and few-shot examples
  fix: improve recall gate prompt to reduce false positives on CJK continuations
  fix: skip session-local continue recall
  feat: decide recall in one llm call
  feat: use previous assistant replies for recall hooks
  fix(website): correct false claims and stale numbers in live docs
  chore(website): add Google Analytics
  new landing page pt 2
  new landing page
  fix: add explicit UTF-8 encoding to read_text() calls (MemPalace#776)
  feat: Update Indonesian translations
  Add Indonesian language support
  remove unnecessary comment
  fix: use pre-wrapped candidate patterns after MemPalace#932 refactor
  fix: use i18n candidate patterns for entity extraction in miner and palace
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant