feat(i18n): add Traditional + Simplified Chinese entity detection#945
Merged
igorls merged 2 commits intoMemPalace:developfrom Apr 21, 2026
Merged
feat(i18n): add Traditional + Simplified Chinese entity detection#945igorls merged 2 commits intoMemPalace:developfrom
igorls merged 2 commits intoMemPalace:developfrom
Conversation
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.
This adds entity sections for both locales:
- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
surnames covering >95% of Taiwanese / PRC names), length capped
at {1,2} trailing chars so greedy matches don't swallow the
trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
transitions. This is the same mechanism used for Devanagari,
applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
name with no whitespace, so patterns are written as `{name}說`,
`{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
question words, conjunctions, UI nouns, and politeness forms.
**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.
**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.
Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`
Full suite: 957 passed, 0 failed.
Collapse implicit string concatenation to single-line strings to satisfy ruff format --check in CI. Co-Authored-By: Claude <[email protected]>
Member
|
I think the new ASCII command-style project patterns in Specifically, patterns like As a result, the expanded regex no longer matches plain ASCII text at all. So these project signals are effectively dead code in the current implementation. I’d suggest either:
Everything else in the PR looked consistent to me, but I don’t think these two patterns currently do what the PR intends. |
6 tasks
This was referenced Apr 23, 2026
jphein
pushed a commit
to jphein/mempalace
that referenced
this pull request
Apr 24, 2026
Restore-integrity release. Unbreaks fresh `pip install mempalace` from v3.3.2 by re-tagging current develop, which carries both the plugin.json consumer (shipped in 3.3.2) and the matching mempalace-mcp entry point in pyproject.toml (added on develop ~10h after the 3.3.2 tag via MemPalace#340 by @messelink). MemPalace#1093 diagnosed by @jphein. Bumps (all 5 sources agree per Version Guard / CLAUDE.md): - mempalace/version.py 3.3.2 → 3.3.3 - pyproject.toml 3.3.2 → 3.3.3 - .claude-plugin/plugin.json 3.3.2 → 3.3.3 - .claude-plugin/marketplace.json 3.3.2 → 3.3.3 - .codex-plugin/plugin.json 3.3.2 → 3.3.3 - CHANGELOG.md new [3.3.3] entry No code changes. The fix for MemPalace#1093 is already on develop via merged PRs MemPalace#340, MemPalace#1021, MemPalace#851, MemPalace#942, MemPalace#833, MemPalace#673, MemPalace#661, MemPalace#659, MemPalace#1097, MemPalace#1051, MemPalace#1001, MemPalace#945. Branch name intentionally outside the `release/*` ruleset so follow-up CI-fix commits aren't gated behind a nested PR. (Supersedes MemPalace#1143 — closed for exactly that reason after it missed 3 of 5 version files.) Smoke-tested locally from a fresh develop clone: grep mempalace-mcp pyproject.toml .claude-plugin/plugin.json # both ✓ python -m build --wheel # ✓ pip install …-py3-none-any.whl # ✓ which mempalace-mcp # ✓ mempalace-mcp --help # ✓
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
zh-TWandzh-CNare shipped inmempalace/i18n/but have noentitysection. When a Chinese user runs:get_entity_patterns()silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern[A-Z][a-z]{1,19}is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document.jaandkoshare the same bug (follow-up PRs).Reproduction (before this PR)
Approach
Add
entitysections tozh-TW.jsonandzh-CN.jsonthat work within the current framework's constraints:candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at{1,2}trailing chars so greedy matching doesn't swallow the trailing verb (e.g.朱宜振說→ entity朱宜振說is wrong).boundary_chars: \u4E00-\u9FFF: reuses the script-aware\binfrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK,\bfires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as{name}說,{name}問,{name}決定— no\bor\s+between them.dialogue_patterns: full-width colon:, Chinese quotes 「」『』, plus the standard Latin forms.pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.What you get
Known Limitation (documented in tests)
CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's
\b(...)\bwrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note:
朱宜振appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.Testing
tests/test_entity_detector.py:test_zh_tw_candidate_extraction_at_boundariestest_zh_tw_person_classificationtest_zh_tw_stopwords_filter_common_particlestest_zh_tw_falls_back_to_english_for_non_cjk_namestest_zh_cn_candidate_extractiontest_zh_cn_and_zh_tw_union_covers_both_variantstest_zh_tw_known_limitation_inline_name_no_boundarypytest tests/ -q).ruff check mempalace/i18n/ tests/test_entity_detector.py).Follow-ups (separate PRs)
ja.json: same treatment (currently falls back to English).ko.json: same treatment.Checklist
pytest tests/ -v)ruff check)developperCONTRIBUTING.md