fix: use i18n candidate patterns for entity extraction in miner and palace by mvalentsev · Pull Request #931 · MemPalace/mempalace

mvalentsev · 2026-04-16T00:24:07Z

Summary

#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only [A-Z][a-z]{2,} for entity name extraction,
silently missing Cyrillic, accented Latin, and other non-Latin names:

miner.py _extract_entities_for_metadata() -- drawer metadata tags
palace.py build_closet_lines() -- closet index entity tags
entity_registry.py extract_unknown_candidates() -- Wikipedia lookup

For example, mining a file with "Михаил написал код" produces zero entity
metadata because [A-Z] never matches Cyrillic uppercase.

Changes

Add _candidate_entity_words() helper in palace.py that loads
candidate patterns from get_entity_patterns() (same i18n source as
entity_detector), with lazy-cached compiled regexes
Replace hardcoded ASCII regex in all 3 locations with the shared helper
diary_ingest.py imports _extract_entities_for_metadata from miner,
so it gets the fix automatically
Add regression test: Cyrillic name appears in entity metadata when
entity_languages includes "ru"

Test plan

New test: test_entity_metadata_finds_cyrillic_names
Full suite: 945 passed
ruff check + ruff format --check: clean

…alace entity_detector.py was refactored in MemPalace#911 to load candidate patterns from i18n locale JSON files, supporting non-Latin scripts (Cyrillic, accented Latin, etc.). But three other code paths still hardcoded the ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity names in metadata tagging, closet indexing, and registry lookups. Replace the hardcoded regex with a shared _candidate_entity_words() helper that reuses the same i18n candidate_patterns as entity_detector.

mvalentsev · 2026-04-16T05:37:47Z

Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms.

Qodo-Free-For-OSS · 2026-04-16T13:36:06Z

Hi, _candidate_entity_words() swallows re.error during pattern compilation and permanently caches the partial/empty compiled list, which can silently disable entity extraction for configured languages.

Severity: action required | Category: reliability

How to fix: Log and avoid caching failures

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.palace._candidate_entity_words() silently drops invalid regex patterns (re.error) and then caches the resulting compiled regex list forever. This makes i18n entity extraction fail silently and non-recoverably within the process.

Issue Context

Patterns are loaded from i18n JSON via get_entity_patterns(). A single bad pattern (or any unexpected pattern content) currently yields no visibility and may leave _CANDIDATE_RX_CACHE empty.

Fix Focus Areas

mempalace/palace.py[134-160]

Log which pattern failed to compile (include language tuple if available)

Consider not caching when compilation yields an empty set (or cache keyed by language; see separate finding)

Optionally fall back to English candidate patterns if compilation yields none

We noticed a couple of other issues in this PR as well — happy to share if helpful.

Found by Qodo code review

mvalentsev · 2026-04-16T13:58:14Z

Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here.

A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version.

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements

mvalentsev marked this pull request as ready for review April 16, 2026 00:28

mvalentsev requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 00:28

igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining labels Apr 16, 2026

mvalentsev added 2 commits April 16, 2026 10:35

fix: use pre-wrapped candidate patterns after MemPalace#932 refactor

973bd62

mvalentsev force-pushed the fix/i18n-entity-metadata branch from 58db004 to 973bd62 Compare April 16, 2026 05:37

remove unnecessary comment

cde0f5b

igorls merged commit 55a004f into MemPalace:develop Apr 16, 2026
6 checks passed

igorls mentioned this pull request Apr 16, 2026

release: v3.3.1 #957

Merged

8 tasks

mvalentsev mentioned this pull request Apr 19, 2026

feat(i18n): add Hebrew language support #1031

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use i18n candidate patterns for entity extraction in miner and palace#931

fix: use i18n candidate patterns for entity extraction in miner and palace#931
igorls merged 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata

mvalentsev commented Apr 16, 2026

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvalentsev commented Apr 16, 2026

Summary

Changes

Test plan

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants