Skip to content

fix: use i18n candidate patterns for entity extraction in miner and palace#931

Merged
igorls merged 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata
Apr 16, 2026
Merged

fix: use i18n candidate patterns for entity extraction in miner and palace#931
igorls merged 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

Summary

#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only [A-Z][a-z]{2,} for entity name extraction,
silently missing Cyrillic, accented Latin, and other non-Latin names:

  • miner.py _extract_entities_for_metadata() -- drawer metadata tags
  • palace.py build_closet_lines() -- closet index entity tags
  • entity_registry.py extract_unknown_candidates() -- Wikipedia lookup

For example, mining a file with "Михаил написал код" produces zero entity
metadata because [A-Z] never matches Cyrillic uppercase.

Changes

  • Add _candidate_entity_words() helper in palace.py that loads
    candidate patterns from get_entity_patterns() (same i18n source as
    entity_detector), with lazy-cached compiled regexes
  • Replace hardcoded ASCII regex in all 3 locations with the shared helper
  • diary_ingest.py imports _extract_entities_for_metadata from miner,
    so it gets the fix automatically
  • Add regression test: Cyrillic name appears in entity metadata when
    entity_languages includes "ru"

Test plan

  • New test: test_entity_metadata_finds_cyrillic_names
  • Full suite: 945 passed
  • ruff check + ruff format --check: clean

@mvalentsev mvalentsev marked this pull request as ready for review April 16, 2026 00:28
@igorls igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining labels Apr 16, 2026
…alace

entity_detector.py was refactored in MemPalace#911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
@mvalentsev mvalentsev force-pushed the fix/i18n-entity-metadata branch from 58db004 to 973bd62 Compare April 16, 2026 05:37
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, _candidate_entity_words() swallows re.error during pattern compilation and permanently caches the partial/empty compiled list, which can silently disable entity extraction for configured languages.

Severity: action required | Category: reliability

How to fix: Log and avoid caching failures

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.palace._candidate_entity_words() silently drops invalid regex patterns (re.error) and then caches the resulting compiled regex list forever. This makes i18n entity extraction fail silently and non-recoverably within the process.

Issue Context

Patterns are loaded from i18n JSON via get_entity_patterns(). A single bad pattern (or any unexpected pattern content) currently yields no visibility and may leave _CANDIDATE_RX_CACHE empty.

Fix Focus Areas

  • mempalace/palace.py[134-160]
    • Log which pattern failed to compile (include language tuple if available)
    • Consider not caching when compilation yields an empty set (or cache keyed by language; see separate finding)
    • Optionally fall back to English candidate patterns if compilation yields none

We noticed a couple of other issues in this PR as well — happy to share if helpful.


Found by Qodo code review

@mvalentsev
Copy link
Copy Markdown
Contributor Author

Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here.

A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version.

@igorls igorls merged commit 55a004f into MemPalace:develop Apr 16, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
@igorls igorls mentioned this pull request Apr 16, 2026
8 tasks
shafdev pushed a commit to shafdev/mempalace that referenced this pull request Apr 17, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (MemPalace#911) + script-aware word
  boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive
  locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/
  entity_registry (MemPalace#931)
- Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907),
  hi (MemPalace#773), id (MemPalace#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (MemPalace#946)
- KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887)
- Various smaller fixes and improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants