Skip to content

fix(i18n): language code lookup is case-sensitive — '--lang PT-BR' silently falls back to English #927

@arnoldwender

Description

@arnoldwender

Bug

Language file lookup in mempalace/i18n/__init__.py is case-sensitive and the locale files use inconsistent casing for the region subtag:

mempalace/i18n/de.json
mempalace/i18n/en.json
mempalace/i18n/es.json
mempalace/i18n/fr.json
mempalace/i18n/it.json
mempalace/i18n/ja.json
mempalace/i18n/ko.json
mempalace/i18n/pt-br.json   ← lowercase region
mempalace/i18n/ru.json
mempalace/i18n/zh-CN.json   ← uppercase region
mempalace/i18n/zh-TW.json   ← uppercase region

On case-sensitive filesystems (Linux, default APFS-CS on macOS, Windows when Python's path comparison is strict), passing --lang PT-BR, --lang Pt-Br, --lang zh-cn, or --lang ZH-TW silently misses the file. _load_entity_section() returns {}, the merge fallback fires English, and entity detection runs in the wrong language with no warning.

The cache key in get_entity_patterns() is tuple(languages) raw, so ("PT-BR",)("pt-br",) — even mixing case across calls bypasses the cache and re-runs the wrong fallback every time. cli.py:81 also persists the wrong-case value to the user's config via cfg.set_entity_languages(languages).

Reproduction

mempalace mine ~/docs --lang PT-BR
# Expected: Brazilian Portuguese entity patterns merged
# Actual: silently runs English-only entity detection; config persists "PT-BR"

Why it matters

  • BCP 47 language tags are case-insensitive by spec (RFC 5646 §2.1.1).
  • The inconsistent file naming (pt-br vs zh-CN) is a footgun for users who copy from documentation that uses the canonical capitalization.
  • The silent English fallback on a multilingual user's content means missed entities and broken cross-room search — the symptom is invisible.

Suggested fix

Resolve language codes via a case-insensitive lookup against the actual filenames in _LANG_DIR. Normalize the cache key the same way so callers using different casing share the same merged dict. Apply the same normalization in load_lang().

Happy to send a PR — already drafted on fix/i18n-lang-case-insensitive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/i18nMultilingual, Unicode, non-English embeddingsbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions