Skip to content

feat(i18n): add Ukrainian language support#994

Open
alpiua wants to merge 1 commit intoMemPalace:developfrom
alpiua:feat/i18n-uk
Open

feat(i18n): add Ukrainian language support#994
alpiua wants to merge 1 commit intoMemPalace:developfrom
alpiua:feat/i18n-uk

Conversation

@alpiua
Copy link
Copy Markdown
Contributor

@alpiua alpiua commented Apr 18, 2026

feat(i18n): Add Ukrainian language support

Description

This PR adds comprehensive native Ukrainian language support to the MemPalace entity detection pipeline by introducing the uk.json locale.

Motivation & Context

With the introduction of the multi-language JSON-based entity detection architecture in v3.3.1, adding robust localization is now entirely configuration-driven. This PR introduces the BCP 47 uk mappings to natively handle Cyrillic Ukrainian inputs, enabling teams working with Ukrainian documentation or notes to fully utilize agentic memory and the knowledge graph extraction.

Key Additions

  • Complete Lexical Patterns: Full coverage of Ukrainian pronouns, dialogue markers (e.g., сказав, прокоментував), action verbs (e.g., відчуває, вирішив), and direct addresses (e.g., добрий день {name}).
  • Hybrid Character Classes: The candidate_pattern leverages the combined [А-ЯІЇЄҐA-Z][а-яіїєґ'a-z] regex range. Since Ukrainian workspaces often mix English tech terminology with Ukrainian prose, this guarantees that Latin project names embedded inside Cyrillic sentences won't be dropped by the candidate extractor.
  • Project Context Synonyms: Native action patterns dedicated to IT projects: репозиторій, пайплайн, задеплоїв, деплою, розгорнув etc., dynamically mapped to the {name} variable.
  • AAAK Prompt Localization: Translated the AAAK indexing instruction to direct an LLM in native Ukrainian for better context retention when compressing documents.

How to Test

Verified locally by running detect_entities against markdown files containing conversational Ukrainian mixed with English tool names, successfully categorizing Cyrillic names and Latin project identifiers.

@mvalentsev
Copy link
Copy Markdown
Contributor

mvalentsev commented Apr 18, 2026

uk.json declares direct_address_patterns (plural, list of 10 regexes) under entity. The loader at mempalace/i18n/__init__.py:209-210 only reads the singular key direct_address_pattern as a single |-alternation string; the plural key is never consulted. Every working locale (en, ru, it, pt-br, hi, id) uses the singular form. Effect after merge: all 10 direct-address patterns in this PR are silently dropped, and Ukrainian users get no direct-address entity detection until the schema matches.

Fix is collapsing the 10 alternatives into one regex and renaming the key:

"direct_address_pattern": "\\bпривіт\\s+{name}\\b|\\bдякую\\s+{name}\\b|\\bдобридень\\s+{name}\\b|\\bдобрий\\s+день\\s+{name}\\b|\\bшановний\\s+{name}\\b|\\bшановна\\s+{name}\\b|\\bдорогий\\s+{name}\\b|\\bдорога\\s+{name}\\b|\\bhey\\s+{name}\\b|\\bhi\\s+{name}\\b"

Minor, separately: two person_verb_patterns have a literal hyphen in their alternation group:

"\\b{name}\\s+засмія(-|в|л)(ся|ась|ись)?\\b"
"\\b{name}\\s+посміхну(-|в|л)(ся|ась|ись)?\\b"

(-|в|л) matches a literal - or в or л. Natural past-tense forms are засміявся / посміхнувся without a hyphen, so the intended group is (в|л).

Otherwise topic_pattern, quote_pattern («»), candidate_pattern, and the stop_words / stopwords split align with ru.json. Once the direct-address key is corrected the file should load cleanly.

@alpiua
Copy link
Copy Markdown
Contributor Author

alpiua commented Apr 18, 2026

@mvalentsev sorry, missed that.
thank you for checking.

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 21, 2026

Thanks for the Ukrainian locale — content is structurally solid, and the hybrid Cyrillic+Latin candidate_pattern is a nice touch for UK/English mixed tech prose. Two small things before I can merge:

  1. Drop multi-word stopwords. "будь ласка" in the stopwords list can never fire — the BM25 tokenizer splits on whitespace (\w{2,} tokens), so space-containing entries don't match any single token. Either split it into separate single-word entries or remove it. Same for any other multi-word phrases in the list.

  2. Add a uk sample to tests/test_i18n.py::test_dialect_compress_samples — every other shipped locale has one, and it's a useful smoke test that Dialect round-trips your locale correctly. Something like:

    "uk": "Ми вирішили перейти з SQLite на PostgreSQL для кращого паралельного запису. Бен учора схвалив PR.",

I'll kick off CI once you push the fixes.

@igorls igorls added enhancement New feature or request area/i18n Multilingual, Unicode, non-English embeddings labels Apr 24, 2026
marc252 added a commit to marc252/mempalace that referenced this pull request May 4, 2026
- Drop 5 dead entries from entity.stopwords:
  - 4 single-char ("a", "i", "o", "u") never match candidates that
    require >=2 chars per candidate_pattern
  - 1 multi-word ("si us plau") never matches single-token splits
- Add "ca" sample to test_dialect_compress_samples so the locale
  is exercised on every CI run

Same shape as the fixes applied to MemPalace#994 (uk locale).
@alpiua alpiua requested a review from igorls as a code owner May 6, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants