feat(i18n): add Ukrainian language support#994
feat(i18n): add Ukrainian language support#994alpiua wants to merge 1 commit intoMemPalace:developfrom
Conversation
|
Fix is collapsing the 10 alternatives into one regex and renaming the key: "direct_address_pattern": "\\bпривіт\\s+{name}\\b|\\bдякую\\s+{name}\\b|\\bдобридень\\s+{name}\\b|\\bдобрий\\s+день\\s+{name}\\b|\\bшановний\\s+{name}\\b|\\bшановна\\s+{name}\\b|\\bдорогий\\s+{name}\\b|\\bдорога\\s+{name}\\b|\\bhey\\s+{name}\\b|\\bhi\\s+{name}\\b"Minor, separately: two "\\b{name}\\s+засмія(-|в|л)(ся|ась|ись)?\\b"
"\\b{name}\\s+посміхну(-|в|л)(ся|ась|ись)?\\b"
Otherwise |
|
@mvalentsev sorry, missed that. |
|
Thanks for the Ukrainian locale — content is structurally solid, and the hybrid Cyrillic+Latin
I'll kick off CI once you push the fixes. |
- Drop 5 dead entries from entity.stopwords:
- 4 single-char ("a", "i", "o", "u") never match candidates that
require >=2 chars per candidate_pattern
- 1 multi-word ("si us plau") never matches single-token splits
- Add "ca" sample to test_dialect_compress_samples so the locale
is exercised on every CI run
Same shape as the fixes applied to MemPalace#994 (uk locale).
feat(i18n): Add Ukrainian language support
Description
This PR adds comprehensive native Ukrainian language support to the MemPalace entity detection pipeline by introducing the
uk.jsonlocale.Motivation & Context
With the introduction of the multi-language JSON-based entity detection architecture in v3.3.1, adding robust localization is now entirely configuration-driven. This PR introduces the BCP 47
ukmappings to natively handle Cyrillic Ukrainian inputs, enabling teams working with Ukrainian documentation or notes to fully utilize agentic memory and the knowledge graph extraction.Key Additions
сказав,прокоментував), action verbs (e.g.,відчуває,вирішив), and direct addresses (e.g.,добрий день {name}).candidate_patternleverages the combined[А-ЯІЇЄҐA-Z][а-яіїєґ'a-z]regex range. Since Ukrainian workspaces often mix English tech terminology with Ukrainian prose, this guarantees that Latin project names embedded inside Cyrillic sentences won't be dropped by the candidate extractor.репозиторій,пайплайн,задеплоїв,деплою,розгорнувetc., dynamically mapped to the{name}variable.How to Test
Verified locally by running
detect_entitiesagainst markdown files containing conversational Ukrainian mixed with English tool names, successfully categorizing Cyrillic names and Latin project identifiers.