fix: allow Unicode in sanitize_name() — Latvian, CJK, Cyrillic (#637)#683
fix: allow Unicode in sanitize_name() — Latvian, CJK, Cyrillic (#637)#683bensig merged 2 commits intoMemPalace:developfrom
Conversation
…emPalace#637) _SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode names like "Jānis" or "太郎". Changed to \w which matches Unicode word characters (letters, digits, underscore) in Python 3. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Updates name sanitization to accept Unicode characters in wing/room/entity names (e.g., Latvian diacritics, CJK, Cyrillic) by changing the safe-name regex.
Changes:
- Broaden
_SAFE_NAME_REfrom ASCII-only alphanumerics to Unicode-aware word characters.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| MAX_NAME_LENGTH = 128 | ||
| _SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9_ .'-]{0,126}[a-zA-Z0-9]?$") | ||
| _SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$") |
There was a problem hiding this comment.
This change expands allowed names beyond Unicode letters/digits: \w includes _, so the new pattern now permits leading/trailing underscores (e.g. _foo, foo_, or even _). Previously the first/last characters were restricted to alphanumerics only. If you only intended to add Unicode support while keeping that constraint, use a first/last-character class that excludes _ (e.g. an alphanumeric-only Unicode class) while still allowing _ in the middle.
| _SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$") | |
| _SAFE_NAME_RE = re.compile(r"^(?:[^\W_]|[^\W_][\w .'-]{0,126}[^\W_])$") |
There was a problem hiding this comment.
Fixed — applied the suggested [^\W_] pattern for first/last chars. Leading/trailing underscores are now rejected while Unicode letters pass. f7c4715
|
|
||
| MAX_NAME_LENGTH = 128 | ||
| _SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9_ .'-]{0,126}[a-zA-Z0-9]?$") | ||
| _SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$") |
There was a problem hiding this comment.
The validation behavior for wing/room/entity names changes here, but there are no direct unit tests covering sanitize_name() (existing tests/test_config*.py only cover config loading). Please add tests that assert representative Unicode names (Latvian/CJK/Cyrillic) are accepted and that invalid characters/path traversal are still rejected, so future regex tweaks don’t silently change behavior.
There was a problem hiding this comment.
Added 7 tests: Latvian (Jānis), CJK (太郎), Cyrillic (Алексей), leading underscore rejection, path traversal rejection, empty string rejection, and basic ASCII. f7c4715
Use [^\W_] for first/last char to allow Unicode letters/digits but reject leading/trailing underscores (Copilot feedback). Add 7 tests covering Latvian, CJK, Cyrillic, path traversal, and edge cases. Co-Authored-By: Claude Opus 4.6 <[email protected]>
bensig
left a comment
There was a problem hiding this comment.
Code review + security audit clean.
Upstream merged MemPalace#682-684 (our splits), MemPalace#687 (dry-run None room), MemPalace#695/MemPalace#708 (convo_miner full response), MemPalace#732 (0-chunk re-processing), plus VitePress docs site. Conflicts: - config.py: take upstream's [^\W_] regex (our MemPalace#683 merged version) - miner.py: integrate upstream's early-return for tiny files, dedupe dry-run read path - test_miner.py: keep our detect_room tests + upstream's dry-run test - CONTRIBUTING.md: take upstream's org URL update Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ged, add MemPalace#681, bump test count to 715 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
_SAFE_NAME_REwas ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode names like "Jānis", "太郎", or "Алексей"\wwhich matches Unicode word characters (letters, digits, underscore) in Python 3Files changed
mempalace/config.py— line 19Test plan
sanitize_name("Jānis")) are accepted🤖 Generated with Claude Code