Skip to content

fix: allow Unicode in sanitize_name() — Latvian, CJK, Cyrillic (#637)#683

Merged
bensig merged 2 commits intoMemPalace:developfrom
jphein:fix/sanitize-unicode
Apr 12, 2026
Merged

fix: allow Unicode in sanitize_name() — Latvian, CJK, Cyrillic (#637)#683
bensig merged 2 commits intoMemPalace:developfrom
jphein:fix/sanitize-unicode

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 12, 2026

Summary

  • _SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode names like "Jānis", "太郎", or "Алексей"
  • Changed to \w which matches Unicode word characters (letters, digits, underscore) in Python 3

Files changed

  • mempalace/config.py — line 19

Test plan

  • Existing tests pass
  • Verify names with Unicode characters (e.g., sanitize_name("Jānis")) are accepted

🤖 Generated with Claude Code

…emPalace#637)

_SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode
names like "Jānis" or "太郎". Changed to \w which matches Unicode
word characters (letters, digits, underscore) in Python 3.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copilot AI review requested due to automatic review settings April 12, 2026 07:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates name sanitization to accept Unicode characters in wing/room/entity names (e.g., Latvian diacritics, CJK, Cyrillic) by changing the safe-name regex.

Changes:

  • Broaden _SAFE_NAME_RE from ASCII-only alphanumerics to Unicode-aware word characters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/config.py Outdated

MAX_NAME_LENGTH = 128
_SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9_ .'-]{0,126}[a-zA-Z0-9]?$")
_SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$")
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change expands allowed names beyond Unicode letters/digits: \w includes _, so the new pattern now permits leading/trailing underscores (e.g. _foo, foo_, or even _). Previously the first/last characters were restricted to alphanumerics only. If you only intended to add Unicode support while keeping that constraint, use a first/last-character class that excludes _ (e.g. an alphanumeric-only Unicode class) while still allowing _ in the middle.

Suggested change
_SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$")
_SAFE_NAME_RE = re.compile(r"^(?:[^\W_]|[^\W_][\w .'-]{0,126}[^\W_])$")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — applied the suggested [^\W_] pattern for first/last chars. Leading/trailing underscores are now rejected while Unicode letters pass. f7c4715

Comment thread mempalace/config.py Outdated

MAX_NAME_LENGTH = 128
_SAFE_NAME_RE = re.compile(r"^[a-zA-Z0-9][a-zA-Z0-9_ .'-]{0,126}[a-zA-Z0-9]?$")
_SAFE_NAME_RE = re.compile(r"^[\w][\w .'-]{0,126}[\w]?$")
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation behavior for wing/room/entity names changes here, but there are no direct unit tests covering sanitize_name() (existing tests/test_config*.py only cover config loading). Please add tests that assert representative Unicode names (Latvian/CJK/Cyrillic) are accepted and that invalid characters/path traversal are still rejected, so future regex tweaks don’t silently change behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 7 tests: Latvian (Jānis), CJK (太郎), Cyrillic (Алексей), leading underscore rejection, path traversal rejection, empty string rejection, and basic ASCII. f7c4715

Use [^\W_] for first/last char to allow Unicode letters/digits but
reject leading/trailing underscores (Copilot feedback). Add 7 tests
covering Latvian, CJK, Cyrillic, path traversal, and edge cases.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein jphein requested a review from igorls as a code owner April 12, 2026 07:26
Copy link
Copy Markdown
Collaborator

@bensig bensig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review + security audit clean.

@bensig bensig merged commit 6e2ced3 into MemPalace:develop Apr 12, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
Upstream merged MemPalace#682-684 (our splits), MemPalace#687 (dry-run None room),
MemPalace#695/MemPalace#708 (convo_miner full response), MemPalace#732 (0-chunk re-processing),
plus VitePress docs site. Conflicts:
- config.py: take upstream's [^\W_] regex (our MemPalace#683 merged version)
- miner.py: integrate upstream's early-return for tiny files, dedupe
  dry-run read path
- test_miner.py: keep our detect_room tests + upstream's dry-run test
- CONTRIBUTING.md: take upstream's org URL update

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
…ged, add MemPalace#681, bump test count to 715

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@igorls igorls mentioned this pull request Apr 13, 2026
4 tasks
@jphein jphein deleted the fix/sanitize-unicode branch April 19, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants