fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde by Giggitycountless · Pull Request #715 · memtomem/memtomem

Giggitycountless · 2026-05-02T06:08:07Z

Summary

_FTS5_SPECIAL_RE was missing 6 FTS5 special characters: ., /, \, <, >, ~. Tokens containing these characters — URLs (https://example.com), filesystem paths (a/b/c), dotted filenames (file.name.ext), YAML frontmatter (key: value), and proximity queries (word~n) — leaked into the FTS5 query parser unquoted and triggered fts5: syntax error.

The error was caught silently (logged at WARNING, HTTP 200, dense-only fallback), so users had no indication that keyword search had failed.

Changes

fts_tokenizer.py: Extend _FTS5_SPECIAL_RE to the full unicode61 token-class special set. Words containing any special character are now wrapped in double quotes (literal phrase match) instead of receiving a * prefix wildcard.

test_fts_tokenizer.py (new): 45 regression tests covering:

Regex matches every FTS5 special character
_apply_prefix_wildcard quotes URLs, paths, dotted names, code spans, tilde
tokenize_for_fts end-to-end query path for special-character tokens
Plain words still get wildcarded, empty input handled

How to test

uv run pytest packages/memtomem/tests/test_fts_tokenizer.py -v

Fixes #697

…ngle brackets, tilde The _FTS5_SPECIAL_RE character class was missing ., /, \, <, >, and ~ (among others). Tokens containing these characters — URLs, filesystem paths, dotted filenames, YAML frontmatter — leaked into the FTS5 query parser unquoted and triggered 'fts5: syntax error near "."'. The error was caught silently (HTTP 200, dense-only fallback), so users had no indication that keyword search had failed. Extend the regex to the full unicode61 token-class special set and add 45 regression tests covering the regex, the quoting logic, and the end-to-end tokenize_for_fts query path. Fixes memtomem#697

github-actions · 2026-05-02T06:08:16Z

Thank you for your contribution! Before we can merge, please sign the Contributor License Agreement.

To sign, comment on this pull request with the statement below. You only need to sign once per GitHub account.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

memtomem · 2026-05-02T11:30:54Z

Thanks for the careful fix and the thorough test file — really appreciate the parametrized regex coverage plus the end-to-end cases.

I ran the tests locally on the PR branch:

pytest test_fts_tokenizer.py → 45/45 pass
pre-existing test_usability_fixes.py::TestFTS5HyphenQuoting → 4/4 still pass
broader pytest -k "fts or tokeniz or bm25 or search" → 218/218 pass, no adjacent regressions
mutation check (revert the regex to old, re-run new tests) → 25 of 45 fail — confirms the tests genuinely pin the new behavior

Required before merge

Sign the CLA — the bot comment up-thread has the exact text to copy.
uv run ruff format — ruff format --check currently fails on three spots, all auto-fixable in one command:
1. fts_tokenizer.py:18 — single-quoted regex string → ruff prefers double-quoted (r"...").
2. test_fts_tokenizer.py — class definitions need two blank lines before them (PEP 8) at lines 24, 46, 95.
3. test_fts_tokenizer.py:90 — trailing whitespace on the 'say* """hello"""' assertion line.
Note: ruff check already passes — it's only ruff format --check that flags these. Both are gating in CI.

Suggestion (non-blocking)

Could you add a ### Fixed line under [Unreleased] in the top-level CHANGELOG.md? Something like:

mem_search BM25 leg now handles queries containing . / \ < > ~ (and the rest of the unicode61 separator set) — previously silently fell back to dense-only with a WARNING log. (fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde #715, fixes BM25 FTS5 query sanitization misses . / \ < > (silent degradation) #697)

Out of scope (just noting for future, not for this PR)

The kiwipiepy backend at fts_tokenizer.py:101 drops tokens matching the regex rather than quoting them. That pre-dates this PR, but the silent-drop surface widens as the regex grows — worth a follow-up.
The if safe: branch at fts_tokenizer.py:123 is dead (text.split() never yields empty, .replace only grows). Pre-existing.

Approve

Once the format pass + CLA land, happy to approve and merge. Thanks again for picking this up!

memtomem mentioned this pull request May 5, 2026

[Tracking] v0.1.34 prod UX review backlog #702

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde#715

fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde#715
Giggitycountless wants to merge 1 commit intomemtomem:mainfrom
Giggitycountless:fix/fts5-query-sanitization

Giggitycountless commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

memtomem commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Giggitycountless commented May 2, 2026

Summary

Changes

How to test

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

memtomem commented May 2, 2026

Required before merge

Suggestion (non-blocking)

Out of scope (just noting for future, not for this PR)

Approve

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants