Skip to content

fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde#715

Open
Giggitycountless wants to merge 1 commit intomemtomem:mainfrom
Giggitycountless:fix/fts5-query-sanitization
Open

fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde#715
Giggitycountless wants to merge 1 commit intomemtomem:mainfrom
Giggitycountless:fix/fts5-query-sanitization

Conversation

@Giggitycountless
Copy link
Copy Markdown

Summary

_FTS5_SPECIAL_RE was missing 6 FTS5 special characters: ., /, \, <, >, ~. Tokens containing these characters — URLs (https://example.com), filesystem paths (a/b/c), dotted filenames (file.name.ext), YAML frontmatter (key: value), and proximity queries (word~n) — leaked into the FTS5 query parser unquoted and triggered fts5: syntax error.

The error was caught silently (logged at WARNING, HTTP 200, dense-only fallback), so users had no indication that keyword search had failed.

Changes

fts_tokenizer.py: Extend _FTS5_SPECIAL_RE to the full unicode61 token-class special set. Words containing any special character are now wrapped in double quotes (literal phrase match) instead of receiving a * prefix wildcard.

test_fts_tokenizer.py (new): 45 regression tests covering:

  • Regex matches every FTS5 special character
  • _apply_prefix_wildcard quotes URLs, paths, dotted names, code spans, tilde
  • tokenize_for_fts end-to-end query path for special-character tokens
  • Plain words still get wildcarded, empty input handled

How to test

uv run pytest packages/memtomem/tests/test_fts_tokenizer.py -v

Fixes #697

…ngle brackets, tilde

The _FTS5_SPECIAL_RE character class was missing ., /, \, <, >, and ~
(among others).  Tokens containing these characters — URLs, filesystem
paths, dotted filenames, YAML frontmatter — leaked into the FTS5 query
parser unquoted and triggered 'fts5: syntax error near "."'.

The error was caught silently (HTTP 200, dense-only fallback), so users
had no indication that keyword search had failed.

Extend the regex to the full unicode61 token-class special set and add
45 regression tests covering the regex, the quoting logic, and the
end-to-end tokenize_for_fts query path.

Fixes memtomem#697
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Thank you for your contribution! Before we can merge, please sign the Contributor License Agreement.

To sign, comment on this pull request with the statement below. You only need to sign once per GitHub account.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@memtomem
Copy link
Copy Markdown
Owner

memtomem commented May 2, 2026

Thanks for the careful fix and the thorough test file — really appreciate the parametrized regex coverage plus the end-to-end cases.

I ran the tests locally on the PR branch:

  • pytest test_fts_tokenizer.py45/45 pass
  • pre-existing test_usability_fixes.py::TestFTS5HyphenQuoting4/4 still pass
  • broader pytest -k "fts or tokeniz or bm25 or search"218/218 pass, no adjacent regressions
  • mutation check (revert the regex to old, re-run new tests) → 25 of 45 fail — confirms the tests genuinely pin the new behavior

Required before merge

  • Sign the CLA — the bot comment up-thread has the exact text to copy.

  • uv run ruff formatruff format --check currently fails on three spots, all auto-fixable in one command:

    1. fts_tokenizer.py:18 — single-quoted regex string → ruff prefers double-quoted (r"...").
    2. test_fts_tokenizer.py — class definitions need two blank lines before them (PEP 8) at lines 24, 46, 95.
    3. test_fts_tokenizer.py:90 — trailing whitespace on the 'say* """hello"""' assertion line.

    Note: ruff check already passes — it's only ruff format --check that flags these. Both are gating in CI.

Suggestion (non-blocking)

Could you add a ### Fixed line under [Unreleased] in the top-level CHANGELOG.md? Something like:

Out of scope (just noting for future, not for this PR)

  • The kiwipiepy backend at fts_tokenizer.py:101 drops tokens matching the regex rather than quoting them. That pre-dates this PR, but the silent-drop surface widens as the regex grows — worth a follow-up.
  • The if safe: branch at fts_tokenizer.py:123 is dead (text.split() never yields empty, .replace only grows). Pre-existing.

Approve

Once the format pass + CLA land, happy to approve and merge. Thanks again for picking this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BM25 FTS5 query sanitization misses . / \ < > (silent degradation)

2 participants