fix: extend FTS5 special-char regex to cover dot, slash, backslash, angle brackets, tilde#715
Conversation
…ngle brackets, tilde The _FTS5_SPECIAL_RE character class was missing ., /, \, <, >, and ~ (among others). Tokens containing these characters — URLs, filesystem paths, dotted filenames, YAML frontmatter — leaked into the FTS5 query parser unquoted and triggered 'fts5: syntax error near "."'. The error was caught silently (HTTP 200, dense-only fallback), so users had no indication that keyword search had failed. Extend the regex to the full unicode61 token-class special set and add 45 regression tests covering the regex, the quoting logic, and the end-to-end tokenize_for_fts query path. Fixes memtomem#697
|
Thank you for your contribution! Before we can merge, please sign the Contributor License Agreement. To sign, comment on this pull request with the statement below. You only need to sign once per GitHub account. I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
|
Thanks for the careful fix and the thorough test file — really appreciate the parametrized regex coverage plus the end-to-end cases. I ran the tests locally on the PR branch:
Required before merge
Suggestion (non-blocking)Could you add a
Out of scope (just noting for future, not for this PR)
ApproveOnce the format pass + CLA land, happy to approve and merge. Thanks again for picking this up! |
Summary
_FTS5_SPECIAL_REwas missing 6 FTS5 special characters:.,/,\,<,>,~. Tokens containing these characters — URLs (https://example.com), filesystem paths (a/b/c), dotted filenames (file.name.ext), YAML frontmatter (key: value), and proximity queries (word~n) — leaked into the FTS5 query parser unquoted and triggeredfts5: syntax error.The error was caught silently (logged at WARNING, HTTP 200, dense-only fallback), so users had no indication that keyword search had failed.
Changes
fts_tokenizer.py: Extend_FTS5_SPECIAL_REto the full unicode61 token-class special set. Words containing any special character are now wrapped in double quotes (literal phrase match) instead of receiving a*prefix wildcard.test_fts_tokenizer.py(new): 45 regression tests covering:_apply_prefix_wildcardquotes URLs, paths, dotted names, code spans, tildetokenize_for_ftsend-to-end query path for special-character tokensHow to test
Fixes #697