Skip to content

BM25 FTS5 query sanitization misses . / \ < > (silent degradation) #697

@memtomem

Description

@memtomem

Summary

User search queries containing punctuation common in markdown / YAML / URLs / filesystem paths can crash BM25 with fts5: syntax error near ".". The error is caught and logged but the HTTP response stays 200, so users see degraded (dense-only) results without any UI signal.

Found by Playwright UX review of v0.1.34 prod (2026-05-02). See docs/reports/mm-web-prod-v0.1.34-playwright-review.md (P1 — BM25 search can fail on raw markdown/YAML-like queries).

Evidence

Sanitization gappackages/memtomem/src/memtomem/storage/fts_tokenizer.py:18

_FTS5_SPECIAL_RE = re.compile(r'[*"()\-+^:]')

Missing characters that FTS5 treats as syntactically meaningful: . / \ < >. URLs (https://example.com), filesystem paths (a/b/c), dotted filenames (file.name.ext), and YAML/frontmatter (key: value, ---) all flow through unquoted and trip the parser.

Silent degradationpackages/memtomem/src/memtomem/search/pipeline.py:468-474

if use_bm25:
    try:
        bm25_results = await bm25_task
    except Exception as exc:
        logger.warning("BM25 search failed: %s", exc)
        bm25_results = []
        bm25_error = str(exc)

bm25_error lands in RetrievalStats but the web UI does not surface it. Users get dense-only results and have no signal that keyword search broke.

Tests — no FTS5-syntax-in-query regression cases in packages/memtomem/tests/. Queries with frontmatter, code spans, URLs, paths, punctuation are uncovered.

Suggested fix

  1. Extend _FTS5_SPECIAL_RE to include . / \ < > (and re-audit the rest of the FTS5 special set against the sqlite docs).
  2. Add regression cases to tests/ that pass each of: frontmatter (---\nkey: value), URL (https://example.com/path), dotted filename (file.name.ext), unix path (a/b/c), and a code-span fragment.
  3. Surface bm25_error in the web UI as a non-blocking warning (e.g., "Keyword search degraded — using vector results only") rather than swallowing it. This matches the repo's loud-vs-silent invariant.

Notes for first-time contributors

(1)+(2) is a self-contained good first issue-sized change. (3) touches the web layer and is a separate follow-up — keep it out of this PR unless requested.

References

  • Review: docs/reports/mm-web-prod-v0.1.34-playwright-review.md
  • Tracking umbrella: TBD (linked once opened)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions