Skip to content

fix(search): normalize separators in source_filter match (#720)#722

Merged
memtomem merged 3 commits intomainfrom
fix/720-source-filter-windows
May 2, 2026
Merged

fix(search): normalize separators in source_filter match (#720)#722
memtomem merged 3 commits intomainfrom
fix/720-source-filter-windows

Conversation

@memtomem
Copy link
Copy Markdown
Owner

@memtomem memtomem commented May 2, 2026

Summary

Apply the same Windows separator fold at every source_filter comparison site, with three contract-specific helpers so the fold rule lives in exactly three places (each POSIX-pinned) rather than scattered inline at every call site:

Tool File Contract Approach
mem_search search/pipeline.py substring + glob match_source_filter
mem_list server/tools/browse.py substring + glob calls match_source_filter
mem_consolidate server/tools/consolidation.py substring + glob calls match_source_filter
mem_decay search/decay.py:expire_chunks substring-only calls match_source_filter_substring
mem_auto_tag tools/auto_tag.py substring-only calls match_source_filter_substring
mem_export_chunks tools/export_import.py substring-only (negated) calls match_source_filter_substring
mem_entity_scan server/tools/entity.py glob-only calls match_source_filter_glob

Why

Sibling of #647 (closed via #717) but in a different code path. The search-pipeline source_filter parameter (and the same parameter on six other MCP tools) was substring-matched (or fnmatch-globbed) directly against the chunk's stored source_file string with no separator normalisation. On Windows that meant a caller-supplied filter like /tmp/keep/ (POSIX literal — what tests and most users type) never matched a chunk whose stored path was \tmp\keep\policy.md.

Symptom on Windows CI (caught after #717 merged):

FAILED packages/memtomem/tests/test_temporal_validity.py::TestValidityFilterPipelineWiring::test_and_with_source_filter
  AssertionError: assert [] == ['chunk-ok']

That test was the visible failure, but the bug is family-wide: review of the first commit found 5 more sites with the same shape. Fixing only mem_search would have been a partial fix — mem_list(source_filter="docs/") on Windows would still return empty.

Approach

Match-side normalise (read-time fold) — same pattern as #716 (categorize_memory_dir) and #717 (norm_dir_prefix via os.sep). Three contract-specific helpers in search/pipeline.py, one parametrized pin per helper, all callers route through them:

  • match_source_filter — substring + glob auto-detect (existing). Pinned by 9-row TestMatchSourceFilter.
  • match_source_filter_substring — substring-only, no glob fallback. Pinned by 7-row TestMatchSourceFilterSubstring.
  • match_source_filter_glob — glob-only, no substring fallback. Pinned by 5-row TestMatchSourceFilterGlob.

The contract differences stay explicit at the call sites: substring-only callers can't accidentally gain glob, glob-only callers can't accidentally gain substring. feedback_default_off_when_opt_out_exists.md regression class avoided.

norm_path (storage/sqlite_helpers.py) is unchanged — persisted chunks.source_file rows stay native-form on every platform; only the comparison picks a canonical form.

POSIX edge case (documented)

Backslash is a legal filename character on POSIX, so a chunk indexed under foo\bar.md would match a filter foo/bar after the fold. Rare in practice. Documented in match_source_filter's docstring as the explicit trade-off for Windows portability.

What's not in this PR

  • No norm_path change. Same reasoning as fix(indexing): use os.sep for memory_dir prefix so Windows paths match (#647) #717 — write+read symmetry already collapses drive case, 8.3, and \\?\ long-path prefix to a canonical form via Path.resolve().
  • No promotion of the helper module out of search/pipeline.py. It currently lives there because that's where match_source_filter was already defined. If a future site lives outside search/ AND the layering of server.toolssearch becomes painful, moving the trio to e.g. storage/sqlite_helpers.py is a clean follow-up.

Test plan

  • uv run pytest packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilter packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilterSubstring packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilterGlob -v — 21 passed (9 + 7 + 5).
  • Mutation-validated all three helpers (per feedback_pin_test_mutation_validation.md): locally reverted each helper's .replace() calls and confirmed Windows-shape rows fail on macOS.
  • uv run pytest packages/memtomem/tests/{test_pipeline,test_temporal_validity,test_user_workflows,test_server_tools_core,test_storage_extended,test_server_tools_advanced,test_export_import_roundtrip,test_auto_tag_llm,test_entity_extraction_llm}.py -m "not ollama and not llm" — 316 passed (POSIX no-op confirmed across every source_filter consumer including the inline-site test files).
  • uv run ruff check and ruff format --check on search/, server/tools/, tools/, tests/test_pipeline.py — clean.
  • Windows-CI leg: test_temporal_validity::test_and_with_source_filter turns green; the 6 other source_filter consumers (mem_list, mem_consolidate, mem_decay, mem_auto_tag, mem_export_chunks, mem_entity_scan) integration tests stay green.

Closes #720. Refs #647 (closed via #717), #716, #634 (Windows CI matrix umbrella).

🤖 Generated with Claude Code

pandas-studio and others added 3 commits May 2, 2026 16:25
Sibling of #647 in a different code path. The search-pipeline
``source_filter`` parameter was substring-matched (or ``fnmatch``-globbed)
directly against the chunk's stored ``source_file`` string, with no
separator normalisation on either side. On Windows that meant a
caller-supplied filter like ``/tmp/keep/`` (POSIX literal — what tests
and most users type) never matched a chunk whose stored path was
``\tmp\keep\policy.md``. Symptom on Windows CI:

    FAILED test_temporal_validity::test_and_with_source_filter
    AssertionError: assert [] == ['chunk-ok']

The test was meant to exercise validity-AND-source semantics but never
got that far — the source filter alone rejected everything.

Fix: in both ``_match_source`` (pipeline.py) and the inlined substring
check in ``expire_chunks`` (decay.py), fold both sides through
``str.replace("\\", "/")`` before the substring/glob comparison. POSIX
is a no-op (backslashes don't appear in native paths); Windows now
matches POSIX-shaped filters against native source strings, and the
glob branch keeps working because the canonical form on both sides
uses ``/`` which ``fnmatch`` already treats literally.

This mirrors the read-side normalisation pattern from PR #716
(``categorize_memory_dir``) — normalise where the comparison happens,
not where the data is stored. ``norm_path`` (sqlite_helpers) is
unchanged; persisted ``chunks.source_file`` rows stay native-form.

Pin: ``TestMatchSource::test_separator_normalised_both_sides`` —
9-row parametrize over (filter shape × source shape × substring/glob
branch). Mutation-validated locally per
``feedback_pin_test_mutation_validation.md``: removing the
``.replace("\\", "/")`` calls makes the 4 Windows-shape rows fail
on macOS, exactly the regression the pin needs to catch.

Closes #720.
Refs #647 (closed via #717), #716, #634 (Windows CI matrix umbrella).

Co-Authored-By: Claude <[email protected]>
#720)

Review of #722's first commit found three additional sibling sites
that share the same Windows separator-mismatch bug as
``_match_source``: ``mem_list`` (browse.py), ``mem_consolidate``
(consolidation.py), ``mem_extract_entities`` (entity.py),
``mem_export_chunks`` (export_import.py), and ``mem_auto_tag``
(auto_tag.py). Without fixing them, ``mem_list(source_filter="docs/")``
on Windows would still return empty after #720 lands — the
``test_temporal_validity`` symptom would just shift to the next user
that types a forward-slash filter.

Two structural changes:

- Promote ``_match_source`` to public ``match_source_filter`` in
  ``search/pipeline.py``. ``mem_list`` (browse.py) and
  ``mem_consolidate`` (consolidation.py) had a literal copy of the
  substring + glob branching inlined; they now call the shared helper
  so the contract and the separator-fold rule live in exactly one
  place. Internal caller ``SearchPipeline.search`` is updated to the
  new name.
- For the three contract-different sites — ``mem_extract_entities``
  (glob-only by design), ``mem_export_chunks`` (substring-only with
  negation), and ``mem_auto_tag`` (substring-only) — apply the same
  ``str.replace("\\", "/")`` fold inline at each call site. Sharing
  ``match_source_filter`` would silently broaden their contract
  (substring-only sites would gain glob support, glob-only sites would
  gain substring fallback), which is a behaviour change beyond #720
  scope.

Also note the POSIX edge case in ``match_source_filter``'s docstring:
backslash is a legal POSIX filename character, so a chunk indexed
under ``foo\bar.md`` would match a filter ``foo/bar`` after the fold.
Rare in practice; documented as the trade-off for portability.

The 9-row pin test (renamed ``TestMatchSourceFilter``) covers the
shared helper. The four contract-different inlined sites have only
the import-path change to verify (no logic change worth a separate
unit test); existing integration tests under
``test_server_tools_*.py`` exercise the call paths.

Co-Authored-By: Claude <[email protected]>
…eview)

Review of commit 5dd21c8 raised that the four inline sites
(``mem_decay``, ``mem_auto_tag``, ``mem_export_chunks``,
``mem_entity_scan``) had no POSIX-catchable pin — a future revert of
``.replace("\\", "/")`` at any one of them would only fail the
Windows CI leg, not the Linux/macOS legs that block merge today
(per ``feedback_pin_test_mutation_validation.md``).

Rather than add per-site pin tests (which need per-tool fake-storage
setup), extract two contract-specific helpers next to the existing
``match_source_filter`` and have the four sites call them. The
separator-fold rule now lives in exactly three functions, each with a
parametrized POSIX-runnable pin:

- ``match_source_filter`` (substring + glob auto-detect) — used by
  ``mem_search``, ``mem_list``, ``mem_consolidate``. Existing 9-row
  ``TestMatchSourceFilter`` pin.
- ``match_source_filter_substring`` (substring-only) — new. Used by
  ``mem_decay`` / ``expire_chunks``, ``mem_auto_tag``,
  ``mem_export_chunks``. New 7-row ``TestMatchSourceFilterSubstring``
  pin (mutation-validated: reverting ``.replace()`` makes the
  Windows-shape rows fail on macOS).
- ``match_source_filter_glob`` (glob-only) — new. Used by
  ``mem_entity_scan``. New 5-row ``TestMatchSourceFilterGlob`` pin.

The contract differences stay explicit: substring-only callers can't
accidentally gain glob, glob-only callers can't accidentally gain
substring. ``feedback_default_off_when_opt_out_exists.md`` regression
class avoided.

Stale ``from fnmatch import fnmatch`` import in ``entity.py`` removed
(only call site replaced).

POSIX sweep across all source_filter consumers (test_pipeline,
test_temporal_validity, test_user_workflows, test_server_tools_*,
test_storage_extended, test_export_import_roundtrip, test_auto_tag_llm,
test_entity_extraction_llm): 316 passed, 0 regressions.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem merged commit 88c3e81 into main May 2, 2026
8 of 9 checks passed
@memtomem memtomem deleted the fix/720-source-filter-windows branch May 2, 2026 07:44
@github-actions github-actions Bot locked and limited conversation to collaborators May 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows: search source_filter substring match ignores separator (sibling of #647)

2 participants