fix(search): normalize separators in source_filter match (#720)#722
Merged
fix(search): normalize separators in source_filter match (#720)#722
Conversation
Sibling of #647 in a different code path. The search-pipeline ``source_filter`` parameter was substring-matched (or ``fnmatch``-globbed) directly against the chunk's stored ``source_file`` string, with no separator normalisation on either side. On Windows that meant a caller-supplied filter like ``/tmp/keep/`` (POSIX literal — what tests and most users type) never matched a chunk whose stored path was ``\tmp\keep\policy.md``. Symptom on Windows CI: FAILED test_temporal_validity::test_and_with_source_filter AssertionError: assert [] == ['chunk-ok'] The test was meant to exercise validity-AND-source semantics but never got that far — the source filter alone rejected everything. Fix: in both ``_match_source`` (pipeline.py) and the inlined substring check in ``expire_chunks`` (decay.py), fold both sides through ``str.replace("\\", "/")`` before the substring/glob comparison. POSIX is a no-op (backslashes don't appear in native paths); Windows now matches POSIX-shaped filters against native source strings, and the glob branch keeps working because the canonical form on both sides uses ``/`` which ``fnmatch`` already treats literally. This mirrors the read-side normalisation pattern from PR #716 (``categorize_memory_dir``) — normalise where the comparison happens, not where the data is stored. ``norm_path`` (sqlite_helpers) is unchanged; persisted ``chunks.source_file`` rows stay native-form. Pin: ``TestMatchSource::test_separator_normalised_both_sides`` — 9-row parametrize over (filter shape × source shape × substring/glob branch). Mutation-validated locally per ``feedback_pin_test_mutation_validation.md``: removing the ``.replace("\\", "/")`` calls makes the 4 Windows-shape rows fail on macOS, exactly the regression the pin needs to catch. Closes #720. Refs #647 (closed via #717), #716, #634 (Windows CI matrix umbrella). Co-Authored-By: Claude <[email protected]>
#720) Review of #722's first commit found three additional sibling sites that share the same Windows separator-mismatch bug as ``_match_source``: ``mem_list`` (browse.py), ``mem_consolidate`` (consolidation.py), ``mem_extract_entities`` (entity.py), ``mem_export_chunks`` (export_import.py), and ``mem_auto_tag`` (auto_tag.py). Without fixing them, ``mem_list(source_filter="docs/")`` on Windows would still return empty after #720 lands — the ``test_temporal_validity`` symptom would just shift to the next user that types a forward-slash filter. Two structural changes: - Promote ``_match_source`` to public ``match_source_filter`` in ``search/pipeline.py``. ``mem_list`` (browse.py) and ``mem_consolidate`` (consolidation.py) had a literal copy of the substring + glob branching inlined; they now call the shared helper so the contract and the separator-fold rule live in exactly one place. Internal caller ``SearchPipeline.search`` is updated to the new name. - For the three contract-different sites — ``mem_extract_entities`` (glob-only by design), ``mem_export_chunks`` (substring-only with negation), and ``mem_auto_tag`` (substring-only) — apply the same ``str.replace("\\", "/")`` fold inline at each call site. Sharing ``match_source_filter`` would silently broaden their contract (substring-only sites would gain glob support, glob-only sites would gain substring fallback), which is a behaviour change beyond #720 scope. Also note the POSIX edge case in ``match_source_filter``'s docstring: backslash is a legal POSIX filename character, so a chunk indexed under ``foo\bar.md`` would match a filter ``foo/bar`` after the fold. Rare in practice; documented as the trade-off for portability. The 9-row pin test (renamed ``TestMatchSourceFilter``) covers the shared helper. The four contract-different inlined sites have only the import-path change to verify (no logic change worth a separate unit test); existing integration tests under ``test_server_tools_*.py`` exercise the call paths. Co-Authored-By: Claude <[email protected]>
…eview) Review of commit 5dd21c8 raised that the four inline sites (``mem_decay``, ``mem_auto_tag``, ``mem_export_chunks``, ``mem_entity_scan``) had no POSIX-catchable pin — a future revert of ``.replace("\\", "/")`` at any one of them would only fail the Windows CI leg, not the Linux/macOS legs that block merge today (per ``feedback_pin_test_mutation_validation.md``). Rather than add per-site pin tests (which need per-tool fake-storage setup), extract two contract-specific helpers next to the existing ``match_source_filter`` and have the four sites call them. The separator-fold rule now lives in exactly three functions, each with a parametrized POSIX-runnable pin: - ``match_source_filter`` (substring + glob auto-detect) — used by ``mem_search``, ``mem_list``, ``mem_consolidate``. Existing 9-row ``TestMatchSourceFilter`` pin. - ``match_source_filter_substring`` (substring-only) — new. Used by ``mem_decay`` / ``expire_chunks``, ``mem_auto_tag``, ``mem_export_chunks``. New 7-row ``TestMatchSourceFilterSubstring`` pin (mutation-validated: reverting ``.replace()`` makes the Windows-shape rows fail on macOS). - ``match_source_filter_glob`` (glob-only) — new. Used by ``mem_entity_scan``. New 5-row ``TestMatchSourceFilterGlob`` pin. The contract differences stay explicit: substring-only callers can't accidentally gain glob, glob-only callers can't accidentally gain substring. ``feedback_default_off_when_opt_out_exists.md`` regression class avoided. Stale ``from fnmatch import fnmatch`` import in ``entity.py`` removed (only call site replaced). POSIX sweep across all source_filter consumers (test_pipeline, test_temporal_validity, test_user_workflows, test_server_tools_*, test_storage_extended, test_export_import_roundtrip, test_auto_tag_llm, test_entity_extraction_llm): 316 passed, 0 regressions. Co-Authored-By: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Apply the same Windows separator fold at every
source_filtercomparison site, with three contract-specific helpers so the fold rule lives in exactly three places (each POSIX-pinned) rather than scattered inline at every call site:mem_searchsearch/pipeline.pymatch_source_filtermem_listserver/tools/browse.pymatch_source_filtermem_consolidateserver/tools/consolidation.pymatch_source_filtermem_decaysearch/decay.py:expire_chunksmatch_source_filter_substringmem_auto_tagtools/auto_tag.pymatch_source_filter_substringmem_export_chunkstools/export_import.pymatch_source_filter_substringmem_entity_scanserver/tools/entity.pymatch_source_filter_globWhy
Sibling of #647 (closed via #717) but in a different code path. The search-pipeline
source_filterparameter (and the same parameter on six other MCP tools) was substring-matched (orfnmatch-globbed) directly against the chunk's storedsource_filestring with no separator normalisation. On Windows that meant a caller-supplied filter like/tmp/keep/(POSIX literal — what tests and most users type) never matched a chunk whose stored path was\tmp\keep\policy.md.Symptom on Windows CI (caught after #717 merged):
That test was the visible failure, but the bug is family-wide: review of the first commit found 5 more sites with the same shape. Fixing only
mem_searchwould have been a partial fix —mem_list(source_filter="docs/")on Windows would still return empty.Approach
Match-side normalise (read-time fold) — same pattern as #716 (
categorize_memory_dir) and #717 (norm_dir_prefixviaos.sep). Three contract-specific helpers insearch/pipeline.py, one parametrized pin per helper, all callers route through them:match_source_filter— substring + glob auto-detect (existing). Pinned by 9-rowTestMatchSourceFilter.match_source_filter_substring— substring-only, no glob fallback. Pinned by 7-rowTestMatchSourceFilterSubstring.match_source_filter_glob— glob-only, no substring fallback. Pinned by 5-rowTestMatchSourceFilterGlob.The contract differences stay explicit at the call sites: substring-only callers can't accidentally gain glob, glob-only callers can't accidentally gain substring.
feedback_default_off_when_opt_out_exists.mdregression class avoided.norm_path(storage/sqlite_helpers.py) is unchanged — persistedchunks.source_filerows stay native-form on every platform; only the comparison picks a canonical form.POSIX edge case (documented)
Backslash is a legal filename character on POSIX, so a chunk indexed under
foo\bar.mdwould match a filterfoo/barafter the fold. Rare in practice. Documented inmatch_source_filter's docstring as the explicit trade-off for Windows portability.What's not in this PR
norm_pathchange. Same reasoning as fix(indexing): use os.sep for memory_dir prefix so Windows paths match (#647) #717 — write+read symmetry already collapses drive case, 8.3, and\\?\long-path prefix to a canonical form viaPath.resolve().search/pipeline.py. It currently lives there because that's wherematch_source_filterwas already defined. If a future site lives outsidesearch/AND the layering ofserver.tools→searchbecomes painful, moving the trio to e.g.storage/sqlite_helpers.pyis a clean follow-up.Test plan
uv run pytest packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilter packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilterSubstring packages/memtomem/tests/test_pipeline.py::TestMatchSourceFilterGlob -v— 21 passed (9 + 7 + 5).feedback_pin_test_mutation_validation.md): locally reverted each helper's.replace()calls and confirmed Windows-shape rows fail on macOS.uv run pytest packages/memtomem/tests/{test_pipeline,test_temporal_validity,test_user_workflows,test_server_tools_core,test_storage_extended,test_server_tools_advanced,test_export_import_roundtrip,test_auto_tag_llm,test_entity_extraction_llm}.py -m "not ollama and not llm"— 316 passed (POSIX no-op confirmed across everysource_filterconsumer including the inline-site test files).uv run ruff checkandruff format --checkonsearch/,server/tools/,tools/,tests/test_pipeline.py— clean.test_temporal_validity::test_and_with_source_filterturns green; the 6 othersource_filterconsumers (mem_list,mem_consolidate,mem_decay,mem_auto_tag,mem_export_chunks,mem_entity_scan) integration tests stay green.Closes #720. Refs #647 (closed via #717), #716, #634 (Windows CI matrix umbrella).
🤖 Generated with Claude Code