Skip to content

Windows: search source_filter substring match ignores separator (sibling of #647) #720

@memtomem

Description

@memtomem

Background

Sibling of #647 (closed via #717) but in a different code path: the search pipeline's source_filter parameter does plain substring matching against the chunk's source_file, with no separator normalisation. On Windows that means a caller-supplied filter like /tmp/keep/ (which a test or a user types in POSIX form) never matches a chunk whose stored source_file is \tmp\keep\policy.md — separators differ, substring fails.

Affected code

  • packages/memtomem/src/memtomem/search/pipeline.py:40-44_match_source:
    def _match_source(filter_str: str, source_path: str) -> bool:
        """Match source_filter: glob when pattern chars present, substring otherwise."""
        if any(c in filter_str for c in ("*", "?", "[")):
            return fnmatch(source_path, filter_str)
        return filter_str in source_path
  • packages/memtomem/src/memtomem/search/decay.py:111-112expire_chunks:
    if source_filter:
        sources = {s for s in sources if source_filter in str(s)}

Both compare a user/caller-supplied substring against a stored native-form path. Neither normalises separators on either side.

Symptom on Windows CI

FAILED packages/memtomem/tests/test_temporal_validity.py::TestValidityFilterPipelineWiring::test_and_with_source_filter
  AssertionError: assert [] == ['chunk-ok']

The test sets source_filter="/tmp/keep/" and indexes a chunk with source_file=Path("/tmp/keep/policy.md"). On Windows str(WindowsPath("/tmp/keep/policy.md")) is \tmp\keep\policy.md; the substring "/tmp/keep/" is not in "\tmp\keep\policy.md" → the filter rejects the chunk → empty result.

This is not what test_and_with_source_filter is exercising — it's exercising the AND between validity and source filter, but the test never gets a chance to assert AND-semantics because the source filter alone already rejects everything on Windows.

Why this didn't get fixed by #717

#717 normalised the trailing separator inside norm_dir_prefix (os.sep instead of a hardcoded "/"). That helper is only used by engine.py:resolve_owning_memory_dir, engine.py:memory_dir_stats, and the delete_chunks=true branch of web/routes/system.py:remove_memory_dir. Search-pipeline source_filter is a separate code path: it bypasses norm_dir_prefix and does its own ad-hoc in str(...) / fnmatch(...) matching.

norm_path itself is symmetric (write side and read side both go through Path.resolve() + NFC, so the stored chunks.source_file is in native form on every OS), so the asymmetry is purely the caller's input — they pass a separator-naive substring/glob.

Fix direction

Two options, equivalent in correctness:

A) Caller-side normalise — the MemSearchInput / CLI / API layer that accepts source_filter should normalise separators to native form before passing in (e.g. source_filter.replace("/", os.sep) when not a glob, or the equivalent for fnmatch patterns). One layer to change, but every entry point has to remember.

B) Match-side normalise (recommended) — in _match_source and expire_chunks, normalise both filter_str and source_path to a single canonical form before comparing. Mirrors PR #716's read-time .replace("\\", "/") pattern in categorize_memory_dir. One helper, all callers covered. Suggested:

def _match_source(filter_str: str, source_path: str) -> bool:
    """Match source_filter: glob when pattern chars present, substring otherwise.

    Both sides are folded to forward-slash form before comparing so a
    user-typed `/tmp/keep/` matches a Windows-stored `\\tmp\\keep\\file`
    (#NNN, sibling of #647).
    """
    norm_filter = filter_str.replace("\\", "/")
    norm_source = source_path.replace("\\", "/")
    if any(c in norm_filter for c in ("*", "?", "[")):
        return fnmatch(norm_source, norm_filter)
    return norm_filter in norm_source

The decay.py:expire_chunks site should call into the same helper (currently inlined).

Test plan

  • The Windows-CI failure of test_temporal_validity::test_and_with_source_filter is the existing validation surface — should turn green once _match_source normalises.
  • Add a small parametrized unit test that exercises both forward-slash and backslash inputs against both forward-slash and backslash sources, mutation-validated against a hardcoded-separator regression (per the pattern in fix(indexing): use os.sep for memory_dir prefix so Windows paths match (#647) #717's TestNormDirPrefix).
  • POSIX is a no-op: .replace("\\", "/") on a POSIX path is identity.

Refs

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions