sec(privacy): scan full content at LTM trust boundary, not first 10K chars by memtomem · Pull Request #792 · memtomem/memtomem

memtomem · 2026-05-05T01:11:44Z

Summary

PR-3 of the security hardening plan. privacy.scan() previously truncated at the first 10 K chars (_SCAN_WINDOW = 10_000) to mirror STM's compression-side scanner. At the LTM trust boundary that cap is a silent bypass: any secret pasted past the 10 K mark wrote through enforce_write_guard unredacted, hitting storage with the matched bytes intact.

This change drops the truncation and scans the full input. The asymmetry with STM is intentional and one-directional — STM's window is a routing signal, the LTM scan is a write-rejection gate. The trust boundary lives in this module.

Changes

packages/memtomem/src/memtomem/privacy.py — remove _SCAN_WINDOW, run re.finditer over full text. Updated docstring captures the STM-asymmetry rationale so a future contributor doesn't try to "restore parity" by re-adding the cap.
packages/memtomem/src/memtomem/web/static/app.js — replace the stale "client scans entire textarea while server caps at 10 K" comment block on the compose-mode confirm dialog with the post-fix shape (both sides cover the full content). No behavior change in the SPA path.
packages/memtomem/tests/test_privacy.py — invert the existing test_scan_window_capped_at_10k pin into test_secret_past_former_10k_window_is_seen. See "Test-fixture note" below for why the original test was actually testing the wrong invariant.
packages/memtomem/tests/test_privacy_long_content.py (new) — pin the full-content contract at 12 K / 100 K / 1 MB with positive + negative pairs, an interior-position case, and a 1 MB perf ceiling.

Test plan

uv run pytest -m "not ollama" — 4060 passed, 11 skipped, 46 deselected (full CI filter, no regressions)
uv run pytest packages/memtomem/tests/test_privacy.py packages/memtomem/tests/test_privacy_long_content.py packages/memtomem/tests/test_memory_crud_redaction.py packages/memtomem/tests/test_redaction_write_surfaces.py -q — 80 passed
uv run ruff check packages/memtomem/src && uv run ruff format --check packages/memtomem/src — all checks passed
Mutation validation (per feedback_pin_test_mutation_validation.md): temporarily reintroduced text = text[:10_000] in scan() and confirmed the new tests fail with the expected shape — 6 failed, 15 passed in the privacy suite, all 6 failures on the new pin (3 size variants of the trailing-secret case + the interior-position case + the 1 MB perf-sanity case + the inverted test in test_privacy.py). Mutation reverted before commit.

Test-fixture note

Investigating the existing test_scan_window_capped_at_10k showed it was asserting scan("a" * 10_001 + "AKIAIOSFODNN7EXAMPLE") == [] and attributing the empty result to the truncation. The real cause was the AKIA pattern's \b word boundary failing to fire between a (word char) and A (word char) — the test would have returned [] even without truncation. The replacement uses the sk- prefix shape (DEFAULT_PATTERNS index 2, no boundary anchor) so the only failure mode is "scan did not cover the position." A separate test test_secret_within_first_10k_is_seen keeps the within-10 K baseline pinned for symmetry.

Out of scope (intentional)

Pattern set itself is unchanged. The PII / secret-class split rule in the module docstring still applies; broadening to PII patterns would reverse the false-positive profile and is a separate decision pass.
STM-side _SCAN_WINDOW stays. STM's scanner is routing-only; the LTM scan is the trust boundary.
PRs 4–10 from the hardening plan (DNS rebinding, supported_extensions constraint, SHA pin drift detection, mm reset env override, etc.) are independent and tracked separately per feedback_one_change_per_pr.md.

🤖 Generated with Claude Code

…K chars ``privacy.scan()`` previously truncated at ``_SCAN_WINDOW = 10_000`` chars to mirror STM's compression-side scanner. At the LTM trust boundary that cap is a silent bypass: any secret pasted past the 10K mark — pasted ``.env`` files, transcripts with embedded tokens, multi-screen notes — wrote through ``enforce_write_guard`` unredacted, hitting the storage layer with the matched bytes intact. Drop the truncation. ``scan()`` now runs ``re.finditer`` over the full input. The asymmetry with STM is intentional and one-directional: - STM's window is a compression-routing signal (does this block contain anything sensitive enough to skip routing). - LTM's scan is a write-rejection gate. The two contracts diverge, and the trust boundary lives in this module. All current ``DEFAULT_PATTERNS`` are short, prefix-anchored regexes (provider tokens, PEM headers, etc.); the linear-scan claim is pinned by ``test_one_megabyte_scan_under_perf_ceiling`` (1MB completes well under 50ms on CI hardware, ceiling set to 200ms to absorb jitter). Pin the post-fix contract at three sizes spanning the former cap (12K / 100K / 1MB), each with a paired negative (clean prose of identical shape must produce zero hits) and an interior-position case so a "scan only the last N chars" rewrite would also fail. Per ``feedback_pin_invert_symmetric_assertion.md`` and ``feedback_pin_test_mutation_validation.md``, the pin was mutation- validated before commit: temporarily reintroducing ``text = text[:10_000]`` made the new tests fail with the expected shape, then the mutation was reverted. Test-fixture aside: the existing ``test_scan_window_capped_at_10k`` was asserting ``privacy.scan("a" * 10_001 + "AKIAIOSFODNN7EXAMPLE") == []``, attributing the empty result to the truncation. Investigation showed the real cause was the pattern's ``\b`` word boundary failing to fire between ``a`` (word char) and ``A`` (word char) — the test would have returned ``[]`` even without truncation. The replacement test uses the ``sk-`` prefix shape (DEFAULT_PATTERNS index 2, no boundary anchor) so the only failure mode is "scan did not cover the position." The web-UI compose-mode confirm dialog (``app.js:3753``) carried a stale comment about scan-window asymmetry between client and server ``re.test()`` / ``privacy.scan()``. Replace with the post-fix shape: both sides now cover the full content; the server remains the source of truth and the client check is a UX-time hint that fires before the request goes out. No behavior change in the SPA path. Out of scope (deliberately not bundled): - Pattern set itself is unchanged. The PII / secret-class split rule in the module docstring still applies; broadening to PII patterns would reverse the false-positive profile and is a separate decision pass. - STM-side ``_SCAN_WINDOW`` stays. STM's scanner is routing-only; the LTM scan is the trust boundary. Per ``feedback_one_change_per_pr.md``: this PR is the third item in the hardening plan (PR-3) and is independent of PRs 4–10. Co-Authored-By: Claude <[email protected]>

…t scan Codex review of the parent commit (PR #792) flagged that the tool-facing docstrings on ``mem_add`` and ``mem_batch_add`` still advertised the old "first 10,000 characters" scan window, even though the underlying ``privacy.scan()`` was just rewritten to cover the full input. Stale contract surfaces visible to agents/operators are exactly the kind of "silent policy enforcement gap" — the change shipped, the docs lied. Fixes from the review: - ``mem_add`` (memory_crud.py:202): replace the 10,000-char-window paragraph with the full-content guarantee, and capture the STM asymmetry rationale (routing signal vs write-rejection gate). - ``mem_batch_add`` (memory_crud.py:428): same fix, scoped to per-entry scanning. - ``privacy.scan`` (privacy.py:278): drop the misleading "50 ms ceiling" number — the actual test ceiling is 200 ms and local 1 MB runs measure ~64-70 ms, so the docstring number was both too tight and inaccurate. Replace with a linear-time claim that points at the test as the authoritative perf pin, and reword "prefix-anchored" to "simple short" since some patterns (AKIA, JWT) are word-boundary-anchored rather than prefix-anchored. Pure docstring changes — no code-path or test changes. Existing 80-test privacy suite stays green; ``ruff check`` + ``ruff format --check`` clean. Per ``feedback_subagent_review_verification.md``: the review was re-validated by re-reading both call sites in ``memory_crud.py`` before applying the fix, so the line numbers and stale prose were confirmed in tree rather than taken on faith from the review output. Co-Authored-By: Claude <[email protected]>

pandas-studio and others added 2 commits May 5, 2026 10:11

memtomem merged commit 038b32a into main May 5, 2026
11 checks passed

memtomem deleted the sec/privacy-scan-full-content branch May 5, 2026 01:36

github-actions Bot locked and limited conversation to collaborators May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sec(privacy): scan full content at LTM trust boundary, not first 10K chars#792

sec(privacy): scan full content at LTM trust boundary, not first 10K chars#792
memtomem merged 2 commits intomainfrom
sec/privacy-scan-full-content

memtomem commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented May 5, 2026

Summary

Changes

Test plan

Test-fixture note

Out of scope (intentional)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants