Skip to content

sec(privacy): scan full content at LTM trust boundary, not first 10K chars#792

Merged
memtomem merged 2 commits intomainfrom
sec/privacy-scan-full-content
May 5, 2026
Merged

sec(privacy): scan full content at LTM trust boundary, not first 10K chars#792
memtomem merged 2 commits intomainfrom
sec/privacy-scan-full-content

Conversation

@memtomem
Copy link
Copy Markdown
Owner

@memtomem memtomem commented May 5, 2026

Summary

PR-3 of the security hardening plan. privacy.scan() previously truncated at the first 10 K chars (_SCAN_WINDOW = 10_000) to mirror STM's compression-side scanner. At the LTM trust boundary that cap is a silent bypass: any secret pasted past the 10 K mark wrote through enforce_write_guard unredacted, hitting storage with the matched bytes intact.

This change drops the truncation and scans the full input. The asymmetry with STM is intentional and one-directional — STM's window is a routing signal, the LTM scan is a write-rejection gate. The trust boundary lives in this module.

Changes

  • packages/memtomem/src/memtomem/privacy.py — remove _SCAN_WINDOW, run re.finditer over full text. Updated docstring captures the STM-asymmetry rationale so a future contributor doesn't try to "restore parity" by re-adding the cap.
  • packages/memtomem/src/memtomem/web/static/app.js — replace the stale "client scans entire textarea while server caps at 10 K" comment block on the compose-mode confirm dialog with the post-fix shape (both sides cover the full content). No behavior change in the SPA path.
  • packages/memtomem/tests/test_privacy.py — invert the existing test_scan_window_capped_at_10k pin into test_secret_past_former_10k_window_is_seen. See "Test-fixture note" below for why the original test was actually testing the wrong invariant.
  • packages/memtomem/tests/test_privacy_long_content.py (new) — pin the full-content contract at 12 K / 100 K / 1 MB with positive + negative pairs, an interior-position case, and a 1 MB perf ceiling.

Test plan

  • uv run pytest -m "not ollama"4060 passed, 11 skipped, 46 deselected (full CI filter, no regressions)
  • uv run pytest packages/memtomem/tests/test_privacy.py packages/memtomem/tests/test_privacy_long_content.py packages/memtomem/tests/test_memory_crud_redaction.py packages/memtomem/tests/test_redaction_write_surfaces.py -q80 passed
  • uv run ruff check packages/memtomem/src && uv run ruff format --check packages/memtomem/src — all checks passed
  • Mutation validation (per feedback_pin_test_mutation_validation.md): temporarily reintroduced text = text[:10_000] in scan() and confirmed the new tests fail with the expected shape — 6 failed, 15 passed in the privacy suite, all 6 failures on the new pin (3 size variants of the trailing-secret case + the interior-position case + the 1 MB perf-sanity case + the inverted test in test_privacy.py). Mutation reverted before commit.

Test-fixture note

Investigating the existing test_scan_window_capped_at_10k showed it was asserting scan("a" * 10_001 + "AKIAIOSFODNN7EXAMPLE") == [] and attributing the empty result to the truncation. The real cause was the AKIA pattern's \b word boundary failing to fire between a (word char) and A (word char) — the test would have returned [] even without truncation. The replacement uses the sk- prefix shape (DEFAULT_PATTERNS index 2, no boundary anchor) so the only failure mode is "scan did not cover the position." A separate test test_secret_within_first_10k_is_seen keeps the within-10 K baseline pinned for symmetry.

Out of scope (intentional)

  • Pattern set itself is unchanged. The PII / secret-class split rule in the module docstring still applies; broadening to PII patterns would reverse the false-positive profile and is a separate decision pass.
  • STM-side _SCAN_WINDOW stays. STM's scanner is routing-only; the LTM scan is the trust boundary.
  • PRs 4–10 from the hardening plan (DNS rebinding, supported_extensions constraint, SHA pin drift detection, mm reset env override, etc.) are independent and tracked separately per feedback_one_change_per_pr.md.

🤖 Generated with Claude Code

pandas-studio and others added 2 commits May 5, 2026 10:11
…K chars

``privacy.scan()`` previously truncated at ``_SCAN_WINDOW = 10_000`` chars
to mirror STM's compression-side scanner. At the LTM trust boundary that
cap is a silent bypass: any secret pasted past the 10K mark — pasted
``.env`` files, transcripts with embedded tokens, multi-screen notes —
wrote through ``enforce_write_guard`` unredacted, hitting the storage
layer with the matched bytes intact.

Drop the truncation. ``scan()`` now runs ``re.finditer`` over the full
input. The asymmetry with STM is intentional and one-directional:

- STM's window is a compression-routing signal (does this block contain
  anything sensitive enough to skip routing).
- LTM's scan is a write-rejection gate. The two contracts diverge, and
  the trust boundary lives in this module.

All current ``DEFAULT_PATTERNS`` are short, prefix-anchored regexes
(provider tokens, PEM headers, etc.); the linear-scan claim is pinned by
``test_one_megabyte_scan_under_perf_ceiling`` (1MB completes well under
50ms on CI hardware, ceiling set to 200ms to absorb jitter).

Pin the post-fix contract at three sizes spanning the former cap (12K /
100K / 1MB), each with a paired negative (clean prose of identical shape
must produce zero hits) and an interior-position case so a "scan only
the last N chars" rewrite would also fail. Per
``feedback_pin_invert_symmetric_assertion.md`` and
``feedback_pin_test_mutation_validation.md``, the pin was mutation-
validated before commit: temporarily reintroducing ``text = text[:10_000]``
made the new tests fail with the expected shape, then the mutation was
reverted.

Test-fixture aside: the existing ``test_scan_window_capped_at_10k`` was
asserting ``privacy.scan("a" * 10_001 + "AKIAIOSFODNN7EXAMPLE") == []``,
attributing the empty result to the truncation. Investigation showed
the real cause was the pattern's ``\b`` word boundary failing to fire
between ``a`` (word char) and ``A`` (word char) — the test would have
returned ``[]`` even without truncation. The replacement test uses the
``sk-`` prefix shape (DEFAULT_PATTERNS index 2, no boundary anchor) so
the only failure mode is "scan did not cover the position."

The web-UI compose-mode confirm dialog (``app.js:3753``) carried a
stale comment about scan-window asymmetry between client and server
``re.test()`` / ``privacy.scan()``. Replace with the post-fix shape:
both sides now cover the full content; the server remains the source
of truth and the client check is a UX-time hint that fires before the
request goes out. No behavior change in the SPA path.

Out of scope (deliberately not bundled):

- Pattern set itself is unchanged. The PII / secret-class split rule in
  the module docstring still applies; broadening to PII patterns would
  reverse the false-positive profile and is a separate decision pass.
- STM-side ``_SCAN_WINDOW`` stays. STM's scanner is routing-only; the
  LTM scan is the trust boundary.

Per ``feedback_one_change_per_pr.md``: this PR is the third item in the
hardening plan (PR-3) and is independent of PRs 4–10.

Co-Authored-By: Claude <[email protected]>
…t scan

Codex review of the parent commit (PR #792) flagged that the tool-facing
docstrings on ``mem_add`` and ``mem_batch_add`` still advertised the old
"first 10,000 characters" scan window, even though the underlying
``privacy.scan()`` was just rewritten to cover the full input. Stale
contract surfaces visible to agents/operators are exactly the kind of
"silent policy enforcement gap" — the change shipped, the docs lied.

Fixes from the review:

- ``mem_add`` (memory_crud.py:202): replace the 10,000-char-window
  paragraph with the full-content guarantee, and capture the STM
  asymmetry rationale (routing signal vs write-rejection gate).
- ``mem_batch_add`` (memory_crud.py:428): same fix, scoped to per-entry
  scanning.
- ``privacy.scan`` (privacy.py:278): drop the misleading "50 ms ceiling"
  number — the actual test ceiling is 200 ms and local 1 MB runs measure
  ~64-70 ms, so the docstring number was both too tight and inaccurate.
  Replace with a linear-time claim that points at the test as the
  authoritative perf pin, and reword "prefix-anchored" to "simple
  short" since some patterns (AKIA, JWT) are word-boundary-anchored
  rather than prefix-anchored.

Pure docstring changes — no code-path or test changes. Existing
80-test privacy suite stays green; ``ruff check`` + ``ruff format
--check`` clean.

Per ``feedback_subagent_review_verification.md``: the review was
re-validated by re-reading both call sites in ``memory_crud.py``
before applying the fix, so the line numbers and stale prose were
confirmed in tree rather than taken on faith from the review output.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem merged commit 038b32a into main May 5, 2026
11 checks passed
@memtomem memtomem deleted the sec/privacy-scan-full-content branch May 5, 2026 01:36
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants