Skip to content

web/upload: redaction-blocked retry duplicates clean files in mixed batches #803

@memtomem

Description

@memtomem

Background

PR #802 wired the SPA confirm-and-retry UX for the trust-boundary redaction
guard introduced in PR #784. For /api/upload the helper
uploadFilesWithRedactionRetry() (packages/memtomem/src/memtomem/web/static/app.js)
re-issues the original FormData with ?force_unsafe=true after the
user confirms the bypass.

The PR comment notes the trade-off:

Re-uploading already-OK files is idempotent thanks to the server's
_{mtime_ns} collision suffix (system.py:1121). Acceptable trade-off
given the rarity of mixed batches.

That framing is too generous: in every mixed batch (≥1 clean + ≥1
blocked file) the clean files are silently indexed twice — once on the
first pass, once on the retry under a different filename — not just in
rare cases.

Reproduction

  1. mm web → Index → Upload mode.
  2. Select two files in the same batch:
    • clean.md — plain prose, no secret-like patterns.
    • secret.md — contains a token matching the redaction patterns
      (e.g. sk-ant-…).
  3. Click Upload.
  4. First response: clean.md indexed and written to disk; secret.md
    returns error="redaction_blocked (hits=N)".
  5. Confirm the bypass dialog.
  6. Inspect the upload directory: clean.md exists twice — once under its
    original name, once under clean_{mtime_ns}.md. Both are indexed.

Root cause

uploadFilesWithRedactionRetry() posts the unmodified original
FormData on retry; the helper has no awareness of which entries were
blocked vs which were already persisted. The server's _{mtime_ns}
suffix in system.py:1121 was designed for genuine name collisions, not
as a deduplication mechanism for retry batches.

Fix options

  1. Client-side scope reduction (preferred):
    In uploadFilesWithRedactionRetry(), after parsing per-file errors,
    rebuild a new FormData containing only the blocked files (matched by
    filename against the per-file error rows) and POST that to the
    ?force_unsafe=true URL. Clean files remain written from the first
    pass; the retry only adds the bypassed ones.

  2. Server-side skip-if-written:
    Add a check in the upload handler that, when force_unsafe=true and a
    file with the same content hash / mtime is already indexed, treats it
    as a no-op rather than rewriting under _{mtime_ns}. Wider blast
    radius and changes existing collision semantics; only worth it if
    another caller hits the same problem.

Option 1 is local to the SPA and matches how a thoughtful user would
interpret "retry the bypass" (retry the blocked part, not the whole
batch).

Test plan

  • New Playwright spec in tests/web/test_redaction_blocked_retry.py:
    mixed batch (1 clean + 1 blocked) → first pass → confirm → assert the
    retry FormData contains only the blocked filename.
  • Manual mm web smoke: confirm only the blocked file appears in the
    upload directory after retry; clean file is not duplicated.

Out of scope

Refs PR #802, PR #784, issue #785.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions