Skip to content

fix(storage): block duplicate chunk inserts via UNIQUE + INSERT OR IGNORE (#691)#705

Merged
memtomem merged 2 commits intomainfrom
fix/691-duplicate-chunks-storage-uniqueness
May 2, 2026
Merged

fix(storage): block duplicate chunk inserts via UNIQUE + INSERT OR IGNORE (#691)#705
memtomem merged 2 commits intomainfrom
fix/691-duplicate-chunks-storage-uniqueness

Conversation

@memtomem
Copy link
Copy Markdown
Owner

@memtomem memtomem commented May 2, 2026

Summary

  • Closes Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691. Multi-process indexing (mm web watcher + mm CLI / MCP server) on the same SQLite DB used to insert chunk rows that shared (namespace, source_file, content_hash, start_line) but differed only in id. Each process held its own asyncio.Lock; without a storage-layer guard both INSERTs succeeded with fresh UUIDs, and the differ then reused only one of the IDs per re-index — leaving the rest as silent ghosts that survived every subsequent run.
  • Fix lands at the storage layer (UNIQUE(namespace, source_file, content_hash, start_line) + INSERT OR IGNORE), with an idempotent startup migration that collapses each pre-existing duplicate group to its most-actively-used row before installing the index.
  • start_line is part of the key so legitimate same-content-different-position chunks (e.g. an identical TODO block reused at two locations in one file) are still preserved; race-induced dups always share start_line and are caught.

What was happening

User-reported on a real DB:

total_chunks            1546
distinct_content_hash   1469
duplicate_groups          77
duplicate_rows           154

Every duplicate group was same-namespace + same-source. 64 of 77 groups had identical created_at to second precision, ruling out user-driven re-indexing. Search results visibly repeated content under different chunk ids; chunk counts in mem_stats were inflated.

Reproduction with two multiprocessing.Process workers indexing the same RFC file against the same DB on main:

total=26 distinct_hash=13 duplicate_groups=13   # bug

Same script after this PR:

total=13 distinct_hash=13 duplicate_groups=0    # fixed

The reproduced hashes match the prod DB bit-for-bit, confirming the multi-process race is the actual cause (not a chunker bug — MarkdownChunker.chunk_file on the same file with the same config emits 0 hash duplicates, and so does the full pre-diff pipeline including _merge_short_chunks and _add_overlap).

What changed

packages/memtomem/src/memtomem/storage/sqlite_schema.py

  • New _migrate_chunks_uniqueness helper, gated on whether idx_chunks_unique_content already exists. On a clean DB (or already-migrated DB) it's a single index lookup — zero-cost.
  • When the index is missing it groups rows by (namespace, source_file, content_hash, start_line), picks the keeper as the row with the highest (access_count + use_count) so we preserve whichever row the differ has been actively reusing across re-indexes (tie-break: oldest created_at, then id), and deletes the loser rowids from chunks_fts and chunks_vec (when present) before deleting from chunks. Sidecars first so an interrupted run never leaves an FTS/vec entry pointing at a non-existent chunks.rowid.
  • Then CREATE UNIQUE INDEX IF NOT EXISTS idx_chunks_unique_content ON chunks(namespace, source_file, content_hash, start_line).

packages/memtomem/src/memtomem/storage/sqlite_backend.py

  • upsert_chunks: chunk INSERT becomes INSERT OR IGNORE. Race losers are dropped silently at insert time.
  • The follow-up SELECT id, rowid FROM chunks WHERE id IN (...) already filters by actual row existence, so chunks_fts and chunks_vec inserts naturally skip dropped rows. Wrapped the sidecar block in if new_rowids: for the case where every to_insert row was a race loser (otherwise placeholders(0) raised).

packages/memtomem/src/memtomem/tools/export_import.py

  • on_conflict="duplicate" mode kept for back-compat but documented and commented to reflect that the storage invariant is now authoritative — the mode no longer materialises duplicate rows on disk.

Tests

  • test_storage.py::TestChunkUniqueness — sequential and within-batch twin chunks (same content_hash, different id) collapse to one row.
  • test_sqlite_schema.py::TestDuplicateChunksMigration — pre-Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691 dup rows are cleaned up on first startup with the most-active row preserved; second startup is a no-op.
  • test_export_import_roundtrip.py::TestOnConflictModes::test_duplicate_mode_dedups_at_storage_layer — replaces the old test_duplicate_mode_backcompat; pins the new dedup behaviour.
  • Seed helpers in test_chunk_links_{reader,writer,schema}.py were inserting multiple rows with content_hash='' and source_file='' under the same namespace — which now correctly violates the UNIQUE index. Helpers updated to key content_hash off chunk_id so seed-only tests don't double as smoke tests for the constraint.

Test plan

  • uv run pytest -m "not ollama" — full suite, 3891 passed.
  • uv run ruff check packages/memtomem/src packages/memtomem/tests
  • uv run ruff format --check packages/memtomem/src packages/memtomem/tests
  • Two-process race repro on a clean DB shows duplicate_groups=0 after fix (was 13).
  • Migration test pins keeper selection: row with higher access_count + use_count is preserved over the older but unused row.
  • Manual smoke: create_tables runs twice on the same DB without error (idempotency).

🤖 Generated with Claude Code

memtomem pushed a commit that referenced this pull request May 2, 2026
…car (#691)

Review follow-ups on PR #705:

* Concurrent-startup safety: two processes booting simultaneously
  could each see ``already_migrated=False`` on the cheap fast-path,
  then both attempt the cleanup. Wrap the migration body in
  ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on
  SQLite's RESERVED lock until the first commits, then re-check the
  index existence inside the transaction so the loser short-circuits.
  Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep
  the migration crash-safe across restarts.

* Docstring spells out the SQLite ≥ 3.25 requirement for
  ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past).

* Negative-pin assertion in
  ``test_collapses_existing_dup_rows_and_installs_unique_index``
  (per ``feedback_pin_invert_symmetric_assertion``): asserting the
  keeper survived isn't enough. Seed now writes matching ``chunks_fts``
  rows alongside ``chunks`` (mirroring the prod write path through
  ``upsert_chunks``) so the test can pin both halves of the cleanup —
  loser id is gone from ``chunks`` and the loser's rowid is also gone
  from ``chunks_fts``.

Co-Authored-By: Claude <[email protected]>
pandas-studio and others added 2 commits May 2, 2026 13:46
…NORE (#691)

Multi-process indexing (mm web watcher + mm CLI / MCP server) on the same
SQLite DB used to insert rows that shared
(namespace, source_file, content_hash, start_line) but differed only in id.
Each process held its own asyncio.Lock and there was no storage-layer
guard, so both INSERT calls succeeded with fresh UUIDs. The differ then
reused only one of the IDs per re-indexing pass, leaving the rest as
silent ghosts that survived every subsequent run.

Multi-engine race reproduces the exact prod hash on a clean DB:

  $ python /tmp/repro_691.py
  total=26 distinct_hash=13 duplicate_groups=13     # before
  total=13 distinct_hash=13 duplicate_groups=0      # after

Fix is at the storage layer where coordination is actually shared:

  * sqlite_schema.py: idempotent migration that, when
    idx_chunks_unique_content is missing, collapses each group to its
    most-actively-used row (highest access_count + use_count, tie-break
    on oldest created_at, then id) and removes the chunks_fts /
    chunks_vec sidecar entries before installing the UNIQUE INDEX.
    Subsequent startups short-circuit on the index check.

  * sqlite_backend.py upsert_chunks: switches the chunk INSERT to
    INSERT OR IGNORE so multi-process race losers are dropped silently
    at insert time. The follow-up SELECT id, rowid WHERE id IN (...)
    already filters by actual existence, so chunks_fts and chunks_vec
    inserts naturally skip dropped rows. Guarded the sidecar block on
    a non-empty new_rowid_map for the case where every row in to_insert
    was a race loser.

  * tools/export_import.py: documents that on_conflict="duplicate" no
    longer materialises duplicate rows on disk — the storage invariant
    is now authoritative. Mode is preserved for back-compat.

start_line is part of the key so two paragraphs that happen to be
identical at different positions in the same file are still both
indexed; race-induced dups always share start_line and are therefore
caught.

Existing seed helpers in test_chunk_links_{reader,writer,schema}.py
were inserting multiple rows with content_hash='' and source_file=''
under the same namespace, which now correctly violates the UNIQUE
index. Helpers updated to key content_hash off chunk_id so seed-only
tests don't double as smoke tests for the constraint.

Regression coverage:
  * test_storage.py::TestChunkUniqueness — sequential and within-batch
    twins collapse to one row.
  * test_sqlite_schema.py::TestDuplicateChunksMigration — pre-#691
    dups are cleaned up on first startup with the most-active row
    preserved; second startup is a no-op.
  * test_export_import_roundtrip.py — duplicate mode now dedups via
    storage invariant.

Co-Authored-By: Claude <[email protected]>
…car (#691)

Review follow-ups on PR #705:

* Concurrent-startup safety: two processes booting simultaneously
  could each see ``already_migrated=False`` on the cheap fast-path,
  then both attempt the cleanup. Wrap the migration body in
  ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on
  SQLite's RESERVED lock until the first commits, then re-check the
  index existence inside the transaction so the loser short-circuits.
  Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep
  the migration crash-safe across restarts.

* Docstring spells out the SQLite ≥ 3.25 requirement for
  ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past).

* Negative-pin assertion in
  ``test_collapses_existing_dup_rows_and_installs_unique_index``
  (per ``feedback_pin_invert_symmetric_assertion``): asserting the
  keeper survived isn't enough. Seed now writes matching ``chunks_fts``
  rows alongside ``chunks`` (mirroring the prod write path through
  ``upsert_chunks``) so the test can pin both halves of the cleanup —
  loser id is gone from ``chunks`` and the loser's rowid is also gone
  from ``chunks_fts``.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem force-pushed the fix/691-duplicate-chunks-storage-uniqueness branch from 01c1ca4 to 2165d51 Compare May 2, 2026 04:46
@memtomem memtomem merged commit 1076c13 into main May 2, 2026
8 of 9 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 2, 2026
@memtomem memtomem deleted the fix/691-duplicate-chunks-storage-uniqueness branch May 2, 2026 04:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash)

2 participants