fix(storage): block duplicate chunk inserts via UNIQUE + INSERT OR IGNORE (#691) by memtomem · Pull Request #705 · memtomem/memtomem

memtomem · 2026-05-02T04:27:35Z

Summary

Closes Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691. Multi-process indexing (mm web watcher + mm CLI / MCP server) on the same SQLite DB used to insert chunk rows that shared (namespace, source_file, content_hash, start_line) but differed only in id. Each process held its own asyncio.Lock; without a storage-layer guard both INSERTs succeeded with fresh UUIDs, and the differ then reused only one of the IDs per re-index — leaving the rest as silent ghosts that survived every subsequent run.
Fix lands at the storage layer (UNIQUE(namespace, source_file, content_hash, start_line) + INSERT OR IGNORE), with an idempotent startup migration that collapses each pre-existing duplicate group to its most-actively-used row before installing the index.
start_line is part of the key so legitimate same-content-different-position chunks (e.g. an identical TODO block reused at two locations in one file) are still preserved; race-induced dups always share start_line and are caught.

What was happening

User-reported on a real DB:

total_chunks            1546
distinct_content_hash   1469
duplicate_groups          77
duplicate_rows           154

Every duplicate group was same-namespace + same-source. 64 of 77 groups had identical created_at to second precision, ruling out user-driven re-indexing. Search results visibly repeated content under different chunk ids; chunk counts in mem_stats were inflated.

Reproduction with two multiprocessing.Process workers indexing the same RFC file against the same DB on main:

total=26 distinct_hash=13 duplicate_groups=13   # bug

Same script after this PR:

total=13 distinct_hash=13 duplicate_groups=0    # fixed

The reproduced hashes match the prod DB bit-for-bit, confirming the multi-process race is the actual cause (not a chunker bug — MarkdownChunker.chunk_file on the same file with the same config emits 0 hash duplicates, and so does the full pre-diff pipeline including _merge_short_chunks and _add_overlap).

What changed

packages/memtomem/src/memtomem/storage/sqlite_schema.py

New _migrate_chunks_uniqueness helper, gated on whether idx_chunks_unique_content already exists. On a clean DB (or already-migrated DB) it's a single index lookup — zero-cost.
When the index is missing it groups rows by (namespace, source_file, content_hash, start_line), picks the keeper as the row with the highest (access_count + use_count) so we preserve whichever row the differ has been actively reusing across re-indexes (tie-break: oldest created_at, then id), and deletes the loser rowids from chunks_fts and chunks_vec (when present) before deleting from chunks. Sidecars first so an interrupted run never leaves an FTS/vec entry pointing at a non-existent chunks.rowid.
Then CREATE UNIQUE INDEX IF NOT EXISTS idx_chunks_unique_content ON chunks(namespace, source_file, content_hash, start_line).

packages/memtomem/src/memtomem/storage/sqlite_backend.py

upsert_chunks: chunk INSERT becomes INSERT OR IGNORE. Race losers are dropped silently at insert time.
The follow-up SELECT id, rowid FROM chunks WHERE id IN (...) already filters by actual row existence, so chunks_fts and chunks_vec inserts naturally skip dropped rows. Wrapped the sidecar block in if new_rowids: for the case where every to_insert row was a race loser (otherwise placeholders(0) raised).

packages/memtomem/src/memtomem/tools/export_import.py

on_conflict="duplicate" mode kept for back-compat but documented and commented to reflect that the storage invariant is now authoritative — the mode no longer materialises duplicate rows on disk.

Tests

test_storage.py::TestChunkUniqueness — sequential and within-batch twin chunks (same content_hash, different id) collapse to one row.
test_sqlite_schema.py::TestDuplicateChunksMigration — pre-Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691 dup rows are cleaned up on first startup with the most-active row preserved; second startup is a no-op.
test_export_import_roundtrip.py::TestOnConflictModes::test_duplicate_mode_dedups_at_storage_layer — replaces the old test_duplicate_mode_backcompat; pins the new dedup behaviour.
Seed helpers in test_chunk_links_{reader,writer,schema}.py were inserting multiple rows with content_hash='' and source_file='' under the same namespace — which now correctly violates the UNIQUE index. Helpers updated to key content_hash off chunk_id so seed-only tests don't double as smoke tests for the constraint.

Test plan

uv run pytest -m "not ollama" — full suite, 3891 passed.
uv run ruff check packages/memtomem/src packages/memtomem/tests
uv run ruff format --check packages/memtomem/src packages/memtomem/tests
Two-process race repro on a clean DB shows duplicate_groups=0 after fix (was 13).
Migration test pins keeper selection: row with higher access_count + use_count is preserved over the older but unused row.
Manual smoke: create_tables runs twice on the same DB without error (idempotency).

🤖 Generated with Claude Code

…car (#691) Review follow-ups on PR #705: * Concurrent-startup safety: two processes booting simultaneously could each see ``already_migrated=False`` on the cheap fast-path, then both attempt the cleanup. Wrap the migration body in ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on SQLite's RESERVED lock until the first commits, then re-check the index existence inside the transaction so the loser short-circuits. Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep the migration crash-safe across restarts. * Docstring spells out the SQLite ≥ 3.25 requirement for ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past). * Negative-pin assertion in ``test_collapses_existing_dup_rows_and_installs_unique_index`` (per ``feedback_pin_invert_symmetric_assertion``): asserting the keeper survived isn't enough. Seed now writes matching ``chunks_fts`` rows alongside ``chunks`` (mirroring the prod write path through ``upsert_chunks``) so the test can pin both halves of the cleanup — loser id is gone from ``chunks`` and the loser's rowid is also gone from ``chunks_fts``. Co-Authored-By: Claude <[email protected]>

…NORE (#691) Multi-process indexing (mm web watcher + mm CLI / MCP server) on the same SQLite DB used to insert rows that shared (namespace, source_file, content_hash, start_line) but differed only in id. Each process held its own asyncio.Lock and there was no storage-layer guard, so both INSERT calls succeeded with fresh UUIDs. The differ then reused only one of the IDs per re-indexing pass, leaving the rest as silent ghosts that survived every subsequent run. Multi-engine race reproduces the exact prod hash on a clean DB: $ python /tmp/repro_691.py total=26 distinct_hash=13 duplicate_groups=13 # before total=13 distinct_hash=13 duplicate_groups=0 # after Fix is at the storage layer where coordination is actually shared: * sqlite_schema.py: idempotent migration that, when idx_chunks_unique_content is missing, collapses each group to its most-actively-used row (highest access_count + use_count, tie-break on oldest created_at, then id) and removes the chunks_fts / chunks_vec sidecar entries before installing the UNIQUE INDEX. Subsequent startups short-circuit on the index check. * sqlite_backend.py upsert_chunks: switches the chunk INSERT to INSERT OR IGNORE so multi-process race losers are dropped silently at insert time. The follow-up SELECT id, rowid WHERE id IN (...) already filters by actual existence, so chunks_fts and chunks_vec inserts naturally skip dropped rows. Guarded the sidecar block on a non-empty new_rowid_map for the case where every row in to_insert was a race loser. * tools/export_import.py: documents that on_conflict="duplicate" no longer materialises duplicate rows on disk — the storage invariant is now authoritative. Mode is preserved for back-compat. start_line is part of the key so two paragraphs that happen to be identical at different positions in the same file are still both indexed; race-induced dups always share start_line and are therefore caught. Existing seed helpers in test_chunk_links_{reader,writer,schema}.py were inserting multiple rows with content_hash='' and source_file='' under the same namespace, which now correctly violates the UNIQUE index. Helpers updated to key content_hash off chunk_id so seed-only tests don't double as smoke tests for the constraint. Regression coverage: * test_storage.py::TestChunkUniqueness — sequential and within-batch twins collapse to one row. * test_sqlite_schema.py::TestDuplicateChunksMigration — pre-#691 dups are cleaned up on first startup with the most-active row preserved; second startup is a no-op. * test_export_import_roundtrip.py — duplicate mode now dedups via storage invariant. Co-Authored-By: Claude <[email protected]>

…car (#691) Review follow-ups on PR #705: * Concurrent-startup safety: two processes booting simultaneously could each see ``already_migrated=False`` on the cheap fast-path, then both attempt the cleanup. Wrap the migration body in ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on SQLite's RESERVED lock until the first commits, then re-check the index existence inside the transaction so the loser short-circuits. Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep the migration crash-safe across restarts. * Docstring spells out the SQLite ≥ 3.25 requirement for ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past). * Negative-pin assertion in ``test_collapses_existing_dup_rows_and_installs_unique_index`` (per ``feedback_pin_invert_symmetric_assertion``): asserting the keeper survived isn't enough. Seed now writes matching ``chunks_fts`` rows alongside ``chunks`` (mirroring the prod write path through ``upsert_chunks``) so the test can pin both halves of the cleanup — loser id is gone from ``chunks`` and the loser's rowid is also gone from ``chunks_fts``. Co-Authored-By: Claude <[email protected]>

pandas-studio and others added 2 commits May 2, 2026 13:46

memtomem force-pushed the fix/691-duplicate-chunks-storage-uniqueness branch from 01c1ca4 to 2165d51 Compare May 2, 2026 04:46

memtomem merged commit 1076c13 into main May 2, 2026
8 of 9 checks passed

github-actions Bot locked and limited conversation to collaborators May 2, 2026

memtomem deleted the fix/691-duplicate-chunks-storage-uniqueness branch May 2, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): block duplicate chunk inserts via UNIQUE + INSERT OR IGNORE (#691)#705

fix(storage): block duplicate chunk inserts via UNIQUE + INSERT OR IGNORE (#691)#705
memtomem merged 2 commits intomainfrom
fix/691-duplicate-chunks-storage-uniqueness

memtomem commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented May 2, 2026

Summary

What was happening

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants