Conversation
memtomem
pushed a commit
that referenced
this pull request
May 2, 2026
…car (#691) Review follow-ups on PR #705: * Concurrent-startup safety: two processes booting simultaneously could each see ``already_migrated=False`` on the cheap fast-path, then both attempt the cleanup. Wrap the migration body in ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on SQLite's RESERVED lock until the first commits, then re-check the index existence inside the transaction so the loser short-circuits. Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep the migration crash-safe across restarts. * Docstring spells out the SQLite ≥ 3.25 requirement for ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past). * Negative-pin assertion in ``test_collapses_existing_dup_rows_and_installs_unique_index`` (per ``feedback_pin_invert_symmetric_assertion``): asserting the keeper survived isn't enough. Seed now writes matching ``chunks_fts`` rows alongside ``chunks`` (mirroring the prod write path through ``upsert_chunks``) so the test can pin both halves of the cleanup — loser id is gone from ``chunks`` and the loser's rowid is also gone from ``chunks_fts``. Co-Authored-By: Claude <[email protected]>
…NORE (#691) Multi-process indexing (mm web watcher + mm CLI / MCP server) on the same SQLite DB used to insert rows that shared (namespace, source_file, content_hash, start_line) but differed only in id. Each process held its own asyncio.Lock and there was no storage-layer guard, so both INSERT calls succeeded with fresh UUIDs. The differ then reused only one of the IDs per re-indexing pass, leaving the rest as silent ghosts that survived every subsequent run. Multi-engine race reproduces the exact prod hash on a clean DB: $ python /tmp/repro_691.py total=26 distinct_hash=13 duplicate_groups=13 # before total=13 distinct_hash=13 duplicate_groups=0 # after Fix is at the storage layer where coordination is actually shared: * sqlite_schema.py: idempotent migration that, when idx_chunks_unique_content is missing, collapses each group to its most-actively-used row (highest access_count + use_count, tie-break on oldest created_at, then id) and removes the chunks_fts / chunks_vec sidecar entries before installing the UNIQUE INDEX. Subsequent startups short-circuit on the index check. * sqlite_backend.py upsert_chunks: switches the chunk INSERT to INSERT OR IGNORE so multi-process race losers are dropped silently at insert time. The follow-up SELECT id, rowid WHERE id IN (...) already filters by actual existence, so chunks_fts and chunks_vec inserts naturally skip dropped rows. Guarded the sidecar block on a non-empty new_rowid_map for the case where every row in to_insert was a race loser. * tools/export_import.py: documents that on_conflict="duplicate" no longer materialises duplicate rows on disk — the storage invariant is now authoritative. Mode is preserved for back-compat. start_line is part of the key so two paragraphs that happen to be identical at different positions in the same file are still both indexed; race-induced dups always share start_line and are therefore caught. Existing seed helpers in test_chunk_links_{reader,writer,schema}.py were inserting multiple rows with content_hash='' and source_file='' under the same namespace, which now correctly violates the UNIQUE index. Helpers updated to key content_hash off chunk_id so seed-only tests don't double as smoke tests for the constraint. Regression coverage: * test_storage.py::TestChunkUniqueness — sequential and within-batch twins collapse to one row. * test_sqlite_schema.py::TestDuplicateChunksMigration — pre-#691 dups are cleaned up on first startup with the most-active row preserved; second startup is a no-op. * test_export_import_roundtrip.py — duplicate mode now dedups via storage invariant. Co-Authored-By: Claude <[email protected]>
…car (#691) Review follow-ups on PR #705: * Concurrent-startup safety: two processes booting simultaneously could each see ``already_migrated=False`` on the cheap fast-path, then both attempt the cleanup. Wrap the migration body in ``BEGIN IMMEDIATE`` / ``COMMIT`` so the second process blocks on SQLite's RESERVED lock until the first commits, then re-check the index existence inside the transaction so the loser short-circuits. Idempotent ``DELETE`` + ``CREATE UNIQUE INDEX IF NOT EXISTS`` keep the migration crash-safe across restarts. * Docstring spells out the SQLite ≥ 3.25 requirement for ``ROW_NUMBER()`` (Python 3.12's bundled sqlite3 is well past). * Negative-pin assertion in ``test_collapses_existing_dup_rows_and_installs_unique_index`` (per ``feedback_pin_invert_symmetric_assertion``): asserting the keeper survived isn't enough. Seed now writes matching ``chunks_fts`` rows alongside ``chunks`` (mirroring the prod write path through ``upsert_chunks``) so the test can pin both halves of the cleanup — loser id is gone from ``chunks`` and the loser's rowid is also gone from ``chunks_fts``. Co-Authored-By: Claude <[email protected]>
01c1ca4 to
2165d51
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mm webwatcher +mmCLI / MCP server) on the same SQLite DB used to insert chunk rows that shared(namespace, source_file, content_hash, start_line)but differed only inid. Each process held its ownasyncio.Lock; without a storage-layer guard bothINSERTs succeeded with fresh UUIDs, and the differ then reused only one of the IDs per re-index — leaving the rest as silent ghosts that survived every subsequent run.UNIQUE(namespace, source_file, content_hash, start_line)+INSERT OR IGNORE), with an idempotent startup migration that collapses each pre-existing duplicate group to its most-actively-used row before installing the index.start_lineis part of the key so legitimate same-content-different-position chunks (e.g. an identical TODO block reused at two locations in one file) are still preserved; race-induced dups always sharestart_lineand are caught.What was happening
User-reported on a real DB:
Every duplicate group was same-namespace + same-source. 64 of 77 groups had identical
created_atto second precision, ruling out user-driven re-indexing. Search results visibly repeated content under different chunk ids; chunk counts inmem_statswere inflated.Reproduction with two
multiprocessing.Processworkers indexing the same RFC file against the same DB onmain:Same script after this PR:
The reproduced hashes match the prod DB bit-for-bit, confirming the multi-process race is the actual cause (not a chunker bug —
MarkdownChunker.chunk_fileon the same file with the same config emits 0 hash duplicates, and so does the full pre-diff pipeline including_merge_short_chunksand_add_overlap).What changed
packages/memtomem/src/memtomem/storage/sqlite_schema.py_migrate_chunks_uniquenesshelper, gated on whetheridx_chunks_unique_contentalready exists. On a clean DB (or already-migrated DB) it's a single index lookup — zero-cost.(namespace, source_file, content_hash, start_line), picks the keeper as the row with the highest(access_count + use_count)so we preserve whichever row the differ has been actively reusing across re-indexes (tie-break: oldestcreated_at, thenid), and deletes the loser rowids fromchunks_ftsandchunks_vec(when present) before deleting fromchunks. Sidecars first so an interrupted run never leaves an FTS/vec entry pointing at a non-existentchunks.rowid.CREATE UNIQUE INDEX IF NOT EXISTS idx_chunks_unique_content ON chunks(namespace, source_file, content_hash, start_line).packages/memtomem/src/memtomem/storage/sqlite_backend.pyupsert_chunks: chunk INSERT becomesINSERT OR IGNORE. Race losers are dropped silently at insert time.SELECT id, rowid FROM chunks WHERE id IN (...)already filters by actual row existence, sochunks_ftsandchunks_vecinserts naturally skip dropped rows. Wrapped the sidecar block inif new_rowids:for the case where everyto_insertrow was a race loser (otherwiseplaceholders(0)raised).packages/memtomem/src/memtomem/tools/export_import.pyon_conflict="duplicate"mode kept for back-compat but documented and commented to reflect that the storage invariant is now authoritative — the mode no longer materialises duplicate rows on disk.Tests
test_storage.py::TestChunkUniqueness— sequential and within-batch twin chunks (samecontent_hash, differentid) collapse to one row.test_sqlite_schema.py::TestDuplicateChunksMigration— pre-Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691 dup rows are cleaned up on first startup with the most-active row preserved; second startup is a no-op.test_export_import_roundtrip.py::TestOnConflictModes::test_duplicate_mode_dedups_at_storage_layer— replaces the oldtest_duplicate_mode_backcompat; pins the new dedup behaviour.test_chunk_links_{reader,writer,schema}.pywere inserting multiple rows withcontent_hash=''andsource_file=''under the same namespace — which now correctly violates the UNIQUE index. Helpers updated to keycontent_hashoffchunk_idso seed-only tests don't double as smoke tests for the constraint.Test plan
uv run pytest -m "not ollama"— full suite, 3891 passed.uv run ruff check packages/memtomem/src packages/memtomem/testsuv run ruff format --check packages/memtomem/src packages/memtomem/testsduplicate_groups=0after fix (was13).access_count + use_countis preserved over the older but unused row.create_tablesruns twice on the same DB without error (idempotency).🤖 Generated with Claude Code