Skip to content

Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691

@memtomem

Description

@memtomem

What

Indexing produces multiple chunks rows that share the same content_hash,
namespace, source_file, and line range, differing only in id. This is
not a re-index artifact — the duplicate rows are inserted within the same
indexing run.

Surfaced via mm web search QA: identical body / line range repeated across
results in /api/search?q=memtomem&top_k=3.

Evidence (local DB)

total_chunks            1546
distinct_content_hash   1469
duplicate_groups          77
duplicate_rows           154

Every duplicate group is same-namespace + same-source-file:

bucket groups
same-ns + same-source 77
cross-ns same-source 0
same-ns cross-source 0
cross-ns + cross-source 0

Of the 77 groups, 64 share the exact same created_at (second precision)
across all duplicate rows; the remaining 13 fall within 60s of each other.
This rules out user-driven re-indexing as the cause.

Sample group (one of 77):

id                                    start_line  end_line  created_at
f61516dc-fa29-4013-bb35-e762aea8c7a7  102         92        2026-04-30T03:08:58+00:00
4c842f74-5a42-4f9b-97aa-b45b3ebf25c2  102         92        2026-04-30T03:08:58+00:00

(Note: start_line=102, end_line=92 looks inverted — that's a separate
chunker oddity not in scope here.)

Where it bites

  • Search: BM25 + dense both pull duplicates independently; RRF / rerank
    preserve them, so user-visible result lists repeat the same content under
    different chunk ids. No dedup pass exists in pipeline.search()
    (packages/memtomem/src/memtomem/search/pipeline.py:357).
  • Stats: mem_stats chunk counts are inflated proportionally.
  • Existing dedup tooling (/api/dedup, scheduler/jobs.py:142) is a
    manual scan-and-merge surface — it surfaces these but does not prevent
    insertion.

Schema

chunks table:

CREATE TABLE chunks (
    id TEXT PRIMARY KEY,
    content_hash TEXT NOT NULL,
    source_file TEXT NOT NULL,
    namespace TEXT NOT NULL DEFAULT 'default',
    ...
);
CREATE INDEX idx_chunks_hash ON chunks(content_hash);

content_hash has a lookup index but no UNIQUE constraint, neither
standalone nor on (namespace, source_file, content_hash). Same-run duplicate
inserts are therefore not rejected at the storage layer.

Hypothesis (needs verification before fix)

Two plausible causes, in priority order:

  1. Indexing pipeline inserts the same Chunk twice within one batch
    e.g. a parser yielding the same section under two paths, or a write loop
    not deduping the in-memory list before the executemany. The same-second
    timestamps strongly point here.
  2. Concurrent indexers racing on the same source — file watcher + CLI
    mm index overlap. Less likely given the same-second clustering, but
    possible if both share the run's "now()".

What's needed

Before deciding on a fix:

  • Reproduce on a clean DB by indexing a single representative file
    (one that has duplicates today) and check whether duplicates appear
    from one mm index invocation.
  • Trace the chunker → storage write path to find where the same
    content_hash enters the insert batch.
  • Decide dedup point: in-batch dedup before executemany (cheapest),
    a UNIQUE(namespace, source_file, content_hash) schema constraint
    with INSERT OR IGNORE (defensive), or both.

A backfill / cleanup pass for the 77 existing groups is a separate concern;
/api/dedup already covers manual cleanup.

Severity

Medium. Not data loss, but degrades search result quality (visible
repetition) and inflates chunk counts. ~5% of chunks affected on the DB
where this was observed (77 dup groups / 1469 distinct hashes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions