What
Indexing produces multiple chunks rows that share the same content_hash,
namespace, source_file, and line range, differing only in id. This is
not a re-index artifact — the duplicate rows are inserted within the same
indexing run.
Surfaced via mm web search QA: identical body / line range repeated across
results in /api/search?q=memtomem&top_k=3.
Evidence (local DB)
total_chunks 1546
distinct_content_hash 1469
duplicate_groups 77
duplicate_rows 154
Every duplicate group is same-namespace + same-source-file:
| bucket |
groups |
| same-ns + same-source |
77 |
| cross-ns same-source |
0 |
| same-ns cross-source |
0 |
| cross-ns + cross-source |
0 |
Of the 77 groups, 64 share the exact same created_at (second precision)
across all duplicate rows; the remaining 13 fall within 60s of each other.
This rules out user-driven re-indexing as the cause.
Sample group (one of 77):
id start_line end_line created_at
f61516dc-fa29-4013-bb35-e762aea8c7a7 102 92 2026-04-30T03:08:58+00:00
4c842f74-5a42-4f9b-97aa-b45b3ebf25c2 102 92 2026-04-30T03:08:58+00:00
(Note: start_line=102, end_line=92 looks inverted — that's a separate
chunker oddity not in scope here.)
Where it bites
- Search: BM25 + dense both pull duplicates independently; RRF / rerank
preserve them, so user-visible result lists repeat the same content under
different chunk ids. No dedup pass exists in pipeline.search()
(packages/memtomem/src/memtomem/search/pipeline.py:357).
- Stats:
mem_stats chunk counts are inflated proportionally.
- Existing dedup tooling (
/api/dedup, scheduler/jobs.py:142) is a
manual scan-and-merge surface — it surfaces these but does not prevent
insertion.
Schema
chunks table:
CREATE TABLE chunks (
id TEXT PRIMARY KEY,
content_hash TEXT NOT NULL,
source_file TEXT NOT NULL,
namespace TEXT NOT NULL DEFAULT 'default',
...
);
CREATE INDEX idx_chunks_hash ON chunks(content_hash);
content_hash has a lookup index but no UNIQUE constraint, neither
standalone nor on (namespace, source_file, content_hash). Same-run duplicate
inserts are therefore not rejected at the storage layer.
Hypothesis (needs verification before fix)
Two plausible causes, in priority order:
- Indexing pipeline inserts the same
Chunk twice within one batch —
e.g. a parser yielding the same section under two paths, or a write loop
not deduping the in-memory list before the executemany. The same-second
timestamps strongly point here.
- Concurrent indexers racing on the same source — file watcher + CLI
mm index overlap. Less likely given the same-second clustering, but
possible if both share the run's "now()".
What's needed
Before deciding on a fix:
A backfill / cleanup pass for the 77 existing groups is a separate concern;
/api/dedup already covers manual cleanup.
Severity
Medium. Not data loss, but degrades search result quality (visible
repetition) and inflates chunk counts. ~5% of chunks affected on the DB
where this was observed (77 dup groups / 1469 distinct hashes).
What
Indexing produces multiple
chunksrows that share the samecontent_hash,namespace,source_file, and line range, differing only inid. This isnot a re-index artifact — the duplicate rows are inserted within the same
indexing run.
Surfaced via
mm websearch QA: identical body / line range repeated acrossresults in
/api/search?q=memtomem&top_k=3.Evidence (local DB)
Every duplicate group is same-namespace + same-source-file:
Of the 77 groups, 64 share the exact same
created_at(second precision)across all duplicate rows; the remaining 13 fall within 60s of each other.
This rules out user-driven re-indexing as the cause.
Sample group (one of 77):
(Note:
start_line=102, end_line=92looks inverted — that's a separatechunker oddity not in scope here.)
Where it bites
preserve them, so user-visible result lists repeat the same content under
different chunk ids. No dedup pass exists in
pipeline.search()(
packages/memtomem/src/memtomem/search/pipeline.py:357).mem_statschunk counts are inflated proportionally./api/dedup,scheduler/jobs.py:142) is amanual scan-and-merge surface — it surfaces these but does not prevent
insertion.
Schema
chunkstable:content_hashhas a lookup index but no UNIQUE constraint, neitherstandalone nor on
(namespace, source_file, content_hash). Same-run duplicateinserts are therefore not rejected at the storage layer.
Hypothesis (needs verification before fix)
Two plausible causes, in priority order:
Chunktwice within one batch —e.g. a parser yielding the same section under two paths, or a write loop
not deduping the in-memory list before the executemany. The same-second
timestamps strongly point here.
mm indexoverlap. Less likely given the same-second clustering, butpossible if both share the run's "now()".
What's needed
Before deciding on a fix:
(one that has duplicates today) and check whether duplicates appear
from one
mm indexinvocation.content_hashenters the insert batch.executemany(cheapest),a
UNIQUE(namespace, source_file, content_hash)schema constraintwith
INSERT OR IGNORE(defensive), or both.A backfill / cleanup pass for the 77 existing groups is a separate concern;
/api/dedupalready covers manual cleanup.Severity
Medium. Not data loss, but degrades search result quality (visible
repetition) and inflates chunk counts. ~5% of chunks affected on the DB
where this was observed (77 dup groups / 1469 distinct hashes).