Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash)

## What

Indexing produces multiple `chunks` rows that share the same `content_hash`,
`namespace`, `source_file`, and line range, differing only in `id`. This is
not a re-index artifact — the duplicate rows are inserted within the same
indexing run.

Surfaced via `mm web` search QA: identical body / line range repeated across
results in `/api/search?q=memtomem&top_k=3`.

## Evidence (local DB)

```
total_chunks            1546
distinct_content_hash   1469
duplicate_groups          77
duplicate_rows           154
```

Every duplicate group is **same-namespace + same-source-file**:

| bucket                        | groups |
| ----------------------------- | ------ |
| same-ns + same-source         | 77     |
| cross-ns same-source          | 0      |
| same-ns cross-source          | 0      |
| cross-ns + cross-source       | 0      |

Of the 77 groups, 64 share the **exact same `created_at`** (second precision)
across all duplicate rows; the remaining 13 fall within 60s of each other.
This rules out user-driven re-indexing as the cause.

Sample group (one of 77):

```
id                                    start_line  end_line  created_at
f61516dc-fa29-4013-bb35-e762aea8c7a7  102         92        2026-04-30T03:08:58+00:00
4c842f74-5a42-4f9b-97aa-b45b3ebf25c2  102         92        2026-04-30T03:08:58+00:00
```

(Note: `start_line=102, end_line=92` looks inverted — that's a separate
chunker oddity not in scope here.)

## Where it bites

- **Search**: BM25 + dense both pull duplicates independently; RRF / rerank
  preserve them, so user-visible result lists repeat the same content under
  different chunk ids. No dedup pass exists in `pipeline.search()`
  (`packages/memtomem/src/memtomem/search/pipeline.py:357`).
- **Stats**: `mem_stats` chunk counts are inflated proportionally.
- **Existing dedup tooling** (`/api/dedup`, `scheduler/jobs.py:142`) is a
  manual scan-and-merge surface — it surfaces these but does not prevent
  insertion.

## Schema

`chunks` table:

```sql
CREATE TABLE chunks (
    id TEXT PRIMARY KEY,
    content_hash TEXT NOT NULL,
    source_file TEXT NOT NULL,
    namespace TEXT NOT NULL DEFAULT 'default',
    ...
);
CREATE INDEX idx_chunks_hash ON chunks(content_hash);
```

`content_hash` has a lookup index but **no UNIQUE constraint**, neither
standalone nor on `(namespace, source_file, content_hash)`. Same-run duplicate
inserts are therefore not rejected at the storage layer.

## Hypothesis (needs verification before fix)

Two plausible causes, in priority order:

1. **Indexing pipeline inserts the same `Chunk` twice within one batch** —
   e.g. a parser yielding the same section under two paths, or a write loop
   not deduping the in-memory list before the executemany. The same-second
   timestamps strongly point here.
2. **Concurrent indexers racing on the same source** — file watcher + CLI
   `mm index` overlap. Less likely given the same-second clustering, but
   possible if both share the run's "now()".

## What's needed

Before deciding on a fix:

- [ ] Reproduce on a clean DB by indexing a single representative file
      (one that has duplicates today) and check whether duplicates appear
      from one `mm index` invocation.
- [ ] Trace the chunker → storage write path to find where the same
      `content_hash` enters the insert batch.
- [ ] Decide dedup point: in-batch dedup before `executemany` (cheapest),
      a `UNIQUE(namespace, source_file, content_hash)` schema constraint
      with `INSERT OR IGNORE` (defensive), or both.

A backfill / cleanup pass for the 77 existing groups is a separate concern;
`/api/dedup` already covers manual cleanup.

## Severity

Medium. Not data loss, but degrades search result quality (visible
repetition) and inflates chunk counts. ~5% of chunks affected on the DB
where this was observed (77 dup groups / 1469 distinct hashes).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691

What

Evidence (local DB)

Where it bites

Schema

Hypothesis (needs verification before fix)

What's needed

Severity

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bucket	groups
same-ns + same-source	77
cross-ns same-source	0
same-ns cross-source	0
cross-ns + cross-source	0

Indexing creates duplicate chunks within a single run (no UNIQUE on content_hash) #691

Description

What

Evidence (local DB)

Where it bites

Schema

Hypothesis (needs verification before fix)

What's needed

Severity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions