Skip to content

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287

Merged
igorls merged 1 commit intoMemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold
May 1, 2026
Merged

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
igorls merged 1 commit intoMemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold

Conversation

@messelink
Copy link
Copy Markdown
Contributor

Summary

The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default sync_threshold of 1,000:

Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling

#1191 then set hnsw:sync_threshold = 50_000 via _HNSW_BLOAT_GUARD (to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.

Symptom

Any palace created via ChromaBackend.create_collection after #1191 carries hnsw:sync_threshold=50_000 in collection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-disk index_metadata.pickle only updates at flush time, so between flushes:

  • sqlite_count grows with every write
  • hnsw_count (read from the pickle) stays frozen until the next flush
  • Divergence climbs from 0 → ~50K → flush → back to 0 → repeat

With the 2,000 floor, the divergence threshold becomes max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.

Concretely on a 100K-drawer palace I just rebuilt:

$ mempalace repair-status
  [drawers]
    sqlite count:   98,176
    hnsw count:     50,000
    divergence:     48,176
    status:         DIVERGED
    note:           HNSW index holds 50,000 elements but sqlite has 98,176
                    embeddings — 48,176 drawers (49%) are invisible to vector
                    search. Run `mempalace repair` to rebuild.

The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open — count() returns the full 98,176, and a CLI mempalace search returns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.

Fix

Read the configured hnsw:sync_threshold from collection_metadata per palace and scale the floor to 2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.

sync_threshold = _read_sync_threshold(palace_path, collection_name)
divergence_floor = max(_HNSW_DIVERGENCE_FALLBACK_FLOOR, 2 * sync_threshold)
threshold = max(divergence_floor, int(sqlite_count * _HNSW_DIVERGENCE_FRACTION))

A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000.

Behavior summary

Collection's hnsw:sync_threshold Floor (after fix) Floor (before)
Missing (legacy palace from before #1191) 2,000 2,000 (unchanged)
1,000 (chromadb default) 2,000 2,000 (unchanged)
50,000 (_HNSW_BLOAT_GUARD) 100,000 2,000 (the bug)

Tests

Four new tests in tests/test_hnsw_capacity.py:

_seed_chroma_db in the test fixture grew an optional sync_threshold parameter that, when provided, seeds an hnsw:sync_threshold row into a new collection_metadata table. All 18 pre-existing tests in test_hnsw_capacity.py continue to pass; the broader test_backends.py (45 tests) shows no regressions.

Related

The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the
"diverged" decision. The comment justifying that number explicitly tied
it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization
windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling".

MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without
updating the floor. Result: any palace created with the bloat guard
flips between OK and DIVERGED on every flush cycle. Steady-state
divergence sits at 0–50K (the natural queue depth), and the 2,000 floor
trips the guardrail the moment the queue exceeds 10% of sqlite_count.
The MCP server then routes search to BM25-only and disables duplicate
detection for ~80% of the write cycle on actively-mined ≥100K palaces,
even though chromadb is behaving correctly.

This change reads the configured `hnsw:sync_threshold` from
`collection_metadata` per palace and scales the floor to 2 × that value.
The 10% relative term and the original MemPalace#1222 detection capability are
unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer)
still trips, regardless of whether the collection was created with
sync_threshold=1000 or 50000.

Behavior summary:

| Collection's sync_threshold | New floor | Old floor |
|---|---|---|
| Missing (legacy palace)     | 2000      | 2000 (unchanged)
| 1000 (chromadb default)     | 2000      | 2000 (unchanged)
| 50000 (MemPalace#1191 bloat guard)   | 100000    | 2000 (the bug)

Tests:
- test_capacity_status_tolerates_lag_under_large_sync_threshold (regression
  for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK)
- test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222
  shape with bloat-guard collection — still detects corruption)
- test_capacity_status_default_threshold_when_no_sync_metadata (legacy
  palaces without the metadata row use the 2000 fallback floor)
- test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch
  scales too — 30K under sync_threshold=50000 is no longer flagged)

All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests
in tests/test_backends.py still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@igorls igorls merged commit 96bb80a into MemPalace:develop May 1, 2026
6 checks passed
igorls added a commit that referenced this pull request May 1, 2026
Three fixes landed on develop after the initial release-prep cut and
were brought in via the develop merge. Document them in the 3.3.4
Bug Fixes section so the release notes reflect what users will
actually receive.

- #1287 - HNSW divergence floor scales with hnsw:sync_threshold
  (resolves a silent-fallback regression introduced by the
  interaction between #1191 and #1227 in this release)
- #1262 - ChromaBackend get_or_create_collection split, fixing the
  stop-hook SIGSEGV class on legacy palaces with mismatched stored
  metadata (#1089)
- #1288 / #1254 - repair --mode max-seq-id heuristic now decodes
  BLOB-typed embeddings.seq_id, restoring the un-poison path added
  in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
igorls added a commit that referenced this pull request May 1, 2026
The release was originally cut on 2026-04-27 but did not tag that day.
Three additional bug fixes have been folded in since then (#1262,
#1287, #1288) and the actual tag will happen on 2026-04-30. Update
the header date to match.
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 1, 2026
Catches up xdev-patches with 112 commits from MemPalace/develop, including:
- v3.3.4 release
- MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running
  MCP server & mempalace-api)
- MemPalace#1287 HNSW divergence floor fix
- MemPalace#1288 BLOB seq_id decode in repair
- MemPalace#1180 cross-wing tunnels by shared topics
- MemPalace#1194 wing-slug normalization for hyphenated dirs

Conflict resolution: hooks_cli.py and mcp_server.py both had local patches
(6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash)
that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on
those two files — upstream's version handles separate transcript/project
ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local
config.py regex relaxation auto-merged cleanly and is preserved.

Safety tag: pre-upstream-merge-20260501-091227 (rollback target).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants