fix(repair): scale HNSW divergence floor with hnsw:sync_threshold by messelink · Pull Request #1287 · MemPalace/mempalace

messelink · 2026-05-01T00:32:31Z

Summary

The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default sync_threshold of 1,000:

Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling

#1191 then set hnsw:sync_threshold = 50_000 via _HNSW_BLOAT_GUARD (to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.

Symptom

Any palace created via ChromaBackend.create_collection after #1191 carries hnsw:sync_threshold=50_000 in collection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-disk index_metadata.pickle only updates at flush time, so between flushes:

sqlite_count grows with every write
hnsw_count (read from the pickle) stays frozen until the next flush
Divergence climbs from 0 → ~50K → flush → back to 0 → repeat

With the 2,000 floor, the divergence threshold becomes max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.

Concretely on a 100K-drawer palace I just rebuilt:

$ mempalace repair-status
  [drawers]
    sqlite count:   98,176
    hnsw count:     50,000
    divergence:     48,176
    status:         DIVERGED
    note:           HNSW index holds 50,000 elements but sqlite has 98,176
                    embeddings — 48,176 drawers (49%) are invisible to vector
                    search. Run `mempalace repair` to rebuild.

The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open — count() returns the full 98,176, and a CLI mempalace search returns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.

Fix

Read the configured hnsw:sync_threshold from collection_metadata per palace and scale the floor to 2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.

sync_threshold = _read_sync_threshold(palace_path, collection_name)
divergence_floor = max(_HNSW_DIVERGENCE_FALLBACK_FLOOR, 2 * sync_threshold)
threshold = max(divergence_floor, int(sqlite_count * _HNSW_DIVERGENCE_FRACTION))

A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000.

Behavior summary

Collection's `hnsw:sync_threshold`	Floor (after fix)	Floor (before)
Missing (legacy palace from before #1191)	2,000	2,000 (unchanged)
1,000 (chromadb default)	2,000	2,000 (unchanged)
50,000 (`_HNSW_BLOAT_GUARD`)	100,000	2,000 (the bug)

Tests

Four new tests in tests/test_hnsw_capacity.py:

test_capacity_status_tolerates_lag_under_large_sync_threshold — regression for the fix: prevent HNSW index bloat from resize+persist cycles #1191/fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 conflict: 100K sqlite + 50K HNSW + sync_threshold=50_000 → OK. (Pre-fix: would flag DIVERGED.)
test_capacity_status_still_flags_real_corruption_under_large_sync — HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222-shape (200K sqlite + 16K HNSW) on a sync_threshold=50_000 collection still trips correctly.
test_capacity_status_default_threshold_when_no_sync_metadata — legacy palaces without the metadata row use the 2,000 fallback floor (no behavior change).
test_unflushed_path_also_uses_dynamic_floor — the never-flushed branch (no pickle yet) also scales: a 30K-drawer collection under sync_threshold=50_000 is no longer flagged DIVERGED before its first flush.

_seed_chroma_db in the test fixture grew an optional sync_threshold parameter that, when provided, seeds an hnsw:sync_threshold row into a new collection_metadata table. All 18 pre-existing tests in test_hnsw_capacity.py continue to pass; the broader test_backends.py (45 tests) shows no regressions.

fix: prevent HNSW index bloat from resize+persist cycles #1191 — the PR that set _HNSW_BLOAT_GUARD with sync_threshold=50_000
fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 — the PR that added the capacity divergence detection (and the hardcoded 2,000 floor this PR makes dynamic)
HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222 — the original failure mode the divergence detection was added to catch (still detected after this fix)

The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original MemPalace#1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (MemPalace#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively

The release was originally cut on 2026-04-27 but did not tag that day. Three additional bug fixes have been folded in since then (#1262, #1287, #1288) and the actual tag will happen on 2026-04-30. Update the header date to match.

Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

messelink requested review from bensig, igorls and milla-jovovich as code owners May 1, 2026 00:32

igorls merged commit 96bb80a into MemPalace:develop May 1, 2026
6 checks passed

briankingdomlight mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1307

Closed

potterdigital mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
igorls merged 1 commit intoMemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold

messelink commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

messelink commented May 1, 2026

Summary

Symptom

Fix

Behavior summary

Tests

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants