fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
Merged
igorls merged 1 commit intoMemPalace:developfrom May 1, 2026
Merged
Conversation
The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original MemPalace#1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (MemPalace#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
igorls
added a commit
that referenced
this pull request
May 1, 2026
Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
xcarbo
added a commit
to xcarbo/mempalace
that referenced
this pull request
May 1, 2026
Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default
sync_thresholdof 1,000:#1191 then set
hnsw:sync_threshold = 50_000via_HNSW_BLOAT_GUARD(to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.Symptom
Any palace created via
ChromaBackend.create_collectionafter #1191 carrieshnsw:sync_threshold=50_000incollection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-diskindex_metadata.pickleonly updates at flush time, so between flushes:sqlite_countgrows with every writehnsw_count(read from the pickle) stays frozen until the next flushWith the 2,000 floor, the divergence threshold becomes
max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.Concretely on a 100K-drawer palace I just rebuilt:
The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open —
count()returns the full 98,176, and a CLImempalace searchreturns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.Fix
Read the configured
hnsw:sync_thresholdfromcollection_metadataper palace and scale the floor to2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with
sync_threshold=1000or50000.Behavior summary
hnsw:sync_threshold_HNSW_BLOAT_GUARD)Tests
Four new tests in
tests/test_hnsw_capacity.py:test_capacity_status_tolerates_lag_under_large_sync_threshold— regression for the fix: prevent HNSW index bloat from resize+persist cycles #1191/fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 conflict: 100K sqlite + 50K HNSW +sync_threshold=50_000→OK. (Pre-fix: would flag DIVERGED.)test_capacity_status_still_flags_real_corruption_under_large_sync— HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222-shape (200K sqlite + 16K HNSW) on async_threshold=50_000collection still trips correctly.test_capacity_status_default_threshold_when_no_sync_metadata— legacy palaces without the metadata row use the 2,000 fallback floor (no behavior change).test_unflushed_path_also_uses_dynamic_floor— the never-flushed branch (no pickle yet) also scales: a 30K-drawer collection undersync_threshold=50_000is no longer flagged DIVERGED before its first flush._seed_chroma_dbin the test fixture grew an optionalsync_thresholdparameter that, when provided, seeds anhnsw:sync_thresholdrow into a newcollection_metadatatable. All 18 pre-existing tests intest_hnsw_capacity.pycontinue to pass; the broadertest_backends.py(45 tests) shows no regressions.Related
_HNSW_BLOAT_GUARDwithsync_threshold=50_000