Skip to content

fix: prevent HNSW index bloat from resize+persist cycles#1191

Merged
igorls merged 1 commit intoMemPalace:developfrom
funguf:fix/hnsw-index-bloat-rebased
Apr 27, 2026
Merged

fix: prevent HNSW index bloat from resize+persist cycles#1191
igorls merged 1 commit intoMemPalace:developfrom
funguf:fix/hnsw-index-bloat-rebased

Conversation

@funguf
Copy link
Copy Markdown
Contributor

@funguf funguf commented Apr 25, 2026

Summary

Rebases #346 onto current develop and verifies the fix on a fresh 39,792-drawer workload. The original PR has been stalled awaiting rebase since 2026-04-11 (@bensig requested rebase, @robot-rocket-science agreed but went silent).

Fixes #344. Supersedes #346 — credit retained via Co-authored-by: trailer.

What changed

mempalace/backends/chroma.py — adds _HNSW_BLOAT_GUARD = {"hnsw:batch_size": 50_000, "hnsw:sync_threshold": 50_000} and merges it into the collection metadata in both get_collection(create=True) and the legacy create_collection, alongside the existing hnsw:space.

mempalace/miner.py process_file — replaces the per-chunk add_drawer() loop with a single batched collection.upsert(documents=..., ids=..., metadatas=...) per file. The batched path recomputes the same deterministic drawer id (drawer_{wing}_{room}_{sha256(source_file + chunk_index)[:24]}) and preserves every existing metadata field (wing, room, source_file, chunk_index, added_by, filed_at, normalize_version, source_mtime, hall, entities) so re-mines remain on-disk-compatible with previously-filed drawers. The surrounding mine_lock, file_already_mined re-check, collection.delete(where=...) purge, and closet-building are untouched. add_drawer() is kept for any external callers.

Empirical verification

Rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied — that's ~4× past the threshold where the bug was reproducible.

Metric Before fix (~15K drawers) After fix (39,792 drawers)
du -sh palace/ 30 GB on disk 376 MB
link_lists.bin apparent 614 GB sparse 0 bytes
mempalace status Segfault (exit 139) Returns instantly
mempalace search Segfault Returns ranked results
mempalace repair Segfault n/a

Both Chroma collection dirs in the post-fix palace have link_lists.bin at literal 0 bytes (zero, not zero-filled-sparse). The metadata-persistence concern raised in #1161 was reproduced as a 5-line test against chromadb 1.5.8 and does not affect the keys this PR sets — hnsw:batch_size and hnsw:sync_threshold survive PersistentClient reopen cleanly.

Migration note

The HNSW config is only honoured at collection-create time. Users on already-bloated palaces still need to nuke and re-mine — there's no in-place fix because Chroma treats HNSW metadata as immutable post-creation. Worth a release-note line; happy to add a one-time warning in _get_collection() if maintainers want one.

Test plan

  • Mine <1K drawer wing — behaviour unchanged (esp32-tuya-gw, 717 drawers)
  • Mine ~1K drawer wing — behaviour unchanged (chirpa, 1367 drawers)
  • Mine ~10K drawer wing — no bloat (home, 9,712 drawers)
  • Mine ~28K drawer wing — no bloat (git-home, 27,862 drawers)
  • Sequential mine 5 wings → 39,792 drawers total → palace 376 MB, link_lists.bin 0 bytes
  • mempalace status and mempalace search no segfault on post-fix palace
  • Drawer id formula and metadata fields verified byte-identical against pre-fix add_drawer() for re-mine compatibility
  • chromadb 1.5.8 persists batch_size + sync_threshold through reopen (verified separately, addresses HNSW sparse bloat + chromadb segfaults persist even after hnsw:num_threads metadata is set (config not persisted by chromadb 1.5.x) #1161 concern)

🤖 Generated with Claude Code

@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 25, 2026

Review

mergeable: dirty against develop — rebase onto current develop will not be a quiet refresh, the same area was substantially restructured on 2026-04-24 by a4868a3 ("perf(mining): batch per-chunk upserts and add optional GPU acceleration").

chroma.py — valuable, but needs adjustment on rebase

The HNSW guard is the real, novel fix develop doesn't have. The diff still bases on the older signature though — develop's get_collection and create_collection now pass **ef_kwargs to plumb the GPU/CoreML/DML embedding function. A naïve rebase would silently drop that. The merged form should be:

client.get_or_create_collection(
    collection_name,
    metadata={"hnsw:space": hnsw_space, **_HNSW_BLOAT_GUARD},
    **ef_kwargs,
)

…at both call sites (get_collection and create_collection).

miner.py — superseded, please drop the hunk

a4868a3 already batches per-file upserts. The current shape on develop is materially better than this PR's shape:

  • A _build_drawer_metadata() helper factors the metadata construction. The PR re-inlines it, duplicating logic.
  • The on-develop loop uses bounded sub-batches of DRAWER_UPSERT_BATCH_SIZE = 1000. The PR collapses to a single unbounded upsert per file, regressing the deliberate trade-off (a bad chunk fails its sub-batch instead of the whole file).

The HNSW-bloat root cause is resize/persist drift driven by the default thresholds — that's fixed by the chroma.py metadata, not by switching from "loop calling upsert" to "one upsert" (both forms hit the same hnswlib internals once batch/sync are large).

Other notes

  • The new try-block reduces to:
    try:
        collection.upsert(...)
        drawers_added = len(drawer_ids)
    except Exception:
        raise
    That's a no-op handler and should be removed (moot if the whole hunk is dropped).
  • The drawer-id formula and the metadata-field set are byte-identical to add_drawer() — the re-mine compatibility claim holds.
  • A small regression test asserting the two HNSW keys land on a freshly-created collection's metadata would lock the contract and cover the HNSW sparse bloat + chromadb segfaults persist even after hnsw:num_threads metadata is set (config not persisted by chromadb 1.5.x) #1161 "config not persisted" concern at CI time:
    col = backend.get_collection(palace_ref, "drawers", create=True)
    assert col._raw.metadata["hnsw:batch_size"] == 50_000
    assert col._raw.metadata["hnsw:sync_threshold"] == 50_000
  • Migration note is correct — HNSW metadata is immutable post-creation, so already-bloated palaces need a fresh mine. A one-time warning in _get_collection() when the keys are absent would be friendly but isn't a blocker.

Suggested next steps

  1. Rebase onto current develop.
  2. Drop the miner.py hunk — the batched-upsert path already exists upstream with bounded sub-batches and a shared metadata helper.
  3. Preserve **ef_kwargs when adding _HNSW_BLOAT_GUARD to the two metadata={...} call sites in chroma.py.
  4. Add a test asserting the two HNSW keys land on the created collection's metadata.

The chroma.py change is the load-bearing fix and should ship.

@funguf funguf force-pushed the fix/hnsw-index-bloat-rebased branch from a93b59d to 7538299 Compare April 25, 2026 13:05
@funguf funguf requested a review from igorls as a code owner April 25, 2026 13:05
@funguf
Copy link
Copy Markdown
Contributor Author

funguf commented Apr 25, 2026

@igorls thank you for the careful review — agreed on every point, force-pushed accordingly:

  1. Rebased onto current develop (0d9929c). PR is now MERGEABLE.
  2. Dropped the miner.py hunk entirely. You're right that a4868a3's bounded DRAWER_UPSERT_BATCH_SIZE=1000 sub-batches with the shared _build_drawer_metadata() helper is strictly better than my single-unbounded-upsert. Re-inlining the metadata construction would have been a regression and the failure-isolation argument is convincing.
  3. _HNSW_BLOAT_GUARD merged into the metadata dict at both call sites, alongside the existing hnsw:space, hnsw:num_threads=1, and **ef_kwargs. No silent drop of either the race fix or the embedding-function plumbing.
  4. Added two regression tests (test_chroma_backend_sets_hnsw_bloat_guard_on_creation for the RFC 001 path, test_chroma_backend_create_collection_sets_hnsw_bloat_guard for the legacy create_collection path). Both assert hnsw:batch_size and hnsw:sync_threshold land on the persisted collection metadata, which doubles as a CI-time guard against the HNSW sparse bloat + chromadb segfaults persist even after hnsw:num_threads metadata is set (config not persisted by chromadb 1.5.x) #1161 silent-drop concern. Full tests/test_backends.py suite still passes (34/34).
  5. No-op try/except: raise removed — moot now that the miner.py hunk is gone.

Happy to add the one-time warning in _get_collection() for pre-fix palaces (when the keys are absent in already-existing collection metadata) as a follow-up; let me know if you'd rather it land here or as a separate PR.

@funguf funguf force-pushed the fix/hnsw-index-bloat-rebased branch from 7538299 to 9a1e929 Compare April 25, 2026 13:17
@funguf
Copy link
Copy Markdown
Contributor Author

funguf commented Apr 25, 2026

Self-review caught two more nits before this gets re-reviewed; folded them in (force-push, single commit 9a1e929):

  1. mempalace/mcp_server.py:227 was a third collection-creation call site bypassing ChromaBackend and would have created palaces without the guard if the MCP server ran before any CLI mine. Imported _HNSW_BLOAT_GUARD and merged it into that metadata= dict alongside the existing hnsw:space and hnsw:num_threads=1. Verified all 67 tests in tests/test_mcp_server.py + tests/test_mcp_stdio_protection.py still pass, and added a smoke test confirming the metadata round-trips through PersistentClient reopen.

  2. The "immutable post-creation" migration claim was wrong on chromadb 1.5.x. Verified empirically that UpdateHNSWConfiguration (defined at chromadb/api/collection_configuration.py:460-466) exposes batch_size, sync_threshold, num_threads, ef_search, resize_factor as updatable via collection.modify(configuration={"hnsw":{...}}). I'd cargo-culted the claim from HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers #344's discussion. Reworded the PR body migration note and the _HNSW_BLOAT_GUARD docstring to acknowledge the retrofit path exists but isn't useful here (by the time link_lists.bin has bloated, the index is already corrupt and live retrofit won't recover it). Test docstring also softened.

Diff is small (mempalace/backends/chroma.py | 8, mempalace/mcp_server.py | 13, tests/test_backends.py | 8).

Copy link
Copy Markdown
Collaborator

@bensig bensig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. The empirical verification table makes this an easy yes — 30 GB → 376 MB on the same drawer count is the kind of result that's hard to argue with.

What I checked

  • Diff is tight. Three call sites get `**_HNSW_BLOAT_GUARD` merged into their collection-create metadata: `ChromaBackend.get_collection(create=True)`, `ChromaBackend.create_collection`, and `mcp_server._get_collection`. No drift between the three.
  • chromadb 1.5.x persistence. The body claims `hnsw:batch_size` + `hnsw:sync_threshold` survive `PersistentClient` reopen (verified separately against #1161's concern). That's the right thing to verify; `hnsw:num_threads` was the one that didn't persist on 1.5.x and #976 had to add the explicit retrofit. Worth a comment in the constant block noting which keys are persistence-safe vs which need the retrofit.
  • Migration honesty. The PR body says "users on already-bloated palaces still need to nuke and re-mine — there's no in-place fix." That's right: chromadb's HNSW config is immutable post-creation, and the rebuild path (#935 + #1210) handles the recovery. Worth a release-note line so users with pre-existing bloat understand why repair alone won't reclaim disk.

Tests

`pytest tests/test_backends.py tests/test_mcp_server.py -k "hnsw or bloat or guard or batch_size"` — 8 passed on this branch. The three new metadata-assertion tests (`get_collection_includes_hnsw_bloat_guard`, `create_collection_includes_hnsw_bloat_guard`, `get_collection_persists_hnsw_metadata_through_reopen`) cover the surface this PR changes.

Two minor observations

  1. PR body has stale scope. It mentions `process_file` batching changes, but those landed via #1185 and aren't in this diff. Worth trimming the body before merge so a reader doesn't search for the change and wonder why it isn't there.
  2. Two-key vs three-key set. This PR's guard is `{batch_size: 50_000, sync_threshold: 50_000}`. #976's existing pin is `{num_threads: 1}`. Both end up in the same metadata dict via `**_HNSW_BLOAT_GUARD` and the explicit `hnsw:num_threads: 1` line. They don't conflict, but the dual sources for HNSW config are a small maintenance trap if someone later assumes everything HNSW lives in one constant. Could promote `num_threads` into `_HNSW_BLOAT_GUARD` as a follow-up.

Closes #344. Approving.

Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
@igorls igorls force-pushed the fix/hnsw-index-bloat-rebased branch from 9a1e929 to 88a53b2 Compare April 27, 2026 05:54
@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 27, 2026

Rebased onto current develop (88a53b2) and pushed. The conflict was in the from .backends.chroma import block in mcp_server.py_HNSW_BLOAT_GUARD from this PR alongside hnsw_capacity_status from #1227 (merged 2026-04-26). Both symbols are now imported and used; no behaviour change.

Author preserved on the commit; committer is mine (force-pushed via maintainer-edit access). Tests: 1420 passed, 1 skipped on the rebased branch; ruff check + format clean.

Thanks for picking up this stalled rebase from #346 — the empirical 30 GB → 376 MB result is the kind of evidence that closes the loop on the bug class.

@igorls igorls merged commit de7801e into MemPalace:develop Apr 27, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 27, 2026
Bumps every version source from 3.3.3 to 3.3.4:
- pyproject.toml
- mempalace/version.py (canonical)
- .claude-plugin/plugin.json
- .claude-plugin/marketplace.json
- .codex-plugin/plugin.json
- README.md badge

Dates the CHANGELOG section and adds entries for the bug fixes that
landed this cycle (#1135, #1191, #1230, #1231) plus expands the #1194
entry to credit the lookup-side recovery path from #1197.

Pre-tag verification:
- 1441 passed, 1 skipped (full suite minus benchmarks, all platforms)
- ruff check + format clean
- 44/44 in test_version_consistency + test_readme_claims (6-file sync)
- JPH invariant: pyproject.toml + .claude-plugin/plugin.json both
  reference mempalace-mcp
- Wheel build + fresh-venv install: mempalace --version reports 3.3.4,
  mempalace-mcp --help works (catches the v3.3.2-class regression)
@igorls igorls mentioned this pull request Apr 27, 2026
jphein added a commit to jphein/mempalace that referenced this pull request Apr 27, 2026
22 commits from upstream/develop, including:

- HNSW capacity divergence detection + BM25-only sqlite fallback when
  vector layer is unloadable (MemPalace#1222 / MemPalace#1227 — vector_disabled kwarg
  routes search through chromadb-bypass path on segfault risk).
- HNSW index bloat prevention via batch_size + sync_threshold metadata
  on collection create (MemPalace#1191).
- Hooks: always mine the active transcript as convos, additive to
  MEMPAL_DIR (MemPalace#1230 / MemPalace#1231). Restructured into `_get_mine_targets()`
  list approach + separate `_ingest_transcript()`. Daemon-strict gate
  preserved on each entry point.
- Wing-name normalization for hyphenated dirs across miner / convo_miner
  / palace_graph (MemPalace#1194 / MemPalace#1195 / MemPalace#1197).
- `narrow _fix_blob_seq_ids` shim + `repair --mode max-seq-id` for
  legacy 0.6.x BLOB-poisoned palaces (MemPalace#1135).

Conflict resolution:
- README.md: kept fork-shaped narrative, dropped upstream's sweep tip
  injection (per fork-readme handling memory).
- hooks/README.md: adopted upstream's accurate `MEMPAL_DIR` additive
  description; kept `MEMPAL_PYTHON` env-var name (matches actual
  `hooks/*.sh` scripts in both forks).
- mempalace/cli.py: consolidated duplicate `--mode` argparse declarations
  into one with all 4 choices (rebuild/legacy/reorganize/max-seq-id).
- mempalace/hooks_cli.py: adopted upstream's `_get_mine_targets()` +
  `_ingest_transcript()` shape; added `_daemon_strict()` guard at entry
  of `_maybe_auto_ingest`, `_mine_sync`, and `_ingest_transcript` so
  the daemon-strict architecture still skips local writes.
- mempalace/mcp_server.py: kept both `kind=` and `vector_disabled=`
  kwargs on the `search_memories` call.
- mempalace/searcher.py: kept fork's `_count_in_scope`,
  `_sqlite_fallback_and_scope`, `_apply_kind_text_filter` AND upstream's
  `_bm25_only_via_sqlite`. Both `kind` and `vector_disabled` parameters
  on `search_memories`.

Tests: 1510 passing (up from 1366 — upstream brought new test suites).
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Directly related to MemPalace#1124.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Directly related to MemPalace#1124.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Directly related to MemPalace#1124.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Directly related to MemPalace#1124.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
mjc added a commit to mjc/mempalace that referenced this pull request Apr 28, 2026
Follow-up to MemPalace#1222.
Directly related to MemPalace#1124.
Adjacent to the upstream HNSW index-bloat work in MemPalace#1191.

Co-authored-by: Copilot <[email protected]>
peterwangsc added a commit to peterwangsc/mempalace that referenced this pull request Apr 30, 2026
…lush

PR MemPalace#1191's `_HNSW_BLOAT_GUARD` set `hnsw:batch_size=50_000` and
`hnsw:sync_threshold=50_000` at every collection-creation site to
prevent the link_lists.bin resize-cycle bloat that hits during large
mining workloads. The empirical validation was on a 39,792-drawer
rebuild — i.e. the `mempalace_drawers` collection.

`mempalace_closets` legitimately stays small (~6,000 entries on a
~110k-drawer palace, capped by source-file count). With the same
50_000 threshold, its HNSW segment never reaches a sync point, so
`index_metadata.pickle` never lands on disk. Once sqlite crosses the
2,000-row absolute divergence threshold, `hnsw_capacity_status`
reports the closet collection DIVERGED with `hnsw_count=None`, search
falls back to BM25-only, and `mempalace repair --mode legacy` is the
recommended remedy — but rebuilding a never-flushed segment hits
chromadb 1.5.x's missing-pickle path and SIGSEGVs.

Splits the guard into a default and a small-collection variant
(batch_size=100, sync_threshold=100), selected by collection name via
`_hnsw_metadata_for(collection_name)`. Drawers keep the 50_000 guard;
closets get the lower threshold and flush during normal upsert
batches without re-introducing resize-cycle bloat (which only fires
past ~10k items per PR MemPalace#344).

Adds `tests/test_small_collection_hnsw_flush.py` — 6 regression tests
covering the pickle-flush contract for small collections, the
`hnsw_capacity_status` false-positive that motivated the production
incident, a control test pinning the cause to the threshold setting,
and a slow-marked counter-test confirming PR MemPalace#1191's bloat protection
still holds for collections that cross 50_000.
peterwangsc added a commit to peterwangsc/mempalace that referenced this pull request Apr 30, 2026
Two defenses that, combined, prevent a mid-write-corrupt HNSW segment
from reaching chromadb's Rust loader and tripping the SIGSEGV
documented at chroma-core/chroma#6852 (open, unfixed in 1.5.7 — same
null-pointer-deref signature reproduced today via lldb backtrace).

1. ``_segment_appears_healthy`` no longer treats a missing
   ``index_metadata.pickle`` as universally healthy. The function
   previously assumed "no pickle = fresh, never-flushed" and returned
   True so ``quarantine_stale_hnsw`` would leave the segment in place.
   That's correct only when ``data_level0.bin`` is also empty/absent.
   When ``data_level0.bin`` holds real HNSW data and the pickle is
   missing, the segment is mid-write corruption — chromadb wrote HNSW
   bytes but never the metadata, either because it crashed mid-flush
   or because ``sync_threshold`` was never crossed (PR MemPalace#1191's uniform
   50_000 threshold caused exactly this on closet collections, see
   commit 8488358). Loading such a segment via ``PersistentClient``
   SIGSEGVs in the Rust HNSW loader. The function now checks
   ``data_level0.bin``'s size and returns False when it's non-empty
   without an accompanying pickle, so quarantine catches it.

2. ``ChromaBackend._client`` now invokes ``quarantine_stale_hnsw``
   once per palace per process, mirroring the protection ``make_client``
   already provides. The two paths previously diverged: ``make_client``
   (used by some long-lived callers) gated quarantine on
   ``_quarantined_paths`` and called it on first open; ``_client``
   (used by every ``get_collection`` invocation, including
   ``mempalace repair --mode legacy``) skipped quarantine entirely.
   That made repair the most exposed code path to chromadb's
   stale-HNSW SIGSEGV — the exact failure observed today when repair
   was run against a palace whose closets segment had the mid-write-
   corruption shape from MemPalace#1. ``_client`` now uses the same gate so
   the cost is amortized and steady-write daemons don't thrash on
   reopen.

Adds 9 regression tests in ``tests/test_segment_quarantine_defenses.py``:

* Three ``_segment_appears_healthy`` cases pinning the truly-fresh
  vs. mid-write-corrupt vs. flushed-with-pickle distinction, plus the
  truncated-pickle edge.
* Three ``quarantine_stale_hnsw`` end-to-end cases: corrupt-and-stale
  → quarantined; flushed-and-stale → left alone (chromadb's async
  flush makes drift the steady state); fresh-and-stale → left alone
  (renaming an empty segment would orphan nothing).
* Two ``ChromaBackend.get_collection`` integration cases: a corrupt
  segment alongside a freshly-created palace gets quarantined on
  first open; subsequent opens within the same process do not
  re-fire the scan (``_quarantined_paths`` gate).
igorls pushed a commit that referenced this pull request May 1, 2026
The capacity probe added in #1227 hardcoded a 2,000-row floor for the
"diverged" decision. The comment justifying that number explicitly tied
it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization
windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling".

#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without
updating the floor. Result: any palace created with the bloat guard
flips between OK and DIVERGED on every flush cycle. Steady-state
divergence sits at 0–50K (the natural queue depth), and the 2,000 floor
trips the guardrail the moment the queue exceeds 10% of sqlite_count.
The MCP server then routes search to BM25-only and disables duplicate
detection for ~80% of the write cycle on actively-mined ≥100K palaces,
even though chromadb is behaving correctly.

This change reads the configured `hnsw:sync_threshold` from
`collection_metadata` per palace and scales the floor to 2 × that value.
The 10% relative term and the original #1222 detection capability are
unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer)
still trips, regardless of whether the collection was created with
sync_threshold=1000 or 50000.

Behavior summary:

| Collection's sync_threshold | New floor | Old floor |
|---|---|---|
| Missing (legacy palace)     | 2000      | 2000 (unchanged)
| 1000 (chromadb default)     | 2000      | 2000 (unchanged)
| 50000 (#1191 bloat guard)   | 100000    | 2000 (the bug)

Tests:
- test_capacity_status_tolerates_lag_under_large_sync_threshold (regression
  for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK)
- test_capacity_status_still_flags_real_corruption_under_large_sync (#1222
  shape with bloat-guard collection — still detects corruption)
- test_capacity_status_default_threshold_when_no_sync_metadata (legacy
  palaces without the metadata row use the 2000 fallback floor)
- test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch
  scales too — 30K under sync_threshold=50000 is no longer flagged)

All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests
in tests/test_backends.py still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
igorls added a commit that referenced this pull request May 1, 2026
Three fixes landed on develop after the initial release-prep cut and
were brought in via the develop merge. Document them in the 3.3.4
Bug Fixes section so the release notes reflect what users will
actually receive.

- #1287 - HNSW divergence floor scales with hnsw:sync_threshold
  (resolves a silent-fallback regression introduced by the
  interaction between #1191 and #1227 in this release)
- #1262 - ChromaBackend get_or_create_collection split, fixing the
  stop-hook SIGSEGV class on legacy palaces with mismatched stored
  metadata (#1089)
- #1288 / #1254 - repair --mode max-seq-id heuristic now decodes
  BLOB-typed embeddings.seq_id, restoring the un-poison path added
  in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
… (upstream MemPalace#1191)

Adapted upstream fix to our chroma.py call sites (get_or_create + create_collection).
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
Sync with upstream/develop via cherry-pick of 9 critical bugfixes:
- HNSW index bloat prevention (MemPalace#1191)
- SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289)
- blob-seq marker fast-path (MemPalace#1177)
- palace_graph None-metadata guard (MemPalace#1201)
- palace_graph security tunnels (MemPalace#1168)
- hyphenated wing name normalization (MemPalace#1197)
- searcher _tokenize None guard (MemPalace#1198)
- mcp_server diary topic sanitize (MemPalace#936)

Tests: 1459 default + 113 benchmark + 6/7 stress all pass.
(random-kill stress test was failing on dev pre-merge; not regressed.)
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 4, 2026
Catches up on a month of upstream work. Highlights pulled in:

- MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool)
- MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status
- MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW
- MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate
- MemPalace#1321 cli: honor --palace flag in cmd_init
- MemPalace#1314 kg temporal params fix
- MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read
- MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read
- MemPalace#1303 mcp_server: pass embedding_function= on collection reopen
- MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json
- Various ruff format passes on touched files

Conflict resolution (CHANGELOG.md only — code files all auto-merged):

- 3.3.5 unreleased section header from upstream kept above 3.3.4
- 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new
  MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels
  (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and
  auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail
  (upstream's version was a strict subset).

xdev patches preserved (still on this branch, untouched by merge):
- 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings
- 3fad61d fix(config): allow leading dash in wing names

Not pushed to origin — run tests locally and decide when to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HNSW index bloat: link_lists.bin grows to 441GB when mining >10K drawers

3 participants