fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955)#976
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes several concurrency-triggered failure modes by reducing parallel HNSW mutation risk, preventing runaway concurrent mine fan-out, and ensuring /compact can eventually proceed even when context is exhausted.
Changes:
- Pin Chroma HNSW inserts to single-threaded mode (
hnsw:num_threads=1) when creating collections. - Add a non-blocking, process-wide global lock around
miner.mineto collapse concurrentmempalace minespawns. - Cap
hook_precompact“block” decisions per session to avoid indefinite compaction deadlocks.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
mempalace/palace.py |
Introduces MineAlreadyRunning and mine_global_lock() for non-blocking process-wide mine serialization. |
mempalace/miner.py |
Wraps the mining pipeline in mine_global_lock() (except dry_run) and exits cleanly if another mine is running. |
mempalace/hooks_cli.py |
Adds per-session attempt tracking to limit how many times precompact can block before passing through. |
mempalace/backends/chroma.py |
Sets hnsw:num_threads=1 in collection metadata on create paths to avoid unsafe parallel inserts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| lock_dir = os.path.join(os.path.expanduser("~"), ".mempalace", "locks") | ||
| os.makedirs(lock_dir, exist_ok=True) | ||
| lock_path = os.path.join(lock_dir, "mine_global.lock") | ||
|
|
There was a problem hiding this comment.
mine_global_lock() uses a single global lock file (~/.mempalace/locks/mine_global.lock), which serializes all mempalace mine runs across all palaces/projects (even if they use different --palace paths). If the intent is only to prevent concurrent writes to the same palace, consider keying the lock filename by palace_path (e.g., hash of resolved palace path) or otherwise scoping the lock so independent palaces can be mined in parallel.
| @contextlib.contextmanager | ||
| def mine_global_lock(): | ||
| """Process-wide non-blocking lock around the full `mine` pipeline. | ||
|
|
||
| The per-file `mine_lock` only protects delete+insert interleave for a |
There was a problem hiding this comment.
mine_global_lock() introduces an important cross-process synchronization mechanism, but there doesn't appear to be automated coverage for its non-blocking behavior (second acquire raises MineAlreadyRunning) and reusability after release. Since the repo already tests mine_lock inter-process behavior, adding similar tests for mine_global_lock would help prevent future regressions.
| collection = client.get_or_create_collection( | ||
| collection_name, metadata={"hnsw:space": "cosine"} | ||
| collection_name, | ||
| metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1}, | ||
| ) |
There was a problem hiding this comment.
Passing metadata={... "hnsw:num_threads": 1} to get_or_create_collection() typically does not update metadata for an already-existing collection, so palaces created before this change may keep the unsafe default thread setting. To make the fix effective for existing users, consider explicitly updating the collection after retrieval (e.g., collection.modify(metadata=...) when supported, or at least detect missing/incorrect metadata and attempt to set it) for both the create and non-create code paths.
| collection = self._client(palace_path).create_collection( | ||
| collection_name, metadata={"hnsw:space": hnsw_space} | ||
| collection_name, | ||
| metadata={"hnsw:space": hnsw_space, "hnsw:num_threads": 1}, | ||
| ) |
There was a problem hiding this comment.
This metadata pin only applies when callers go through ChromaBackend. The MCP server still creates collections directly via chromadb.PersistentClient.get_or_create_collection(..., metadata={"hnsw:space": "cosine"}) (see mempalace/mcp_server.py, around _get_collection), so that code path would continue to omit hnsw:num_threads=1 and may still hit the HNSW parallel-insert crash. Consider routing MCP server collection creation through ChromaBackend or applying the same metadata update there so the fix covers the primary crash surface described in #974/#965.
|
Thanks for the thorough review — all six comments were actionable. Pushed follow-up commit 4976e71.
Testing
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor | ||
| # HNSW insert path, which has a race in repairConnectionsForUpdate / | ||
| # addPoint (see issues #974, #965). The setting is only honored at | ||
| # collection creation time — pre-existing palaces created before | ||
| # this fix keep the unsafe default; users must `mempalace nuke` + | ||
| # re-mine to get the protection on legacy palaces. | ||
| _collection_cache = ChromaCollection( | ||
| client.get_or_create_collection( | ||
| _config.collection_name, metadata={"hnsw:space": "cosine"} | ||
| _config.collection_name, | ||
| metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1}, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The comment notes that pre-existing palaces must be nuked/re-mined to pick up hnsw:num_threads=1, which means upgrading does not actually protect existing users from #974/#965 unless they manually rebuild. If ChromaDB supports updating metadata on an existing collection (e.g., collection.modify(metadata=...)), consider applying it after get_or_create_collection so older palaces are automatically migrated to single-threaded inserts (best-effort, with graceful fallback if unsupported).
|
Heads-up on the merge conflict. It's specifically in #863 removed the unconditional block entirely. In current develop, After you rebase, the
Separately on #965, there is an existing open PR #346 (yoshi280) that attributes the sparse The |
|
Thanks — pushed d8a26cd addressing both actionable points from the 2nd review. Response to all four comments:
Testing
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
mempalace/backends/chroma.py:134
get_collection(..., create=False)returns an existing Chroma collection without ensuringhnsw:num_threadsis pinned to 1. Sinceget_or_create_collection(..., metadata=...)does not update metadata for pre-existing collections, palaces created before this change will keep the unsafe default and may still hit the HNSW race described in #974/#965. Consider a best-effort metadata update for existing collections (e.g., callingcollection.modify(metadata={...})afterget_collection) or otherwise adjust the "closes #974/#965" claim/documentation to reflect that legacy palaces need manual intervention.
client = self._client(palace_path)
if create:
collection = client.get_or_create_collection(
collection_name,
metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
)
else:
collection = client.get_collection(collection_name)
return ChromaCollection(collection)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| The lock file is keyed by sha256(palace_path) so mines against | ||
| *different* palaces can still run in parallel — we only serialize | ||
| writes into the same palace, which is the correctness boundary. | ||
|
|
||
| The key is derived from a fully normalized form of the path: | ||
| `realpath` resolves symlinks and `..` segments, and `normcase` folds | ||
| case on Windows (which has a case-insensitive filesystem). Without | ||
| normcase, `C:\\Palace` and `c:\\palace` would hash to different keys | ||
| on Windows and let two concurrent mines touch the same on-disk palace. |
There was a problem hiding this comment.
The docstring says the lock file is keyed by sha256(palace_path), but the implementation hashes a normalized, fully-resolved path (normcase(realpath(expanduser(palace_path)))). Please update the docstring to match the actual key derivation to avoid misleading future callers/debugging.
| The lock file is keyed by sha256(palace_path) so mines against | |
| *different* palaces can still run in parallel — we only serialize | |
| writes into the same palace, which is the correctness boundary. | |
| The key is derived from a fully normalized form of the path: | |
| `realpath` resolves symlinks and `..` segments, and `normcase` folds | |
| case on Windows (which has a case-insensitive filesystem). Without | |
| normcase, `C:\\Palace` and `c:\\palace` would hash to different keys | |
| on Windows and let two concurrent mines touch the same on-disk palace. | |
| The lock file is keyed by the first 16 hex chars of | |
| `sha256(normcase(realpath(expanduser(palace_path))))` so mines against | |
| *different* palaces can still run in parallel — we only serialize | |
| writes into the same palace, which is the correctness boundary. | |
| The key is derived from a fully normalized form of the path: | |
| `expanduser` resolves `~`, `realpath` resolves symlinks and `..` | |
| segments, and `normcase` folds case on Windows (which has a | |
| case-insensitive filesystem). Without normcase, `C:\\Palace` and | |
| `c:\\palace` would hash to different keys on Windows and let two | |
| concurrent mines touch the same on-disk palace. |
| # Backward-compatible alias (previous patch iteration used a single global | ||
| # lock). Kept so third-party callers that imported it continue to work; new | ||
| # code should use `mine_palace_lock(palace_path)` for per-palace scoping. | ||
| mine_global_lock = mine_palace_lock |
There was a problem hiding this comment.
PR description calls this a process-wide mine_global_lock(); the code implements a per-palace lock (mine_palace_lock(palace_path)) and keeps mine_global_lock only as an alias. Please align the PR description/public API docs with the shipped behavior (per-palace serialization), and consider clarifying the alias comment/name so callers don’t assume it serializes all mines regardless of palace path.
| """Pick a start method that works on every CI runner. | ||
|
|
||
| `fork` is cheaper (no re-import) but is unavailable on Windows, so we fall | ||
| back to `spawn` there. `spawn` inherits ``os.environ`` (including the | ||
| monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the | ||
| child, which is sufficient for the lock-file semantics exercised here. | ||
| """ | ||
| start_method = "spawn" if os.name == "nt" else "fork" | ||
| return multiprocessing.get_context(start_method) |
There was a problem hiding this comment.
_get_mp_context() uses fork on non-Windows platforms. Elsewhere in the test suite (e.g., tests/test_closets.py:115) multiprocessing tests explicitly use spawn for consistent, portable semantics across macOS/Linux/Windows and to avoid fork-related flakiness. Consider switching this helper to always use spawn (or matching the existing convention) so CI behavior is uniform across runners.
| """Pick a start method that works on every CI runner. | |
| `fork` is cheaper (no re-import) but is unavailable on Windows, so we fall | |
| back to `spawn` there. `spawn` inherits ``os.environ`` (including the | |
| monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the | |
| child, which is sufficient for the lock-file semantics exercised here. | |
| """ | |
| start_method = "spawn" if os.name == "nt" else "fork" | |
| return multiprocessing.get_context(start_method) | |
| """Use a multiprocessing start method with consistent CI semantics. | |
| Always use `spawn` so these tests run with the same process model on | |
| macOS, Linux, and Windows. `spawn` inherits ``os.environ`` (including the | |
| monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the | |
| child, which is sufficient for the lock-file semantics exercised here. | |
| """ | |
| return multiprocessing.get_context("spawn") |
d8a26cd to
404d73f
Compare
Follow-up: addressing remaining Copilot review itemsCopilot #1 —
|
…stant Addresses remaining PR MemPalace#976 review items after rebase on develop. ## Copilot Review MemPalace#3 (chroma.py:134) — retrofit legacy collections `get_collection(create=False)` previously returned existing collections without re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls `collection.modify(configuration=UpdateCollectionConfiguration( hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every `get_collection` call (including the MCP server's `_get_collection`). In chromadb 1.5.x the runtime config does not persist to disk across `PersistentClient` reopens, so the retrofit is re-applied each process start rather than being a one-shot migration. Fresh palaces keep the metadata-based pin as primary defense; legacy palaces now also get per-session protection without requiring `mempalace nuke` + re-mine. ## mvalentsev feedback — drop dead `MAX_PRECOMPACT_BLOCK_ATTEMPTS` After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and no longer emits `decision: block`, so the attempt-cap constant was orphaned. Grep confirms 0 usages in the repo — remove it. ## Tests - `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1) - `_pin_hnsw_threads` swallows all errors (never raises) - `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace - 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
Follow-up: merge-conflict cleanup + Copilot Review #3Pushed commit ffa9e8e addressing the two remaining items. @mvalentsev — dead
|
|
@felipetruman pls rebase |
Code reviewFound 3 issues:
mempalace/mempalace/backends/chroma.py Lines 158 to 170 in ffa9e8e
Lines 313 to 392 in ffa9e8e
Lines 756 to 801 in ffa9e8e 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Hi, New tests assert that Severity: action required | Category: correctness How to fix: Skip tests for chromadb<1.5 Agent prompt to fix - you can give this to your LLM of choice:
Qodo code review - free for open-source. |
|
@felipetruman heads up — this branch has merge conflicts against develop and needs a rebase before it can ship. |
Scanned all 233 open upstream PRs today against our open PRs and fork-ahead / planned-work items. Findings merged into README: - P2 (decay) and P3 Tier-0 (LLM rerank): both covered by MemPalace#1032 (@zackchiutw, MERGEABLE, 2026-04-19 — Weibull decay + 4-stage rerank pipeline). Older simpler version at MemPalace#337. Dropped as fork work; watching MemPalace#1032. - P7 (alternative storage): formally out of scope. RFC 001 MemPalace#743 (@igorls) defines the plugin contract; four backend PRs already in flight (MemPalace#700, MemPalace#381 Qdrant; MemPalace#574, MemPalace#575 LanceDB). Fork consumes, does not rebuild. - P0 (multi-label tags): still fork/upstream candidate. MemPalace#1033 (@zackchiutw) ships adjacent privacy-tag + progressive disclosure but not the full multi-label scheme. - Merged MemPalace#1023 section acknowledges complementary MemPalace#976 (felipetruman) which adds broader mine_global_lock() + HNSW num_threads pin. Gives future-us a map so we don't re-file MemPalace#1036-style duplicates.
|
A narrower slice of the mine fan-out fix (hook-triggered only) landed on develop yesterday as #1023 — Separately: the |
Changes (all 7 approved items): 1. Tighten line-15 status line to keep only non-redundant bits (test count, Discussion MemPalace#1017, issues link) 2. Remove "Fork Changes / Headlines" subsection — duplicated the three differentiators already in the lead paragraph 3. Remove "Planned work / Done" — 5 bullets of PR status already in the "Open upstream PRs" table 4. Trim P2 "original design notes" — MemPalace#1032 is MERGEABLE; kept the prune CLI fork-opportunity note 5. Tighten MemPalace#1023 row in "Merged upstream" — moved the MemPalace#976 mapping detail to a single compact sentence 6. Collapse "Pulled in from upstream v3.3.1" from 6 bullets to one paragraph pointing at the release notes 7. Prune three oldest "Superseded by upstream" bullets (epsilon mtime, .jsonl, max_distance — all upstream-authored or shipped by upstream before the fork claimed them as contributions) 8. Drop P7 "Original fork-mode design notes" (flashcard/AAAK/ diary modes) now that the section explicitly marks P7 as dropped fork work Net: 345 → 316 lines. No unique info removed; anything cut is either duplicated elsewhere in the same README or retrievable from the cited PR/release link.
…HNSW fixes Bring in 29 commits from upstream/develop since the last merge (2026-04-23): Major absorbed changes: - MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too. - MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank, legacy-metric warning + _warn_if_legacy_metric, invariant tests on hnsw:space=cosine across all 5 collection-creation paths. - MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine. - MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device detection via mempalace/embedding.py. - MemPalace#1182: graceful Ctrl-C during mempalace mine. - MemPalace#1168: tunnel permissions security fix. Conflict resolutions (8 files): - searcher.py: kept fork's CLI delegation through search_memories (cleaner single-source-of-truth path); upstream's inline-retrieval branch dropped. Merged upstream's print formatting (cosine= + bm25=) with fork's matched_via reporting from sqlite BM25 fallback. - backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs (embedding_function support from MemPalace#1185). Removed duplicate _pin_hnsw_threads (fork already cherry-picked Felipe's earlier). - mcp_server.py: kept fork's palace_path arg + upstream's clearer comment on hnsw:num_threads=1 rationale. - miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C), RESTORED fork's strict detect_room — substring matching from upstream breaks fork-only test_detect_room_priority1_no_substring_match. Added `import re` for word-boundary regex matching. Fork-ahead concurrent mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon migration deprioritizes local concurrent mining; can re-port if needed. - CHANGELOG.md: combined fork's segfault-trio narrative with upstream's v3.3.4 release notes. - HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was stale; hooks already use this name per fork-ahead MemPalace#19). - test_convo_miner_unit.py: took both contextlib + pytest imports. - test_searcher.py: imported upstream's 3 new TestSearchCLI tests but skipped them with TODOs — they assume upstream's inline-retrieval CLI with simpler mocks. Rewrite needed for fork's delegated search_memories path (sqlite BM25 fallback + scope counting). Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings. Implications for open fork PRs: - MemPalace#1171 (cross-process write lock at adapter) becomes structurally redundant given MemPalace#976's mine_global_lock + daemon-strict serialization. Slated for close with thank-you to Felipe. - MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
After today's work: - Merged upstream/develop (29 commits), absorbing MemPalace#976 (HNSW + mine_global_lock + PreCompact attempt-cap), MemPalace#1179 (CLI BM25), MemPalace#1180/MemPalace#1182/MemPalace#1183/MemPalace#1185 features. - Filed PR MemPalace#1198 (`_tokenize` None guard) upstream. - Closed PR MemPalace#1171 (adapter-level flock) — superseded by MemPalace#976 + daemon-strict. - palace-daemon migration commits (c09582c + 0e97b19) make the daemon the single canonical writer; in-tree adapter flock no longer carries its weight. README: - Lead paragraph: 2026-04-21 → 2026-04-25 sync date, 165,632 → 151,420 drawer count, daemon-on-disks topology, listed absorbed upstream PRs. - Fork change queue: dropped MemPalace#1171 row, added MemPalace#1198 row. - "Multi-client coordination" section: rewrote — palace-daemon is now the fork's primary answer (was "deferred"), MemPalace#1171 retired, narrative shifted to defense-in-depth around MemPalace#976 + daemon. - Two-layer memory model: storage path is now palace-daemon HTTP, not ~/.mempalace/palace. - Ecosystem palace-daemon row: marks integration as complete (commits cited). - Open upstream PRs table: refreshed to MemPalace org URLs, added MemPalace#1198 row, removed MemPalace#1171, added 2026-04-25 develop sync to merged list, added MemPalace#1171 to closed list with rationale. - Test count 1096 → 1334. CLAUDE.md: - Row 20 (MemPalace#1198): "Upstream PR pending" → "Filed as MemPalace#1198 on 2026-04-24". - Row MemPalace#1171 in PR table: open → closed (with MemPalace#976 supersession). - Added MemPalace#1198 row to PR table. - Top-of-section count: "14 merged, 7 open, 9 closed" → "14 merged, 7 open (including MemPalace#1198), 10 closed (added MemPalace#1171)" + sync date 2026-04-25. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ng rebase @mvalentsev pointed out on MemPalace#1172 that the PR MemPalace#976 description still claims a `MAX_PRECOMPACT_BLOCK_ATTEMPTS=2` cap, but during the develop rebase Felipe Truman explicitly dropped that hunk: - 40d7958 — removed the attempt-cap tests (would always fail because hook_precompact now mines synchronously and passes through unconditionally) - 8df944a — removed the attempt-cap constant (orphaned after rebase) What actually merged in MemPalace#976 for hooks_cli.py: just the constant removal. Current hook_precompact() on develop is pure MemPalace#863 logic (merged 2026-04-15, @mvalentsev): ingest, sync mine, _output({}). So the right attribution: MemPalace#863 closes MemPalace#1172, not MemPalace#976. Updating the CHANGELOG and README entries that quoted the unmerged PR description as if it were merged code. Lesson recorded as a feedback memory. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Stop-hook auto-save diary entries (topic=checkpoint, text starting "CHECKPOINT:") were dominating mempalace_search results — they're short, word-dense, and outrank substantive content under cosine similarity. The actual decisions/conversations they summarize get buried below the summary entries. Validated against JP's 151K-drawer palace via RLM smoke test on 2026-04-25: default search returned 5 CHECKPOINT entries when asked about "palace-daemon migration architecture," producing a confident but misleading RLM answer that quoted user-prompt fragments instead of the actual decision (MemPalace#1171 closed, MemPalace#976 absorbed, daemon-strict mode added). Filtering checkpoints at retrieval flips this — now the same query surfaces technical-room drawers with file paths and code snippets. Changes: mempalace/searcher.py - New `kind` parameter on search_memories / search: - "content" (default): excludes topic IN ("checkpoint", "auto-save") - "checkpoint": only checkpoints (recovery/audit) - "all": disables filter (pre-2026-04-25 behavior) - build_where_filter accepts the kind argument and composes it with wing/room via $and. Uses ChromaDB's $nin/$in operators (RFC 001 validated supported). - Defense-in-depth post-filter on text prefix: catches drawers written without topic metadata (legacy data, partial writes, metadata=None upstream-MemPalace#1020 territory) so the exclusion is bulletproof even when the where clause can't see the field. - _CHECKPOINT_TOPICS constant: ("checkpoint", "auto-save"). The second name is a legacy synonym from older palace-daemon hook.py callers; both are filtered until canonical hook clients have fully rolled out. - Result dicts now surface `topic` so consumers (familiar.realm.watch's deterministic pipeline, RLM tools, debugging humans) can route on it. mempalace/mcp_server.py - mempalace_search MCP tool accepts `kind` param with input-schema enum ["content", "checkpoint", "all"]. mempalace/layers.py - Layer2/Layer3 callers pass kind="all" explicitly to preserve their pre-existing contract (their tests verify exact where-clause shape; revisit whether wake-up paths should default-exclude checkpoints in a separate PR). tests - TestCheckpointFilter (8 cases): unit tests for build_where_filter across all kind values, integration tests for search_memories text-prefix defense-in-depth, topic field surfacing, ValueError on invalid kind. - 1345 passed, 0 failed, 0 skipped. Companion change in palace-daemon repo: clients/hook.py wrote topic="auto-save" (one-off legacy); standardized on topic=CHECKPOINT_TOPIC ("checkpoint") so all hook paths agree. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- New scripts/check-docs.sh: passive drift detection for fork docs. Checks (1) test count in README matches `pytest --collect-only`, (2) every fork commit hash referenced in CLAUDE.md / README.md / FORK_CHANGELOG.md resolves via `git cat-file -e`, (3) every #NNNN PR reference has a state matching what the doc claims (skipped gracefully if `gh` isn't authenticated). Cross-repo URL stripping + multi-PR-on-one-line skip avoid false positives. Documented as Layer 1 in the doc-update conversation. - Fix typo `552d0d5` → `552a0d7` in README (the actual fork commit cherry-pick of upstream MemPalace#976's HNSW num_threads pin) — caught by the new check-docs.sh on first run. Layer 3 (canonical-source rendering for FORK_CHANGELOG / README fork-change-queue table / CLAUDE.md row inventory / promises tracker) is the next commit; this lands the lint first so the canonical renderer can use it as part of its own validation. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…#1173 status CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec) that landed earlier today. PR table updated to reflect bensig's 2026-04-26 approvals on four of our open PRs: - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit shape, then force-pushed two safety commits (cold-start gate + integrity sniff-test) after a production cold-start incident destroyed three healthy 253MB segments. Now mergeable=CONFLICTING against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs rebase + re-review. - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge. YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG regenerated to match. promises.md (in claude-config, not this repo) got a long log entry covering today's full output. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Restructure from feature-list to thesis-shape per the 2026-04-26 evening
conversation. Skeleton at scratch/readme-overhaul-skeleton.md; this is
the prose pass.
Major changes:
- New lede surfaces the empirical result (632/3 → 974/1267 token
convergence) in the first 200 words instead of burying it in a
table cell.
- New "The thesis" section names three principles the fork has
converged on: corpus-shape-eats-algorithm, verbatim-vs-derivative,
right-to-measure. Each links out to the long-form essay rather than
re-deriving.
- New "What this fork ships" section organizes work by axis (structural
retrieval / single-writer / deterministic hooks) instead of bag of
features.
- Quickstart added near top — was buried at line 426.
- Architectural principles kept (light edits) and now cross-reference
the new thesis.
- Roadmap reorganized around verbatim-vs-derivative axis. Adds P8
("Corpus partitioning by purpose") as first-class item; the
checkpoint split is its first instance, future transcript-mine
separation and KG-triple-store moves extend it.
- Active investigations rolled forward: engram-2 critique partially
closed (one concrete instance fixed empirically); LongMemEval-S E2E
deferred until chunk-size + embedding-model alignment lands.
- New "Composition with upstream" section makes explicit that the fork
cherry-picks in-flight upstream PRs, contributes back, and closes
fork-only attempts when upstream's version is at the right layer
(MemPalace#1171 → MemPalace#976).
- Fork-change inventory moves from headline to bottom — same content,
reframed as reference for verifying claims rather than the lede.
- Open upstream PRs table updated for 4 bensig approvals 2026-04-26.
- Test count, palace size, deploy commit, post-migration numbers all
refreshed against live state.
Net: 480 → 383 lines (~20% shorter), all four check-docs.sh stages
pass clean (15 fork hashes resolve, 84 PR refs match upstream state).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211 MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four PRs that landed today (21:22-21:41 UTC): MemPalace#1173 quarantine_stale_hnsw on make_client + cold-start gate + integrity sniff-test (PROTO/STOP byte check, no deserialization) MemPalace#1177 .blob_seq_ids_migrated marker guard, closes MemPalace#1090 MemPalace#1198 _tokenize None-document guard in BM25 reranker MemPalace#1201 palace_graph.build_graph skips None metadata Conflict resolution: * mempalace/backends/chroma.py — took upstream as base (it has the igorls-review pickle-protocol docstring, thread-safety paragraph, and Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas in query()/get() since MemPalace#1094 is still open and not yet in develop. * mempalace/mcp_server.py — took upstream's clean form. Dropped the fork-only `palace_path=` kwarg from four ChromaCollection() call sites: the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so the kwarg has no remaining consumer. ChromaCollection.__init__ in upstream/develop is back to (self, collection); calling it with palace_path= raised TypeError → silently swallowed by the broad except in _get_collection() → returned None → tool_status() returned _no_palace(). 41 mcp_server tests went from failing-with-KeyError to passing. * mempalace/cli.py — dropped fork-only `workers=args.workers` from the cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the `--workers` argparse arg landed in 5cd14bd but miner.mine() never accepted a workers param, so production `mempalace mine` TypeError'd on every invocation. Removed the broken plumbing; tests/test_cli.py updated to match. * CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml). * CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's added a "Design Principles / Contributing" charter, which lives in README.md on the fork. * tests/test_backends.py — took upstream's ruff-formatted line widths. docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate, hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated. scripts/check-docs.sh: 4/4 clean. Test suite: 1460/1460. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Reproduced this exact corruption pattern on Trigger pattern observed: Claude Desktop's MCP loader spawns a new Disk impact in our case: Verification of the fix: Pinned to commit Full sanitized transport log + analysis: https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0 A separate observation: the concurrent-spawn behavior on the Claude Desktop client side is what makes this race triggerable in the first place. The Looking forward to v3.3.4. Happy to provide additional logs or repro steps if useful. |
MemPalace#976 protects `mempalace mine`, but MCP/direct backend writers still call ChromaCollection.add/upsert/update/delete without the palace lock. This moves the lock boundary to the Chroma backend seam so all Chroma writes share the same palace-level serialization, with a re-entrant guard for miner paths that already hold the lock. mine_palace_lock(palace_path) gains a per-thread re-entrant guard (threading.local + pid-tag against fork inheritance) so ChromaCollection write methods can take the lock without self-deadlocking when called from inside miner.mine()'s outer hold. ChromaCollection.__init__ accepts an optional palace_path; when set, add/upsert/update/delete wrap their underlying chromadb call with mine_palace_lock(palace_path). palace_path=None preserves the legacy no-lock behaviour for direct callers and tests. ChromaBackend's get_collection/create_collection pass palace_path through; mcp_server._get_collection forwards _config.palace_path so all MCP write tools inherit the wrapping. Tests: 5 new in tests/test_chroma_collection_lock.py covering opt-in, writer-blocks-during-mine, re-entrant-inside-mine, two-process serialization, and a source-level read-path-not-locked pin. Plus 1 new + 1 rewritten in tests/test_palace_locks.py for the re-entrant semantics. 52 passed in 1.01s including the existing test_backends.py regression suite. Refs MemPalace#1161.
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
…stant Addresses remaining PR MemPalace#976 review items after rebase on develop. `get_collection(create=False)` previously returned existing collections without re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls `collection.modify(configuration=UpdateCollectionConfiguration( hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every `get_collection` call (including the MCP server's `_get_collection`). In chromadb 1.5.x the runtime config does not persist to disk across `PersistentClient` reopens, so the retrofit is re-applied each process start rather than being a one-shot migration. Fresh palaces keep the metadata-based pin as primary defense; legacy palaces now also get per-session protection without requiring `mempalace nuke` + re-mine. After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and no longer emits `decision: block`, so the attempt-cap constant was orphaned. Grep confirms 0 usages in the repo — remove it. - `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1) - `_pin_hnsw_threads` swallows all errors (never raises) - `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace - 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs` (embedding-function plumbing from a4868a3). * `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes MemPalace#344 Supersedes MemPalace#346 Co-authored-by: robot-rocket-science <[email protected]>
Summary
Three related problems — same failure mode (runaway HNSW writes + deadlocked session), different trigger paths. Observed in the wild this week on a Nobara install that wrote ~1.5 TB of sparse
link_lists.binbefore crashing and on a macOS M4 Pro install that segfaulted intermittently duringadd_drawer.backends/chroma.py— pinhnsw:num_threads: 1on collection creation. Disables ChromaDB's multi-threadedParallelForinsert path, which has a race inrepairConnectionsForUpdate/addPointthat is the root cause of both SIGSEGV in HNSW parallel inserts — missing hnsw:num_threads on collection creation #974 (SIGSEGV) and Critical Data write loop, HNSW compaction bug #965 (sparse link_lists.bin blowup). MemPalace ingests drawers one at a time anyway, so this path was never a throughput win — only a latent correctness hazard.palace.py+miner.py— addmine_global_lock(), a non-blocking process-wide lock wrappingminer.mine. The existingmine_lockis per source-file (sha256 of path); it does not prevent N copies ofmempalace mine <dir>from running concurrently, which is the fan-out pattern that feeds problem (1). The global lock usesfcntl.LOCK_EX | LOCK_NBon Unix andmsvcrt.LK_NBLCKon Windows — if the lock is held, raiseMineAlreadyRunning, print a short notice, exit cleanly. Concurrent spawns collapse to a single runner instead of queuing as waiters.hooks_cli.py— caphook_precompactblock attempts per session (MAX_PRECOMPACT_BLOCK_ATTEMPTS = 2, tracked inSTATE_DIR/{session_id}_precompact_attempts). Unconditional blocking creates the Compaction Blocking #955 deadlock at context exhaustion: the hook demands a save the assistant has no budget to perform. After the cap the hook passes through (_output({})) so/compactcan make progress. Per-session state means a fresh session gets a fresh budget.Closes
Test plan
python3 -c \"import ast; ast.parse(open(f).read())\"on all four modified filesfrom mempalace.palace import mine_global_lock, MineAlreadyRunning+from mempalace.hooks_cli import MAX_PRECOMPACT_BLOCK_ATTEMPTSmine_global_lock:MineAlreadyRunningpytest tests/test_backends.py tests/test_cli.py→ 49 passed, 0 failedhook_precompactattempt capping and a concurrency test that spawns twomempalace minesubprocesses against the same palace (happy to add if desired).Notes
dry_runbypassesmine_global_lock(no writes, no point serializing).MineAlreadyRunninginherits fromRuntimeErrorso it is a caught error, not a typing surprise.mine_lockis preserved — it still guards delete+insert interleave within a single mine run.hnsw:num_threads=1fix, even a single mine can corrupt the graph; without the global lock, the correctness fix is leaky under hook fan-out).