Skip to content

fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955)#976

Merged
igorls merged 6 commits intoMemPalace:developfrom
felipetruman:fix/hnsw-race-and-fanout
Apr 25, 2026
Merged

fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955)#976
igorls merged 6 commits intoMemPalace:developfrom
felipetruman:fix/hnsw-race-and-fanout

Conversation

@felipetruman
Copy link
Copy Markdown
Contributor

Summary

Three related problems — same failure mode (runaway HNSW writes + deadlocked session), different trigger paths. Observed in the wild this week on a Nobara install that wrote ~1.5 TB of sparse link_lists.bin before crashing and on a macOS M4 Pro install that segfaulted intermittently during add_drawer.

  1. backends/chroma.py — pin hnsw:num_threads: 1 on collection creation. Disables ChromaDB's multi-threaded ParallelFor insert path, which has a race in repairConnectionsForUpdate / addPoint that is the root cause of both SIGSEGV in HNSW parallel inserts — missing hnsw:num_threads on collection creation #974 (SIGSEGV) and Critical Data write loop, HNSW compaction bug #965 (sparse link_lists.bin blowup). MemPalace ingests drawers one at a time anyway, so this path was never a throughput win — only a latent correctness hazard.
  2. palace.py + miner.py — add mine_global_lock(), a non-blocking process-wide lock wrapping miner.mine. The existing mine_lock is per source-file (sha256 of path); it does not prevent N copies of mempalace mine <dir> from running concurrently, which is the fan-out pattern that feeds problem (1). The global lock uses fcntl.LOCK_EX | LOCK_NB on Unix and msvcrt.LK_NBLCK on Windows — if the lock is held, raise MineAlreadyRunning, print a short notice, exit cleanly. Concurrent spawns collapse to a single runner instead of queuing as waiters.
  3. hooks_cli.py — cap hook_precompact block attempts per session (MAX_PRECOMPACT_BLOCK_ATTEMPTS = 2, tracked in STATE_DIR/{session_id}_precompact_attempts). Unconditional blocking creates the Compaction Blocking #955 deadlock at context exhaustion: the hook demands a save the assistant has no budget to perform. After the cap the hook passes through (_output({})) so /compact can make progress. Per-session state means a fresh session gets a fresh budget.

Closes

Test plan

  • python3 -c \"import ast; ast.parse(open(f).read())\" on all four modified files
  • Import smoke test: from mempalace.palace import mine_global_lock, MineAlreadyRunning + from mempalace.hooks_cli import MAX_PRECOMPACT_BLOCK_ATTEMPTS
  • Functional test of mine_global_lock:
    • first acquire succeeds
    • second concurrent acquire raises MineAlreadyRunning
    • lock is reusable after release
  • pytest tests/test_backends.py tests/test_cli.py49 passed, 0 failed
  • Would welcome reviewers running the full suite + a targeted test for hook_precompact attempt capping and a concurrency test that spawns two mempalace mine subprocesses against the same palace (happy to add if desired).

Notes

  • dry_run bypasses mine_global_lock (no writes, no point serializing).
  • MineAlreadyRunning inherits from RuntimeError so it is a caught error, not a typing surprise.
  • The existing per-file mine_lock is preserved — it still guards delete+insert interleave within a single mine run.
  • Commit is a single logical change spanning four files; splitting would obscure the fact that the three fixes are interlocked (without the hnsw:num_threads=1 fix, even a single mine can corrupt the graph; without the global lock, the correctness fix is leaky under hook fan-out).

Copilot AI review requested due to automatic review settings April 17, 2026 18:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes several concurrency-triggered failure modes by reducing parallel HNSW mutation risk, preventing runaway concurrent mine fan-out, and ensuring /compact can eventually proceed even when context is exhausted.

Changes:

  • Pin Chroma HNSW inserts to single-threaded mode (hnsw:num_threads=1) when creating collections.
  • Add a non-blocking, process-wide global lock around miner.mine to collapse concurrent mempalace mine spawns.
  • Cap hook_precompact “block” decisions per session to avoid indefinite compaction deadlocks.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
mempalace/palace.py Introduces MineAlreadyRunning and mine_global_lock() for non-blocking process-wide mine serialization.
mempalace/miner.py Wraps the mining pipeline in mine_global_lock() (except dry_run) and exits cleanly if another mine is running.
mempalace/hooks_cli.py Adds per-session attempt tracking to limit how many times precompact can block before passing through.
mempalace/backends/chroma.py Sets hnsw:num_threads=1 in collection metadata on create paths to avoid unsafe parallel inserts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/hooks_cli.py Outdated
Comment thread mempalace/hooks_cli.py Outdated
Comment thread mempalace/palace.py
Comment on lines +302 to +305
lock_dir = os.path.join(os.path.expanduser("~"), ".mempalace", "locks")
os.makedirs(lock_dir, exist_ok=True)
lock_path = os.path.join(lock_dir, "mine_global.lock")

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine_global_lock() uses a single global lock file (~/.mempalace/locks/mine_global.lock), which serializes all mempalace mine runs across all palaces/projects (even if they use different --palace paths). If the intent is only to prevent concurrent writes to the same palace, consider keying the lock filename by palace_path (e.g., hash of resolved palace path) or otherwise scoping the lock so independent palaces can be mined in parallel.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +287 to +291
@contextlib.contextmanager
def mine_global_lock():
"""Process-wide non-blocking lock around the full `mine` pipeline.

The per-file `mine_lock` only protects delete+insert interleave for a
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine_global_lock() introduces an important cross-process synchronization mechanism, but there doesn't appear to be automated coverage for its non-blocking behavior (second acquire raises MineAlreadyRunning) and reusability after release. Since the repo already tests mine_lock inter-process behavior, adding similar tests for mine_global_lock would help prevent future regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 128 to 131
collection = client.get_or_create_collection(
collection_name, metadata={"hnsw:space": "cosine"}
collection_name,
metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing metadata={... "hnsw:num_threads": 1} to get_or_create_collection() typically does not update metadata for an already-existing collection, so palaces created before this change may keep the unsafe default thread setting. To make the fix effective for existing users, consider explicitly updating the collection after retrieval (e.g., collection.modify(metadata=...) when supported, or at least detect missing/incorrect metadata and attempt to set it) for both the create and non-create code paths.

Copilot uses AI. Check for mistakes.
Comment on lines 150 to 153
collection = self._client(palace_path).create_collection(
collection_name, metadata={"hnsw:space": hnsw_space}
collection_name,
metadata={"hnsw:space": hnsw_space, "hnsw:num_threads": 1},
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metadata pin only applies when callers go through ChromaBackend. The MCP server still creates collections directly via chromadb.PersistentClient.get_or_create_collection(..., metadata={"hnsw:space": "cosine"}) (see mempalace/mcp_server.py, around _get_collection), so that code path would continue to omit hnsw:num_threads=1 and may still hit the HNSW parallel-insert crash. Consider routing MCP server collection creation through ChromaBackend or applying the same metadata update there so the fix covers the primary crash surface described in #974/#965.

Copilot uses AI. Check for mistakes.
@felipetruman felipetruman requested a review from igorls as a code owner April 17, 2026 19:03
@felipetruman
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all six comments were actionable. Pushed follow-up commit 4976e71.

# Concern Resolution
#6 MCP server _get_collection bypasses ChromaBackend, missing the hnsw:num_threads=1 fix on the primary crash surface Fixed. hnsw:num_threads=1 now passed at the mcp_server.py _get_collection create site as well.
#3 mine_global_lock over-serializes mines across unrelated palaces Fixed. Replaced with mine_palace_lock(palace_path) keyed by sha256(os.path.abspath(palace_path)). Same palace still serialized; different palaces run in parallel. mine_global_lock kept as a back-compat alias.
#1 except OSError around subprocess.run(timeout=60) misses subprocess.TimeoutExpired Fixed. Now catches (OSError, subprocess.TimeoutExpired) so the hook always emits a JSON decision even when mining exceeds 60s.
#2 No test for the per-session precompact attempt cap Fixed. Added test_precompact_first_two_attempts_block, test_precompact_passes_through_after_cap, test_precompact_counter_is_per_session in tests/test_hooks_cli.py.
#4 No test for mine_palace_lock non-blocking / reuse behavior Fixed. New tests/test_palace_locks.py covers single-acquire, reuse-after-release, cross-process serialization on same palace, non-interference across different palaces, path normalization, and back-compat alias.
#5 get_or_create_collection(metadata=...) doesn't update existing collections; suggest collection.modify(metadata=...) Documented as a known limitation; not auto-fixed. I verified against chromadb 1.5.7: modify(metadata=...) replaces metadata rather than merging, and re-passing hnsw:space="cosine" raises ValueError: Changing the distance function of a collection once it is created is not supported currently. The HNSW runtime config (configuration_json) also doesn't expose num_threads in chromadb 1.5.x, so the flag is read-only post-creation. Rather than paper over this with a modify that silently drops hnsw:space, I documented in a code comment at the mcp_server create site that pre-existing palaces need a mempalace nuke + re-mine to gain the protection. Fresh palaces are always protected. Open to adding a one-shot CLI helper (mempalace repair-hnsw-threading or similar) that detects this and prints a warning — happy to take a follow-up issue for that if you'd prefer not to leave it as silent behavior.

Testing

  • pytest tests/test_palace_locks.py tests/test_hooks_cli.py tests/test_backends.py tests/test_cli.py98 passed, 0 failed.
  • Runtime validation with two concurrent mempalace mine calls:
    • Different palaces → both complete in parallel ✓
    • Same palace → one completes, the other exits with mempalace: another \mine` is already running against — exiting cleanly.` ✓

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_palace_locks.py Outdated
Comment thread tests/test_palace_locks.py Outdated
Comment thread mempalace/mcp_server.py
Comment on lines +195 to 206
# hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
# HNSW insert path, which has a race in repairConnectionsForUpdate /
# addPoint (see issues #974, #965). The setting is only honored at
# collection creation time — pre-existing palaces created before
# this fix keep the unsafe default; users must `mempalace nuke` +
# re-mine to get the protection on legacy palaces.
_collection_cache = ChromaCollection(
client.get_or_create_collection(
_config.collection_name, metadata={"hnsw:space": "cosine"}
_config.collection_name,
metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
)
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment notes that pre-existing palaces must be nuked/re-mined to pick up hnsw:num_threads=1, which means upgrading does not actually protect existing users from #974/#965 unless they manually rebuild. If ChromaDB supports updating metadata on an existing collection (e.g., collection.modify(metadata=...)), consider applying it after get_or_create_collection so older palaces are automatically migrated to single-threaded inserts (best-effort, with graceful fallback if unsupported).

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py Outdated
@mvalentsev
Copy link
Copy Markdown
Contributor

Heads-up on the merge conflict. It's specifically in hooks_cli.hook_precompact, and the reason is that #863 landed on develop on 2026-04-15 with a different approach to #955.

#863 removed the unconditional block entirely. In current develop, hook_precompact runs the transcript mine synchronously and then emits _output({}), so there is no decision: block left to cap attempts against. Docstring on develop is "Precompact hook: mine transcript synchronously, then allow compaction."

After you rebase, the MAX_PRECOMPACT_BLOCK_ATTEMPTS hunk becomes unreachable. Two options:

  1. Drop that hunk: Compaction Blocking #955 is already resolved on develop.
  2. Keep it as defense-in-depth for harnesses that somehow re-introduce the block, but it would be dead code against the current hook body.

Separately on #965, there is an existing open PR #346 (yoshi280) that attributes the sparse link_lists.bin bloat to persistDirty() seek drift on default batch_size=100 / sync_threshold=1000 rather than to the ParallelFor race. It sets batch_size=50000, sync_threshold=50000 on collection creation and batches per-file upserts. Different root-cause than yours, so the two fixes are orthogonal and could land together: pin threads to cover the SIGSEGV path, raise thresholds to cover the resize-drift path. #346 is dirty against develop (bensig asked for a rebase on 04-11) but not closed.

The mine_global_lock piece looks independent. No overlap there.

@felipetruman
Copy link
Copy Markdown
Contributor Author

Thanks — pushed d8a26cd addressing both actionable points from the 2nd review. Response to all four comments:

# Concern Resolution
#7, #8 multiprocessing.get_context("fork") unavailable on Windows CI Fixed. Added _get_mp_context() helper that returns "spawn" on Windows (os.name == "nt") and "fork" elsewhere. Both test_same_palace_serializes_across_processes and test_different_palaces_dont_conflict now use it. Spawn inherits os.environ from the parent (so monkeypatch.setenv("HOME", ...) still propagates), which is all these tests need.
#10 palace_key not normalized → Windows case-insensitive bypass Fixed. Switched from os.path.abspath(os.path.expanduser(palace_path)) to os.path.normcase(os.path.realpath(os.path.expanduser(palace_path))). realpath canonicalises symlinks and ..; normcase folds case on Windows (no-op on POSIX). C:\Palace and c:\palace now hash to the same palace_key.
#9 Auto-migrate existing palaces via collection.modify(...) Deferred to follow-up, with an honest note on the tradeoff. I re-verified against chromadb 1.5.7: the modify(configuration=UpdateCollectionConfiguration(hnsw=UpdateHNSWConfiguration(num_threads=1))) path does add num_threads=1 to the in-memory collection config without raising (unlike the modify(metadata=...) path, which is what my original reply was about). So a per-session retrofit at every ChromaBackend.get_collection() / MCP _get_collection() startup is technically possible. What I still observe is that the setting does not persist to disk across PersistentClient reopens in 1.5.7 — configuration_json["hnsw"].get("num_threads") returns None after the next session. So the retrofit would have to be re-applied on every process start rather than being a one-shot migration. Given that fresh palaces are already protected and existing users only lose parallelism (not correctness) on the first insert path before the retrofit runs, I'd rather land this PR as the correctness-boundary fix and open a follow-up issue for "per-session HNSW threading retrofit for legacy collections" so it gets its own review. Happy to push it into this PR instead if you'd prefer it all in one — just say the word.

Testing

pytest tests/test_palace_locks.py tests/test_hooks_cli.py tests/test_backends.py tests/test_cli.py98 passed, 0 failed on Linux/Python 3.13. Windows CI will exercise the spawn path on push.

@felipetruman felipetruman requested a review from Copilot April 17, 2026 19:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

mempalace/backends/chroma.py:134

  • get_collection(..., create=False) returns an existing Chroma collection without ensuring hnsw:num_threads is pinned to 1. Since get_or_create_collection(..., metadata=...) does not update metadata for pre-existing collections, palaces created before this change will keep the unsafe default and may still hit the HNSW race described in #974/#965. Consider a best-effort metadata update for existing collections (e.g., calling collection.modify(metadata={...}) after get_collection) or otherwise adjust the "closes #974/#965" claim/documentation to reflect that legacy palaces need manual intervention.
        client = self._client(palace_path)
        if create:
            collection = client.get_or_create_collection(
                collection_name,
                metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
            )
        else:
            collection = client.get_collection(collection_name)
        return ChromaCollection(collection)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/palace.py
Comment on lines +298 to +306
The lock file is keyed by sha256(palace_path) so mines against
*different* palaces can still run in parallel — we only serialize
writes into the same palace, which is the correctness boundary.

The key is derived from a fully normalized form of the path:
`realpath` resolves symlinks and `..` segments, and `normcase` folds
case on Windows (which has a case-insensitive filesystem). Without
normcase, `C:\\Palace` and `c:\\palace` would hash to different keys
on Windows and let two concurrent mines touch the same on-disk palace.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the lock file is keyed by sha256(palace_path), but the implementation hashes a normalized, fully-resolved path (normcase(realpath(expanduser(palace_path)))). Please update the docstring to match the actual key derivation to avoid misleading future callers/debugging.

Suggested change
The lock file is keyed by sha256(palace_path) so mines against
*different* palaces can still run in parallelwe only serialize
writes into the same palace, which is the correctness boundary.
The key is derived from a fully normalized form of the path:
`realpath` resolves symlinks and `..` segments, and `normcase` folds
case on Windows (which has a case-insensitive filesystem). Without
normcase, `C:\\Palace` and `c:\\palace` would hash to different keys
on Windows and let two concurrent mines touch the same on-disk palace.
The lock file is keyed by the first 16 hex chars of
`sha256(normcase(realpath(expanduser(palace_path))))` so mines against
*different* palaces can still run in parallelwe only serialize
writes into the same palace, which is the correctness boundary.
The key is derived from a fully normalized form of the path:
`expanduser` resolves `~`, `realpath` resolves symlinks and `..`
segments, and `normcase` folds case on Windows (which has a
case-insensitive filesystem). Without normcase, `C:\\Palace` and
`c:\\palace` would hash to different keys on Windows and let two
concurrent mines touch the same on-disk palace.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +359 to +362
# Backward-compatible alias (previous patch iteration used a single global
# lock). Kept so third-party callers that imported it continue to work; new
# code should use `mine_palace_lock(palace_path)` for per-palace scoping.
mine_global_lock = mine_palace_lock
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description calls this a process-wide mine_global_lock(); the code implements a per-palace lock (mine_palace_lock(palace_path)) and keeps mine_global_lock only as an alias. Please align the PR description/public API docs with the shipped behavior (per-palace serialization), and consider clarifying the alias comment/name so callers don’t assume it serializes all mines regardless of palace path.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +34
"""Pick a start method that works on every CI runner.

`fork` is cheaper (no re-import) but is unavailable on Windows, so we fall
back to `spawn` there. `spawn` inherits ``os.environ`` (including the
monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the
child, which is sufficient for the lock-file semantics exercised here.
"""
start_method = "spawn" if os.name == "nt" else "fork"
return multiprocessing.get_context(start_method)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_mp_context() uses fork on non-Windows platforms. Elsewhere in the test suite (e.g., tests/test_closets.py:115) multiprocessing tests explicitly use spawn for consistent, portable semantics across macOS/Linux/Windows and to avoid fork-related flakiness. Consider switching this helper to always use spawn (or matching the existing convention) so CI behavior is uniform across runners.

Suggested change
"""Pick a start method that works on every CI runner.
`fork` is cheaper (no re-import) but is unavailable on Windows, so we fall
back to `spawn` there. `spawn` inherits ``os.environ`` (including the
monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the
child, which is sufficient for the lock-file semantics exercised here.
"""
start_method = "spawn" if os.name == "nt" else "fork"
return multiprocessing.get_context(start_method)
"""Use a multiprocessing start method with consistent CI semantics.
Always use `spawn` so these tests run with the same process model on
macOS, Linux, and Windows. `spawn` inherits ``os.environ`` (including the
monkeypatched ``HOME``) and re-imports the ``mempalace`` package in the
child, which is sufficient for the lock-file semantics exercised here.
"""
return multiprocessing.get_context("spawn")

Copilot uses AI. Check for mistakes.
@felipetruman felipetruman force-pushed the fix/hnsw-race-and-fanout branch from d8a26cd to 404d73f Compare April 17, 2026 20:17
@felipetruman
Copy link
Copy Markdown
Contributor Author

Follow-up: addressing remaining Copilot review items

Copilot #1subprocess.TimeoutExpired não capturado em hooks_cli.py

Resolvido pela resolução do merge conflict com develop. hook_precompact agora delega para _mine_sync(), que já captura (OSError, subprocess.TimeoutExpired). O teste test_precompact_with_timeout (que verifica o comportamento silencioso em caso de timeout) está verde.

Copilot #2 — Teste de regressão para o attempt-cap

Não aplicável após rebase. PR #863 em develop removeu o comportamento de bloqueio do hook_precompact (agora minera sincronamente e passa direto). O attempt-cap da nossa branch seria código morto após o rebase. Os testes test_precompact_first_two_attempts_block e test_precompact_counter_is_per_session foram removidos por serem incompatíveis com a abordagem do develop.

Copilot #4 — Testes para mine_global_lock (agora mine_palace_lock)

Resolvido. Adicionado tests/test_palace_locks.py com 6 testes cobrindo:

  • Acquire simples sem erro
  • Mesmo palace serializa entre processos (MineAlreadyRunning no 2º acquire)
  • Palaces diferentes não conflitam (paralelismo preservado)
  • Lock reutilizável após release
  • Normalização de path (relativo e absoluto mapeiam para o mesmo lock)
  • Alias mine_global_lock para retrocompatibilidade

Suite completa: 52 testes passando.

felipetruman added a commit to felipetruman/mempalace that referenced this pull request Apr 17, 2026
…stant

Addresses remaining PR MemPalace#976 review items after rebase on develop.

## Copilot Review MemPalace#3 (chroma.py:134) — retrofit legacy collections

`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).

In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.

## mvalentsev feedback — drop dead `MAX_PRECOMPACT_BLOCK_ATTEMPTS`

After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.

## Tests

- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
@felipetruman
Copy link
Copy Markdown
Contributor Author

Follow-up: merge-conflict cleanup + Copilot Review #3

Pushed commit ffa9e8e addressing the two remaining items.

@mvalentsev — dead MAX_PRECOMPACT_BLOCK_ATTEMPTS constant

You were right. After rebasing on develop (#863 landed), hook_precompact now delegates to _mine_sync and never emits decision: block, so the attempt-cap constant was orphaned. I had already removed the counter logic in 404d73f, but the bare constant MAX_PRECOMPACT_BLOCK_ATTEMPTS = 2 survived. Removed — grep confirms 0 usages.

Also noted your pointer to PR #346 (yoshi280): agreed it's orthogonal — thread-pin covers the SIGSEGV path, the batch_size/sync_threshold bump covers the resize-drift path. Happy to help coordinate if #346 rebases.

Copilot Review #3 (low-confidence) — chroma.py:134 retrofit for legacy palaces

get_collection(create=False) returns an existing Chroma collection without ensuring hnsw:num_threads is pinned to 1

Fixed with a best-effort retrofit. Added _pin_hnsw_threads(collection) that calls collection.modify(configuration=UpdateCollectionConfiguration(hnsw=UpdateHNSWConfiguration(num_threads=1))). Applied on every get_collection code path — both ChromaBackend.get_collection() (create=False and create=True) and mcp_server._get_collection().

Caveat I verified empirically against chromadb 1.5.7 (repro below):

coll = c.create_collection('legacy', metadata={'hnsw:space': 'cosine'})
# config before: {..., no num_threads}
coll.modify(configuration=UpdateCollectionConfiguration(
    hnsw=UpdateHNSWConfiguration(num_threads=1)))
# config after modify: {..., 'num_threads': 1}  ← applied
# reopen client, get_collection(...)
# config after reopen: {..., no num_threads}  ← NOT persisted

So the retrofit does not persist to disk across PersistentClient reopens in 1.5.x — it runs in memory each process start. Net effect:

  • Fresh palaces → metadata={"hnsw:num_threads": 1} at creation (persisted, primary defense).
  • Legacy palaces → per-session retrofit at get_collection (non-persisted, applies every process start).

Both paths are protected during any active session, which is when the SIGSEGV actually fires. Legacy users no longer need mempalace nuke + re-mine to get the fix. Updated the inline comment in mcp_server.py accordingly.

The retrofit is wrapped in try/except that swallows all errors (including ImportError on pre-1.5 chromadb without UpdateCollectionConfiguration), so it's strictly additive.

Testing

pytest tests/test_backends.py tests/test_palace_locks.py tests/test_hooks_cli.py62 passed.

New tests in tests/test_backends.py:

  • test_pin_hnsw_threads_retrofits_legacy_collection — legacy collection (no num_threads) → retrofit → num_threads == 1.
  • test_pin_hnsw_threads_swallows_all_errors — retrofit on a collection whose modify() raises RuntimeError — helper must not propagate.
  • test_get_collection_applies_retrofit_on_existing_palace — bootstrap a legacy palace with plain metadata={"hnsw:space":"cosine"}, reopen via ChromaBackend.get_collection(create=False), assert num_threads == 1 on the returned wrapper.

One pre-existing failure in tests/test_mcp_stdio_protection.py::test_module_import_redirects_stdout_to_stderr (missing overrides module in my linuxbrew python3.14 sandbox) — reproducible on develop clean, unrelated to this PR.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 19, 2026

@felipetruman pls rebase

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 19, 2026

Code review

Found 3 issues:

  1. Duplicate of open PR fix: set hnsw:num_threads to 1 for thread-safe HNSW inserts #991. The hnsw:num_threads=1 change on backends/chroma.py is a strict subset of this PR; fix: set hnsw:num_threads to 1 for thread-safe HNSW inserts #991 is still OPEN with the same primary fix on the same file. A reviewer on fix: set hnsw:num_threads to 1 for thread-safe HNSW inserts #991 already flagged that this PR's chroma.py hunk supersedes it. Before merging either, one should be closed/redirected — shipping both invites a merge-order footgun where the narrower PR lands first, then this one's broader retrofit (_pin_hnsw_threads at mcp_server._get_collection, legacy-palace coverage) silently rebases out the duplicate hunks. For a v3.3.2 release cut, pick one path and close the other explicitly.

client = self._client(palace_path)
if create:
collection = client.get_or_create_collection(
collection_name,
metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
)
else:
collection = client.get_collection(collection_name)
_pin_hnsw_threads(collection)
return ChromaCollection(collection)
def get_or_create_collection(
self, palace_path: str, collection_name: str

  1. PR title/body still claims closes #955 but the compaction fix was removed on rebase. Per felipetruman's own follow-up: fix(hooks): stop precompact hook from blocking compaction (#856, #858) #863 already landed in develop and removed hook_precompact's blocking behavior, so the attempt-cap logic (and the MAX_PRECOMPACT_BLOCK_ATTEMPTS constant) were deleted from this branch. hooks_cli.py no longer appears in the PR's changed files. The PR title and Summary still advertise a three-for-one fix including Compaction Blocking #955 — a reviewer triaging for v3.3.2 will expect compaction-deadlock coverage that isn't here. Update the title/body to reflect the actual scope (closes #974, #965 only) and note that Compaction Blocking #955 is resolved independently by fix(hooks): stop precompact hook from blocking compaction (#856, #858) #863, so the issue should be closed via fix(hooks): stop precompact hook from blocking compaction (#856, #858) #863, not this PR.

class MineAlreadyRunning(RuntimeError):
"""Raised when another `mempalace mine` already holds the per-palace lock."""
@contextlib.contextmanager
def mine_palace_lock(palace_path: str):
"""Per-palace non-blocking lock around the full `mine` pipeline.
The per-file `mine_lock` only protects delete+insert interleave for a
single source; it does not prevent N copies of `mempalace mine <dir>`
from being spawned concurrently by hooks. When that happens, each copy
drives ChromaDB HNSW inserts in parallel against the same palace,
which (combined with chromadb's multi-threaded ParallelFor) can
corrupt the HNSW graph and produce sparse link_lists.bin blowups.
The lock file is keyed by sha256(palace_path) so mines against
*different* palaces can still run in parallel — we only serialize
writes into the same palace, which is the correctness boundary.
The key is derived from a fully normalized form of the path:
`realpath` resolves symlinks and `..` segments, and `normcase` folds
case on Windows (which has a case-insensitive filesystem). Without
normcase, `C:\\Palace` and `c:\\palace` would hash to different keys
on Windows and let two concurrent mines touch the same on-disk palace.
Non-blocking: if another `mine` is already writing to this palace,
raise MineAlreadyRunning so the caller can exit cleanly instead of
piling up as a waiting worker.
"""
lock_dir = os.path.join(os.path.expanduser("~"), ".mempalace", "locks")
os.makedirs(lock_dir, exist_ok=True)
resolved = os.path.realpath(os.path.expanduser(palace_path))
lock_key_source = os.path.normcase(resolved)
palace_key = hashlib.sha256(lock_key_source.encode()).hexdigest()[:16]
lock_path = os.path.join(lock_dir, f"mine_palace_{palace_key}.lock")
lf = open(lock_path, "w")
acquired = False
try:
if os.name == "nt":
import msvcrt
try:
msvcrt.locking(lf.fileno(), msvcrt.LK_NBLCK, 1)
acquired = True
except OSError as exc:
raise MineAlreadyRunning(
f"another `mempalace mine` is already running against {resolved}"
) from exc
else:
import fcntl
try:
fcntl.flock(lf, fcntl.LOCK_EX | fcntl.LOCK_NB)
acquired = True
except BlockingIOError as exc:
raise MineAlreadyRunning(
f"another `mempalace mine` is already running against {resolved}"
) from exc
yield
finally:
if acquired:
try:
if os.name == "nt":
import msvcrt
msvcrt.locking(lf.fileno(), msvcrt.LK_UNLCK, 1)
else:
import fcntl
fcntl.flock(lf, fcntl.LOCK_UN)
except Exception:
pass
lf.close()
# Backward-compatible alias (previous patch iteration used a single global
# lock). Kept so third-party callers that imported it continue to work; new
# code should use `mine_palace_lock(palace_path)` for per-palace scoping.
mine_global_lock = mine_palace_lock

  1. Merge state is DIRTY (needs-rebase) against develop. mergeable=CONFLICTING, mergeStateStatus=DIRTY; maintainer has explicitly requested a rebase. The PR cannot be evaluated against a stable base in its current state — the last substantive refactor in develop (fix(hooks): stop precompact hook from blocking compaction (#856, #858) #863) already invalidated one of the three advertised fixes, so a second rebase pass is likely to surface further scope trimming. Rebase before this lands or before a deeper review round.

):
"""Mine a project directory into the palace."""
if dry_run:
return _mine_impl(
project_dir,
palace_path,
wing_override=wing_override,
agent=agent,
limit=limit,
dry_run=dry_run,
respect_gitignore=respect_gitignore,
include_ignored=include_ignored,
)
try:
with mine_palace_lock(palace_path):
return _mine_impl(
project_dir,
palace_path,
wing_override=wing_override,
agent=agent,
limit=limit,
dry_run=dry_run,
respect_gitignore=respect_gitignore,
include_ignored=include_ignored,
)
except MineAlreadyRunning:
print(
f"mempalace: another `mine` is already running against "
f"{palace_path} — exiting cleanly.",
file=sys.stderr,
)
return
def _mine_impl(
project_dir: str,
palace_path: str,
wing_override: str = None,
agent: str = "mempalace",
limit: int = 0,
dry_run: bool = False,
respect_gitignore: bool = True,
include_ignored: list = None,
):

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, New tests assert that _pin_hnsw_threads() updates configuration_json, but with the repo’s pinned chromadb==0.6.3 the function returns early (ImportError) and no retrofit occurs, causing the assertions to fail and blocking CI.

Severity: action required | Category: correctness

How to fix: Skip tests for chromadb<1.5

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

tests/test_backends.py includes new tests that require ChromaDB 1.5+ collection configuration update APIs and expect configuration_json["hnsw"]["num_threads"] to be set after calling _pin_hnsw_threads(). The repository lockfile pins chromadb==0.6.3, where _pin_hnsw_threads() returns early due to ImportError, so these tests will fail in CI.

Issue Context

_pin_hnsw_threads() is intentionally best-effort and version-gated; tests must reflect that behavior under the currently pinned dependency set.

Fix Focus Areas

  • tests/test_backends.py[150-197]
  • mempalace/backends/chroma.py[14-35]
  • uv.lock[342-345]

Suggested fix

Choose one:

  1. Add a version guard in the new tests (e.g., pytest.importorskip/skipif) so they only run when chromadb.api.collection_configuration is available.
  2. Alternatively, update the pinned ChromaDB version used in CI/dev lockfile to a compatible 1.5+ version if that’s now the intended baseline.
  3. Or mock the 1.5+ Update* classes in tests so the behavior can be validated without changing the pinned dependency.

Qodo code review - free for open-source.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 19, 2026

@felipetruman heads up — this branch has merge conflicts against develop and needs a rebase before it can ship.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
Scanned all 233 open upstream PRs today against our open PRs and
fork-ahead / planned-work items. Findings merged into README:

- P2 (decay) and P3 Tier-0 (LLM rerank): both covered by MemPalace#1032
  (@zackchiutw, MERGEABLE, 2026-04-19 — Weibull decay + 4-stage
  rerank pipeline). Older simpler version at MemPalace#337. Dropped as
  fork work; watching MemPalace#1032.
- P7 (alternative storage): formally out of scope. RFC 001 MemPalace#743
  (@igorls) defines the plugin contract; four backend PRs already
  in flight (MemPalace#700, MemPalace#381 Qdrant; MemPalace#574, MemPalace#575 LanceDB). Fork consumes,
  does not rebuild.
- P0 (multi-label tags): still fork/upstream candidate. MemPalace#1033
  (@zackchiutw) ships adjacent privacy-tag + progressive disclosure
  but not the full multi-label scheme.
- Merged MemPalace#1023 section acknowledges complementary MemPalace#976 (felipetruman)
  which adds broader mine_global_lock() + HNSW num_threads pin.

Gives future-us a map so we don't re-file MemPalace#1036-style duplicates.
@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 19, 2026

A narrower slice of the mine fan-out fix (hook-triggered only) landed on develop yesterday as #1023_mine_already_running() with a PID file guard in hooks_cli.py. Your broader mine_global_lock() still adds real value by catching direct-CLI fan-out too (hooks aren't the only way mine gets spawned twice). The hook-path hunk should conflict-resolve cleanly against #1023's version on rebase.

Separately: the hnsw:num_threads: 1 pin is the right conservative default given ChromaDB ingests one drawer at a time. Worth landing on its own if the combined PR stays hard to review.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
Changes (all 7 approved items):
1. Tighten line-15 status line to keep only non-redundant bits
   (test count, Discussion MemPalace#1017, issues link)
2. Remove "Fork Changes / Headlines" subsection — duplicated the
   three differentiators already in the lead paragraph
3. Remove "Planned work / Done" — 5 bullets of PR status already
   in the "Open upstream PRs" table
4. Trim P2 "original design notes" — MemPalace#1032 is MERGEABLE; kept
   the prune CLI fork-opportunity note
5. Tighten MemPalace#1023 row in "Merged upstream" — moved the MemPalace#976 mapping
   detail to a single compact sentence
6. Collapse "Pulled in from upstream v3.3.1" from 6 bullets to
   one paragraph pointing at the release notes
7. Prune three oldest "Superseded by upstream" bullets (epsilon
   mtime, .jsonl, max_distance — all upstream-authored or shipped
   by upstream before the fork claimed them as contributions)
8. Drop P7 "Original fork-mode design notes" (flashcard/AAAK/
   diary modes) now that the section explicitly marks P7 as
   dropped fork work

Net: 345 → 316 lines. No unique info removed; anything cut is
either duplicated elsewhere in the same README or retrievable
from the cited PR/release link.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
…HNSW fixes

Bring in 29 commits from upstream/develop since the last merge (2026-04-23):

Major absorbed changes:
- MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for
  fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes
  MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too.
- MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank,
  legacy-metric warning + _warn_if_legacy_metric, invariant tests on
  hnsw:space=cosine across all 5 collection-creation paths.
- MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine.
- MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device
  detection via mempalace/embedding.py.
- MemPalace#1182: graceful Ctrl-C during mempalace mine.
- MemPalace#1168: tunnel permissions security fix.

Conflict resolutions (8 files):
- searcher.py: kept fork's CLI delegation through search_memories (cleaner
  single-source-of-truth path); upstream's inline-retrieval branch dropped.
  Merged upstream's print formatting (cosine= + bm25=) with fork's
  matched_via reporting from sqlite BM25 fallback.
- backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to
  ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs
  (embedding_function support from MemPalace#1185). Removed duplicate
  _pin_hnsw_threads (fork already cherry-picked Felipe's earlier).
- mcp_server.py: kept fork's palace_path arg + upstream's clearer comment
  on hnsw:num_threads=1 rationale.
- miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C),
  RESTORED fork's strict detect_room — substring matching from upstream
  breaks fork-only test_detect_room_priority1_no_substring_match. Added
  `import re` for word-boundary regex matching. Fork-ahead concurrent
  mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon
  migration deprioritizes local concurrent mining; can re-port if needed.
- CHANGELOG.md: combined fork's segfault-trio narrative with upstream's
  v3.3.4 release notes.
- HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was
  stale; hooks already use this name per fork-ahead MemPalace#19).
- test_convo_miner_unit.py: took both contextlib + pytest imports.
- test_searcher.py: imported upstream's 3 new TestSearchCLI tests but
  skipped them with TODOs — they assume upstream's inline-retrieval CLI
  with simpler mocks. Rewrite needed for fork's delegated search_memories
  path (sqlite BM25 fallback + scope counting).

Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings.

Implications for open fork PRs:
- MemPalace#1171 (cross-process write lock at adapter) becomes structurally
  redundant given MemPalace#976's mine_global_lock + daemon-strict serialization.
  Slated for close with thank-you to Felipe.
- MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
After today's work:

- Merged upstream/develop (29 commits), absorbing MemPalace#976 (HNSW + mine_global_lock + PreCompact attempt-cap), MemPalace#1179 (CLI BM25), MemPalace#1180/MemPalace#1182/MemPalace#1183/MemPalace#1185 features.
- Filed PR MemPalace#1198 (`_tokenize` None guard) upstream.
- Closed PR MemPalace#1171 (adapter-level flock) — superseded by MemPalace#976 + daemon-strict.
- palace-daemon migration commits (c09582c + 0e97b19) make the daemon the
  single canonical writer; in-tree adapter flock no longer carries its weight.

README:
- Lead paragraph: 2026-04-21 → 2026-04-25 sync date, 165,632 → 151,420
  drawer count, daemon-on-disks topology, listed absorbed upstream PRs.
- Fork change queue: dropped MemPalace#1171 row, added MemPalace#1198 row.
- "Multi-client coordination" section: rewrote — palace-daemon is now the
  fork's primary answer (was "deferred"), MemPalace#1171 retired, narrative shifted
  to defense-in-depth around MemPalace#976 + daemon.
- Two-layer memory model: storage path is now palace-daemon HTTP, not
  ~/.mempalace/palace.
- Ecosystem palace-daemon row: marks integration as complete (commits
  cited).
- Open upstream PRs table: refreshed to MemPalace org URLs, added MemPalace#1198
  row, removed MemPalace#1171, added 2026-04-25 develop sync to merged list, added
  MemPalace#1171 to closed list with rationale.
- Test count 1096 → 1334.

CLAUDE.md:
- Row 20 (MemPalace#1198): "Upstream PR pending" → "Filed as MemPalace#1198 on 2026-04-24".
- Row MemPalace#1171 in PR table: open → closed (with MemPalace#976 supersession).
- Added MemPalace#1198 row to PR table.
- Top-of-section count: "14 merged, 7 open, 9 closed" → "14 merged, 7 open
  (including MemPalace#1198), 10 closed (added MemPalace#1171)" + sync date 2026-04-25.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
…ng rebase

@mvalentsev pointed out on MemPalace#1172 that the PR MemPalace#976 description still
claims a `MAX_PRECOMPACT_BLOCK_ATTEMPTS=2` cap, but during the develop
rebase Felipe Truman explicitly dropped that hunk:

- 40d7958 — removed the attempt-cap tests (would always fail because
  hook_precompact now mines synchronously and passes through
  unconditionally)
- 8df944a — removed the attempt-cap constant (orphaned after rebase)

What actually merged in MemPalace#976 for hooks_cli.py: just the constant
removal. Current hook_precompact() on develop is pure MemPalace#863 logic
(merged 2026-04-15, @mvalentsev): ingest, sync mine, _output({}).

So the right attribution: MemPalace#863 closes MemPalace#1172, not MemPalace#976. Updating the
CHANGELOG and README entries that quoted the unmerged PR description
as if it were merged code. Lesson recorded as a feedback memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
Stop-hook auto-save diary entries (topic=checkpoint, text starting
"CHECKPOINT:") were dominating mempalace_search results — they're
short, word-dense, and outrank substantive content under cosine
similarity. The actual decisions/conversations they summarize get
buried below the summary entries.

Validated against JP's 151K-drawer palace via RLM smoke test on
2026-04-25: default search returned 5 CHECKPOINT entries when asked
about "palace-daemon migration architecture," producing a confident
but misleading RLM answer that quoted user-prompt fragments instead
of the actual decision (MemPalace#1171 closed, MemPalace#976 absorbed, daemon-strict
mode added). Filtering checkpoints at retrieval flips this — now
the same query surfaces technical-room drawers with file paths and
code snippets.

Changes:

mempalace/searcher.py
- New `kind` parameter on search_memories / search:
  - "content" (default): excludes topic IN ("checkpoint", "auto-save")
  - "checkpoint": only checkpoints (recovery/audit)
  - "all": disables filter (pre-2026-04-25 behavior)
- build_where_filter accepts the kind argument and composes it with
  wing/room via $and. Uses ChromaDB's $nin/$in operators (RFC 001
  validated supported).
- Defense-in-depth post-filter on text prefix: catches drawers
  written without topic metadata (legacy data, partial writes,
  metadata=None upstream-MemPalace#1020 territory) so the exclusion is
  bulletproof even when the where clause can't see the field.
- _CHECKPOINT_TOPICS constant: ("checkpoint", "auto-save"). The
  second name is a legacy synonym from older palace-daemon hook.py
  callers; both are filtered until canonical hook clients have
  fully rolled out.
- Result dicts now surface `topic` so consumers (familiar.realm.watch's
  deterministic pipeline, RLM tools, debugging humans) can route
  on it.

mempalace/mcp_server.py
- mempalace_search MCP tool accepts `kind` param with input-schema
  enum ["content", "checkpoint", "all"].

mempalace/layers.py
- Layer2/Layer3 callers pass kind="all" explicitly to preserve their
  pre-existing contract (their tests verify exact where-clause
  shape; revisit whether wake-up paths should default-exclude
  checkpoints in a separate PR).

tests
- TestCheckpointFilter (8 cases): unit tests for build_where_filter
  across all kind values, integration tests for search_memories
  text-prefix defense-in-depth, topic field surfacing, ValueError
  on invalid kind.
- 1345 passed, 0 failed, 0 skipped.

Companion change in palace-daemon repo: clients/hook.py wrote
topic="auto-save" (one-off legacy); standardized on
topic=CHECKPOINT_TOPIC ("checkpoint") so all hook paths agree.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
- New scripts/check-docs.sh: passive drift detection for fork docs.
  Checks (1) test count in README matches `pytest --collect-only`,
  (2) every fork commit hash referenced in CLAUDE.md / README.md /
  FORK_CHANGELOG.md resolves via `git cat-file -e`, (3) every #NNNN
  PR reference has a state matching what the doc claims (skipped
  gracefully if `gh` isn't authenticated). Cross-repo URL stripping
  + multi-PR-on-one-line skip avoid false positives. Documented as
  Layer 1 in the doc-update conversation.
- Fix typo `552d0d5` → `552a0d7` in README (the actual fork commit
  cherry-pick of upstream MemPalace#976's HNSW num_threads pin) — caught by
  the new check-docs.sh on first run.

Layer 3 (canonical-source rendering for FORK_CHANGELOG / README
fork-change-queue table / CLAUDE.md row inventory / promises tracker)
is the next commit; this lands the lint first so the canonical
renderer can use it as part of its own validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
…#1173 status

CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec)
that landed earlier today. PR table updated to reflect bensig's
2026-04-26 approvals on four of our open PRs:

  - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit
    shape, then force-pushed two safety commits (cold-start gate +
    integrity sniff-test) after a production cold-start incident
    destroyed three healthy 253MB segments. Now mergeable=CONFLICTING
    against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs
    rebase + re-review.
  - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge.

YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG
regenerated to match. promises.md (in claude-config, not this repo)
got a long log entry covering today's full output.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
Restructure from feature-list to thesis-shape per the 2026-04-26 evening
conversation. Skeleton at scratch/readme-overhaul-skeleton.md; this is
the prose pass.

Major changes:

- New lede surfaces the empirical result (632/3 → 974/1267 token
  convergence) in the first 200 words instead of burying it in a
  table cell.
- New "The thesis" section names three principles the fork has
  converged on: corpus-shape-eats-algorithm, verbatim-vs-derivative,
  right-to-measure. Each links out to the long-form essay rather than
  re-deriving.
- New "What this fork ships" section organizes work by axis (structural
  retrieval / single-writer / deterministic hooks) instead of bag of
  features.
- Quickstart added near top — was buried at line 426.
- Architectural principles kept (light edits) and now cross-reference
  the new thesis.
- Roadmap reorganized around verbatim-vs-derivative axis. Adds P8
  ("Corpus partitioning by purpose") as first-class item; the
  checkpoint split is its first instance, future transcript-mine
  separation and KG-triple-store moves extend it.
- Active investigations rolled forward: engram-2 critique partially
  closed (one concrete instance fixed empirically); LongMemEval-S E2E
  deferred until chunk-size + embedding-model alignment lands.
- New "Composition with upstream" section makes explicit that the fork
  cherry-picks in-flight upstream PRs, contributes back, and closes
  fork-only attempts when upstream's version is at the right layer
  (MemPalace#1171MemPalace#976).
- Fork-change inventory moves from headline to bottom — same content,
  reframed as reference for verifying claims rather than the lede.
- Open upstream PRs table updated for 4 bensig approvals 2026-04-26.
- Test count, palace size, deploy commit, post-migration numbers all
  refreshed against live state.

Net: 480 → 383 lines (~20% shorter), all four check-docs.sh stages
pass clean (15 fork hashes resolve, 84 PR refs match upstream state).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211
MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four
PRs that landed today (21:22-21:41 UTC):

  MemPalace#1173  quarantine_stale_hnsw on make_client + cold-start gate +
         integrity sniff-test (PROTO/STOP byte check, no deserialization)
  MemPalace#1177  .blob_seq_ids_migrated marker guard, closes MemPalace#1090
  MemPalace#1198  _tokenize None-document guard in BM25 reranker
  MemPalace#1201  palace_graph.build_graph skips None metadata

Conflict resolution:

* mempalace/backends/chroma.py — took upstream as base (it has the
  igorls-review pickle-protocol docstring, thread-safety paragraph, and
  Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas
  in query()/get() since MemPalace#1094 is still open and not yet in develop.

* mempalace/mcp_server.py — took upstream's clean form. Dropped the
  fork-only `palace_path=` kwarg from four ChromaCollection() call sites:
  the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but
  MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so
  the kwarg has no remaining consumer. ChromaCollection.__init__ in
  upstream/develop is back to (self, collection); calling it with
  palace_path= raised TypeError → silently swallowed by the broad except
  in _get_collection() → returned None → tool_status() returned _no_palace().
  41 mcp_server tests went from failing-with-KeyError to passing.

* mempalace/cli.py — dropped fork-only `workers=args.workers` from the
  cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the
  `--workers` argparse arg landed in 5cd14bd but miner.mine() never
  accepted a workers param, so production `mempalace mine` TypeError'd
  on every invocation. Removed the broken plumbing; tests/test_cli.py
  updated to match.

* CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives
  in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml).

* CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's
  added a "Design Principles / Contributing" charter, which lives in
  README.md on the fork.

* tests/test_backends.py — took upstream's ruff-formatted line widths.

docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate,
hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from
OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the
merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated.

scripts/check-docs.sh: 4/4 clean.
Test suite: 1460/1460.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@Seph396
Copy link
Copy Markdown

Seph396 commented Apr 27, 2026

Reproduced this exact corruption pattern on mempalace 3.3.2 running under Claude Desktop, and confirming the fix in this PR resolves it.

Trigger pattern observed: Claude Desktop's MCP loader spawns a new mempalace.mcp_server process before the previous one finishes its shutdown handshake. Two concurrent server processes operating against the same Chroma collection caused the multi-threaded HNSW writer race this PR fixes (hnsw:num_threads=1).

Disk impact in our case: ~/.mempalace/palace/<drawers-uuid>/link_lists.bin bloated to 55 GB (1900× expected size for our ~50K vectors), which triggered a kernel-level watchdog timeout on a 1 TB Mac when free disk dropped below ~6 GB. Recovery required surgical removal of the corrupted collection only — chroma.sqlite3 and the second collection's HNSW segment were intact, so no source data was lost.

Verification of the fix: Pinned to commit 0d9929c0dd31f5f8368d55300c2e71d8eb0f4e46 via uv tool install --reinstall 'mempalace @ git+https://github.com/MemPalace/mempalace@0d9929c0dd31f5f8368d55300c2e71d8eb0f4e46'. Confirmed the changes at mcp_server.py:229 (collection config sets hnsw:num_threads=1) and backends/chroma.py:134 (auto-retrofit for existing collections). 24+ hours of mixed concurrent use across Claude Desktop, Claude Code, and Cowork since the pin — no recurrence.

Full sanitized transport log + analysis: https://gist.github.com/Seph396/d8f724e58f066201b3cb527d0c7ffcc0

A separate observation: the concurrent-spawn behavior on the Claude Desktop client side is what makes this race triggerable in the first place. The num_threads=1 fix in this PR closes the race window structurally (no concurrency in the writer), but a single-instance lock or PID file inside mempalace.mcp_server would also defend against any future client that exhibits the same spawn pattern. Filing that as a separate issue: #1229

Looking forward to v3.3.4. Happy to provide additional logs or repro steps if useful.

imtylervo added a commit to imtylervo/mempalace that referenced this pull request Apr 27, 2026
MemPalace#976 protects `mempalace mine`, but MCP/direct backend writers still call
ChromaCollection.add/upsert/update/delete without the palace lock. This
moves the lock boundary to the Chroma backend seam so all Chroma writes
share the same palace-level serialization, with a re-entrant guard for
miner paths that already hold the lock.

mine_palace_lock(palace_path) gains a per-thread re-entrant guard
(threading.local + pid-tag against fork inheritance) so
ChromaCollection write methods can take the lock without
self-deadlocking when called from inside miner.mine()'s outer hold.

ChromaCollection.__init__ accepts an optional palace_path; when set,
add/upsert/update/delete wrap their underlying chromadb call with
mine_palace_lock(palace_path). palace_path=None preserves the legacy
no-lock behaviour for direct callers and tests. ChromaBackend's
get_collection/create_collection pass palace_path through;
mcp_server._get_collection forwards _config.palace_path so all MCP
write tools inherit the wrapping.

Tests: 5 new in tests/test_chroma_collection_lock.py covering opt-in,
writer-blocks-during-mine, re-entrant-inside-mine, two-process
serialization, and a source-level read-path-not-locked pin. Plus 1 new
+ 1 rewritten in tests/test_palace_locks.py for the re-entrant
semantics. 52 passed in 1.01s including the existing test_backends.py
regression suite.

Refs MemPalace#1161.
igorls pushed a commit to funguf/mempalace that referenced this pull request Apr 27, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
…stant

Addresses remaining PR MemPalace#976 review items after rebase on develop.

`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).

In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.

After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.

- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:

* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
  legacy `create_collection()` path. Preserves existing `hnsw:space`,
  `hnsw:num_threads=1` (race fix from MemPalace#976), and `**ef_kwargs`
  (embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
  path used when a palace is first opened by the MCP server. Without this
  third site, MCP-bootstrapped palaces would skip the guard and could
  still trigger the original bloat.

Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.

Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.

Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.

Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the MemPalace#1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.

Closes MemPalace#344
Supersedes MemPalace#346

Co-authored-by: robot-rocket-science <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/mining File and conversation mining bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants