Skip to content

hnswlib updatePoint race on modified-file re-mine: EXC_BAD_ACCESS in repairConnectionsForUpdate (macOS ARM64, Python 3.13, chromadb 0.6.3) #521

@StefanKremen

Description

@StefanKremen

Environment

  • mempalace 3.1.0 (pipx venv, installed directly from the Claude Code plugin marketplace source, not PyPI)
  • chromadb 0.6.3
  • Python 3.13 (Homebrew)
  • macOS 15.x ARM64 (Apple M1, 8-core)

Symptom

A background python3 -m mempalace mine $MEMPAL_DIR subprocess spawned by mempalace/hooks_cli.py::_maybe_auto_ingest crashes with EXC_BAD_ACCESS (SIGSEGV) inside hnswlib.cpython-313-darwin.so. The Stop hook itself exits cleanly — the mining subprocess dies detached, so Claude Code / VSCode sees a successful hook. The crash surfaces as a macOS crash report, not as a mempalace error message.

Crash fingerprint (unambiguous)

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)

Thread <N> Crashed:
0  hnswlib.cpython-313-darwin.so  searchBaseLayer<...>
1  hnswlib.cpython-313-darwin.so  repairConnectionsForUpdate<...>
2  hnswlib.cpython-313-darwin.so  updatePoint<...>
3  hnswlib.cpython-313-darwin.so  addPoint(void const*, unsigned long, int)
4  hnswlib.cpython-313-darwin.so  ParallelFor<...addItems...>

The unambiguous marker is repairConnectionsForUpdate in the crashed thread's stack — that function is only called on the HNSW update path (addItems on an already-existing label).

Root cause

Three independently-reasonable design choices combine into a race:

  1. Deterministic drawer_id (mempalace/miner.py:377):

    drawer_id = f"drawer_{wing}_{room}_{hashlib.sha256((source_file + str(chunk_index)).encode()).hexdigest()[:24]}"

    Same file path + chunk index ⇒ same ID forever.

  2. Upsert, not add (mempalace/miner.py:392):

    collection.upsert(...)

    Upsert on an existing ID pushes ChromaDB through hnswlib.addItemsaddPoint existing_internal_id branch → updatePointrepairConnectionsForUpdate, which has a long-standing thread-safety bug in nmslib/hnswlib: searchBaseLayer dereferences a neighbor pointer that another worker thread is concurrently mutating.

  3. File-level mtime skip (mempalace/miner.py:420 + mempalace/palace.py:51-71):

    file_already_mined(collection, source_file, check_mtime=True)

    This is a file-level check, not per-chunk. A single mtime change on one file ⇒ every chunk in that file is re-upserted with its unchanged deterministic drawer_id, producing a batch of update-path operations that ChromaDB runs concurrently through hnswlib's std::thread-based ParallelFor (≈ hardware_concurrency() workers, ~8 on M1).

Masking history

The race existed in earlier 3.x code but was masked because the mtime check was broken and file_already_mined always returned True. Commit c2308a1 "fix: address code review — restore mtime check, bound metadata reads, harden security" in the 2026-04-09 critical-bugfixes merge (PR #399, shipped as 3.1.0) re-enabled re-mining and unmasked the crash.

Reproducer

  1. Install mempalace 3.1.0 on macOS ARM64 with Python 3.13 and chromadb 0.6.3
  2. Mine any project: mempalace mine <dir>
  3. Modify a file inside the mined tree (e.g., touch <file> or echo "x" >> <file>)
  4. Re-run mempalace mine <dir> — either manually or by letting the Claude Code Stop hook fire (if MEMPAL_DIR is exported)
  5. Check ~/Library/Logs/DiagnosticReports/ for a crashed Python process with the fingerprint above

On-disk safety

The crash does not corrupt the palace index. ChromaDB has not flushed hnswlib state at the time of the segfault; ~/.mempalace/palace/<uuid>/data_level0.bin and siblings stay at the last successful mine's mtime. No restore needed. Confirmed on this machine: palace dir timestamps stayed at the pre-crash mine time after the crash.

Non-solutions

  • OMP_NUM_THREADS=1 does not help. hnswlib uses its own std::thread pool via ParallelFor, not OpenMP. The crash stack shows std::__1::thread::join with zero OpenMP frames.
  • Wiping ~/.mempalace/palace and re-mining from scratch only buys time — the race fires on the next modified file.
  • Downgrading to 3.0.x doesn't help; 3.0.x has the same code paths and the masking mtime-check bug wasn't a safety net.

Proposed fix

One-hunk patch in mempalace/miner.py::process_file, right before the for chunk in chunks loop:

# Purge stale drawers for this file before re-inserting the fresh chunks.
# Converts modified-file re-mines from upsert-over-existing-IDs (which hits
# hnswlib's thread-unsafe updatePoint path and can segfault on macOS ARM with
# chromadb 0.6.3) into a clean delete+insert, bypassing the update path
# entirely.
try:
    collection.delete(where={"source_file": source_file})
except Exception:
    pass

This converts every re-mine into a pure INSERT path that bypasses updatePoint / repairConnectionsForUpdate entirely. PR to follow.

Related but distinct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions