hnswlib updatePoint race on modified-file re-mine: EXC_BAD_ACCESS in repairConnectionsForUpdate (macOS ARM64, Python 3.13, chromadb 0.6.3)

## Environment

- mempalace 3.1.0 (pipx venv, installed directly from the Claude Code plugin marketplace source, not PyPI)
- chromadb 0.6.3
- Python 3.13 (Homebrew)
- macOS 15.x ARM64 (Apple M1, 8-core)

## Symptom

A background `python3 -m mempalace mine $MEMPAL_DIR` subprocess spawned by `mempalace/hooks_cli.py::_maybe_auto_ingest` crashes with `EXC_BAD_ACCESS (SIGSEGV)` inside `hnswlib.cpython-313-darwin.so`. The Stop hook itself exits cleanly — the mining subprocess dies detached, so Claude Code / VSCode sees a successful hook. The crash surfaces as a macOS crash report, not as a mempalace error message.

## Crash fingerprint (unambiguous)

```
Exception Type:  EXC_BAD_ACCESS (SIGSEGV)

Thread <N> Crashed:
0  hnswlib.cpython-313-darwin.so  searchBaseLayer<...>
1  hnswlib.cpython-313-darwin.so  repairConnectionsForUpdate<...>
2  hnswlib.cpython-313-darwin.so  updatePoint<...>
3  hnswlib.cpython-313-darwin.so  addPoint(void const*, unsigned long, int)
4  hnswlib.cpython-313-darwin.so  ParallelFor<...addItems...>
```

**The unambiguous marker is `repairConnectionsForUpdate` in the crashed thread's stack** — that function is only called on the HNSW update path (`addItems` on an already-existing label).

## Root cause

Three independently-reasonable design choices combine into a race:

1. **Deterministic drawer_id** (`mempalace/miner.py:377`):
   ```python
   drawer_id = f"drawer_{wing}_{room}_{hashlib.sha256((source_file + str(chunk_index)).encode()).hexdigest()[:24]}"
   ```
   Same file path + chunk index ⇒ same ID forever.

2. **Upsert, not add** (`mempalace/miner.py:392`):
   ```python
   collection.upsert(...)
   ```
   Upsert on an existing ID pushes ChromaDB through `hnswlib.addItems` → `addPoint` existing_internal_id branch → `updatePoint` → `repairConnectionsForUpdate`, which has a long-standing thread-safety bug in nmslib/hnswlib: `searchBaseLayer` dereferences a neighbor pointer that another worker thread is concurrently mutating.

3. **File-level mtime skip** (`mempalace/miner.py:420` + `mempalace/palace.py:51-71`):
   ```python
   file_already_mined(collection, source_file, check_mtime=True)
   ```
   This is a file-level check, not per-chunk. A single mtime change on one file ⇒ every chunk in that file is re-upserted with its unchanged deterministic `drawer_id`, producing a batch of update-path operations that ChromaDB runs concurrently through hnswlib's `std::thread`-based `ParallelFor` (≈ `hardware_concurrency()` workers, ~8 on M1).

## Masking history

The race existed in earlier 3.x code but was masked because the mtime check was broken and `file_already_mined` always returned True. Commit `c2308a1` "fix: address code review — restore mtime check, bound metadata reads, harden security" in the 2026-04-09 critical-bugfixes merge (PR #399, shipped as 3.1.0) re-enabled re-mining and unmasked the crash.

## Reproducer

1. Install mempalace 3.1.0 on macOS ARM64 with Python 3.13 and chromadb 0.6.3
2. Mine any project: `mempalace mine <dir>`
3. Modify a file inside the mined tree (e.g., `touch <file>` or `echo "x" >> <file>`)
4. Re-run `mempalace mine <dir>` — either manually or by letting the Claude Code Stop hook fire (if `MEMPAL_DIR` is exported)
5. Check `~/Library/Logs/DiagnosticReports/` for a crashed Python process with the fingerprint above

## On-disk safety

The crash does **not** corrupt the palace index. ChromaDB has not flushed hnswlib state at the time of the segfault; `~/.mempalace/palace/<uuid>/data_level0.bin` and siblings stay at the last successful mine's mtime. No restore needed. Confirmed on this machine: palace dir timestamps stayed at the pre-crash mine time after the crash.

## Non-solutions

- `OMP_NUM_THREADS=1` does **not** help. hnswlib uses its own `std::thread` pool via `ParallelFor`, not OpenMP. The crash stack shows `std::__1::thread::join` with zero OpenMP frames.
- Wiping `~/.mempalace/palace` and re-mining from scratch only buys time — the race fires on the next modified file.
- Downgrading to 3.0.x doesn't help; 3.0.x has the same code paths and the masking mtime-check bug wasn't a safety net.

## Proposed fix

One-hunk patch in `mempalace/miner.py::process_file`, right before the `for chunk in chunks` loop:

```python
# Purge stale drawers for this file before re-inserting the fresh chunks.
# Converts modified-file re-mines from upsert-over-existing-IDs (which hits
# hnswlib's thread-unsafe updatePoint path and can segfault on macOS ARM with
# chromadb 0.6.3) into a clean delete+insert, bypassing the update path
# entirely.
try:
    collection.delete(where={"source_file": source_file})
except Exception:
    pass
```

This converts every re-mine into a pure INSERT path that bypasses `updatePoint` / `repairConnectionsForUpdate` entirely. PR to follow.

## Related but distinct

- #74 (closed) — ARM64 crash, but in `chromadb_rust_bindings.abi3.so`, a different native library with a different fingerprint. On a palace of ~10K drawers (past #74's 8,400-drawer threshold), this crash fingerprint is absent, confirming #74 is unrelated.
- #344 — `link_lists.bin` resize bloat (disk exhaustion), different failure mode.
- #357 — parallel mining corruption, but between multiple independent miner *processes*. The present crash is a single-process multi-*thread* race inside hnswlib's own `ParallelFor`.
- #397 — CoreML / ONNX provider selection in the search path. Different crash path.
- #504 — hook-wrapper lock / kill-switch guidance. Operational mitigation; does not address the single-process race.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hnswlib updatePoint race on modified-file re-mine: EXC_BAD_ACCESS in repairConnectionsForUpdate (macOS ARM64, Python 3.13, chromadb 0.6.3) #521

Environment

Symptom

Crash fingerprint (unambiguous)

Root cause

Masking history

Reproducer

On-disk safety

Non-solutions

Proposed fix

Related but distinct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hnswlib updatePoint race on modified-file re-mine: EXC_BAD_ACCESS in repairConnectionsForUpdate (macOS ARM64, Python 3.13, chromadb 0.6.3) #521

Description

Environment

Symptom

Crash fingerprint (unambiguous)

Root cause

Masking history

Reproducer

On-disk safety

Non-solutions

Proposed fix

Related but distinct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions