Skip to content

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085

Closed
midweste wants to merge 2 commits intoMemPalace:developfrom
midweste:midweste
Closed

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085
midweste wants to merge 2 commits intoMemPalace:developfrom
midweste:midweste

Conversation

@midweste
Copy link
Copy Markdown

@midweste midweste commented Apr 21, 2026

perf: batch ChromaDB inserts in miner — 10-30x faster mining

Problem

process_file() called add_drawer() once per chunk, resulting in:

  • N separate ONNX embedding passes per file (one per chunk)
  • N separate ChromaDB upsert calls per file
  • N redundant datetime.now() and os.path.getmtime() syscalls per file

On large files (1000+ chunks), this made mining slow.

Solution

1. DRY metadata construction via _build_drawer() helper

Extracted the drawer ID generation and metadata assembly into a shared _build_drawer() function used by both:

  • add_drawer() — single-insert path (unchanged contract, used by MCP tools)
  • add_drawers() — new batch-insert path (used by process_file())

2. Batched upserts with sub-batching

add_drawers() collects all chunks into batch lists and upserts in groups of CHROMA_BATCH_LIMIT (5000) to stay under ChromaDB's hard cap of 5,461.

3. Hoisted syscalls

datetime.now() and os.path.getmtime() are called once per file and shared across all chunks via _build_drawer(now=..., source_mtime=...).

Results

Metric Before After
ONNX passes per file N (per chunk) ⌈N/5000⌉
DB writes per file N ⌈N/5000⌉
Syscalls per file 2N 2
Metadata construction duplicated in add_drawer + process_file single _build_drawer helper

Tested against a 499-file project (~70k drawers) with zero errors, including an 18,803-chunk benchmark file.

Breaking Changes

None. add_drawer() retains its existing signature and behavior. add_drawers() is additive.

@igorls igorls added performance Performance improvements area/mining File and conversation mining storage labels Apr 24, 2026
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
…nserts)

README fork-change-queue gets a Performance row above the L1
importance pre-filter row. CLAUDE.md gets a row 27 noting the
conflict-resolution choices (kept fork's DRAWER_UPSERT_BATCH_SIZE=1000,
aliased upstream's CHROMA_BATCH_LIMIT to it).

The cherry-pick is already on upstream develop, so this row exists
mainly to document the merge-conflict resolution for the next
upstream→main sync. Will become a no-op when develop ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
I wrote "already in upstream develop" without running gh pr view, and
JP caught it. MemPalace#1085 was created 2026-04-21, still open against develop.
Re-framed: fork cherry-picked an open upstream PR; the row clears when
MemPalace#1085 lands and we sync develop→main. We don't file a competing
fork-side PR — the proposal is @midweste's; we wait.

Lesson re-applied from feedback_verify_merged_not_described.md: always
read upstream state via gh, never infer from the README's PR list.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
Upstream CHANGELOG.md tracks released versions of MemPalace and is
merged from upstream — mixing fork-ahead changes into it would create
upstream-sync conflicts.

This file is the supplement: date-based sections (no semver — fork
doesn't cut its own release tags), Keep-a-Changelog buckets (Added /
Changed / Fixed / Performance). When a fork-ahead row lands upstream,
the entry moves to the "Merged into upstream" section at the bottom,
kept ~2 weeks then trimmed.

Backfilled with 2026-04-25 + 2026-04-26 entries covering: the
checkpoint collection split (phases A–D, PreCompact incorporation,
the new MCP tool), the kind= filter (transitional safety net), the
HNSW cold-start gate + integrity gate, drawer_id surfacing, the
deploy script, the MemPalace#1085 cherry-pick, palace_graph None guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
midwestE added 2 commits May 5, 2026 04:10
Prevent HNSW index corruption by deferring Ctrl+C until the current
file's ChromaDB upsert completes, then flushing the background
compactor before process exit.

- Add safe_mine_session context manager (palace.py) centralizing
  SIGINT deferral + close_palace flush + signal restoration order
- Apply to both mine() and mine_convos() entry points
- Replace silent except-pass with logger.warning on close failure
- Add 18 tests covering lifecycle, signal handling, and close paths

Closes: HNSW corruption on Ctrl+C during mining
- Register atexit.register(_DEFAULT_BACKEND.close) in palace.py to flush
  ChromaDB on MCP server process termination (SIGTERM, sys.exit, etc.)
- Add ChromaBackend._cache_key() to normalize palace paths via
  Path.resolve(), preventing silent cache misses when close_palace()
  receives a different path format than _client() used
- Add 8 tests covering atexit, close idempotency, and path normalization
@midweste
Copy link
Copy Markdown
Author

midweste commented May 5, 2026

Closing this PR — the core optimization (batched ChromaDB upserts) was independently implemented in #1185 (perf/batched-upsert-gpu), which expanded on the same approach with GPU acceleration support and hardware-aware embedding functions.

All key improvements from this PR are now covered in develop:

DRY metadata builder (_build_drawer_metadata)
Batched collection.upsert() per file instead of per-chunk
Hoisted os.path.getmtime() out of the per-chunk loop
Sub-batching via DRAWER_UPSERT_BATCH_SIZE to stay under ChromaDB's hard cap
Thanks for landing the expanded version! 🎉

@midweste midweste closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining performance Performance improvements storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants