perf: batch ChromaDB inserts in miner - 10-30x faster mining by midweste · Pull Request #1085 · MemPalace/mempalace

midweste · 2026-04-21T23:01:58Z

perf: batch ChromaDB inserts in miner — 10-30x faster mining

Problem

process_file() called add_drawer() once per chunk, resulting in:

N separate ONNX embedding passes per file (one per chunk)
N separate ChromaDB upsert calls per file
N redundant datetime.now() and os.path.getmtime() syscalls per file

On large files (1000+ chunks), this made mining slow.

Solution

1. DRY metadata construction via _build_drawer() helper

Extracted the drawer ID generation and metadata assembly into a shared _build_drawer() function used by both:

add_drawer() — single-insert path (unchanged contract, used by MCP tools)
add_drawers() — new batch-insert path (used by process_file())

2. Batched upserts with sub-batching

add_drawers() collects all chunks into batch lists and upserts in groups of CHROMA_BATCH_LIMIT (5000) to stay under ChromaDB's hard cap of 5,461.

3. Hoisted syscalls

datetime.now() and os.path.getmtime() are called once per file and shared across all chunks via _build_drawer(now=..., source_mtime=...).

Results

Metric	Before	After
ONNX passes per file	N (per chunk)	⌈N/5000⌉
DB writes per file	N	⌈N/5000⌉
Syscalls per file	2N	2
Metadata construction	duplicated in `add_drawer` + `process_file`	single `_build_drawer` helper

Tested against a 499-file project (~70k drawers) with zero errors, including an 18,803-chunk benchmark file.

Breaking Changes

None. add_drawer() retains its existing signature and behavior. add_drawers() is additive.

…nserts) README fork-change-queue gets a Performance row above the L1 importance pre-filter row. CLAUDE.md gets a row 27 noting the conflict-resolution choices (kept fork's DRAWER_UPSERT_BATCH_SIZE=1000, aliased upstream's CHROMA_BATCH_LIMIT to it). The cherry-pick is already on upstream develop, so this row exists mainly to document the merge-conflict resolution for the next upstream→main sync. Will become a no-op when develop ships. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

@midweste

I wrote "already in upstream develop" without running gh pr view, and JP caught it. MemPalace#1085 was created 2026-04-21, still open against develop. Re-framed: fork cherry-picked an open upstream PR; the row clears when MemPalace#1085 lands and we sync develop→main. We don't file a competing fork-side PR — the proposal is @midweste's; we wait. Lesson re-applied from feedback_verify_merged_not_described.md: always read upstream state via gh, never infer from the README's PR list. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Upstream CHANGELOG.md tracks released versions of MemPalace and is merged from upstream — mixing fork-ahead changes into it would create upstream-sync conflicts. This file is the supplement: date-based sections (no semver — fork doesn't cut its own release tags), Keep-a-Changelog buckets (Added / Changed / Fixed / Performance). When a fork-ahead row lands upstream, the entry moves to the "Merged into upstream" section at the bottom, kept ~2 weeks then trimmed. Backfilled with 2026-04-25 + 2026-04-26 entries covering: the checkpoint collection split (phases A–D, PreCompact incorporation, the new MCP tool), the kind= filter (transitional safety net), the HNSW cold-start gate + integrity gate, drawer_id surfacing, the deploy script, the MemPalace#1085 cherry-pick, palace_graph None guard. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Prevent HNSW index corruption by deferring Ctrl+C until the current file's ChromaDB upsert completes, then flushing the background compactor before process exit. - Add safe_mine_session context manager (palace.py) centralizing SIGINT deferral + close_palace flush + signal restoration order - Apply to both mine() and mine_convos() entry points - Replace silent except-pass with logger.warning on close failure - Add 18 tests covering lifecycle, signal handling, and close paths Closes: HNSW corruption on Ctrl+C during mining

- Register atexit.register(_DEFAULT_BACKEND.close) in palace.py to flush ChromaDB on MCP server process termination (SIGTERM, sys.exit, etc.) - Add ChromaBackend._cache_key() to normalize palace paths via Path.resolve(), preventing silent cache misses when close_palace() receives a different path format than _client() used - Add 8 tests covering atexit, close idempotency, and path normalization

midweste · 2026-05-05T09:13:11Z

Closing this PR — the core optimization (batched ChromaDB upserts) was independently implemented in #1185 (perf/batched-upsert-gpu), which expanded on the same approach with GPU acceleration support and hardware-aware embedding functions.

All key improvements from this PR are now covered in develop:

DRY metadata builder (_build_drawer_metadata)
Batched collection.upsert() per file instead of per-chunk
Hoisted os.path.getmtime() out of the per-chunk loop
Sub-batching via DRAWER_UPSERT_BATCH_SIZE to stay under ChromaDB's hard cap
Thanks for landing the expanded version! 🎉

midweste requested review from bensig and milla-jovovich as code owners April 21, 2026 23:01

midweste force-pushed the midweste branch from 23e15bd to 4ea2804 Compare April 21, 2026 23:36

yangqingggui-a11y mentioned this pull request Apr 24, 2026

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068) #1133

Open

igorls added performance Performance improvements area/mining File and conversation mining storage labels Apr 24, 2026

midwestE added 2 commits May 5, 2026 04:10

midweste force-pushed the midweste branch from 4ea2804 to a9f6b10 Compare May 5, 2026 09:11

midweste requested a review from igorls as a code owner May 5, 2026 09:11

midweste closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085
midweste wants to merge 2 commits intoMemPalace:developfrom
midweste:midweste

midweste commented Apr 21, 2026 •

edited

Loading

Uh oh!

midweste commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

midweste commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!