perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085
Closed
midweste wants to merge 2 commits intoMemPalace:developfrom
Closed
perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085midweste wants to merge 2 commits intoMemPalace:developfrom
midweste wants to merge 2 commits intoMemPalace:developfrom
Conversation
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
Apr 26, 2026
…nserts) README fork-change-queue gets a Performance row above the L1 importance pre-filter row. CLAUDE.md gets a row 27 noting the conflict-resolution choices (kept fork's DRAWER_UPSERT_BATCH_SIZE=1000, aliased upstream's CHROMA_BATCH_LIMIT to it). The cherry-pick is already on upstream develop, so this row exists mainly to document the merge-conflict resolution for the next upstream→main sync. Will become a no-op when develop ships. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
Apr 26, 2026
I wrote "already in upstream develop" without running gh pr view, and JP caught it. MemPalace#1085 was created 2026-04-21, still open against develop. Re-framed: fork cherry-picked an open upstream PR; the row clears when MemPalace#1085 lands and we sync develop→main. We don't file a competing fork-side PR — the proposal is @midweste's; we wait. Lesson re-applied from feedback_verify_merged_not_described.md: always read upstream state via gh, never infer from the README's PR list. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
Apr 26, 2026
Upstream CHANGELOG.md tracks released versions of MemPalace and is merged from upstream — mixing fork-ahead changes into it would create upstream-sync conflicts. This file is the supplement: date-based sections (no semver — fork doesn't cut its own release tags), Keep-a-Changelog buckets (Added / Changed / Fixed / Performance). When a fork-ahead row lands upstream, the entry moves to the "Merged into upstream" section at the bottom, kept ~2 weeks then trimmed. Backfilled with 2026-04-25 + 2026-04-26 entries covering: the checkpoint collection split (phases A–D, PreCompact incorporation, the new MCP tool), the kind= filter (transitional safety net), the HNSW cold-start gate + integrity gate, drawer_id surfacing, the deploy script, the MemPalace#1085 cherry-pick, palace_graph None guard. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
added 2 commits
May 5, 2026 04:10
Prevent HNSW index corruption by deferring Ctrl+C until the current file's ChromaDB upsert completes, then flushing the background compactor before process exit. - Add safe_mine_session context manager (palace.py) centralizing SIGINT deferral + close_palace flush + signal restoration order - Apply to both mine() and mine_convos() entry points - Replace silent except-pass with logger.warning on close failure - Add 18 tests covering lifecycle, signal handling, and close paths Closes: HNSW corruption on Ctrl+C during mining
- Register atexit.register(_DEFAULT_BACKEND.close) in palace.py to flush ChromaDB on MCP server process termination (SIGTERM, sys.exit, etc.) - Add ChromaBackend._cache_key() to normalize palace paths via Path.resolve(), preventing silent cache misses when close_palace() receives a different path format than _client() used - Add 8 tests covering atexit, close idempotency, and path normalization
Author
|
Closing this PR — the core optimization (batched ChromaDB upserts) was independently implemented in #1185 (perf/batched-upsert-gpu), which expanded on the same approach with GPU acceleration support and hardware-aware embedding functions. All key improvements from this PR are now covered in develop: DRY metadata builder (_build_drawer_metadata) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf: batch ChromaDB inserts in miner — 10-30x faster mining
Problem
process_file()calledadd_drawer()once per chunk, resulting in:datetime.now()andos.path.getmtime()syscalls per fileOn large files (1000+ chunks), this made mining slow.
Solution
1. DRY metadata construction via
_build_drawer()helperExtracted the drawer ID generation and metadata assembly into a shared
_build_drawer()function used by both:add_drawer()— single-insert path (unchanged contract, used by MCP tools)add_drawers()— new batch-insert path (used byprocess_file())2. Batched upserts with sub-batching
add_drawers()collects all chunks into batch lists and upserts in groups ofCHROMA_BATCH_LIMIT(5000) to stay under ChromaDB's hard cap of 5,461.3. Hoisted syscalls
datetime.now()andos.path.getmtime()are called once per file and shared across all chunks via_build_drawer(now=..., source_mtime=...).Results
add_drawer+process_file_build_drawerhelperTested against a 499-file project (~70k drawers) with zero errors, including an 18,803-chunk benchmark file.
Breaking Changes
None.
add_drawer()retains its existing signature and behavior.add_drawers()is additive.