perf: batch writes, concurrent mining, bulk mtime pre-fetch#629
perf: batch writes, concurrent mining, bulk mtime pre-fetch#629jphein wants to merge 3 commits intoMemPalace:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving mining performance and correctness in MemPalace by reducing ChromaDB call volume, enabling concurrent file preparation, and fixing mtime/collection defaults that previously caused unnecessary re-mining and negative similarity behavior.
Changes:
- Added bulk prefetch of mined mtimes + epsilon mtime comparison, and defaulted new collections to cosine distance.
- Implemented concurrent mining (ThreadPoolExecutor) and batched upserts for higher throughput.
- Extended miner configurability and filtering (chunk sizing via config; skip patterns/junk size), and expanded entity detector stopwords.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_miner.py |
Adds unit tests for detect_room() priority/word-boundary behavior and chunk_text() chunking invariants. |
mempalace/palace.py |
Defaults newly created collections to cosine space; fixes mtime dedup via epsilon; adds bulk_check_mined() for paginated metadata prefetch. |
mempalace/miner.py |
Adds skip filters, configurable chunking, concurrent mining path, and bulk mined-map skip logic; refactors file processing to support batching. |
mempalace/entity_detector.py |
Expands STOPWORDS with common technical terms to reduce false-positive entity extraction. |
mempalace/convo_miner.py |
Switches convo ingestion to batched upserts instead of per-chunk writes. |
mempalace/config.py |
Adds chunk sizing/overlap/min-size settings to user config with defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| collection.upsert( | ||
| documents=batch_docs, | ||
| ids=batch_ids, | ||
| metadatas=batch_metas, | ||
| ) |
There was a problem hiding this comment.
process_file() now upserts all chunks from a file in a single request. With MAX_FILE_SIZE = 10MB and chunk_size=800, a single file can produce thousands of chunks, which defeats the intended “~100 docs/call” batching and can create very large request payloads/memory spikes. Consider slicing batch_docs/ids/metadatas into _UPSERT_BATCH_SIZE batches (as in convo_miner.py).
There was a problem hiding this comment.
Good point. The _UPSERT_BATCH_SIZE constant was intended to cap this, but the single-file path doesn't enforce it. Will add chunked upsert for the single-file path.
There was a problem hiding this comment.
Fixed — single-file path now chunks upserts into _UPSERT_BATCH_SIZE slices via for i in range(0, len(batch_docs), _UPSERT_BATCH_SIZE). 79e3d22
| # Flush when batch is large enough | ||
| if len(pending_docs) >= _UPSERT_BATCH_SIZE: | ||
| collection.upsert( | ||
| documents=pending_docs, | ||
| ids=pending_ids, | ||
| metadatas=pending_metas, | ||
| ) | ||
| pending_docs = [] | ||
| pending_ids = [] | ||
| pending_metas = [] |
There was a problem hiding this comment.
The concurrent-path flush can still upsert far more than _UPSERT_BATCH_SIZE in a single call: pending_docs may jump from <100 to 300+ after extending with one large file’s chunks, and then the code upserts the entire list. To enforce batch size, flush in a loop and only upsert fixed-size slices until len(pending_docs) < _UPSERT_BATCH_SIZE.
| # Flush when batch is large enough | |
| if len(pending_docs) >= _UPSERT_BATCH_SIZE: | |
| collection.upsert( | |
| documents=pending_docs, | |
| ids=pending_ids, | |
| metadatas=pending_metas, | |
| ) | |
| pending_docs = [] | |
| pending_ids = [] | |
| pending_metas = [] | |
| # Flush in fixed-size batches so no single upsert exceeds the limit | |
| while len(pending_docs) >= _UPSERT_BATCH_SIZE: | |
| collection.upsert( | |
| documents=pending_docs[:_UPSERT_BATCH_SIZE], | |
| ids=pending_ids[:_UPSERT_BATCH_SIZE], | |
| metadatas=pending_metas[:_UPSERT_BATCH_SIZE], | |
| ) | |
| pending_docs = pending_docs[_UPSERT_BATCH_SIZE:] | |
| pending_ids = pending_ids[_UPSERT_BATCH_SIZE:] | |
| pending_metas = pending_metas[_UPSERT_BATCH_SIZE:] |
There was a problem hiding this comment.
Agreed — the extend-then-check pattern has a TOCTOU gap. Will restructure to flush in fixed-size batches.
There was a problem hiding this comment.
Fixed — changed if to while loop that drains all full batches, slicing at _UPSERT_BATCH_SIZE boundaries. This prevents the TOCTOU where a single large future result could exceed the batch size. 79e3d22
| files = files[:limit] | ||
|
|
||
| if workers <= 0: | ||
| workers = min(8, os.cpu_count() or 4) |
There was a problem hiding this comment.
mine() auto-enables concurrency by default (workers is set when <= 0). The CLI currently calls mine() without a workers argument and does not expose a --workers flag, so users can’t control/disable concurrency. Either default to workers=1 unless explicitly set, or add a CLI option and update help/text to match.
| workers = min(8, os.cpu_count() or 4) | |
| workers = 1 |
There was a problem hiding this comment.
The CLI does expose --workers — see the argparse addition in cli.py in this same PR (p_mine.add_argument('--workers', ...)).
| total_drawers += drawers | ||
| room_counts[room or "general"] += 1 | ||
| if not dry_run: | ||
| print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}") |
There was a problem hiding this comment.
This progress print uses U+2713 ("✓"). This character has been known to crash on Windows consoles using cp1251/cp1252 (non-UTF-8) encodings. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.
| print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}") | |
| print(f" OK [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}") |
There was a problem hiding this comment.
Will replace with ASCII + to match the convo_miner.py fix in PR #626.
| f" \u2713 [{processed_count:4}/{len(files_to_process)}] " | ||
| f"{filepath.name[:50]:50} +{len(batch_docs)}" |
There was a problem hiding this comment.
This progress print uses U+2713 ("✓"), which can raise encoding errors on some Windows terminals. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.
There was a problem hiding this comment.
Same fix — will replace with ASCII +.
- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ent flush - Single-file path: chunk upserts into _UPSERT_BATCH_SIZE slices - Concurrent path: while loop drains all full batches (was: if, only one) - Replace Unicode ✓ with ASCII + for Windows CI compatibility Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upstream bumped pyproject.toml to 3.2.0 in MemPalace#761 but missed version.py. Aligns the two so test_version_consistency passes. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Closing this as mostly superseded by upstream work that landed in parallel:
The remaining novel bits here ( |
MemPalace#629 and MemPalace#632 were closed upstream as fully superseded — all three features in MemPalace#632 (`--version`, `purge`, `repair`) shipped in v3.3.0; MemPalace#629's core (batching, file locking, skip patterns, stopwords) landed independently via upstream's own paths. Move both to the closed list and update the live/closed counts. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
Performance improvements for mining and storage. Split from #562 per maintainer request (PR 2 of 6).
--workersflag for concurrent file processingabs < 0.01) instead of float==— fixes BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run #475, fix: use epsilon comparison for mtime dedup + add bulk pre-fetch #483hnsw:space=cosine) — fixes fix: default collection to cosine distance and clamp similarity scores #568Closes
Related
Test plan
python -m pytest tests/test_miner.py -v— 37 tests including new batch/concurrent testspython -m pytest tests/ -v— full suite (613 pass on this branch)--workers 4and verify speedup