fix: batch ChromaDB upserts and parallelize file processing for Windo…#298
fix: batch ChromaDB upserts and parallelize file processing for Windo…#298RasimAbiyev wants to merge 4 commits intoMemPalace:mainfrom
Conversation
PR Review: fix: batch ChromaDB upserts and parallelize file processing for WindowsExecutive Summary
Affected Areas: Business Impact: Mining speed improvement on Windows benefits all users onboarding large projects. However, the semantic change silently overwrites previously-mined drawers instead of skipping them. Flow Changes: The mine pipeline changes from serial per-chunk Ratings
PR Health
High Priority Issues🐛 #1:
|
|
Hi @bgauryy, thank you for the detailed review! I've addressed all the issues:
Ready for re-review! |
…ws performance (fixes MemPalace#19)
…(), add error handling
ace899d to
4259566
Compare
web3guru888
left a comment
There was a problem hiding this comment.
Batch upserts are the right fix for the per-chunk round-trip overhead. The performance improvement will be significant for large projects. A few concerns:
Batch upsert (add_drawers_batch):
- Uses
collection.add()instead ofcollection.upsert()— this means re-mining the same file will raise a duplicate ID error. The catch for"already exists"handles it, but returns 0 (entire batch lost) even if only one chunk was a duplicate. Consider usingupsert()instead, which is idempotent, or falling back to per-chunk adds when the batch fails. - The function builds all IDs/docs/metadatas in memory before calling
add(). For very large files (e.g., a 10MB transcript that produces 200 chunks), this is fine — but worth noting there's no batch-size cap. ChromaDB can handle it, but documenting a max batch size would be good hygiene.
Thread safety concerns with ThreadPoolExecutor:
process_file()now returns justdrawers(int) instead of(drawers, room). The room is then re-detected in the main loop viadetect_room(filepath, "", rooms, project_path)— but with an empty string forcontent. This means room detection in the summary will be different from the room used during actual processing (where the file content was available). Theroom_countsdict will be inaccurate.total_drawersandfiles_skippedare modified from the main thread afteras_completed(), so no race condition there — that's correct.print()from multiple threads will interleave output. Consider collecting results and printing in order, or using a lock.
workers = min(4, os.cpu_count() or 1):
- 4 workers is reasonable. But ChromaDB's
PersistentClientuses SQLite under the hood, and SQLite has a single-writer lock. Multiple threads callingcollection.add()concurrently will serialize on the write lock anyway. The parallelism helps with file I/O and chunking, but the ChromaDB writes won't truly parallelize. Worth measuring actual speedup — might want to separate the read/chunk phase from the write phase.
The batch upsert alone is a solid improvement. I'd suggest separating it from the threading change — batch upsert is safe and high-impact, threading needs more testing for correctness.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
|
Hi @web3guru888, thank you for the review! Addressed your feedback:
Ready for re-review! |
What does this PR do?
Fixes the extremely slow
mempalace mineperformance on Windows (issue #19).Two bottlenecks identified and fixed:
add_drawers_batch())ThreadPoolExecutor(up to 4 workers)How to test
Run
mempalace mine .on a mid-size project and compare speed before/after.Expected: significantly fewer ChromaDB round-trips, faster overall indexing.
Checklist
python -m pytest tests/ -v)ruff check .)