Skip to content

fix: batch ChromaDB upserts and parallelize file processing for Windo…#298

Closed
RasimAbiyev wants to merge 4 commits intoMemPalace:mainfrom
RasimAbiyev:fix/windows-mine-performance
Closed

fix: batch ChromaDB upserts and parallelize file processing for Windo…#298
RasimAbiyev wants to merge 4 commits intoMemPalace:mainfrom
RasimAbiyev:fix/windows-mine-performance

Conversation

@RasimAbiyev
Copy link
Copy Markdown

What does this PR do?

Fixes the extremely slow mempalace mine performance on Windows (issue #19).
Two bottlenecks identified and fixed:

  1. Each chunk was written to ChromaDB in a separate upsert — replaced with a single batch upsert per file (add_drawers_batch())
  2. Files were processed one by one — replaced with ThreadPoolExecutor (up to 4 workers)

How to test

Run mempalace mine . on a mid-size project and compare speed before/after.
Expected: significantly fewer ChromaDB round-trips, faster overall indexing.

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: fix: batch ChromaDB upserts and parallelize file processing for Windows

Executive Summary

Aspect Value
PR Goal Fix slow mempalace mine on Windows by batching ChromaDB writes and parallelizing file processing
Files Changed 1 (mempalace/miner.py)
Risk Level 🔴 HIGH - changes write semantics (addupsert) and introduces concurrency to a previously serial pipeline
Review Effort 3 - single file, moderate complexity from threading
Recommendation 🔄 REQUEST_CHANGES

Affected Areas: miner.pyadd_drawers_batch() (new), process_file(), mine()

Business Impact: Mining speed improvement on Windows benefits all users onboarding large projects. However, the semantic change silently overwrites previously-mined drawers instead of skipping them.

Flow Changes: The mine pipeline changes from serial per-chunk collection.add() to batched collection.upsert() per file, and the file loop changes from sequential to concurrent (up to 4 threads).

Ratings

Aspect Score
Correctness 2/5
Security 4/5
Performance 4/5
Maintainability 3/5

PR Health

High Priority Issues

🐛 #1: upsert silently overwrites existing drawers — semantic change from add

Location: mempalace/miner.pyadd_drawers_batch() (new function, line ~481 in PR) | Confidence: ✅ HIGH

The existing add_drawer() uses collection.add(), which raises on duplicates (caught and returns False). The new add_drawers_batch() uses collection.upsert(), which silently overwrites existing drawers. This has two consequences:

  1. Re-mining overwrites data — previously, re-mining a file with existing drawers was a no-op per chunk. Now it silently replaces them with potentially different filed_at timestamps and metadata.
  2. Inflated drawer countadd_drawers_batch() returns len(ids) (all chunks), not the count of newly added chunks. The summary output Drawers filed: N will over-report.
- collection.upsert(documents=documents, ids=ids, metadatas=metadatas)
- return len(ids)
+ try:
+     collection.add(documents=documents, ids=ids, metadatas=metadatas)
+     return len(ids)
+ except Exception as e:
+     if "already exists" in str(e).lower() or "duplicate" in str(e).lower():
+         # Fall back to individual upsert to count actual additions
+         collection.upsert(documents=documents, ids=ids, metadatas=metadatas)
+         return 0  # or implement per-id dedup if count accuracy matters
+     raise

Alternatively, keep upsert but acknowledge the semantic change in the PR description and adjust the file_already_mined() early-return in process_file() which already guards against re-processing entire files.


🐛 #2: ChromaDB PersistentClient concurrent writes — risky on the target platform (Windows)

Location: mempalace/miner.pymine() ThreadPoolExecutor section | Confidence: ⚠️ MED

Multiple threads call process_file()add_drawers_batch()collection.upsert() concurrently on the same collection object. ChromaDB's PersistentClient uses SQLite as its metadata store. SQLite on Windows uses mandatory file locking, which is notoriously problematic for concurrent writers — exactly the platform this PR targets.

While ChromaDB has some internal locking, concurrent upserts from a thread pool can cause:

  • OperationalError: database is locked on Windows
  • HNSW index contention during parallel embedding insertions
  • Serialized writes that negate the parallelism benefit (threads block waiting for the lock)
- workers = 1 if dry_run else min(4, os.cpu_count() or 1)
+ workers = 1 if dry_run else min(4, os.cpu_count() or 1)
+ # Consider: parallelize file READING + chunking, but serialize ChromaDB writes
+ # This avoids SQLite contention while still gaining I/O parallelism

Suggested pattern: use ThreadPoolExecutor for the I/O-bound part (read file, chunk, detect room) and batch the ChromaDB writes back on the main thread.


Medium Priority Issues

🚨 #3: No exception handling around future.result() — one file failure crashes entire mine

Location: mempalace/miner.pyas_completed loop in mine() | Confidence: ✅ HIGH

If any worker thread raises (e.g., ChromaDB write error, permission denied), future.result() re-raises in the main thread. The as_completed loop has no try/except, so one bad file aborts the entire mining operation and produces an incorrect summary.

  for future in as_completed(futures):
-     i, filepath, drawers = future.result()
+     try:
+         i, filepath, drawers = future.result()
+     except Exception as e:
+         print(f"  ✗ Error processing file: {e}")
+         files_skipped += 1
+         continue
      if drawers == 0 and not dry_run:

🎨 #4: Inconsistent metadata schema — source_mtime only in batch path

Location: mempalace/miner.pyadd_drawers_batch() | Confidence: ✅ HIGH

add_drawers_batch() adds source_mtime to drawer metadata, but add_drawer() (still in the codebase, used by other callers) does not. This creates an inconsistent metadata schema across drawers. If source_mtime is valuable (it is — it enables staleness detection), it should be added to add_drawer() as well.


Low Priority Issues

🎨 #5: Progress output is out-of-order with as_completed

Location: mempalace/miner.pyas_completed loop in mine() | Confidence: ✅ HIGH

as_completed() yields futures in completion order, not submission order. The progress lines ✔ [ 3/100] followed by ✔ [ 1/100] will confuse users expecting sequential output. The original code printed in file order.

- for future in as_completed(futures):
+ for future in [futures_list[i] for i in range(len(futures_list))]:
+     # Or simply: iterate futures in submission order

Alternatively, use executor.map() which preserves order, or collect results and sort before printing.


Flow Impact Analysis

BEFORE (serial):
  mine() → for each file → process_file() → for each chunk → add_drawer() [collection.add()]
  ↓ sequential, safe, slow on Windows I/O

AFTER (parallel):
  mine() → ThreadPoolExecutor(4) → process_file() → add_drawers_batch() [collection.upsert()]
  ↓ concurrent file processing, batch writes, risk of SQLite contention

Key flow changes:

  1. collection.add()collection.upsert() — overwrites instead of skip-on-duplicate
  2. Sequential → concurrent file processing — ChromaDB thread safety at risk
  3. Per-chunk writes → per-file batch writes — good optimization regardless of threading
  4. Return value semantics changed — count of all chunks vs count of new chunks

Recommendation: The batch upsert optimization (#1 in PR description) is solid and should be kept. The threading (#2) should be restructured to parallelize I/O only, with writes serialized back to the main thread.


Created by Octocode MCP https://octocode.ai 🔍🐙

@RasimAbiyev
Copy link
Copy Markdown
Author

Hi @bgauryy, thank you for the detailed review!

I've addressed all the issues:

  1. upsert → add: Replaced collection.upsert() with collection.add() in add_drawers_batch(), with fallback handling for duplicates.
  2. SQLite contention: Restructured threading — file reading and chunking are now parallelized, but ChromaDB writes happen serially on the main thread.
  3. Error handling: Added try/except around future.result() so one bad file doesn't crash the entire mining operation.

Ready for re-review!

@RasimAbiyev RasimAbiyev force-pushed the fix/windows-mine-performance branch from ace899d to 4259566 Compare April 9, 2026 13:15
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch upserts are the right fix for the per-chunk round-trip overhead. The performance improvement will be significant for large projects. A few concerns:

Batch upsert (add_drawers_batch):

  • Uses collection.add() instead of collection.upsert() — this means re-mining the same file will raise a duplicate ID error. The catch for "already exists" handles it, but returns 0 (entire batch lost) even if only one chunk was a duplicate. Consider using upsert() instead, which is idempotent, or falling back to per-chunk adds when the batch fails.
  • The function builds all IDs/docs/metadatas in memory before calling add(). For very large files (e.g., a 10MB transcript that produces 200 chunks), this is fine — but worth noting there's no batch-size cap. ChromaDB can handle it, but documenting a max batch size would be good hygiene.

Thread safety concerns with ThreadPoolExecutor:

  • process_file() now returns just drawers (int) instead of (drawers, room). The room is then re-detected in the main loop via detect_room(filepath, "", rooms, project_path) — but with an empty string for content. This means room detection in the summary will be different from the room used during actual processing (where the file content was available). The room_counts dict will be inaccurate.
  • total_drawers and files_skipped are modified from the main thread after as_completed(), so no race condition there — that's correct.
  • print() from multiple threads will interleave output. Consider collecting results and printing in order, or using a lock.

workers = min(4, os.cpu_count() or 1):

  • 4 workers is reasonable. But ChromaDB's PersistentClient uses SQLite under the hood, and SQLite has a single-writer lock. Multiple threads calling collection.add() concurrently will serialize on the write lock anyway. The parallelism helps with file I/O and chunking, but the ChromaDB writes won't truly parallelize. Worth measuring actual speedup — might want to separate the read/chunk phase from the write phase.

The batch upsert alone is a solid improvement. I'd suggest separating it from the threading change — batch upsert is safe and high-impact, threading needs more testing for correctness.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@RasimAbiyev
Copy link
Copy Markdown
Author

Hi @web3guru888, thank you for the review!

Addressed your feedback:

  1. upsert restored: Switched back to collection.upsert() in add_drawers_batch() — idempotent and correct since file_already_mined() already guards against unnecessary re-processing.
  2. Room detection fixed: process_file() now returns (drawers, room) so the main loop uses the actual room detected during processing — no more empty-string re-detection in the summary.

Ready for re-review!

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

Conflicts with main. Batch upserts and parallel processing are being addressed in other PRs (#507, #562's pieces). Thanks for the Windows work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants