fix: batch ChromaDB upserts and parallelize file processing for Windo… by RasimAbiyev · Pull Request #298 · MemPalace/mempalace

RasimAbiyev · 2026-04-08T20:41:27Z

What does this PR do?

Fixes the extremely slow mempalace mine performance on Windows (issue #19).
Two bottlenecks identified and fixed:

Each chunk was written to ChromaDB in a separate upsert — replaced with a single batch upsert per file (add_drawers_batch())
Files were processed one by one — replaced with ThreadPoolExecutor (up to 4 workers)

How to test

Run mempalace mine . on a mid-size project and compare speed before/after.
Expected: significantly fewer ChromaDB round-trips, faster overall indexing.

Checklist

Tests pass (python -m pytest tests/ -v)
No hardcoded paths
Linter passes (ruff check .)

bgauryy · 2026-04-08T22:54:26Z

PR Review: fix: batch ChromaDB upserts and parallelize file processing for Windows

Executive Summary

Aspect	Value
PR Goal	Fix slow `mempalace mine` on Windows by batching ChromaDB writes and parallelizing file processing
Files Changed	1 (`mempalace/miner.py`)
Risk Level	🔴 HIGH - changes write semantics (`add` → `upsert`) and introduces concurrency to a previously serial pipeline
Review Effort	3 - single file, moderate complexity from threading
Recommendation	🔄 REQUEST_CHANGES

Affected Areas: miner.py — add_drawers_batch() (new), process_file(), mine()

Business Impact: Mining speed improvement on Windows benefits all users onboarding large projects. However, the semantic change silently overwrites previously-mined drawers instead of skipping them.

Flow Changes: The mine pipeline changes from serial per-chunk collection.add() to batched collection.upsert() per file, and the file loop changes from sequential to concurrent (up to 4 threads).

Ratings

Aspect	Score
Correctness	2/5
Security	4/5
Performance	4/5
Maintainability	3/5

PR Health

Has clear description
References ticket/issue (Super slow mine on windows #19)
Appropriate size (80 lines changed, single file)
Has relevant tests

High Priority Issues

🐛 #1: `upsert` silently overwrites existing drawers — semantic change from `add`

Location: mempalace/miner.py — add_drawers_batch() (new function, line ~481 in PR) | Confidence: ✅ HIGH

The existing add_drawer() uses collection.add(), which raises on duplicates (caught and returns False). The new add_drawers_batch() uses collection.upsert(), which silently overwrites existing drawers. This has two consequences:

Re-mining overwrites data — previously, re-mining a file with existing drawers was a no-op per chunk. Now it silently replaces them with potentially different filed_at timestamps and metadata.
Inflated drawer count — add_drawers_batch() returns len(ids) (all chunks), not the count of newly added chunks. The summary output Drawers filed: N will over-report.

- collection.upsert(documents=documents, ids=ids, metadatas=metadatas)
- return len(ids)
+ try:
+     collection.add(documents=documents, ids=ids, metadatas=metadatas)
+     return len(ids)
+ except Exception as e:
+     if "already exists" in str(e).lower() or "duplicate" in str(e).lower():
+         # Fall back to individual upsert to count actual additions
+         collection.upsert(documents=documents, ids=ids, metadatas=metadatas)
+         return 0  # or implement per-id dedup if count accuracy matters
+     raise

Alternatively, keep upsert but acknowledge the semantic change in the PR description and adjust the file_already_mined() early-return in process_file() which already guards against re-processing entire files.

🐛 #2: ChromaDB `PersistentClient` concurrent writes — risky on the target platform (Windows)

Location: mempalace/miner.py — mine() ThreadPoolExecutor section | Confidence: ⚠️ MED

Multiple threads call process_file() → add_drawers_batch() → collection.upsert() concurrently on the same collection object. ChromaDB's PersistentClient uses SQLite as its metadata store. SQLite on Windows uses mandatory file locking, which is notoriously problematic for concurrent writers — exactly the platform this PR targets.

While ChromaDB has some internal locking, concurrent upserts from a thread pool can cause:

OperationalError: database is locked on Windows
HNSW index contention during parallel embedding insertions
Serialized writes that negate the parallelism benefit (threads block waiting for the lock)

- workers = 1 if dry_run else min(4, os.cpu_count() or 1)
+ workers = 1 if dry_run else min(4, os.cpu_count() or 1)
+ # Consider: parallelize file READING + chunking, but serialize ChromaDB writes
+ # This avoids SQLite contention while still gaining I/O parallelism

Suggested pattern: use ThreadPoolExecutor for the I/O-bound part (read file, chunk, detect room) and batch the ChromaDB writes back on the main thread.

Medium Priority Issues

🚨 #3: No exception handling around `future.result()` — one file failure crashes entire mine

Location: mempalace/miner.py — as_completed loop in mine() | Confidence: ✅ HIGH

If any worker thread raises (e.g., ChromaDB write error, permission denied), future.result() re-raises in the main thread. The as_completed loop has no try/except, so one bad file aborts the entire mining operation and produces an incorrect summary.

  for future in as_completed(futures):
-     i, filepath, drawers = future.result()
+     try:
+         i, filepath, drawers = future.result()
+     except Exception as e:
+         print(f"  ✗ Error processing file: {e}")
+         files_skipped += 1
+         continue
      if drawers == 0 and not dry_run:

🎨 #4: Inconsistent metadata schema — `source_mtime` only in batch path

Location: mempalace/miner.py — add_drawers_batch() | Confidence: ✅ HIGH

add_drawers_batch() adds source_mtime to drawer metadata, but add_drawer() (still in the codebase, used by other callers) does not. This creates an inconsistent metadata schema across drawers. If source_mtime is valuable (it is — it enables staleness detection), it should be added to add_drawer() as well.

Low Priority Issues

🎨 #5: Progress output is out-of-order with `as_completed`

Location: mempalace/miner.py — as_completed loop in mine() | Confidence: ✅ HIGH

as_completed() yields futures in completion order, not submission order. The progress lines ✔ [ 3/100] followed by ✔ [ 1/100] will confuse users expecting sequential output. The original code printed in file order.

- for future in as_completed(futures):
+ for future in [futures_list[i] for i in range(len(futures_list))]:
+     # Or simply: iterate futures in submission order

Alternatively, use executor.map() which preserves order, or collect results and sort before printing.

Flow Impact Analysis

BEFORE (serial):
  mine() → for each file → process_file() → for each chunk → add_drawer() [collection.add()]
  ↓ sequential, safe, slow on Windows I/O

AFTER (parallel):
  mine() → ThreadPoolExecutor(4) → process_file() → add_drawers_batch() [collection.upsert()]
  ↓ concurrent file processing, batch writes, risk of SQLite contention

Key flow changes:

collection.add() → collection.upsert() — overwrites instead of skip-on-duplicate
Sequential → concurrent file processing — ChromaDB thread safety at risk
Per-chunk writes → per-file batch writes — good optimization regardless of threading
Return value semantics changed — count of all chunks vs count of new chunks

Recommendation: The batch upsert optimization (#1 in PR description) is solid and should be kept. The threading (#2) should be restructured to parallelize I/O only, with writes serialized back to the main thread.

Created by Octocode MCP https://octocode.ai 🔍🐙

RasimAbiyev · 2026-04-09T10:50:39Z

Hi @bgauryy, thank you for the detailed review!

I've addressed all the issues:

upsert → add: Replaced collection.upsert() with collection.add() in add_drawers_batch(), with fallback handling for duplicates.
SQLite contention: Restructured threading — file reading and chunking are now parallelized, but ChromaDB writes happen serially on the main thread.
Error handling: Added try/except around future.result() so one bad file doesn't crash the entire mining operation.

Ready for re-review!

…ws performance (fixes MemPalace#19)

…(), add error handling

web3guru888

Batch upserts are the right fix for the per-chunk round-trip overhead. The performance improvement will be significant for large projects. A few concerns:

Batch upsert (add_drawers_batch):

Uses collection.add() instead of collection.upsert() — this means re-mining the same file will raise a duplicate ID error. The catch for "already exists" handles it, but returns 0 (entire batch lost) even if only one chunk was a duplicate. Consider using upsert() instead, which is idempotent, or falling back to per-chunk adds when the batch fails.
The function builds all IDs/docs/metadatas in memory before calling add(). For very large files (e.g., a 10MB transcript that produces 200 chunks), this is fine — but worth noting there's no batch-size cap. ChromaDB can handle it, but documenting a max batch size would be good hygiene.

Thread safety concerns with ThreadPoolExecutor:

process_file() now returns just drawers (int) instead of (drawers, room). The room is then re-detected in the main loop via detect_room(filepath, "", rooms, project_path) — but with an empty string for content. This means room detection in the summary will be different from the room used during actual processing (where the file content was available). The room_counts dict will be inaccurate.
total_drawers and files_skipped are modified from the main thread after as_completed(), so no race condition there — that's correct.
print() from multiple threads will interleave output. Consider collecting results and printing in order, or using a lock.

workers = min(4, os.cpu_count() or 1):

4 workers is reasonable. But ChromaDB's PersistentClient uses SQLite under the hood, and SQLite has a single-writer lock. Multiple threads calling collection.add() concurrently will serialize on the write lock anyway. The parallelism helps with file I/O and chunking, but the ChromaDB writes won't truly parallelize. Worth measuring actual speedup — might want to separate the read/chunk phase from the write phase.

The batch upsert alone is a solid improvement. I'd suggest separating it from the threading change — batch upsert is safe and high-impact, threading needs more testing for correctness.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

…ch writes

RasimAbiyev · 2026-04-10T05:22:50Z

Hi @web3guru888, thank you for the review!

Addressed your feedback:

upsert restored: Switched back to collection.upsert() in add_drawers_batch() — idempotent and correct since file_already_mined() already guards against unnecessary re-processing.
Room detection fixed: process_file() now returns (drawers, room) so the main loop uses the actual room detected during processing — no more empty-string re-detection in the summary.

Ready for re-review!

bensig · 2026-04-11T05:28:04Z

Conflicts with main. Batch upserts and parallel processing are being addressed in other PRs (#507, #562's pieces). Thanks for the Windows work.

RasimAbiyev added 2 commits April 9, 2026 14:54

fix: batch ChromaDB upserts and parallelize file processing for Windo…

0c3099c

…ws performance (fixes MemPalace#19)

fix: address review — serialize ChromaDB writes, use add() not upsert…

4259566

…(), add error handling

RasimAbiyev force-pushed the fix/windows-mine-performance branch from ace899d to 4259566 Compare April 9, 2026 13:15

web3guru888 reviewed Apr 10, 2026

View reviewed changes

fix: correct room detection in summary, use upsert for idempotent bat…

ba4176c

…ch writes

Merge branch 'main' into fix/windows-mine-performance

eaeb192

bensig closed this Apr 11, 2026

web3guru888 mentioned this pull request Apr 11, 2026

fix: prevent HNSW index bloat from resize+persist cycles #346

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batch ChromaDB upserts and parallelize file processing for Windo…#298

fix: batch ChromaDB upserts and parallelize file processing for Windo…#298
RasimAbiyev wants to merge 4 commits intoMemPalace:mainfrom
RasimAbiyev:fix/windows-mine-performance

RasimAbiyev commented Apr 8, 2026

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

RasimAbiyev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

RasimAbiyev commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RasimAbiyev commented Apr 8, 2026

What does this PR do?

How to test

Checklist

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: fix: batch ChromaDB upserts and parallelize file processing for Windows

Executive Summary

Ratings

PR Health

High Priority Issues

🐛 #1: upsert silently overwrites existing drawers — semantic change from add

🐛 #2: ChromaDB PersistentClient concurrent writes — risky on the target platform (Windows)

Medium Priority Issues

🚨 #3: No exception handling around future.result() — one file failure crashes entire mine

🎨 #4: Inconsistent metadata schema — source_mtime only in batch path

Low Priority Issues

🎨 #5: Progress output is out-of-order with as_completed

Flow Impact Analysis

Uh oh!

RasimAbiyev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

RasimAbiyev commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🐛 #1: `upsert` silently overwrites existing drawers — semantic change from `add`

🐛 #2: ChromaDB `PersistentClient` concurrent writes — risky on the target platform (Windows)

🚨 #3: No exception handling around `future.result()` — one file failure crashes entire mine

🎨 #4: Inconsistent metadata schema — `source_mtime` only in batch path

🎨 #5: Progress output is out-of-order with `as_completed`