perf: batch writes, concurrent mining, bulk mtime pre-fetch by jphein · Pull Request #629 · MemPalace/mempalace

jphein · 2026-04-11T14:23:35Z

Summary

Performance improvements for mining and storage. Split from #562 per maintainer request (PR 2 of 6).

Batch upserts (100 docs/call) in both general miner and convo_miner — 3-5x faster on large projects
ThreadPoolExecutor mining with --workers flag for concurrent file processing
bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan
Epsilon mtime comparison (abs < 0.01) instead of float == — fixes BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run #475, fix: use epsilon comparison for mtime dedup + add bulk pre-fetch #483
Cosine distance default for new collections (hnsw:space=cosine) — fixes fix: default collection to cosine distance and clamp similarity scores #568
SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps (feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise #587)
detect_room() uses word-boundary regex and exact-match priority
chunk_text() accepts configurable size/overlap/min params via config
Entity detector STOPWORDS expanded with 73 technical terms

Closes

BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run #475 — Float mtime comparison breaks file deduplication
fix: use epsilon comparison for mtime dedup + add bulk pre-fetch #483 — mtime dedup fix (epsilon tolerance)
fix: default collection to cosine distance and clamp similarity scores #568 — cosine distance default for collections
feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise #587 — default-exclude runtime-state files + per-file drawer cap

Split from feat: batch writes, concurrent mining, MCP tools, hooks, export, search improvements #562
Other PRs in this series: fix: standalone bug fixes (#534 #535 #536 #538 #570 #572 #637) #626 (bug fixes), feat: MCP reliability — inode detection, WAL rotation, metadata cache #630 (reliability), feat: repair nuke-rebuild, purge command, --version flag #632 (maintenance), feat: two-layer hook capture, auto-mine transcripts, hook settings #633 (hooks), feat: new MCP tools — get/list/update drawer, hook settings, export #635 (new tools)

Test plan

python -m pytest tests/test_miner.py -v — 37 tests including new batch/concurrent tests
python -m pytest tests/ -v — full suite (613 pass on this branch)
Mine a large project with --workers 4 and verify speedup
Verify SKIP_PATTERNS excludes .min.js, .lock, .map files

Copilot

Pull request overview

This PR focuses on improving mining performance and correctness in MemPalace by reducing ChromaDB call volume, enabling concurrent file preparation, and fixing mtime/collection defaults that previously caused unnecessary re-mining and negative similarity behavior.

Changes:

Added bulk prefetch of mined mtimes + epsilon mtime comparison, and defaulted new collections to cosine distance.
Implemented concurrent mining (ThreadPoolExecutor) and batched upserts for higher throughput.
Extended miner configurability and filtering (chunk sizing via config; skip patterns/junk size), and expanded entity detector stopwords.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/test_miner.py`	Adds unit tests for `detect_room()` priority/word-boundary behavior and `chunk_text()` chunking invariants.
`mempalace/palace.py`	Defaults newly created collections to cosine space; fixes mtime dedup via epsilon; adds `bulk_check_mined()` for paginated metadata prefetch.
`mempalace/miner.py`	Adds skip filters, configurable chunking, concurrent mining path, and bulk mined-map skip logic; refactors file processing to support batching.
`mempalace/entity_detector.py`	Expands STOPWORDS with common technical terms to reduce false-positive entity extraction.
`mempalace/convo_miner.py`	Switches convo ingestion to batched upserts instead of per-chunk writes.
`mempalace/config.py`	Adds chunk sizing/overlap/min-size settings to user config with defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-11T14:27:25Z

+    collection.upsert(
+        documents=batch_docs,
+        ids=batch_ids,
+        metadatas=batch_metas,
+    )


process_file() now upserts all chunks from a file in a single request. With MAX_FILE_SIZE = 10MB and chunk_size=800, a single file can produce thousands of chunks, which defeats the intended “~100 docs/call” batching and can create very large request payloads/memory spikes. Consider slicing batch_docs/ids/metadatas into _UPSERT_BATCH_SIZE batches (as in convo_miner.py).

Good point. The _UPSERT_BATCH_SIZE constant was intended to cap this, but the single-file path doesn't enforce it. Will add chunked upsert for the single-file path.

Fixed — single-file path now chunks upserts into _UPSERT_BATCH_SIZE slices via for i in range(0, len(batch_docs), _UPSERT_BATCH_SIZE). 79e3d22

Copilot · 2026-04-11T14:27:25Z

+                # Flush when batch is large enough
+                if len(pending_docs) >= _UPSERT_BATCH_SIZE:
+                    collection.upsert(
+                        documents=pending_docs,
+                        ids=pending_ids,
+                        metadatas=pending_metas,
+                    )
+                    pending_docs = []
+                    pending_ids = []
+                    pending_metas = []


The concurrent-path flush can still upsert far more than _UPSERT_BATCH_SIZE in a single call: pending_docs may jump from <100 to 300+ after extending with one large file’s chunks, and then the code upserts the entire list. To enforce batch size, flush in a loop and only upsert fixed-size slices until len(pending_docs) < _UPSERT_BATCH_SIZE.

Suggested change

# Flush when batch is large enough

if len(pending_docs) >= _UPSERT_BATCH_SIZE:

collection.upsert(

documents=pending_docs,

ids=pending_ids,

metadatas=pending_metas,

)

pending_docs = []

pending_ids = []

pending_metas = []

# Flush in fixed-size batches so no single upsert exceeds the limit

while len(pending_docs) >= _UPSERT_BATCH_SIZE:

collection.upsert(

documents=pending_docs[:_UPSERT_BATCH_SIZE],

ids=pending_ids[:_UPSERT_BATCH_SIZE],

metadatas=pending_metas[:_UPSERT_BATCH_SIZE],

)

pending_docs = pending_docs[_UPSERT_BATCH_SIZE:]

pending_ids = pending_ids[_UPSERT_BATCH_SIZE:]

pending_metas = pending_metas[_UPSERT_BATCH_SIZE:]

Agreed — the extend-then-check pattern has a TOCTOU gap. Will restructure to flush in fixed-size batches.

Fixed — changed if to while loop that drains all full batches, slicing at _UPSERT_BATCH_SIZE boundaries. This prevents the TOCTOU where a single large future result could exceed the batch size. 79e3d22

Copilot · 2026-04-11T14:27:25Z

        files = files[:limit]

+    if workers <= 0:
+        workers = min(8, os.cpu_count() or 4)


mine() auto-enables concurrency by default (workers is set when <= 0). The CLI currently calls mine() without a workers argument and does not expose a --workers flag, so users can’t control/disable concurrency. Either default to workers=1 unless explicitly set, or add a CLI option and update help/text to match.

Suggested change

workers = min(8, os.cpu_count() or 4)

workers = 1

The CLI does expose --workers — see the argparse addition in cli.py in this same PR (p_mine.add_argument('--workers', ...)).

Copilot · 2026-04-11T14:27:26Z

+                total_drawers += drawers
+                room_counts[room or "general"] += 1
+                if not dry_run:
+                    print(f"  \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")


This progress print uses U+2713 ("✓"). This character has been known to crash on Windows consoles using cp1251/cp1252 (non-UTF-8) encodings. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.

Suggested change

print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")

print(f" OK [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")

Will replace with ASCII + to match the convo_miner.py fix in PR #626.

Fixed — replaced with ASCII +. 79e3d22

Copilot · 2026-04-11T14:27:26Z

+                        f"  \u2713 [{processed_count:4}/{len(files_to_process)}] "
+                        f"{filepath.name[:50]:50} +{len(batch_docs)}"


This progress print uses U+2713 ("✓"), which can raise encoding errors on some Windows terminals. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.

Same fix — will replace with ASCII +.

Fixed — replaced with ASCII +. 79e3d22

- Batch upserts (100 docs/call) in both general miner and convo_miner instead of one-at-a-time writes — 3-5x faster on large projects - _prepare_file() separates read/chunk (thread-safe) from write, enabling ThreadPoolExecutor concurrency with --workers flag - bulk_check_mined() pre-fetches all source_file/mtime pairs in one paginated scan instead of per-file ChromaDB queries - Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483) - Cosine distance default for new collections (hnsw:space=cosine) - SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps - detect_room() uses word-boundary regex and exact-match priority - chunk_text() accepts configurable size/overlap/min params - Entity detector STOPWORDS expanded (73 technical terms) - Configurable chunk_size/chunk_overlap/min_chunk_size in palace config Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ent flush - Single-file path: chunk upserts into _UPSERT_BATCH_SIZE slices - Concurrent path: while loop drains all full batches (was: if, only one) - Replace Unicode ✓ with ASCII + for Windows CI compatibility Co-Authored-By: Claude Opus 4.6 <[email protected]>

Upstream bumped pyproject.toml to 3.2.0 in MemPalace#761 but missed version.py. Aligns the two so test_version_consistency passes. Co-Authored-By: Claude Opus 4.6 <[email protected]>

jphein · 2026-04-16T18:32:36Z

Closing this as mostly superseded by upstream work that landed in parallel:

Batch upserts → upstream has batching in cli.py (5000/batch)
File-level locking for concurrent miners → fcntl.flock in palace.py
Cosine distance default, SKIP_PATTERNS for minified/lockfile exclusions, entity-detector stopword expansion (via the i18n refactor) — all shipped
Epsilon mtime comparison → already lives in my fork as a separate change

The remaining novel bits here (--workers ThreadPool mining and bulk_check_mined() paginated pre-fetch) are optional performance tweaks, not bug fixes. If they prove valuable later I'll open them as small standalone PRs rather than drag this 8-file branch forward. Thanks for the reviews along the way.

MemPalace#629 and MemPalace#632 were closed upstream as fully superseded — all three features in MemPalace#632 (`--version`, `purge`, `repair`) shipped in v3.3.0; MemPalace#629's core (batching, file locking, skip patterns, stopwords) landed independently via upstream's own paths. Move both to the closed list and update the live/closed counts. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot AI review requested due to automatic review settings April 11, 2026 14:23

jphein requested review from bensig and milla-jovovich as code owners April 11, 2026 14:23

Copilot started reviewing on behalf of jphein April 11, 2026 14:24 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

bensig changed the base branch from main to develop April 11, 2026 22:21

bensig requested a review from igorls as a code owner April 11, 2026 22:21

jphein mentioned this pull request Apr 12, 2026

bug: convo_miner re-processes files every run + drawers_added counter always 0 #654

Closed

jphein force-pushed the pr/performance branch from 1ba6f64 to f4f43bc Compare April 12, 2026 02:45

mvalentsev mentioned this pull request Apr 12, 2026

fix: prevent convo_miner from re-processing 0-chunk files on every run (#654) #732

Merged

jphein mentioned this pull request Apr 12, 2026

fix: extract tool_use and tool_result blocks in Claude Code JSONL mining #730

Closed

3 tasks

jphein force-pushed the pr/performance branch from 79e3d22 to 195c28e Compare April 12, 2026 21:53

jphein and others added 2 commits April 13, 2026 07:20

jphein force-pushed the pr/performance branch from 195c28e to 11d069c Compare April 13, 2026 14:22

fix: sync version.py with pyproject.toml (3.1.0 → 3.2.0)

9f5e8a2

Upstream bumped pyproject.toml to 3.2.0 in MemPalace#761 but missed version.py. Aligns the two so test_version_consistency passes. Co-Authored-By: Claude Opus 4.6 <[email protected]>

igorls added area/kg Knowledge graph area/mining File and conversation mining performance Performance improvements labels Apr 14, 2026

igorls added the storage label Apr 14, 2026

jphein closed this Apr 16, 2026

jphein deleted the pr/performance branch April 19, 2026 03:51

	print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")
	print(f" OK [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")

		f" \u2713 [{processed_count:4}/{len(files_to_process)}] "
		f"{filepath.name[:50]:50} +{len(batch_docs)}"

Conversation

jphein commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Closes

Related

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jphein commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jphein commented Apr 11, 2026 •

edited

Loading