Skip to content

perf: batch writes, concurrent mining, bulk mtime pre-fetch#629

Closed
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:pr/performance
Closed

perf: batch writes, concurrent mining, bulk mtime pre-fetch#629
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:pr/performance

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 11, 2026

Summary

Performance improvements for mining and storage. Split from #562 per maintainer request (PR 2 of 6).

Closes

Related

Test plan

  • python -m pytest tests/test_miner.py -v — 37 tests including new batch/concurrent tests
  • python -m pytest tests/ -v — full suite (613 pass on this branch)
  • Mine a large project with --workers 4 and verify speedup
  • Verify SKIP_PATTERNS excludes .min.js, .lock, .map files

Copilot AI review requested due to automatic review settings April 11, 2026 14:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving mining performance and correctness in MemPalace by reducing ChromaDB call volume, enabling concurrent file preparation, and fixing mtime/collection defaults that previously caused unnecessary re-mining and negative similarity behavior.

Changes:

  • Added bulk prefetch of mined mtimes + epsilon mtime comparison, and defaulted new collections to cosine distance.
  • Implemented concurrent mining (ThreadPoolExecutor) and batched upserts for higher throughput.
  • Extended miner configurability and filtering (chunk sizing via config; skip patterns/junk size), and expanded entity detector stopwords.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_miner.py Adds unit tests for detect_room() priority/word-boundary behavior and chunk_text() chunking invariants.
mempalace/palace.py Defaults newly created collections to cosine space; fixes mtime dedup via epsilon; adds bulk_check_mined() for paginated metadata prefetch.
mempalace/miner.py Adds skip filters, configurable chunking, concurrent mining path, and bulk mined-map skip logic; refactors file processing to support batching.
mempalace/entity_detector.py Expands STOPWORDS with common technical terms to reduce false-positive entity extraction.
mempalace/convo_miner.py Switches convo ingestion to batched upserts instead of per-chunk writes.
mempalace/config.py Adds chunk sizing/overlap/min-size settings to user config with defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/miner.py Outdated
Comment on lines +580 to +584
collection.upsert(
documents=batch_docs,
ids=batch_ids,
metadatas=batch_metas,
)
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_file() now upserts all chunks from a file in a single request. With MAX_FILE_SIZE = 10MB and chunk_size=800, a single file can produce thousands of chunks, which defeats the intended “~100 docs/call” batching and can create very large request payloads/memory spikes. Consider slicing batch_docs/ids/metadatas into _UPSERT_BATCH_SIZE batches (as in convo_miner.py).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The _UPSERT_BATCH_SIZE constant was intended to cap this, but the single-file path doesn't enforce it. Will add chunked upsert for the single-file path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — single-file path now chunks upserts into _UPSERT_BATCH_SIZE slices via for i in range(0, len(batch_docs), _UPSERT_BATCH_SIZE). 79e3d22

Comment thread mempalace/miner.py Outdated
Comment on lines +842 to +851
# Flush when batch is large enough
if len(pending_docs) >= _UPSERT_BATCH_SIZE:
collection.upsert(
documents=pending_docs,
ids=pending_ids,
metadatas=pending_metas,
)
pending_docs = []
pending_ids = []
pending_metas = []
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrent-path flush can still upsert far more than _UPSERT_BATCH_SIZE in a single call: pending_docs may jump from <100 to 300+ after extending with one large file’s chunks, and then the code upserts the entire list. To enforce batch size, flush in a loop and only upsert fixed-size slices until len(pending_docs) < _UPSERT_BATCH_SIZE.

Suggested change
# Flush when batch is large enough
if len(pending_docs) >= _UPSERT_BATCH_SIZE:
collection.upsert(
documents=pending_docs,
ids=pending_ids,
metadatas=pending_metas,
)
pending_docs = []
pending_ids = []
pending_metas = []
# Flush in fixed-size batches so no single upsert exceeds the limit
while len(pending_docs) >= _UPSERT_BATCH_SIZE:
collection.upsert(
documents=pending_docs[:_UPSERT_BATCH_SIZE],
ids=pending_ids[:_UPSERT_BATCH_SIZE],
metadatas=pending_metas[:_UPSERT_BATCH_SIZE],
)
pending_docs = pending_docs[_UPSERT_BATCH_SIZE:]
pending_ids = pending_ids[_UPSERT_BATCH_SIZE:]
pending_metas = pending_metas[_UPSERT_BATCH_SIZE:]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the extend-then-check pattern has a TOCTOU gap. Will restructure to flush in fixed-size batches.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — changed if to while loop that drains all full batches, slicing at _UPSERT_BATCH_SIZE boundaries. This prevents the TOCTOU where a single large future result could exceed the batch size. 79e3d22

Comment thread mempalace/miner.py
files = files[:limit]

if workers <= 0:
workers = min(8, os.cpu_count() or 4)
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine() auto-enables concurrency by default (workers is set when <= 0). The CLI currently calls mine() without a workers argument and does not expose a --workers flag, so users can’t control/disable concurrency. Either default to workers=1 unless explicitly set, or add a CLI option and update help/text to match.

Suggested change
workers = min(8, os.cpu_count() or 4)
workers = 1

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI does expose --workers — see the argparse addition in cli.py in this same PR (p_mine.add_argument('--workers', ...)).

Comment thread mempalace/miner.py Outdated
total_drawers += drawers
room_counts[room or "general"] += 1
if not dry_run:
print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This progress print uses U+2713 ("✓"). This character has been known to crash on Windows consoles using cp1251/cp1252 (non-UTF-8) encodings. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.

Suggested change
print(f" \u2713 [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")
print(f" OK [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers}")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will replace with ASCII + to match the convo_miner.py fix in PR #626.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced with ASCII +. 79e3d22

Comment thread mempalace/miner.py Outdated
Comment on lines +856 to +857
f" \u2713 [{processed_count:4}/{len(files_to_process)}] "
f"{filepath.name[:50]:50} +{len(batch_docs)}"
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This progress print uses U+2713 ("✓"), which can raise encoding errors on some Windows terminals. Prefer ASCII output (e.g., "+"/"OK") or gate the symbol based on stdout encoding support.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix — will replace with ASCII +.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced with ASCII +. 79e3d22

jphein and others added 2 commits April 13, 2026 07:20
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ent flush

- Single-file path: chunk upserts into _UPSERT_BATCH_SIZE slices
- Concurrent path: while loop drains all full batches (was: if, only one)
- Replace Unicode ✓ with ASCII + for Windows CI compatibility

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upstream bumped pyproject.toml to 3.2.0 in MemPalace#761 but missed version.py.
Aligns the two so test_version_consistency passes.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@igorls igorls added the storage label Apr 14, 2026
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 16, 2026

Closing this as mostly superseded by upstream work that landed in parallel:

  • Batch upserts → upstream has batching in cli.py (5000/batch)
  • File-level locking for concurrent miners → fcntl.flock in palace.py
  • Cosine distance default, SKIP_PATTERNS for minified/lockfile exclusions, entity-detector stopword expansion (via the i18n refactor) — all shipped
  • Epsilon mtime comparison → already lives in my fork as a separate change

The remaining novel bits here (--workers ThreadPool mining and bulk_check_mined() paginated pre-fetch) are optional performance tweaks, not bug fixes. If they prove valuable later I'll open them as small standalone PRs rather than drag this 8-file branch forward. Thanks for the reviews along the way.

@jphein jphein closed this Apr 16, 2026
jphein added a commit to jphein/mempalace that referenced this pull request Apr 16, 2026
MemPalace#629 and MemPalace#632 were closed upstream as fully superseded — all three
features in MemPalace#632 (`--version`, `purge`, `repair`) shipped in v3.3.0;
MemPalace#629's core (batching, file locking, skip patterns, stopwords) landed
independently via upstream's own paths. Move both to the closed list
and update the live/closed counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein jphein deleted the pr/performance branch April 19, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kg Knowledge graph area/mining File and conversation mining performance Performance improvements storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run

3 participants