perf: batch writes, concurrent mining, entity fixes, configurable chunks#492
perf: batch writes, concurrent mining, entity fixes, configurable chunks#492jphein wants to merge 41 commits intoMemPalace:mainfrom
Conversation
Float equality on mtime fails due to JSON round-trip precision loss, causing every file to be re-mined on each run. Use epsilon < 0.01. Also adds bulk_check_mined() for fetching all source_file/mtime pairs in paginated batches — turns 25K individual DB queries into ~5 fetches. Fixes MemPalace#475 Co-Authored-By: Claude Opus 4.6 <[email protected]>
…decls - Clamp tool_search limit to [1, 100] to prevent memory exhaustion - Replace hardcoded limit=10000 in status/taxonomy tools with paginated _fetch_all_metadata() helper (matches palace_graph.py pattern) - Remove duplicate _client_cache/_collection_cache declarations Fixes MemPalace#477, MemPalace#478, MemPalace#479 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Accumulate all chunks for a file into lists, then issue a single collection.upsert() (miner) or collection.add() (convo_miner) call. Reduces 125K-375K individual DB round-trips to ~25K batched calls. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Prevents false positives like Handler, Node, Service, Manager, Client being flagged as project/person entities in code-heavy directories. Fixes milla-jovovich#476 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds min_similarity parameter (L2 distance cutoff) to search_memories() and MCP tool_search (default 1.5). Filters out clearly irrelevant results instead of always returning top-N regardless of quality. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Updated STOP_BLOCK_REASON to instruct AI to use mempalace_diary_write and mempalace_add_drawer instead of generic "memory system" - Updated PRECOMPACT_BLOCK_REASON with same MCP tool instructions - Added _ingest_transcript() to mine Claude Code JSONL transcripts into the palace automatically on stop/precompact triggers - Transcript goes into a "sessions" wing via convo_miner Co-Authored-By: Claude Opus 4.6 <[email protected]>
Documents fork relationship, key files, development workflow, fork changes, upstream PRs, and integration details. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Mining: - Added _prepare_file() for thread-safe file processing (read/chunk/route) - mine() now supports --workers flag (default: min(8, cpu_count)) - Concurrent path: bulk mtime pre-fetch, parallel _prepare_file(), serialized ChromaDB writes in batches of 100. Sequential path unchanged (workers=1). Room routing: - Priority 1: exact folder match only (no substring) - Priority 2: exact filename match only - Content scan increased from 2KB to 5KB (full file if <10KB) - Keyword scoring uses word-boundary regex instead of substring count - Added 13 unit tests for detect_room covering all priority paths Co-Authored-By: Claude Opus 4.6 <[email protected]>
New exporter.py: paginates all drawers, groups by wing/room, writes browsable markdown tree with index.md table of contents. Each drawer becomes a blockquoted section with metadata table. Usage: mempalace export -o ./palace-export Also fixes test_cli.py for new --workers arg on mine subparser. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… cache I7: Three new MCP tools — get_drawer, list_drawers (paginated), update_drawer (with WAL audit logging and input sanitization). I8: WAL file chmod(0o600) now only runs on file creation instead of every write call. I9: 5-second TTL metadata cache for status/wings/taxonomy tools. Eliminates redundant full-palace pagination when tools are called in quick succession. Co-Authored-By: Claude Opus 4.6 <[email protected]>
I6: Chunk size/overlap/min now configurable via ~/.mempalace/config.json instead of hardcoded constants. Wired through mine() → process_file() → chunk_text(). I11: Layer1.generate() capped at MAX_SCAN=2000 drawers (was unbounded). Reduces wake-up from 250+ ChromaDB round-trips to 4 max. I12: Extracted _build_where_filter() helper in searcher.py, replaced 5 duplicate where-filter blocks across searcher.py and layers.py. Co-Authored-By: Claude Opus 4.6 <[email protected]>
I10: 10 unit tests for chunk_text() covering boundaries, overlap, indices, empty/whitespace input, content preservation. I13: KG query_entity default direction aligned from "outgoing" to "both" to match the MCP schema default. I14: Plugin versions synced to 3.1.0 in both .claude-plugin/ and .codex-plugin/. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Accumulate all chunks for a file into lists, then issue a single collection.upsert() (miner) or collection.add() (convo_miner) call. Reduces 125K-375K individual DB round-trips to ~25K batched calls. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Prevents false positives like Handler, Node, Service, Manager, Client being flagged as project/person entities in code-heavy directories. Fixes MemPalace#476 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Mining: - Added _prepare_file() for thread-safe file processing (read/chunk/route) - mine() now supports --workers flag (default: min(8, cpu_count)) - Concurrent path: bulk mtime pre-fetch, parallel _prepare_file(), serialized ChromaDB writes in batches of 100. Sequential path unchanged (workers=1). Room routing: - Priority 1: exact folder match only (no substring) - Priority 2: exact filename match only - Content scan increased from 2KB to 5KB (full file if <10KB) - Keyword scoring uses word-boundary regex instead of substring count - Added 13 unit tests for detect_room covering all priority paths Co-Authored-By: Claude Opus 4.6 <[email protected]>
I6: Chunk size/overlap/min now configurable via ~/.mempalace/config.json instead of hardcoded constants. Wired through mine() → process_file() → chunk_text(). I11: Layer1.generate() capped at MAX_SCAN=2000 drawers (was unbounded). Reduces wake-up from 250+ ChromaDB round-trips to 4 max. I12: Extracted _build_where_filter() helper in searcher.py, replaced 5 duplicate where-filter blocks across searcher.py and layers.py. Co-Authored-By: Claude Opus 4.6 <[email protected]>
I10: 10 unit tests for chunk_text() covering boundaries, overlap, indices, empty/whitespace input, content preservation. I13: KG query_entity default direction aligned from "outgoing" to "both" to match the MCP schema default. I14: Plugin versions synced to 3.1.0 in both .claude-plugin/ and .codex-plugin/. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving mining/search performance and configurability by batching ChromaDB writes, introducing concurrent file processing, refining room routing logic, and making chunk sizing configurable via user config.
Changes:
- Refactors project mining to support concurrent read/chunk/route with sequential batched ChromaDB upserts, plus improved room detection and keyword scoring.
- Deduplicates ChromaDB
wherefilter construction across search and memory layers via a shared helper. - Adds config-driven chunk sizing defaults and expands unit tests around
detect_room()andchunk_text().
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/miner.py |
Adds concurrent mining pipeline, batched upserts, improved detect_room, and configurable chunking parameters. |
mempalace/searcher.py |
Introduces _build_where_filter() helper and applies it in search functions. |
mempalace/layers.py |
Uses shared where-filter helper and adds an L1 scan cap (MAX_SCAN). |
mempalace/config.py |
Adds chunk_size, chunk_overlap, min_chunk_size to persisted config and exposes them via properties. |
mempalace/convo_miner.py |
Batches conversation chunk inserts into a single collection.add() per file. |
mempalace/knowledge_graph.py |
Changes query_entity() default direction to "both". |
mempalace/entity_detector.py |
Expands STOPWORDS with common technical terms to reduce false entities. |
tests/test_miner.py |
Adds unit tests for detect_room() routing and chunk_text() behavior. |
.codex-plugin/plugin.json |
Bumps plugin version to 3.1.0. |
.claude-plugin/plugin.json |
Bumps plugin version to 3.1.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import chromadb | ||
|
|
||
| from .palace import SKIP_DIRS, get_collection, file_already_mined | ||
| from .palace import SKIP_DIRS, get_collection, file_already_mined, bulk_check_mined |
There was a problem hiding this comment.
bulk_check_mined is imported here but is not defined anywhere in the codebase (and mempalace/palace.py currently only exports get_collection and file_already_mined). Since mine() defaults to workers>1, this will raise an ImportError in the common path. Add bulk_check_mined to mempalace/palace.py (or stop importing it and use an existing API).
There was a problem hiding this comment.
Fixed — removed the unused filepaths param from bulk_check_mined() and added logging.getLogger with a logger.warning() on partial fetch instead of the bare except: pass. Both the function and its sole caller in mine() are updated.
| """Check if a file is already mined using the bulk-fetched mined_map. | ||
|
|
||
| Compares stored mtime against current file mtime, matching the logic | ||
| in file_already_mined() but without per-file DB queries. | ||
| """ |
There was a problem hiding this comment.
Docstring says this matches file_already_mined() logic, but file_already_mined() currently compares mtimes with exact equality (float(stored_mtime) == current_mtime). Either align the implementation (or update file_already_mined() as well) so the behavior and documentation stay consistent.
There was a problem hiding this comment.
Good catch. Both file_already_mined() and _is_already_mined() already use the epsilon tolerance (abs(...) < 0.01), so the logic is consistent. Updated the docstring in _is_already_mined to explicitly say 'epsilon tolerance' to make the consistency clear.
| try: | ||
| current_mtime = os.path.getmtime(source_file) | ||
| except OSError: | ||
| return False | ||
| return abs(stored_mtime - current_mtime) < 0.01 |
There was a problem hiding this comment.
stored_mtime may come back from Chroma metadata as a string/decimal; this subtraction can raise TypeError and will cause the miner to treat the file as not-mined. Consider casting stored_mtime to float (mirroring file_already_mined) and decide explicitly whether you want a tolerance or exact comparison.
There was a problem hiding this comment.
Fixed — added float() cast on stored_mtime and broadened the except to catch TypeError and ValueError alongside OSError, so string/decimal values from metadata won't crash the comparison.
| results = [] | ||
| with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as pool: | ||
| futures = {pool.submit(prepare_one, fp): fp for fp in files_to_process} | ||
| for future in concurrent.futures.as_completed(futures): | ||
| filepath, (batch_docs, batch_ids, batch_metas, room) = future.result() | ||
| if batch_docs is None: | ||
| with counter_lock: | ||
| files_skipped += 1 | ||
| continue | ||
| results.append((filepath, batch_docs, batch_ids, batch_metas, room)) | ||
| with counter_lock: |
There was a problem hiding this comment.
The concurrent path collects every file’s prepared docs/ids/metas into results before doing any writes. For large projects this can spike memory and delays persistence until all threads finish. Consider streaming writes as futures complete (accumulate into the pending batch and flush) instead of storing all results.
There was a problem hiding this comment.
Fixed — merged Phase 1 and Phase 2 so writes stream as futures complete instead of buffering all results. Memory is now bounded by _UPSERT_BATCH_SIZE (100 docs) regardless of project size.
| from .config import MempalaceConfig | ||
| from .searcher import _build_where_filter |
There was a problem hiding this comment.
This imports _build_where_filter (leading underscore) from another module. Since it’s now used cross-module, consider making it a public helper (e.g. build_where_filter) and/or moving it to a shared utility module to avoid signaling a private API that other modules depend on.
There was a problem hiding this comment.
Fixed — renamed _build_where_filter → build_where_filter since it's used cross-module (layers.py imports it from searcher.py). Updated all 5 call sites.
| content = "abcdefghij" * (CHUNK_SIZE * 3 // 10) | ||
| chunks = chunk_text(content, "/fake/file.py") | ||
| assert len(chunks) >= 2 | ||
| # Every character in the original must appear in at least one chunk | ||
| all_chunk_text = "".join(c["content"] for c in chunks) | ||
| for ch_pos, ch in enumerate(content): | ||
| assert ch in all_chunk_text, f"Character '{ch}' at position {ch_pos} not in any chunk" |
There was a problem hiding this comment.
This test doesn’t actually verify that all original content is covered: assert ch in all_chunk_text only checks character presence, which will always pass for repeated characters and won’t detect missing segments. Use position-unique tokens (e.g. concatenated numeric markers) and assert each token appears in at least one chunk, or assert coverage by reconstructing expected ranges.
There was a problem hiding this comment.
Fixed — replaced the repeated-character content with position-unique tokens ([T0000] through [T0299]) so the test actually verifies that every segment appears in a chunk, not just every character.
| collection.add( | ||
| documents=[chunk["content"]], | ||
| ids=[drawer_id], | ||
| metadatas=[ | ||
| { | ||
| "wing": wing, | ||
| "room": chunk_room, | ||
| "source_file": source_file, | ||
| "chunk_index": chunk["chunk_index"], | ||
| "added_by": agent, | ||
| "filed_at": datetime.now().isoformat(), | ||
| "ingest_mode": "convos", | ||
| "extract_mode": extract_mode, | ||
| } | ||
| ], | ||
| documents=batch_docs, | ||
| ids=batch_ids, | ||
| metadatas=batch_metas, | ||
| ) |
There was a problem hiding this comment.
collection.add() is called once with all chunks from a file. For large conversation exports this can exceed SQLite/Chroma variable limits and fail the whole file insert. Consider splitting into multiple add() calls with a safe batch size (similar to _UPSERT_BATCH_SIZE in miner.py) and/or handling partial failures explicitly.
There was a problem hiding this comment.
Fixed — convo_miner now batches collection.add() calls in groups of 100 (matching _UPSERT_BATCH_SIZE in miner.py) instead of one call per file.
| if chunk_size is None: | ||
| chunk_size = CHUNK_SIZE | ||
| if chunk_overlap is None: | ||
| chunk_overlap = CHUNK_OVERLAP | ||
| if min_chunk_size is None: |
There was a problem hiding this comment.
Since chunk_size/chunk_overlap are now configurable, chunk_text() should validate that 0 <= chunk_overlap < chunk_size (otherwise start = end - chunk_overlap can fail to advance and the loop can become infinite). Consider raising a clear error or clamping to a safe value.
There was a problem hiding this comment.
Fixed — added validation at the top of chunk_text(): raises ValueError if chunk_overlap < 0 or chunk_overlap >= chunk_size.
web3guru888
left a comment
There was a problem hiding this comment.
Solid set of improvements here, @jphein. A few thoughts from running MemPalace at scale (208 discoveries, 710 KG entities across 5 wings):
Batch writes — this directly addresses a bottleneck we hit. At our volume, per-chunk upserts to ChromaDB became the dominant cost in mining runs. Batching to one collection.upsert() per file is the right call; the overhead per call is substantial relative to the actual data transfer.
Concurrent mining + sequential write phase — the Phase 1/Phase 2 split is exactly the right architecture for ChromaDB. PersistentClient isn't thread-safe for concurrent writes, so serializing Phase 2 while parallelizing the CPU-bound read/chunk work in Phase 1 gives you the throughput gains without the corruption risk. Good call.
One question on error handling: if a worker thread in Phase 1 raises (e.g., file read error, chunking failure), does it skip that file and continue, or does it abort the whole mine run? At scale you'll inevitably hit encoding issues, locked files, etc. — graceful per-file degradation with logged failures would be preferable to a full abort.
Entity detector additions — the stopword list expansion looks right. "Handler", "Node", "Service", "Manager" etc. were generating noise in our KG extractions too. These are structural/role terms, not entities worth tracking.
Configurable chunk sizes — we ended up hardcoding 800/200 in a local fork to tune for our domain. Having this in config.py is the cleaner approach.
kg_query direction default — "both" makes more sense as the default. "outgoing" was the surprising choice; most callers want full relational context.
Address web3guru888's review feedback across PRs MemPalace#492 and MemPalace#493: - palace.py: remove unused filepaths param from bulk_check_mined(), replace bare except with logger.warning for partial fetch visibility - miner.py: wrap future.result() in try/except so one file failure doesn't abort the entire concurrent mining run - exporter.py: stream drawers in batches instead of loading entire palace into memory — keeps memory bounded for large palaces - searcher.py: document min_similarity as L2 distance (not cosine) with typical range guidance in docstring Co-Authored-By: Claude Opus 4.6 <[email protected]>
# Conflicts: # mempalace/miner.py
- Rename _build_where_filter → build_where_filter (public cross-module API) - Add float() cast + TypeError/ValueError handling in _is_already_mined - Add chunk_overlap validation (must be >= 0 and < chunk_size) - Batch convo_miner adds to 100 docs per call (avoid SQLite limits) - Stream miner writes as futures complete (bounded memory) - Remove unused palace_path in hooks_cli - Remove unused chromadb import in test_exporter - Sanitize wing/room as path components in exporter (prevent traversal) - Filter on raw distance before rounding in searcher - Clamp negative offset in tool_list_drawers - No-op early return + cache invalidation in tool_update_drawer - Add min/max schema bounds for search limit and list_drawers limit/offset - Update CLAUDE.md test count (534 → 562) - Improve chunk coverage test with position-unique tokens Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop and precompact hooks used bare `python3` which resolves to system Python in sessions outside the memorypalace project directory, causing `No module named mempalace` errors. Now uses the venv's Python with fallback to system python3. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace hardcoded venv path with a resolution chain: 1. MEMPALACE_PYTHON env var (user override) 2. Plugin root's own venv (development installs) 3. System python3 (pip/pipx installs) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Covers basic CRUD, filtering, pagination, negative offset clamping, not-found errors, and no-op update detection. Addresses review comment on PR MemPalace#493 requesting coverage for the new drawer tools. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Single source of truth for the limit ceiling (100) so operators can adjust without hunting through multiple clamp sites. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…king MCP Stop hook no longer blocks Claude with MCP tool call instructions every 15 messages. Instead it saves a diary checkpoint directly via the Python API and shows a single-line terminal notification + desktop toast. Fixes MemPalace#554 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add hooks.silent_save and hooks.desktop_toast to config.json, readable via new mempalace_hook_settings MCP tool (get/set). Stop hook checks config to decide between silent direct save vs legacy blocking MCP. Restore STOP_BLOCK_REASON for legacy mode. Toast is opt-in via config. Co-Authored-By: Claude Opus 4.6 <[email protected]>
stderr from hook subprocesses doesn't reach the Claude Code terminal. Block with a one-liner notification after the direct save completes — save already happened, Claude just continues. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Claude Code shows all hook blocks as "Stop hook error:" with no info
level available. Return {} for truly invisible saves.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hook saves directly, then blocks asking Claude to call mempalace_checkpoint_ack — a zero-param tool returning one line like "✦ Journal entry filed — 30 messages tucked into drawers". Replaces both the verbose MCP diary/drawer calls and the invisible silent mode with a single clean terminal line. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Claude Code labels all hook blocks as "Stop hook error:" with no way to customize. Go fully silent instead — save happens invisibly. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop hook now outputs {"systemMessage": "✦ N messages filed away"} which
Claude Code renders as a visible one-line terminal notification — no MCP
tool call needed. Also renames checkpoint_ack → memories_filed_away and
fixes MCP server to silently ignore all notifications/ methods per spec.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copilot review caught that hook_stop() updated the last-save marker before _save_diary_direct() ran. If save failed, the marker would still advance and skip the next checkpoint. Move marker write after save confirms success. Also updates CLAUDE.md test count and hook docs. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop hook now extracts topic keywords from recent messages and displays them in the notification: "✦ 10 memories woven into the palace — hooks, notifications, MCP". Stopword filtering keeps only distinctive terms. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Rename min_similarity → max_distance (searcher + MCP schema), keep backwards compat alias in MCP tool handler - Fix ingest comment accuracy (async/best-effort, not guaranteed) - Add notification protocol tests (all notifications/* return None, unknown methods without id return None) - 578 tests passing Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Superseded by consolidated PR — all changes plus review feedback incorporated into a single clean PR from main. |
Summary
--workers(default: min(8, cpu_count)). Phase 0 bulk mtime pre-fetch, Phase 1 parallel read/chunk, Phase 2 sequential batched writeschunk_size,chunk_overlap,min_chunk_sizenow in~/.mempalace/config.json_build_where_filter()helper replaces 5 duplicate blocksquery_entitydefault aligned to "both" (was "outgoing", MCP was "both")chunk_text()Test plan
python -m pytest tests/ -x -q)mempalace mine ~/Projects --workers 8completes without errorschunk_sizeoverride in config.json is respectedmempalace wake-upreturns in <1 second with large palace🤖 Generated with Claude Code