Skip to content

feat: batch writes, concurrent mining, MCP tools, hooks, export, search improvements#562

Closed
jphein wants to merge 51 commits intoMemPalace:mainfrom
jphein:main
Closed

feat: batch writes, concurrent mining, MCP tools, hooks, export, search improvements#562
jphein wants to merge 51 commits intoMemPalace:mainfrom
jphein:main

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 10, 2026

Summary

Consolidated PR covering performance, reliability, and feature improvements across the codebase. Supersedes closed PRs #492, #493, #556 — all Copilot review feedback has been incorporated.

Performance

  • Batch ChromaDB writes — one upsert per file instead of per chunk in both miners
  • Concurrent mining — ThreadPoolExecutor for parallel file processing
  • Bulk mtime pre-fetchbulk_check_mined() avoids per-file DB queries during mining

Reliability

  • Epsilon mtime comparisonabs() < 0.01 instead of == for float mtime dedup
  • Entity detector STOPWORDS — 73 technical terms added (Handler, Node, Service, etc.)
  • Search limit capped [1,100] — schema bounds enforced
  • Status/taxonomy tools paginated past 10K drawers
  • Save marker race fix — only advances after confirmed successful save
  • MCP notification handlingnotifications/* catch-all returns None per JSON-RPC spec
  • Method guard — handles None/missing method without crash
  • Corrupt checkpoint cleanup — bad last_checkpoint file unlinked on parse error
  • Venv python resolution — background subprocesses find the correct interpreter (Hook scripts should use venv python in it's hooks #545)

Bug fixes from upstream issues

Features

  • Silent stop hook — saves directly via Python API with systemMessage notification instead of blocking MCP calls (UX: Stop hook MCP tool calls clutter terminal every 15 messages #554)
  • Theme extraction — word frequency analysis surfaces conversation topics in notifications
  • Precompact hook — emergency save before context compaction
  • Configurable hook settingsmempalace_hook_settings MCP tool for silent_save/desktop_toast
  • mempalace export — markdown backup of the entire palace
  • New MCP tools — get/list/update drawer, memories_filed_away, hook_settings
  • max_distance parameter — renamed from min_similarity with backwards compat shim
  • Configurable chunk size/overlap in mining

Palace maintenance

Stats

Test plan

  • python -m pytest tests/ -x -q — 615 passed
  • ruff check mempalace/ tests/ — clean
  • mempalace status — correctly shows 46K+ drawers
  • mempalace purge --wing X --room Y — batch deletion works
  • mempalace repair — full nuke+rebuild, verified with query after
  • Junk filter: jewelrycycle went from ~327K junk drawers to 8K useful
  • Palace rebuilt from 3 GB corrupt to 293 MB clean
  • MCP auto-reconnects after palace rebuild
  • mempalace compress --dry-run — no KeyError
  • Stop hook systemMessage renders in Claude Code terminal
  • Background ingest resolves venv python correctly

jphein and others added 30 commits April 9, 2026 19:15
Float equality on mtime fails due to JSON round-trip precision loss,
causing every file to be re-mined on each run. Use epsilon < 0.01.

Also adds bulk_check_mined() for fetching all source_file/mtime pairs
in paginated batches — turns 25K individual DB queries into ~5 fetches.

Fixes MemPalace#475

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…decls

- Clamp tool_search limit to [1, 100] to prevent memory exhaustion
- Replace hardcoded limit=10000 in status/taxonomy tools with paginated
  _fetch_all_metadata() helper (matches palace_graph.py pattern)
- Remove duplicate _client_cache/_collection_cache declarations

Fixes MemPalace#477, MemPalace#478, MemPalace#479

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Accumulate all chunks for a file into lists, then issue a single
collection.upsert() (miner) or collection.add() (convo_miner) call.
Reduces 125K-375K individual DB round-trips to ~25K batched calls.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Prevents false positives like Handler, Node, Service, Manager, Client
being flagged as project/person entities in code-heavy directories.

Fixes MemPalace#476

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds min_similarity parameter (L2 distance cutoff) to search_memories()
and MCP tool_search (default 1.5). Filters out clearly irrelevant
results instead of always returning top-N regardless of quality.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Updated STOP_BLOCK_REASON to instruct AI to use mempalace_diary_write
  and mempalace_add_drawer instead of generic "memory system"
- Updated PRECOMPACT_BLOCK_REASON with same MCP tool instructions
- Added _ingest_transcript() to mine Claude Code JSONL transcripts
  into the palace automatically on stop/precompact triggers
- Transcript goes into a "sessions" wing via convo_miner

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Documents fork relationship, key files, development workflow,
fork changes, upstream PRs, and integration details.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Mining:
- Added _prepare_file() for thread-safe file processing (read/chunk/route)
- mine() now supports --workers flag (default: min(8, cpu_count))
- Concurrent path: bulk mtime pre-fetch, parallel _prepare_file(), serialized
  ChromaDB writes in batches of 100. Sequential path unchanged (workers=1).

Room routing:
- Priority 1: exact folder match only (no substring)
- Priority 2: exact filename match only
- Content scan increased from 2KB to 5KB (full file if <10KB)
- Keyword scoring uses word-boundary regex instead of substring count
- Added 13 unit tests for detect_room covering all priority paths

Co-Authored-By: Claude Opus 4.6 <[email protected]>
New exporter.py: paginates all drawers, groups by wing/room, writes
browsable markdown tree with index.md table of contents. Each drawer
becomes a blockquoted section with metadata table.

Usage: mempalace export -o ./palace-export

Also fixes test_cli.py for new --workers arg on mine subparser.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… cache

I7: Three new MCP tools — get_drawer, list_drawers (paginated),
update_drawer (with WAL audit logging and input sanitization).

I8: WAL file chmod(0o600) now only runs on file creation instead
of every write call.

I9: 5-second TTL metadata cache for status/wings/taxonomy tools.
Eliminates redundant full-palace pagination when tools are called
in quick succession.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
I6: Chunk size/overlap/min now configurable via ~/.mempalace/config.json
instead of hardcoded constants. Wired through mine() → process_file() →
chunk_text().

I11: Layer1.generate() capped at MAX_SCAN=2000 drawers (was unbounded).
Reduces wake-up from 250+ ChromaDB round-trips to 4 max.

I12: Extracted _build_where_filter() helper in searcher.py, replaced
5 duplicate where-filter blocks across searcher.py and layers.py.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
I10: 10 unit tests for chunk_text() covering boundaries, overlap,
indices, empty/whitespace input, content preservation.

I13: KG query_entity default direction aligned from "outgoing" to
"both" to match the MCP schema default.

I14: Plugin versions synced to 3.1.0 in both .claude-plugin/ and
.codex-plugin/.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Address web3guru888's review feedback across PRs MemPalace#492 and MemPalace#493:

- palace.py: remove unused filepaths param from bulk_check_mined(),
  replace bare except with logger.warning for partial fetch visibility
- miner.py: wrap future.result() in try/except so one file failure
  doesn't abort the entire concurrent mining run
- exporter.py: stream drawers in batches instead of loading entire
  palace into memory — keeps memory bounded for large palaces
- searcher.py: document min_similarity as L2 distance (not cosine)
  with typical range guidance in docstring

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Rename _build_where_filter → build_where_filter (public cross-module API)
- Add float() cast + TypeError/ValueError handling in _is_already_mined
- Add chunk_overlap validation (must be >= 0 and < chunk_size)
- Batch convo_miner adds to 100 docs per call (avoid SQLite limits)
- Stream miner writes as futures complete (bounded memory)
- Remove unused palace_path in hooks_cli
- Remove unused chromadb import in test_exporter
- Sanitize wing/room as path components in exporter (prevent traversal)
- Filter on raw distance before rounding in searcher
- Clamp negative offset in tool_list_drawers
- No-op early return + cache invalidation in tool_update_drawer
- Add min/max schema bounds for search limit and list_drawers limit/offset
- Update CLAUDE.md test count (534 → 562)
- Improve chunk coverage test with position-unique tokens

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop and precompact hooks used bare `python3` which resolves to system
Python in sessions outside the memorypalace project directory, causing
`No module named mempalace` errors. Now uses the venv's Python with
fallback to system python3.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace hardcoded venv path with a resolution chain:
1. MEMPALACE_PYTHON env var (user override)
2. Plugin root's own venv (development installs)
3. System python3 (pip/pipx installs)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Covers basic CRUD, filtering, pagination, negative offset clamping,
not-found errors, and no-op update detection. Addresses review comment
on PR MemPalace#493 requesting coverage for the new drawer tools.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Single source of truth for the limit ceiling (100) so operators can
adjust without hunting through multiple clamp sites.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…king MCP

Stop hook no longer blocks Claude with MCP tool call instructions every 15
messages. Instead it saves a diary checkpoint directly via the Python API
and shows a single-line terminal notification + desktop toast.

Fixes MemPalace#554

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add hooks.silent_save and hooks.desktop_toast to config.json, readable
via new mempalace_hook_settings MCP tool (get/set). Stop hook checks
config to decide between silent direct save vs legacy blocking MCP.
Restore STOP_BLOCK_REASON for legacy mode. Toast is opt-in via config.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
stderr from hook subprocesses doesn't reach the Claude Code terminal.
Block with a one-liner notification after the direct save completes —
save already happened, Claude just continues.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Claude Code shows all hook blocks as "Stop hook error:" with no info
level available. Return {} for truly invisible saves.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hook saves directly, then blocks asking Claude to call
mempalace_checkpoint_ack — a zero-param tool returning one line
like "✦ Journal entry filed — 30 messages tucked into drawers".
Replaces both the verbose MCP diary/drawer calls and the invisible
silent mode with a single clean terminal line.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Claude Code labels all hook blocks as "Stop hook error:" with no way
to customize. Go fully silent instead — save happens invisibly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop hook now outputs {"systemMessage": "✦ N messages filed away"} which
Claude Code renders as a visible one-line terminal notification — no MCP
tool call needed. Also renames checkpoint_ack → memories_filed_away and
fixes MCP server to silently ignore all notifications/ methods per spec.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copilot review caught that hook_stop() updated the last-save marker
before _save_diary_direct() ran. If save failed, the marker would
still advance and skip the next checkpoint. Move marker write after
save confirms success. Also updates CLAUDE.md test count and hook docs.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Stop hook now extracts topic keywords from recent messages and displays
them in the notification: "✦ 10 memories woven into the palace — hooks,
notifications, MCP". Stopword filtering keeps only distinctive terms.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Rename min_similarity → max_distance (searcher + MCP schema), keep
  backwards compat alias in MCP tool handler
- Fix ingest comment accuracy (async/best-effort, not guaranteed)
- Add notification protocol tests (all notifications/* return None,
  unknown methods without id return None)
- 578 tests passing

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein and others added 2 commits April 10, 2026 19:43
Both save and precompact hooks now auto-mine the JSONL transcript
directly into the palace before blocking the AI. This captures raw
tool output (Bash results, search findings, build errors) that the
AI would otherwise summarize away during its save cycle.

- Add MP_PYTHON auto-detection (MEMPAL_PYTHON env → repo venv → system)
- Add inline normalize → chunk → upsert pipeline to both hooks
- Skip file_already_mined — transcript grows, upsert is idempotent
- Update block reason messages to explicitly request verbatim tool output
- Precompact hook: parse transcript_path from input, fallback to session_id lookup
- Update README with two-layer capture docs and MEMPAL_PYTHON config

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…b upgrade

- CLAUDE.md: test count 648, fork changes 8-13, PR MemPalace#562, hook descriptions
- README.md: hooks section reflects two-layer capture, chromadb >=1.5.4,
  normalize.py captures tool blocks, MEMPAL_PYTHON env var documented
- AGENTS.md: add normalize.py and hooks to key files section
- HOOKS_TUTORIAL.md: new sections for two-layer capture and configuration
- mempalace/README.md: normalize.py description updated, version.py added
- normalize.py: docstring updated for tool_use/tool_result capture

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 11, 2026

Two more commits:

17518c3 — feat: hooks auto-mine transcript for tool output capture

  • Both save and precompact hooks now auto-mine the JSONL transcript directly into the palace before blocking the AI
  • Two-layer capture: auto-mine (deterministic, runs normalize pipeline) + updated block reason (tells AI to save tool output verbatim)
  • MP_PYTHON auto-detection for venv (chromadb isn't on system python)
  • Precompact hook finds transcript via input JSON or session_id fallback
  • Closes the live-session gap from Claude Code JSONL mining silently drops all tool output (49% content loss) #590

bddc087 — docs: update all docs for tool output mining, hook auto-mine, chromadb upgrade

648 tests pass. This PR now addresses 18+ upstream issues with tool output capture both post-hoc and live.

@web3guru888
Copy link
Copy Markdown

The two-layer capture approach (auto-mine + updated block reason) is the right architecture for this. A few observations on the implementation:

Auto-mine in hooks: Running normalize() → chunk_exchanges() → upsert() inline in both save and precompact hooks ensures tool output is captured regardless of what the AI does with the block reason. The idempotent upsert with deterministic IDs means no double-store even if the transcript grows incrementally during a session. Clean.

MEMPAL_PYTHON env var chain: The MEMPAL_PYTHON → repo venv → system python3 fallback is exactly right — this is the exact failure mode we hit early in our setup where chromadb was installed in the venv but the hook fired using system python. The auto-detection saves future integrators from that silent failure.

Precompact transcript fallback: The find by session_id fallback is a good safety net, though worth noting it will fail in multi-project setups where the same session_id appears in multiple project directories (unlikely but possible). A -maxdepth 3 or sorting by mtime would make the fallback more deterministic.

This PR is in excellent shape — 648 tests, 18+ upstream issues addressed, the chromadb upgrade, and now live-session capture closed. Full LGTM. Ready for maintainer merge.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

There's good work in here but 32 files covering batch writes, MCP tools, hooks, export, and search improvements is too much for one PR. Could you split this into separate PRs by feature? We'd merge the individual pieces much faster.

@web3guru888
Copy link
Copy Markdown

@jphein — having followed this PR closely I can suggest a logical split that maps to the natural seams in the codebase:

PR 1 — Bug fixes (mergeable today, low risk)
The standalone bug fixes with no cross-dependencies: cosine distance default (#568), emotion regex fix (#536), Unicode checkmark (#535), KG path mismatch (#538), MCP arg filtering (#572), compress dict keys (#569), spellcheck registry (#570), skill docs (#534). These are all single-file patches with existing test coverage. Maintainers should be able to review each in isolation quickly.

PR 2 — Performance (batch writes + concurrent mining)
Batch ChromaDB writes, ThreadPoolExecutor mining, bulk mtime pre-fetch. These three features are tightly coupled — they interact around the batching contract — so they belong together. The epsilon mtime comparison and entity detector STOPWORDS also fit here since they directly affect what gets batched.

PR 3 — Reliability / MCP protocol
MCP notification handling, method guard, save marker race fix, MCP stale cache detection, WAL rotation, corrupt checkpoint cleanup, venv python resolution. These all touch the server reliability surface without adding new tools.

PR 4 — Palace maintenance
Junk file filter, mempalace purge, status 10K cap fix, repair command fix. These are user-facing maintenance features that stand alone.

PR 5 — Hooks
Silent stop hook, precompact hook, configurable hook settings, theme extraction. These are all connected through the hook lifecycle so they ship best as one unit.

PR 6 — New MCP tools + export
mempalace export, get/list/update drawer tools, memories_filed_away, hook_settings MCP tool, max_distance rename. These are additive feature additions.

Splitting this way, PRs 1–3 could likely land in the next release cycle. PR 1 especially — those bug fixes close 8 upstream issues and have minimal review surface.

@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 11, 2026

@bensig — Fair point, will split. @web3guru888's breakdown is solid and maps to the natural seams.

I'll close this mega-PR and open the individual ones. Starting with PR 1 (bug fixes) since those are the lowest risk and close 8 issues immediately. Will reference this PR in each for context.

Thanks for the review time on this — the feedback from both of you made the code better.

@Formatted
Copy link
Copy Markdown

On the split — worth noting that the status 10K cap fix (proposed PR 4) overlaps with #482, which is already open and addresses it via a shared iter_metadatas(col, where=None, batch=500) generator in palace.py. That PR also consolidates the inline loops in mcp_server.py and palace_graph.py onto the same helper. Might be worth checking whether the pagination piece from #562 can defer to #482 rather than duplicating it in a new split PR.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 11, 2026
…ORDS

Split from MemPalace#562 per maintainer request.

- palace.py: epsilon mtime comparison for float dedup, cosine distance
  space, bulk_check_mined() pre-fetch to eliminate N+1 queries
- miner.py: batch ChromaDB upserts (one per file instead of per chunk),
  ThreadPoolExecutor concurrent mining with --workers flag,
  _prepare_file() helper for thread-safe read/chunk/route, junk file
  filter (SKIP_PATTERNS + JUNK_FILE_SIZE), word-boundary keyword
  matching in detect_room(), configurable chunk params, paginated
  status() past 10K drawers
- convo_miner.py: batch upserts in groups of 100 instead of per-chunk
  writes
- entity_detector.py: 73 technical STOPWORDS (Handler, Node, Service,
  etc.) to reduce false-positive entity detection
- cli.py: --workers argument for parallel mining
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 11, 2026

@web3guru888 Good catch on the #482 overlap. I've reviewed that PR — its iter_metadatas() generator in palace.py is the right long-term approach (one shared helper for all pagination consumers). Our _fetch_all_metadata() in mcp_server.py solves the same problem but is scoped to just the MCP server.

Plan for the split: PR 4 (palace maintenance) will not include the status 10K pagination fix — we'll defer that piece to #482's iter_metadatas() landing. PR 4 will focus on: repair nuke-rebuild, purge command, --version flag, and chromadb pin bump. The pagination improvement will come naturally when #482 merges.

Also just merged upstream/main into our fork — query sanitizer (#385) and pagination (#371) are now integrated. Starting the split PRs now.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 11, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 11, 2026

Update on chromadb 0.6.3 → 1.5.x migration: The migration is not seamless — there's a latent data corruption bug.

Bug

ChromaDB 0.6.x stored seq_id values as big-endian 8-byte BLOBs in the embeddings and max_seq_id tables. ChromaDB 1.5.x expects these to be INTEGER. The auto-migration does not convert existing BLOB seq_id values — it only writes new rows as INTEGER, leaving old rows as BLOBs.

The palace appears to work initially (reads from WAL/cache succeed), but the Rust compactor eventually hits:

InternalError: Error executing plan: Error sending backfill request to compactor:
Error reading from metadata segment reader: error occurred while decoding column 0:
mismatched types; Rust type `u64` (as SQL type `INTEGER`) is not compatible with SQL type `BLOB`

At that point, all ChromaDB API calls fail — count(), get(), query() — the palace is bricked until the BLOBs are fixed.

Impact

Any palace created with ChromaDB 0.6.x and then opened with 1.5.x will hit this. Our 50K palace had 50,796 BLOB seq_id rows in embeddings and 1 in max_seq_id. The initial "648 tests pass" was misleading — tests use fresh databases, never migrated ones.

Fix

Direct SQLite conversion of BLOB seq_ids to INTEGER:

import sqlite3
conn = sqlite3.connect("path/to/chroma.sqlite3")

# Fix embeddings table
cursor = conn.execute(
    "SELECT rowid, seq_id FROM embeddings WHERE typeof(seq_id) = 'blob'"
)
updates = [(int.from_bytes(blob, byteorder='big'), rowid) for rowid, blob in cursor]
conn.executemany("UPDATE embeddings SET seq_id = ? WHERE rowid = ?", updates)

# Fix max_seq_id table
cursor = conn.execute(
    "SELECT rowid, seq_id FROM max_seq_id WHERE typeof(seq_id) = 'blob'"
)
updates = [(int.from_bytes(blob, byteorder='big'), rowid) for rowid, blob in cursor]
conn.executemany("UPDATE max_seq_id SET seq_id = ? WHERE rowid = ?", updates)

conn.commit()
conn.close()

Should this be a mempalace repair step? cc @web3guru888

jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 12, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 13, 2026
- Batch upserts (100 docs/call) in both general miner and convo_miner
  instead of one-at-a-time writes — 3-5x faster on large projects
- _prepare_file() separates read/chunk (thread-safe) from write,
  enabling ThreadPoolExecutor concurrency with --workers flag
- bulk_check_mined() pre-fetches all source_file/mtime pairs in one
  paginated scan instead of per-file ChromaDB queries
- Epsilon mtime comparison (abs < 0.01) instead of float == (MemPalace#483)
- Cosine distance default for new collections (hnsw:space=cosine)
- SKIP_PATTERNS/JUNK_FILE_SIZE filter minified JS, lockfiles, large dumps
- detect_room() uses word-boundary regex and exact-match priority
- chunk_text() accepts configurable size/overlap/min params
- Entity detector STOPWORDS expanded (73 technical terms)
- Configurable chunk_size/chunk_overlap/min_chunk_size in palace config

Split from MemPalace#562 per maintainer request. Closes MemPalace#483, MemPalace#568.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants