fix(hooks): drop transcript-parent fallback that caused unbounded ingest#1199
fix(hooks): drop transcript-parent fallback that caused unbounded ingest#1199rmdes wants to merge 1 commit intoMemPalace:developfrom
Conversation
`_get_mine_dir` fell back to the transcript's parent directory when `MEMPAL_DIR` was unset, so every silent stop hook fire ran two mines against the same Claude Code session JSONLs: 1. `_ingest_transcript` — `mempalace mine ... --mode convos` (good, dedups) 2. `_maybe_auto_ingest` — `mempalace mine ...` (bug, no dedup) The second mine treated each growing session JSONL as a project file and re-chunked the whole transcript on every fire. Witnessed in the wild as ~800 GB of palace data from a single project after a year of sessions. The fix: `_get_mine_dir` now returns "" unless `MEMPAL_DIR` is explicitly set to a real directory. `_maybe_auto_ingest` keeps its intended purpose (mining a separate project codebase the user has opted in to). Conversation mining continues unchanged via `_ingest_transcript`, with dedup. Tests updated to cover the new contract; a regression test asserts the silent stop hook spawns exactly one mine subprocess (the convo-mode one). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein
left a comment
There was a problem hiding this comment.
Pulled the branch and ran pytest tests/test_hooks_cli.py — 67/67 pass on Python 3.12 + the fork's chromadb 1.5.x stack. The actual fix in ae7b65e is +16/-(small) lines in hooks_cli.py and is correct and well-scoped.
(The diff against current develop looks much larger because the branch is behind upstream's #1210 truncation-guard merge from yesterday — it'll narrow back to the hook fix once rebased.)
This composes cleanly with the fork's recent PreCompact incorporation (jphein/mempalace 42817d7 — hook_precompact writes a session-recovery checkpoint via _save_diary_direct before transcript ingest + compaction). With your fix landed:
_save_diary_directkeeps writing the checkpoint diary entry (fork-only routing to a dedicatedmempalace_session_recoverycollection for now)_ingest_transcriptkeeps mining transcripts in--mode convoswith dedup — unchanged_maybe_auto_ingestand_mine_synccorrectly become no-ops withoutMEMPAL_DIR— your fix
So the polluted-mega-wing failure mode from #1083 (raphaelsamy) and the unbounded-ingest from your bug report are both fully eliminated. We'd been working around #1083 with PALACE_DAEMON_STRICT=1 (route hook writes through palace-daemon, skip local mines), but that's a heavier hammer than this single function-level change.
Worth rebasing against current develop so the diff reads as the +16-line fix it actually is. After that, +1 from this side.
Closes #1083 (would be worth referencing in the body alongside the unbounded-ingest case you describe).
…rry-pick of MemPalace#1094) Fork main was carrying the per-site `meta = meta or {}` guards from MemPalace#999 in eight read paths but didn't have the boundary coercion that closes the issue once for all callers. Cherry-picked MemPalace#1094 (open upstream, jp-authored) so the typed QueryResult/GetResult contract — both declare `metadatas: list[dict]`, never `list[Optional[dict]]` — is honored at the chromadb adapter boundary. Three same-family fork-ahead PRs we filed since (MemPalace#1198 _tokenize None-doc, MemPalace#1201 palace_graph None metadata, MemPalace#1199 review) all pointed at gaps that would have been impossible if this pattern had been in place. Future None-metadata bugs of the same shape are now catchable at one site instead of N. Per-site guards are kept as belt-and-suspenders for paths that might bypass the typed wrappers (e.g. legacy `import chromadb` direct calls). A future cleanup PR can remove them once we're confident no caller reaches chromadb directly. 6 new tests in test_backends.py (MemPalace#1094 originals); composes cleanly with fork main's _segment_appears_healthy + _quarantined_paths from earlier today. Full suite 1383/1383. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks @rmdes! Closing as superseded — the transcript-parent fallback this PR removes was already eliminated on
Current Appreciate the early diagnosis on this one — it informed the eventual fix in #1231. |
Summary
The silent stop hook fires two mine subprocesses on every save trigger, and the second one re-ingests the entire growing Claude Code session JSONL on every fire. In a long-running project the resulting palace can balloon by orders of magnitude — observed in the wild at ~800 GB from a single project after a year of sessions, which filled a WSL VHD.
_get_mine_dir(used by both_maybe_auto_ingestand_mine_sync) fell back to the transcript's parent directory whenMEMPAL_DIRwas unset. For Claude Code that parent is~/.claude/projects/-<project>/, a folder of growing JSONL session files. Reading the hook log on a real session shows the redundancy clearly:The second mine treats the JSONL as a project file, has no dedup against transcripts, and re-chunks the whole conversation history on every hook fire. As the session grows, each fire ingests more.
Fix
_get_mine_dirnow returns""unlessMEMPAL_DIRis explicitly set to a real directory._maybe_auto_ingestkeeps its intended purpose — mining a separate project codebase the user has opted into viaMEMPAL_DIR. Conversation mining still happens unchanged via_ingest_transcript(which uses--mode convosand dedups already-filed transcripts).Tests
test_get_mine_dir_no_transcript_fallback— replacestest_get_mine_dir_transcript_fallback; asserts the new contract.test_maybe_auto_ingest_no_mempal_dir_does_not_spawn— replacestest_maybe_auto_ingest_with_transcript; asserts no subprocess spawn withoutMEMPAL_DIR.test_precompact_no_sync_project_mine_without_mempal_dir— replacestest_precompact_mines_transcript_dir; same fix path applied to the precompact hook.test_stop_hook_does_not_double_mine_transcript— new regression test at thehook_stoplevel. Asserts exactly onesubprocess.Popencall (the convo-mode mine), and that the spawned command carries--mode convos.Test plan
python -m pytest tests/ --ignore=tests/benchmarks— 1306 passedruff check mempalace/hooks_cli.py tests/test_hooks_cli.py— cleanruff format --check— clean