fix: prevent convo_miner from re-processing 0-chunk files on every run (#654) by mvalentsev · Pull Request #732 · MemPalace/mempalace

mvalentsev · 2026-04-12T20:54:30Z

Closes #654 (Bug 1 only).

mine_convos() has three early-exit paths (OSError during normalize, content below MIN_CHUNK_SIZE, zero chunks from chunk_exchanges) that continue without writing anything to ChromaDB. Since file_already_mined() checks for a document with a matching source_file metadata value, these files return False on every subsequent run and get re-read, re-normalized, and re-chunked -- forever.

With 50 such files in a directory, that is 50 wasted reads on every mine invocation.

Fix (2 files, +79/-0):

mempalace/convo_miner.py:

Add _register_file() helper that upserts a lightweight sentinel document (room="_registry", ingest_mode="registry") so file_already_mined() returns True on future runs
Call it at all three early-exit points, guarded by if not dry_run
Uses upsert() (not add()) so repeated runs are idempotent

tests/test_convo_miner.py:

test_mine_convos_does_not_reprocess_short_files -- verifies a too-short file gets a sentinel and is skipped on second run
test_mine_convos_does_not_reprocess_empty_chunk_files -- verifies a file with no exchange markers gets a sentinel and is skipped

Scope note: Bug 2 from the issue (drawers_added counter always 0) was already resolved upstream via the switch from collection.add() to collection.upsert(). This PR only addresses Bug 1, as @DevOPsJourneyman suggested in the issue thread -- a small focused follow-up separate from the batching logic in #629.

…emPalace#654) mine_convos() has three early-exit paths (OSError, content too short, zero chunks) that skip writing anything to ChromaDB. Since file_already_mined() checks for the presence of a document with a matching source_file, these files are re-read and re-processed on every subsequent run. Add _register_file() that upserts a lightweight sentinel document (room="_registry", ingest_mode="registry") so file_already_mined() returns True on future runs. Note: Bug 2 from the issue (drawers_added counter always 0) was already resolved upstream via the switch from collection.add() to collection.upsert().

bensig

Code review + security audit clean.

Upstream merged MemPalace#682-684 (our splits), MemPalace#687 (dry-run None room), MemPalace#695/MemPalace#708 (convo_miner full response), MemPalace#732 (0-chunk re-processing), plus VitePress docs site. Conflicts: - config.py: take upstream's [^\W_] regex (our MemPalace#683 merged version) - miner.py: integrate upstream's early-return for tiny files, dedupe dry-run read path - test_miner.py: keep our detect_room tests + upstream's dry-run test - CONTRIBUTING.md: take upstream's org URL update Co-Authored-By: Claude Opus 4.6 <[email protected]>

… 0-chunk files Three upstream fixes ported together because they're conceptually one "convo_miner polish" pass on the same exchange-chunking path. 1. Remove ai_lines[:8] truncation (upstream d52d6c9, PR MemPalace#695). The _chunk_by_exchange path was silently dropping every line past line 8 of the AI response, violating the verbatim-storage principle. 2. Split oversize exchanges across drawers (upstream 9b60c6e, PR MemPalace#708). Now that the full response is preserved, an exchange that exceeds CHUNK_SIZE (800 chars, aligned with miner.py) is split into consecutive drawers instead of a single oversized one. Adds CHUNK_SIZE module constant. 3. Register a no-embedding sentinel for files that produce zero chunks (upstream 87e8baf, PR MemPalace#732). mine_convos has three early-exit paths (OSError, content too short, zero chunks) that previously wrote nothing — file_already_mined() then returned False on the next run and the file was re-read every time. Adapted MemPalace#3 for the PG backend: the upstream sentinel uses collection.upsert() (ChromaDB API). This fork instead adds a PalaceDB.register_empty_file() method that inserts a row directly with embedding=NULL and metadata.ingest_mode='registry', so the sentinel is free of embedding cost and invisible to vector search. file_already_mined() already keys on source_file + source_mtime, so the existing path picks up the sentinel without further changes. Three behavioural tests added: full AI response preserved, oversize exchange split across drawers, and the sentinel + file_already_mined round trip. Upstream: MemPalace@d52d6c9 MemPalace@9b60c6e MemPalace@87e8baf Co-authored-by: shafdev <[email protected]> Co-authored-by: Sanjay Ramadugu <[email protected]> Co-authored-by: Mikhail Valentsev <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

mvalentsev added 2 commits April 13, 2026 01:53

fix: resolve macOS path symlink in test + remove unused variable

0e03a86

mvalentsev marked this pull request as ready for review April 12, 2026 21:10

mvalentsev requested review from bensig, igorls and milla-jovovich as code owners April 12, 2026 21:10

bensig approved these changes Apr 12, 2026

View reviewed changes

bensig merged commit 87e8baf into MemPalace:develop Apr 12, 2026
6 checks passed

This was referenced Apr 13, 2026

fix: extract tool_use and tool_result blocks in Claude Code JSONL mining #730

Closed

fix: extract tool_result and tool_use content in JSONL parser #744

Closed

release: v3.2.0 #762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent convo_miner from re-processing 0-chunk files on every run (#654)#732

fix: prevent convo_miner from re-processing 0-chunk files on every run (#654)#732
bensig merged 2 commits intoMemPalace:developfrom
mvalentsev:fix/convo-miner-reprocess-sentinel

mvalentsev commented Apr 12, 2026

Uh oh!

bensig left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvalentsev commented Apr 12, 2026

Uh oh!

bensig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants