fix: prevent convo_miner from re-processing 0-chunk files on every run (#654)#732
Merged
bensig merged 2 commits intoMemPalace:developfrom Apr 12, 2026
Merged
Conversation
…emPalace#654) mine_convos() has three early-exit paths (OSError, content too short, zero chunks) that skip writing anything to ChromaDB. Since file_already_mined() checks for the presence of a document with a matching source_file, these files are re-read and re-processed on every subsequent run. Add _register_file() that upserts a lightweight sentinel document (room="_registry", ingest_mode="registry") so file_already_mined() returns True on future runs. Note: Bug 2 from the issue (drawers_added counter always 0) was already resolved upstream via the switch from collection.add() to collection.upsert().
bensig
approved these changes
Apr 12, 2026
Collaborator
bensig
left a comment
There was a problem hiding this comment.
Code review + security audit clean.
jphein
added a commit
to jphein/mempalace
that referenced
this pull request
Apr 12, 2026
Upstream merged MemPalace#682-684 (our splits), MemPalace#687 (dry-run None room), MemPalace#695/MemPalace#708 (convo_miner full response), MemPalace#732 (0-chunk re-processing), plus VitePress docs site. Conflicts: - config.py: take upstream's [^\W_] regex (our MemPalace#683 merged version) - miner.py: integrate upstream's early-return for tiny files, dedupe dry-run read path - test_miner.py: keep our detect_room tests + upstream's dry-run test - CONTRIBUTING.md: take upstream's org URL update Co-Authored-By: Claude Opus 4.6 <[email protected]>
This was referenced Apr 13, 2026
Merged
gnusam
pushed a commit
to gnusam/mempalace-pgsql
that referenced
this pull request
Apr 25, 2026
… 0-chunk files Three upstream fixes ported together because they're conceptually one "convo_miner polish" pass on the same exchange-chunking path. 1. Remove ai_lines[:8] truncation (upstream d52d6c9, PR MemPalace#695). The _chunk_by_exchange path was silently dropping every line past line 8 of the AI response, violating the verbatim-storage principle. 2. Split oversize exchanges across drawers (upstream 9b60c6e, PR MemPalace#708). Now that the full response is preserved, an exchange that exceeds CHUNK_SIZE (800 chars, aligned with miner.py) is split into consecutive drawers instead of a single oversized one. Adds CHUNK_SIZE module constant. 3. Register a no-embedding sentinel for files that produce zero chunks (upstream 87e8baf, PR MemPalace#732). mine_convos has three early-exit paths (OSError, content too short, zero chunks) that previously wrote nothing — file_already_mined() then returned False on the next run and the file was re-read every time. Adapted MemPalace#3 for the PG backend: the upstream sentinel uses collection.upsert() (ChromaDB API). This fork instead adds a PalaceDB.register_empty_file() method that inserts a row directly with embedding=NULL and metadata.ingest_mode='registry', so the sentinel is free of embedding cost and invisible to vector search. file_already_mined() already keys on source_file + source_mtime, so the existing path picks up the sentinel without further changes. Three behavioural tests added: full AI response preserved, oversize exchange split across drawers, and the sentinel + file_already_mined round trip. Upstream: MemPalace@d52d6c9 MemPalace@9b60c6e MemPalace@87e8baf Co-authored-by: shafdev <[email protected]> Co-authored-by: Sanjay Ramadugu <[email protected]> Co-authored-by: Mikhail Valentsev <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #654 (Bug 1 only).
mine_convos()has three early-exit paths (OSError during normalize, content belowMIN_CHUNK_SIZE, zero chunks fromchunk_exchanges) thatcontinuewithout writing anything to ChromaDB. Sincefile_already_mined()checks for a document with a matchingsource_filemetadata value, these files returnFalseon every subsequent run and get re-read, re-normalized, and re-chunked -- forever.With 50 such files in a directory, that is 50 wasted reads on every
mineinvocation.Fix (2 files, +79/-0):
mempalace/convo_miner.py:_register_file()helper that upserts a lightweight sentinel document (room="_registry",ingest_mode="registry") sofile_already_mined()returns True on future runsif not dry_runupsert()(notadd()) so repeated runs are idempotenttests/test_convo_miner.py:test_mine_convos_does_not_reprocess_short_files-- verifies a too-short file gets a sentinel and is skipped on second runtest_mine_convos_does_not_reprocess_empty_chunk_files-- verifies a file with no exchange markers gets a sentinel and is skippedScope note: Bug 2 from the issue (drawers_added counter always 0) was already resolved upstream via the switch from
collection.add()tocollection.upsert(). This PR only addresses Bug 1, as @DevOPsJourneyman suggested in the issue thread -- a small focused follow-up separate from the batching logic in #629.