Bug Report
Two related bugs in convo_miner.py that compound each other.
Bug 1 — File registry not updated for 0-chunk files
file_already_mined() works by checking ChromaDB for any drawer with a matching source_file metadata value. This works correctly for files that produce drawers — but files that pass normalize and then produce 0 chunks (content too short, or chunker returns empty list) hit an early continue without ever writing anything to ChromaDB. On every subsequent run, file_already_mined() returns False for these files and they are re-processed indefinitely.
Affected paths in mine_convos():
if not content or len(content.strip()) < MIN_CHUNK_SIZE: continue
if not chunks: continue
Bug 2 — drawers_added counter always shows 0 on re-runs
When collection.add() raises an "already exists" exception, the increment drawers_added += 1 is skipped. On re-runs (caused by Bug 1), all drawer IDs already exist so all adds throw, leaving the counter at 0 and the output showing +0 for every file.
# current
except Exception as e:
if "already exists" not in str(e).lower():
raise
# drawers_added never incremented here
Fix
Added _register_file() — writes a sentinel entry (ingest_mode: "registry", room: "_registry") to ChromaDB after processing each file, including the early-exit paths. Since file_already_mined() queries by source_file, it finds the sentinel on subsequent runs.
Also added drawers_added += 1 in the except branch for "already exists".
def _register_file(collection, source_file: str, wing: str, agent: str = "mempalace"):
registry_id = f"_reg_{hashlib.sha256(source_file.encode()).hexdigest()[:24]}"
try:
collection.add(
documents=[source_file],
ids=[registry_id],
metadatas=[{
"wing": wing,
"room": "_registry",
"source_file": source_file,
"added_by": agent,
"filed_at": datetime.now().isoformat(),
"ingest_mode": "registry",
}]
)
except Exception:
pass
Environment: mempalace 3.1.0, chromadb 0.6.3, Python 3.13, Debian 12
Bug Report
Two related bugs in
convo_miner.pythat compound each other.Bug 1 — File registry not updated for 0-chunk files
file_already_mined()works by checking ChromaDB for any drawer with a matchingsource_filemetadata value. This works correctly for files that produce drawers — but files that pass normalize and then produce 0 chunks (content too short, or chunker returns empty list) hit an earlycontinuewithout ever writing anything to ChromaDB. On every subsequent run,file_already_mined()returnsFalsefor these files and they are re-processed indefinitely.Affected paths in
mine_convos():if not content or len(content.strip()) < MIN_CHUNK_SIZE: continueif not chunks: continueBug 2 —
drawers_addedcounter always shows 0 on re-runsWhen
collection.add()raises an "already exists" exception, the incrementdrawers_added += 1is skipped. On re-runs (caused by Bug 1), all drawer IDs already exist so all adds throw, leaving the counter at 0 and the output showing+0for every file.Fix
Added
_register_file()— writes a sentinel entry (ingest_mode: "registry",room: "_registry") to ChromaDB after processing each file, including the early-exit paths. Sincefile_already_mined()queries bysource_file, it finds the sentinel on subsequent runs.Also added
drawers_added += 1in theexceptbranch for "already exists".Environment: mempalace 3.1.0, chromadb 0.6.3, Python 3.13, Debian 12