Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998
Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998
Conversation
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:
if filepath.suffix.lower() not in READABLE_EXTENSIONS:
continue
Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.
Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.
Tests: tests/test_miner_jsonl_visibility.py
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Long Claude Code sessions routinely produce transcripts larger than 10 MB. The previous cap at miner.py:65 silently dropped them at line 732 with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same silent-failure pattern as the .jsonl extension bug. The cap exists as a safety rail against pathological binaries, not as a limit on legitimate text. Downstream chunking at 800 chars per drawer means source file size does not affect storage or embedding cost. 500 MB leaves headroom for year-long continuous transcripts while still catching accidental multi-GB binary mines. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Mirrors the miner.py fix in this same branch. convo_miner.py had the exact same 10 MB cap at line 58 that silently dropped long transcripts via continue. Long Claude Code sessions, multi-year ChatGPT exports, and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern, different file. Raised to 500 MB to match miner.py for consistency; downstream chunking means source file size does not affect storage or embedding cost. Tests: tests/test_convo_miner_size_cap.py (1 test) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.
The sweeper is a second, cooperating miner that works at MESSAGE
granularity:
- Parses Claude Code .jsonl line by line, yielding only
user/assistant records (filters progress, file-history-snapshot,
etc. noise).
- For each session_id, queries the palace for max(timestamp) and
treats that as the cursor.
- Ingests only messages newer than the cursor, as one small drawer
per exchange (never hits a size cap — each drawer is 1-5 KB).
- Deterministic drawer IDs from session_id + message UUID make
reruns idempotent; crash mid-sweep is safe.
Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.
Smoke test on a real Claude Code transcript:
1st run: +39 drawers, 0 already present
2nd run: +0 drawers, 39 already present (perfect idempotence)
Opt-in via:
mempalace sweep <file.jsonl>
mempalace sweep <transcript-dir>
No changes to existing miners. No schema migration. Purely additive.
Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR addresses a silent transcript-ingestion failure by ensuring .jsonl transcripts are discoverable/ingestable, increasing file-size caps for long transcripts, and adding a new “tandem sweeper” (mempalace sweep) that ingests Claude Code JSONL at message granularity as a safety net.
Changes:
- Add
.jsonlto the project miner’s readable extension whitelist. - Raise
MAX_FILE_SIZEfrom 10 MB to 500 MB in bothminer.pyandconvo_miner.py. - Introduce
mempalace/sweeper.pyplusmempalace sweepCLI command and associated tests.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/miner.py |
Allows .jsonl project files and raises the project scan size cap. |
mempalace/convo_miner.py |
Raises conversation scan size cap to match the project miner. |
mempalace/sweeper.py |
Adds message-level JSONL sweeper with timestamp cursoring and deterministic drawer IDs. |
mempalace/cli.py |
Adds sweep subcommand wiring and exit behavior on per-file failures. |
tests/test_sweeper.py |
Validates sweeper parsing, idempotence, cursor resume, and metadata. |
tests/test_miner_jsonl_visibility.py / tests/test_convo_miner_size_cap.py |
Regression coverage for .jsonl visibility and size-cap behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def sweep_directory(dir_path: str, palace_path: str) -> dict: | ||
| """Sweep every .jsonl file in a directory (recursive). | ||
|
|
||
| Returns aggregated summary across all files. | ||
| """ | ||
| dir_p = Path(dir_path).expanduser().resolve() | ||
| files = sorted(dir_p.rglob("*.jsonl")) | ||
|
|
||
| total_added = 0 | ||
| total_skipped = 0 | ||
| per_file = [] | ||
|
|
||
| failures: list[dict] = [] | ||
| for f in files: | ||
| try: | ||
| result = sweep(str(f), palace_path, source_label=str(f)) | ||
| except Exception as exc: | ||
| logger.error("sweeper: sweep failed on %s: %s", f, exc) | ||
| print(f" \u26a0 sweep failed on {f}: {exc}", file=sys.stderr) | ||
| failures.append({"file": str(f), "error": str(exc)}) | ||
| continue | ||
| total_added += result["drawers_added"] | ||
| total_skipped += result["drawers_skipped"] | ||
| per_file.append( | ||
| { | ||
| "file": str(f), | ||
| "added": result["drawers_added"], | ||
| "skipped": result["drawers_skipped"], | ||
| } | ||
| ) | ||
|
|
||
| return { | ||
| "files_processed": len(per_file), | ||
| "drawers_added": total_added, | ||
| "drawers_skipped": total_skipped, | ||
| "per_file": per_file, | ||
| "failures": failures, | ||
| } |
There was a problem hiding this comment.
In sweep_directory, files_processed is reported as len(per_file), which excludes failed files (those end up only in failures). This can make the summary/CLI output misleading (it will say fewer files were swept than were discovered/attempted). Consider returning both total discovered/attempted and succeeded counts, or define files_processed to include failures as well.
| for rec in parse_claude_jsonl(jsonl_path): | ||
| sid = rec["session_id"] | ||
| if sid not in cursors: | ||
| cursors[sid] = get_palace_cursor(collection, sid) | ||
|
|
||
| cursor = cursors[sid] | ||
| if cursor is not None and rec["timestamp"] <= cursor: | ||
| drawers_skipped += 1 | ||
| continue |
There was a problem hiding this comment.
Cursor logic can drop messages when multiple records share the same timestamp. With cursor = max(timestamp) and the skip condition rec['timestamp'] <= cursor, any message at the max timestamp that wasn’t yet ingested (e.g., crash mid-run, or concurrent writes) will be skipped forever. Consider tracking a tie-breaker (e.g., message_uuid set at max timestamp) or only skipping < cursor and, for == cursor, checking existence by deterministic drawer_id before skipping.
| def _flush(): | ||
| nonlocal drawers_added | ||
| if not batch_ids: | ||
| return | ||
| collection.upsert( | ||
| ids=batch_ids, | ||
| documents=batch_docs, | ||
| metadatas=batch_metas, | ||
| ) | ||
| drawers_added += len(batch_ids) | ||
| batch_ids.clear() | ||
| batch_docs.clear() | ||
| batch_metas.clear() |
There was a problem hiding this comment.
drawers_added is incremented by the batch size after collection.upsert(...), but upsert can overwrite existing IDs. This makes the CLI output and the returned metric potentially misleading (it’s counting upserts, not necessarily new drawers). If you need “added” semantics, consider pre-checking which IDs already exist (e.g., via get(ids=...)) or rename the metric to reflect upserts.
| miner missed. Coordinates via max(timestamp) per session_id, so | ||
| this is safe to run alongside the file-level miners — neither | ||
| duplicates the other's work. |
There was a problem hiding this comment.
The help/docstring claims sweep “coordinates via max(timestamp) per session_id” and “neither duplicates the other’s work”, but the existing miners don’t appear to write session_id/timestamp metadata, so sweeper can only coordinate with prior sweeps (and will still ingest alongside file-level miners, potentially duplicating content under different IDs). Either adjust the wording to match actual behavior or add the required metadata in the primary miners for transcript formats that have it.
| miner missed. Coordinates via max(timestamp) per session_id, so | |
| this is safe to run alongside the file-level miners — neither | |
| duplicates the other's work. | |
| miner missed. | |
| Sweep deduplicates against prior sweeps using session/timestamp | |
| metadata when available, but it does not guarantee coordination | |
| with the primary file-level miners. Running both may still ingest | |
| overlapping content unless those miners emit compatible metadata. |
| # Long Claude Code sessions and large transcript exports routinely exceed | ||
| # 10 MB. The cap exists as a defensive rail against pathological binary | ||
| # files, not as a limit on legitimate text. Chunking at 800 chars per | ||
| # drawer means source size does not affect storage or embedding cost. |
There was a problem hiding this comment.
The new MAX_FILE_SIZE comment says source size “does not affect storage or embedding cost”, but the number of chunks (and therefore storage/embedding work) still scales with file length; additionally process_file() reads the whole file into memory (read_text), so allowing 500MB increases worst-case memory/time significantly. Consider clarifying the comment and/or adding streaming/format-specific handling for large transcript files.
| # drawer means source size does not affect storage or embedding cost. | |
| # drawer bounds the size of each stored unit, but larger sources still | |
| # produce more drawers and therefore increase storage, embedding work, and | |
| # processing time. |
| # cap at that level silently dropped them with `continue`. Source size | ||
| # does not affect storage or embedding cost — chunking happens downstream. |
There was a problem hiding this comment.
Same as miner.py: the MAX_FILE_SIZE comment states source size “does not affect storage or embedding cost”, but chunk count and ingest time still grow with file size, and the miner normalizes/reads full content into memory. Consider correcting the wording and/or guarding large files with streaming or a configurable cap.
| # cap at that level silently dropped them with `continue`. Source size | |
| # does not affect storage or embedding cost — chunking happens downstream. | |
| # cap at that level silently dropped them with `continue`. Larger source | |
| # files can still increase ingest time, memory use, and downstream chunk/ | |
| # embedding/storage work because this miner reads and normalizes the full | |
| # content before chunking it into drawers. |
… logged failures Four changes on top of the proposal's initial sweeper draft, driven by the CLAUDE.md design principles: 1. Drop the 500-char truncation on tool_use / tool_result content in _flatten_content. The "verbatim always" principle forbids lossy compression of user-adjacent data; a long code-edit diff handed to the assistant must round-trip intact. Unknown block types now also serialize their full payload instead of just a type marker. New test test_parse_preserves_tool_blocks_verbatim covers a 5000-char input. 2. Use the full session_id in drawer IDs (not session_id[:12]). Rules out cross-session collisions if a transcript source ever uses non-UUID session identifiers or shared prefixes. 3. Replace silent `except Exception: return None` in get_palace_cursor with a logger.warning — the exact anti-pattern this PR otherwise criticizes in miner.py. The fallback behavior is still safe (deterministic IDs make a missed cursor recover on the next run), but the failure is now discoverable. 4. sweep_directory now collects per-file failures into the result dict and the CLI exits non-zero when any file failed, so a partial-sweep outcome is visible rather than swallowed. Co-Authored-By: MSL <[email protected]>
5d37f21 to
29ce7c7
Compare
…mments Six items from the automated review on PR #998: 1. **Cursor tie-break bug (correctness).** The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. **`drawers_added` counted upserts, not adds.** Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. **`sweep_directory` hid failures in the summary.** `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. **Coordination claim was overstated.** The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. **MAX_FILE_SIZE comments were misleading.** Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <[email protected]>
…uard Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding), MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4), and MemPalace#998 (sweeper/tandem transcript safety net). Fork changes preserved: - quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault) - get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault) - paginated status() loop (guards SQLite variable limit on large palaces) - searcher hits-loop, BM25 fallback, _count_in_scope - .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large) - _validate_where() + operator constants taken from upstream Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Version bumps across pyproject.toml, mempalace/version.py, README badge, uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*). CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop fix + tandem sweeper (#998), None-metadata guards (#999, #1013), chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681), HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
Summary
Public v3.3.1 silently drops users'
.jsonltranscripts. The hook fires, the miner walks the directory, skips every.jsonlfile via silentcontinue, exits cleanly. No error, no log line. Conversations never reach the palace.Root cause is three-layered. Fixed in three commits, plus one additive safety net and a production-hardening pass:
.jsonladded toREADABLE_EXTENSIONSinminer.py— drops every.jsonlat miner.py:722 via extension whitelistMAX_FILE_SIZEraised 10 MB → 500 MB inminer.py— long sessions exceed 10 MB and got silently dropped at miner.py:732convo_miner.py— symmetric silent-dropmempalace sweep) — message-granular safety net; deterministic drawer IDs, idempotent, timestamp-coordinated with the primary minersProduction hardening vs. the initial draft
[:500]truncation ontool_use/tool_resultcontent in_flatten_content. Long tool inputs (code-edit diffs, etc.) now round-trip intact. New test asserts a 5000-char tool input is preserved.session_id, notsession_id[:12].get_palace_cursornow logs at WARNING on backend errors instead of returningNonesilently — the very anti-pattern this PR criticizes elsewhere.sweep_directoryreturns afailureslist; CLI exits non-zero when any file failed.Why we did NOT rely on the hook
The hook is a Claude Code Stop hook, fired deterministically by the harness — not by the model. So the agent can't "ignore" it. But there are three real reliability gaps that still make a hook-only fix insufficient:
settings.local.jsonper the hooks/README.md instructions. Anyone who skipped that step gets no ingest at all.python -m mempalace mine ... &at hooks/mempal_save_hook.sh:156). The hook returns as soon as it's spawned, so if the miner itself crashes or is killed, the hook still looks "successful" and the user sees nothing wrong."timeout": 30kills the hook, and anything it spawned in the foreground with it; large first-time mines of a transcript directory can exceed that.The sweeper addresses all three structurally: run
mempalace sweep ~/.claude/projects/on a cron (or manually) and every transcript gets picked up regardless of whether the hook ran, succeeded, or timed out. Deterministic drawer IDs make it safe to run alongside the hook — neither duplicates the other.Test plan
pytest --ignore=tests/benchmarks --ignore=tests/test_mcp_stdio_protection.py— 969 passed (proposal baseline was 967 + 2 new tests)ruff check+ruff format --check— cleantests/test_miner_jsonl_visibility.py(4),tests/test_convo_miner_size_cap.py(1),tests/test_sweeper.py(8, including verbatim preservation)Commits
Follow-ups (out of scope)
ingest_mode: "project" | "convos" | "sweep"metadata across all three minersexcept OSError:/except Exception: passpatterns inminer.pyinstall-sweepersubcommand for cron/launchd install