Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper by igorls · Pull Request #998 · MemPalace/mempalace

igorls · 2026-04-18T16:00:53Z

Summary

Public v3.3.1 silently drops users' .jsonl transcripts. The hook fires, the miner walks the directory, skips every .jsonl file via silent continue, exits cleanly. No error, no log line. Conversations never reach the palace.

Root cause is three-layered. Fixed in three commits, plus one additive safety net and a production-hardening pass:

.jsonl added to READABLE_EXTENSIONS in miner.py — drops every .jsonl at miner.py:722 via extension whitelist
MAX_FILE_SIZE raised 10 MB → 500 MB in miner.py — long sessions exceed 10 MB and got silently dropped at miner.py:732
Same cap raise in convo_miner.py — symmetric silent-drop
Tandem sweeper (mempalace sweep) — message-granular safety net; deterministic drawer IDs, idempotent, timestamp-coordinated with the primary miners
Production hardening on the sweeper (see below)

Production hardening vs. the initial draft

Verbatim always. Removed the [:500] truncation on tool_use / tool_result content in _flatten_content. Long tool inputs (code-edit diffs, etc.) now round-trip intact. New test asserts a 5000-char tool input is preserved.
No ID collision risk. Drawer IDs use the full session_id, not session_id[:12].
No silent exceptions. get_palace_cursor now logs at WARNING on backend errors instead of returning None silently — the very anti-pattern this PR criticizes elsewhere.
Per-file failures are visible. sweep_directory returns a failures list; CLI exits non-zero when any file failed.

Why we did NOT rely on the hook

The hook is a Claude Code Stop hook, fired deterministically by the harness — not by the model. So the agent can't "ignore" it. But there are three real reliability gaps that still make a hook-only fix insufficient:

Most users never install it. Installation requires hand-editing settings.local.json per the hooks/README.md instructions. Anyone who skipped that step gets no ingest at all.
The miner runs as a background process (python -m mempalace mine ... & at hooks/mempal_save_hook.sh:156). The hook returns as soon as it's spawned, so if the miner itself crashes or is killed, the hook still looks "successful" and the user sees nothing wrong.
Hook timeout (30s) can truncate large ingests. The configured "timeout": 30 kills the hook, and anything it spawned in the foreground with it; large first-time mines of a transcript directory can exceed that.

The sweeper addresses all three structurally: run mempalace sweep ~/.claude/projects/ on a cron (or manually) and every transcript gets picked up regardless of whether the hook ran, succeeded, or timed out. Deterministic drawer IDs make it safe to run alongside the hook — neither duplicates the other.

Test plan

pytest --ignore=tests/benchmarks --ignore=tests/test_mcp_stdio_protection.py — 969 passed (proposal baseline was 967 + 2 new tests)
ruff check + ruff format --check — clean
Pre-commit hooks — clean
New tests: tests/test_miner_jsonl_visibility.py (4), tests/test_convo_miner_size_cap.py (1), tests/test_sweeper.py (8, including verbatim preservation)
Smoke test on a real Claude Code transcript directory (reviewer to verify)

Commits

5d37f21  Harden sweeper for production: verbatim tool blocks, full session_id, logged failures
fed6993  Add tandem sweeper: message-level safety net for dropped transcripts
6f33d52  Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB
d137d12  Raise MAX_FILE_SIZE cap from 10 MB to 500 MB
560fdbd  Fix silent drop of .jsonl files in project miner

Follow-ups (out of scope)

Uniform ingest_mode: "project" | "convos" | "sweep" metadata across all three miners
Audit of remaining silent except OSError: / except Exception: pass patterns in miner.py
install-sweeper subcommand for cron/launchd install

mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not `.jsonl`. Every jsonl file encountered in a mined directory was silently skipped at miner.py:722: if filepath.suffix.lower() not in READABLE_EXTENSIONS: continue Claude Code transcripts, ChatGPT exports, and every other tool writing line-delimited JSON ship as `.jsonl`. Users running `mempalace mine` against a directory of transcripts saw the command complete with no error and no log line — and their conversations never reached the palace. Silent data loss. Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text line-by-line; the existing chunking pipeline handles it the same way it handles any other text file. Tests: tests/test_miner_jsonl_visibility.py Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Long Claude Code sessions routinely produce transcripts larger than 10 MB. The previous cap at miner.py:65 silently dropped them at line 732 with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same silent-failure pattern as the .jsonl extension bug. The cap exists as a safety rail against pathological binaries, not as a limit on legitimate text. Downstream chunking at 800 chars per drawer means source file size does not affect storage or embedding cost. 500 MB leaves headroom for year-long continuous transcripts while still catching accidental multi-GB binary mines. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Mirrors the miner.py fix in this same branch. convo_miner.py had the exact same 10 MB cap at line 58 that silently dropped long transcripts via continue. Long Claude Code sessions, multi-year ChatGPT exports, and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern, different file. Raised to 500 MB to match miner.py for consistency; downstream chunking means source file size does not affect storage or embedding cost. Tests: tests/test_convo_miner_size_cap.py (1 test) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The primary miners (miner.py, convo_miner.py) operate at file granularity and can drop data for several reasons: size caps, silent OSError on read, dedup false positives, extensions the project miner does not recognize. Even with tonight's hotfixes, any future bug in the file-level path risks silent data loss. The sweeper is a second, cooperating miner that works at MESSAGE granularity: - Parses Claude Code .jsonl line by line, yielding only user/assistant records (filters progress, file-history-snapshot, etc. noise). - For each session_id, queries the palace for max(timestamp) and treats that as the cursor. - Ingests only messages newer than the cursor, as one small drawer per exchange (never hits a size cap — each drawer is 1-5 KB). - Deterministic drawer IDs from session_id + message UUID make reruns idempotent; crash mid-sweep is safe. Tandem coordination is free: if the primary miner committed up to timestamp T, the sweeper resumes from T. If the primary miner missed everything, the sweeper catches it all. Neither duplicates the other. Smoke test on a real Claude Code transcript: 1st run: +39 drawers, 0 already present 2nd run: +0 drawers, 39 already present (perfect idempotence) Opt-in via: mempalace sweep <file.jsonl> mempalace sweep <transcript-dir> No changes to existing miners. No schema migration. Purely additive. Tests: tests/test_sweeper.py (7 tests covering parsing, tandem coordination, idempotency, resume-from-cursor, metadata correctness). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot

Pull request overview

This PR addresses a silent transcript-ingestion failure by ensuring .jsonl transcripts are discoverable/ingestable, increasing file-size caps for long transcripts, and adding a new “tandem sweeper” (mempalace sweep) that ingests Claude Code JSONL at message granularity as a safety net.

Changes:

Add .jsonl to the project miner’s readable extension whitelist.
Raise MAX_FILE_SIZE from 10 MB to 500 MB in both miner.py and convo_miner.py.
Introduce mempalace/sweeper.py plus mempalace sweep CLI command and associated tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`mempalace/miner.py`	Allows `.jsonl` project files and raises the project scan size cap.
`mempalace/convo_miner.py`	Raises conversation scan size cap to match the project miner.
`mempalace/sweeper.py`	Adds message-level JSONL sweeper with timestamp cursoring and deterministic drawer IDs.
`mempalace/cli.py`	Adds `sweep` subcommand wiring and exit behavior on per-file failures.
`tests/test_sweeper.py`	Validates sweeper parsing, idempotence, cursor resume, and metadata.
`tests/test_miner_jsonl_visibility.py` / `tests/test_convo_miner_size_cap.py`	Regression coverage for `.jsonl` visibility and size-cap behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-18T16:05:55Z

+def sweep_directory(dir_path: str, palace_path: str) -> dict:
+    """Sweep every .jsonl file in a directory (recursive).
+
+    Returns aggregated summary across all files.
+    """
+    dir_p = Path(dir_path).expanduser().resolve()
+    files = sorted(dir_p.rglob("*.jsonl"))
+
+    total_added = 0
+    total_skipped = 0
+    per_file = []
+
+    failures: list[dict] = []
+    for f in files:
+        try:
+            result = sweep(str(f), palace_path, source_label=str(f))
+        except Exception as exc:
+            logger.error("sweeper: sweep failed on %s: %s", f, exc)
+            print(f"  \u26a0 sweep failed on {f}: {exc}", file=sys.stderr)
+            failures.append({"file": str(f), "error": str(exc)})
+            continue
+        total_added += result["drawers_added"]
+        total_skipped += result["drawers_skipped"]
+        per_file.append(
+            {
+                "file": str(f),
+                "added": result["drawers_added"],
+                "skipped": result["drawers_skipped"],
+            }
+        )
+
+    return {
+        "files_processed": len(per_file),
+        "drawers_added": total_added,
+        "drawers_skipped": total_skipped,
+        "per_file": per_file,
+        "failures": failures,
+    }


In sweep_directory, files_processed is reported as len(per_file), which excludes failed files (those end up only in failures). This can make the summary/CLI output misleading (it will say fewer files were swept than were discovered/attempted). Consider returning both total discovered/attempted and succeeded counts, or define files_processed to include failures as well.

Copilot · 2026-04-18T16:05:56Z

+    for rec in parse_claude_jsonl(jsonl_path):
+        sid = rec["session_id"]
+        if sid not in cursors:
+            cursors[sid] = get_palace_cursor(collection, sid)
+
+        cursor = cursors[sid]
+        if cursor is not None and rec["timestamp"] <= cursor:
+            drawers_skipped += 1
+            continue


Cursor logic can drop messages when multiple records share the same timestamp. With cursor = max(timestamp) and the skip condition rec['timestamp'] <= cursor, any message at the max timestamp that wasn’t yet ingested (e.g., crash mid-run, or concurrent writes) will be skipped forever. Consider tracking a tie-breaker (e.g., message_uuid set at max timestamp) or only skipping < cursor and, for == cursor, checking existence by deterministic drawer_id before skipping.

Copilot · 2026-04-18T16:05:56Z

+    def _flush():
+        nonlocal drawers_added
+        if not batch_ids:
+            return
+        collection.upsert(
+            ids=batch_ids,
+            documents=batch_docs,
+            metadatas=batch_metas,
+        )
+        drawers_added += len(batch_ids)
+        batch_ids.clear()
+        batch_docs.clear()
+        batch_metas.clear()


drawers_added is incremented by the batch size after collection.upsert(...), but upsert can overwrite existing IDs. This makes the CLI output and the returned metric potentially misleading (it’s counting upserts, not necessarily new drawers). If you need “added” semantics, consider pre-checking which IDs already exist (e.g., via get(ids=...)) or rename the metric to reflect upserts.

Copilot · 2026-04-18T16:05:56Z

+    miner missed. Coordinates via max(timestamp) per session_id, so
+    this is safe to run alongside the file-level miners — neither
+    duplicates the other's work.


The help/docstring claims sweep “coordinates via max(timestamp) per session_id” and “neither duplicates the other’s work”, but the existing miners don’t appear to write session_id/timestamp metadata, so sweeper can only coordinate with prior sweeps (and will still ingest alongside file-level miners, potentially duplicating content under different IDs). Either adjust the wording to match actual behavior or add the required metadata in the primary miners for transcript formats that have it.

Suggested change

miner missed. Coordinates via max(timestamp) per session_id, so

this is safe to run alongside the file-level miners — neither

duplicates the other's work.

miner missed.

Sweep deduplicates against prior sweeps using session/timestamp

metadata when available, but it does not guarantee coordination

with the primary file-level miners. Running both may still ingest

overlapping content unless those miners emit compatible metadata.

Copilot · 2026-04-18T16:05:57Z

+# Long Claude Code sessions and large transcript exports routinely exceed
+# 10 MB. The cap exists as a defensive rail against pathological binary
+# files, not as a limit on legitimate text. Chunking at 800 chars per
+# drawer means source size does not affect storage or embedding cost.


The new MAX_FILE_SIZE comment says source size “does not affect storage or embedding cost”, but the number of chunks (and therefore storage/embedding work) still scales with file length; additionally process_file() reads the whole file into memory (read_text), so allowing 500MB increases worst-case memory/time significantly. Consider clarifying the comment and/or adding streaming/format-specific handling for large transcript files.

Suggested change

# drawer means source size does not affect storage or embedding cost.

# drawer bounds the size of each stored unit, but larger sources still

# produce more drawers and therefore increase storage, embedding work, and

# processing time.

Copilot · 2026-04-18T16:05:57Z

+# cap at that level silently dropped them with `continue`. Source size
+# does not affect storage or embedding cost — chunking happens downstream.


Same as miner.py: the MAX_FILE_SIZE comment states source size “does not affect storage or embedding cost”, but chunk count and ingest time still grow with file size, and the miner normalizes/reads full content into memory. Consider correcting the wording and/or guarding large files with streaming or a configurable cap.

Suggested change

# cap at that level silently dropped them with `continue`. Source size

# does not affect storage or embedding cost — chunking happens downstream.

# cap at that level silently dropped them with `continue`. Larger source

# files can still increase ingest time, memory use, and downstream chunk/

# embedding/storage work because this miner reads and normalizes the full

# content before chunking it into drawers.

… logged failures Four changes on top of the proposal's initial sweeper draft, driven by the CLAUDE.md design principles: 1. Drop the 500-char truncation on tool_use / tool_result content in _flatten_content. The "verbatim always" principle forbids lossy compression of user-adjacent data; a long code-edit diff handed to the assistant must round-trip intact. Unknown block types now also serialize their full payload instead of just a type marker. New test test_parse_preserves_tool_blocks_verbatim covers a 5000-char input. 2. Use the full session_id in drawer IDs (not session_id[:12]). Rules out cross-session collisions if a transcript source ever uses non-UUID session identifiers or shared prefixes. 3. Replace silent `except Exception: return None` in get_palace_cursor with a logger.warning — the exact anti-pattern this PR otherwise criticizes in miner.py. The fallback behavior is still safe (deterministic IDs make a missed cursor recover on the next run), but the failure is now discoverable. 4. sweep_directory now collects per-file failures into the result dict and the CLI exits non-zero when any file failed, so a partial-sweep outcome is visible rather than swallowed. Co-Authored-By: MSL <[email protected]>

…mments Six items from the automated review on PR #998: 1. **Cursor tie-break bug (correctness).** The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. **`drawers_added` counted upserts, not adds.** Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. **`sweep_directory` hid failures in the summary.** `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. **Coordination claim was overstated.** The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. **MAX_FILE_SIZE comments were misleading.** Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <[email protected]>

…uard Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding), MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4), and MemPalace#998 (sweeper/tandem transcript safety net). Fork changes preserved: - quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault) - get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault) - paginated status() loop (guards SQLite variable limit on large palaces) - searcher hits-loop, BM25 fallback, _count_in_scope - .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large) - _validate_where() + operator constants taken from upstream Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Version bumps across pyproject.toml, mempalace/version.py, README badge, uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*). CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop fix + tandem sweeper (#998), None-metadata guards (#999, #1013), chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681), HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).

milla-jovovich and others added 4 commits April 18, 2026 12:52

Copilot AI review requested due to automatic review settings April 18, 2026 16:00

igorls requested review from bensig and milla-jovovich as code owners April 18, 2026 16:00

Copilot started reviewing on behalf of igorls April 18, 2026 16:01 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

igorls force-pushed the fix/silent-transcript-drop branch from 5d37f21 to 29ce7c7 Compare April 18, 2026 16:14

igorls merged commit 74a31b7 into develop Apr 18, 2026
6 checks passed

igorls deleted the fix/silent-transcript-drop branch April 18, 2026 17:00

igorls mentioned this pull request Apr 19, 2026

release: v3.3.2 #1041

Merged

8 tasks

bloodcarter mentioned this pull request Apr 22, 2026

Delete operations silently no-op on palaces migrated from chromadb 0.6.3 → 1.5.8 #1099

Open

bensig mentioned this pull request Apr 23, 2026

bug: 7 issues in v3.0.0→v3.3.0 — data loss in repair/migrate, missing input validation, retry gaps #934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998

Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998
igorls merged 6 commits intodevelopfrom
fix/silent-transcript-drop

igorls commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-# drawer means source size does not affect storage or embedding cost.
+# drawer bounds the size of each stored unit, but larger sources still
+# produce more drawers and therefore increase storage, embedding work, and
+# processing time.

		# cap at that level silently dropped them with `continue`. Source size
		# does not affect storage or embedding cost — chunking happens downstream.

-# cap at that level silently dropped them with `continue`. Source size
-# does not affect storage or embedding cost — chunking happens downstream.
+# cap at that level silently dropped them with `continue`. Larger source
+# files can still increase ingest time, memory use, and downstream chunk/
+# embedding/storage work because this miner reads and normalizes the full
+# content before chunking it into drawers.

Conversation

igorls commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Production hardening vs. the initial draft

Why we did NOT rely on the hook

Test plan

Commits

Follow-ups (out of scope)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

igorls commented Apr 18, 2026 •

edited

Loading