Skip to content

Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998

Merged
igorls merged 6 commits intodevelopfrom
fix/silent-transcript-drop
Apr 18, 2026
Merged

Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper#998
igorls merged 6 commits intodevelopfrom
fix/silent-transcript-drop

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 18, 2026

Summary

Public v3.3.1 silently drops users' .jsonl transcripts. The hook fires, the miner walks the directory, skips every .jsonl file via silent continue, exits cleanly. No error, no log line. Conversations never reach the palace.

Root cause is three-layered. Fixed in three commits, plus one additive safety net and a production-hardening pass:

  1. .jsonl added to READABLE_EXTENSIONS in miner.py — drops every .jsonl at miner.py:722 via extension whitelist
  2. MAX_FILE_SIZE raised 10 MB → 500 MB in miner.py — long sessions exceed 10 MB and got silently dropped at miner.py:732
  3. Same cap raise in convo_miner.py — symmetric silent-drop
  4. Tandem sweeper (mempalace sweep) — message-granular safety net; deterministic drawer IDs, idempotent, timestamp-coordinated with the primary miners
  5. Production hardening on the sweeper (see below)

Production hardening vs. the initial draft

  • Verbatim always. Removed the [:500] truncation on tool_use / tool_result content in _flatten_content. Long tool inputs (code-edit diffs, etc.) now round-trip intact. New test asserts a 5000-char tool input is preserved.
  • No ID collision risk. Drawer IDs use the full session_id, not session_id[:12].
  • No silent exceptions. get_palace_cursor now logs at WARNING on backend errors instead of returning None silently — the very anti-pattern this PR criticizes elsewhere.
  • Per-file failures are visible. sweep_directory returns a failures list; CLI exits non-zero when any file failed.

Why we did NOT rely on the hook

The hook is a Claude Code Stop hook, fired deterministically by the harness — not by the model. So the agent can't "ignore" it. But there are three real reliability gaps that still make a hook-only fix insufficient:

  1. Most users never install it. Installation requires hand-editing settings.local.json per the hooks/README.md instructions. Anyone who skipped that step gets no ingest at all.
  2. The miner runs as a background process (python -m mempalace mine ... & at hooks/mempal_save_hook.sh:156). The hook returns as soon as it's spawned, so if the miner itself crashes or is killed, the hook still looks "successful" and the user sees nothing wrong.
  3. Hook timeout (30s) can truncate large ingests. The configured "timeout": 30 kills the hook, and anything it spawned in the foreground with it; large first-time mines of a transcript directory can exceed that.

The sweeper addresses all three structurally: run mempalace sweep ~/.claude/projects/ on a cron (or manually) and every transcript gets picked up regardless of whether the hook ran, succeeded, or timed out. Deterministic drawer IDs make it safe to run alongside the hook — neither duplicates the other.

Test plan

  • pytest --ignore=tests/benchmarks --ignore=tests/test_mcp_stdio_protection.py969 passed (proposal baseline was 967 + 2 new tests)
  • ruff check + ruff format --check — clean
  • Pre-commit hooks — clean
  • New tests: tests/test_miner_jsonl_visibility.py (4), tests/test_convo_miner_size_cap.py (1), tests/test_sweeper.py (8, including verbatim preservation)
  • Smoke test on a real Claude Code transcript directory (reviewer to verify)

Commits

5d37f21  Harden sweeper for production: verbatim tool blocks, full session_id, logged failures
fed6993  Add tandem sweeper: message-level safety net for dropped transcripts
6f33d52  Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB
d137d12  Raise MAX_FILE_SIZE cap from 10 MB to 500 MB
560fdbd  Fix silent drop of .jsonl files in project miner

Follow-ups (out of scope)

  • Uniform ingest_mode: "project" | "convos" | "sweep" metadata across all three miners
  • Audit of remaining silent except OSError: / except Exception: pass patterns in miner.py
  • install-sweeper subcommand for cron/launchd install

milla-jovovich and others added 4 commits April 18, 2026 12:52
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:

    if filepath.suffix.lower() not in READABLE_EXTENSIONS:
        continue

Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.

Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.

Tests: tests/test_miner_jsonl_visibility.py

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Long Claude Code sessions routinely produce transcripts larger than 10
MB. The previous cap at miner.py:65 silently dropped them at line 732
with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same
silent-failure pattern as the .jsonl extension bug.

The cap exists as a safety rail against pathological binaries, not as
a limit on legitimate text. Downstream chunking at 800 chars per drawer
means source file size does not affect storage or embedding cost.

500 MB leaves headroom for year-long continuous transcripts while still
catching accidental multi-GB binary mines.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Mirrors the miner.py fix in this same branch. convo_miner.py had the
exact same 10 MB cap at line 58 that silently dropped long transcripts
via continue. Long Claude Code sessions, multi-year ChatGPT exports,
and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern,
different file.

Raised to 500 MB to match miner.py for consistency; downstream chunking
means source file size does not affect storage or embedding cost.

Tests: tests/test_convo_miner_size_cap.py (1 test)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.

The sweeper is a second, cooperating miner that works at MESSAGE
granularity:

  - Parses Claude Code .jsonl line by line, yielding only
    user/assistant records (filters progress, file-history-snapshot,
    etc. noise).
  - For each session_id, queries the palace for max(timestamp) and
    treats that as the cursor.
  - Ingests only messages newer than the cursor, as one small drawer
    per exchange (never hits a size cap — each drawer is 1-5 KB).
  - Deterministic drawer IDs from session_id + message UUID make
    reruns idempotent; crash mid-sweep is safe.

Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.

Smoke test on a real Claude Code transcript:
  1st run: +39 drawers, 0 already present
  2nd run: +0 drawers, 39 already present  (perfect idempotence)

Opt-in via:
  mempalace sweep <file.jsonl>
  mempalace sweep <transcript-dir>

No changes to existing miners. No schema migration. Purely additive.

Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings April 18, 2026 16:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a silent transcript-ingestion failure by ensuring .jsonl transcripts are discoverable/ingestable, increasing file-size caps for long transcripts, and adding a new “tandem sweeper” (mempalace sweep) that ingests Claude Code JSONL at message granularity as a safety net.

Changes:

  • Add .jsonl to the project miner’s readable extension whitelist.
  • Raise MAX_FILE_SIZE from 10 MB to 500 MB in both miner.py and convo_miner.py.
  • Introduce mempalace/sweeper.py plus mempalace sweep CLI command and associated tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
mempalace/miner.py Allows .jsonl project files and raises the project scan size cap.
mempalace/convo_miner.py Raises conversation scan size cap to match the project miner.
mempalace/sweeper.py Adds message-level JSONL sweeper with timestamp cursoring and deterministic drawer IDs.
mempalace/cli.py Adds sweep subcommand wiring and exit behavior on per-file failures.
tests/test_sweeper.py Validates sweeper parsing, idempotence, cursor resume, and metadata.
tests/test_miner_jsonl_visibility.py / tests/test_convo_miner_size_cap.py Regression coverage for .jsonl visibility and size-cap behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/sweeper.py
Comment on lines +253 to +290
def sweep_directory(dir_path: str, palace_path: str) -> dict:
"""Sweep every .jsonl file in a directory (recursive).

Returns aggregated summary across all files.
"""
dir_p = Path(dir_path).expanduser().resolve()
files = sorted(dir_p.rglob("*.jsonl"))

total_added = 0
total_skipped = 0
per_file = []

failures: list[dict] = []
for f in files:
try:
result = sweep(str(f), palace_path, source_label=str(f))
except Exception as exc:
logger.error("sweeper: sweep failed on %s: %s", f, exc)
print(f" \u26a0 sweep failed on {f}: {exc}", file=sys.stderr)
failures.append({"file": str(f), "error": str(exc)})
continue
total_added += result["drawers_added"]
total_skipped += result["drawers_skipped"]
per_file.append(
{
"file": str(f),
"added": result["drawers_added"],
"skipped": result["drawers_skipped"],
}
)

return {
"files_processed": len(per_file),
"drawers_added": total_added,
"drawers_skipped": total_skipped,
"per_file": per_file,
"failures": failures,
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sweep_directory, files_processed is reported as len(per_file), which excludes failed files (those end up only in failures). This can make the summary/CLI output misleading (it will say fewer files were swept than were discovered/attempted). Consider returning both total discovered/attempted and succeeded counts, or define files_processed to include failures as well.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sweeper.py
Comment on lines +215 to +223
for rec in parse_claude_jsonl(jsonl_path):
sid = rec["session_id"]
if sid not in cursors:
cursors[sid] = get_palace_cursor(collection, sid)

cursor = cursors[sid]
if cursor is not None and rec["timestamp"] <= cursor:
drawers_skipped += 1
continue
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor logic can drop messages when multiple records share the same timestamp. With cursor = max(timestamp) and the skip condition rec['timestamp'] <= cursor, any message at the max timestamp that wasn’t yet ingested (e.g., crash mid-run, or concurrent writes) will be skipped forever. Consider tracking a tie-breaker (e.g., message_uuid set at max timestamp) or only skipping < cursor and, for == cursor, checking existence by deterministic drawer_id before skipping.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sweeper.py
Comment on lines +201 to +213
def _flush():
nonlocal drawers_added
if not batch_ids:
return
collection.upsert(
ids=batch_ids,
documents=batch_docs,
metadatas=batch_metas,
)
drawers_added += len(batch_ids)
batch_ids.clear()
batch_docs.clear()
batch_metas.clear()
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drawers_added is incremented by the batch size after collection.upsert(...), but upsert can overwrite existing IDs. This makes the CLI output and the returned metric potentially misleading (it’s counting upserts, not necessarily new drawers). If you need “added” semantics, consider pre-checking which IDs already exist (e.g., via get(ids=...)) or rename the metric to reflect upserts.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/cli.py Outdated
Comment on lines +151 to +153
miner missed. Coordinates via max(timestamp) per session_id, so
this is safe to run alongside the file-level miners — neither
duplicates the other's work.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help/docstring claims sweep “coordinates via max(timestamp) per session_id” and “neither duplicates the other’s work”, but the existing miners don’t appear to write session_id/timestamp metadata, so sweeper can only coordinate with prior sweeps (and will still ingest alongside file-level miners, potentially duplicating content under different IDs). Either adjust the wording to match actual behavior or add the required metadata in the primary miners for transcript formats that have it.

Suggested change
miner missed. Coordinates via max(timestamp) per session_id, so
this is safe to run alongside the file-level minersneither
duplicates the other's work.
miner missed.
Sweep deduplicates against prior sweeps using session/timestamp
metadata when available, but it does not guarantee coordination
with the primary file-level miners. Running both may still ingest
overlapping content unless those miners emit compatible metadata.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/miner.py Outdated
# Long Claude Code sessions and large transcript exports routinely exceed
# 10 MB. The cap exists as a defensive rail against pathological binary
# files, not as a limit on legitimate text. Chunking at 800 chars per
# drawer means source size does not affect storage or embedding cost.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new MAX_FILE_SIZE comment says source size “does not affect storage or embedding cost”, but the number of chunks (and therefore storage/embedding work) still scales with file length; additionally process_file() reads the whole file into memory (read_text), so allowing 500MB increases worst-case memory/time significantly. Consider clarifying the comment and/or adding streaming/format-specific handling for large transcript files.

Suggested change
# drawer means source size does not affect storage or embedding cost.
# drawer bounds the size of each stored unit, but larger sources still
# produce more drawers and therefore increase storage, embedding work, and
# processing time.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/convo_miner.py Outdated
Comment on lines +61 to +62
# cap at that level silently dropped them with `continue`. Source size
# does not affect storage or embedding cost — chunking happens downstream.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as miner.py: the MAX_FILE_SIZE comment states source size “does not affect storage or embedding cost”, but chunk count and ingest time still grow with file size, and the miner normalizes/reads full content into memory. Consider correcting the wording and/or guarding large files with streaming or a configurable cap.

Suggested change
# cap at that level silently dropped them with `continue`. Source size
# does not affect storage or embedding cost — chunking happens downstream.
# cap at that level silently dropped them with `continue`. Larger source
# files can still increase ingest time, memory use, and downstream chunk/
# embedding/storage work because this miner reads and normalizes the full
# content before chunking it into drawers.

Copilot uses AI. Check for mistakes.
… logged failures

Four changes on top of the proposal's initial sweeper draft, driven by
the CLAUDE.md design principles:

1. Drop the 500-char truncation on tool_use / tool_result content in
   _flatten_content. The "verbatim always" principle forbids lossy
   compression of user-adjacent data; a long code-edit diff handed to
   the assistant must round-trip intact. Unknown block types now also
   serialize their full payload instead of just a type marker. New test
   test_parse_preserves_tool_blocks_verbatim covers a 5000-char input.

2. Use the full session_id in drawer IDs (not session_id[:12]). Rules
   out cross-session collisions if a transcript source ever uses
   non-UUID session identifiers or shared prefixes.

3. Replace silent `except Exception: return None` in get_palace_cursor
   with a logger.warning — the exact anti-pattern this PR otherwise
   criticizes in miner.py. The fallback behavior is still safe
   (deterministic IDs make a missed cursor recover on the next run),
   but the failure is now discoverable.

4. sweep_directory now collects per-file failures into the result dict
   and the CLI exits non-zero when any file failed, so a partial-sweep
   outcome is visible rather than swallowed.

Co-Authored-By: MSL <[email protected]>
@igorls igorls force-pushed the fix/silent-transcript-drop branch from 5d37f21 to 29ce7c7 Compare April 18, 2026 16:14
…mments

Six items from the automated review on PR #998:

1. **Cursor tie-break bug (correctness).** The skip condition was
   `rec.timestamp <= cursor`; if multiple messages share the max
   timestamp and only some were ingested before a crash, the rest
   would be lost forever. Changed to `< cursor`, relying on
   deterministic drawer IDs for safe re-attempt at the boundary.
   Regression test
   `test_sweep_recovers_untaken_message_at_cursor_timestamp`.

2. **`drawers_added` counted upserts, not adds.** Added a pre-flight
   `collection.get(ids=batch)` to distinguish new rows from already-
   present ones. Return value now carries `drawers_added`,
   `drawers_already_present`, `drawers_upserted`, and `drawers_skipped`
   separately. Dict-compatible access (`existing.get("ids")`) keeps it
   working on both the raw Chroma return and the typed `GetResult`.

3. **`sweep_directory` hid failures in the summary.** `files_processed`
   used to exclude failed files. Replaced with `files_attempted` (all
   discovered) + `files_succeeded` (subset that completed); CLI output
   shows `succeeded/attempted`.

4. **Coordination claim was overstated.** The primary miners don't
   stamp `session_id`/`timestamp` metadata, so the sweeper coordinates
   only with its own prior runs. Softened docstrings on module and CLI
   command. Uniform cross-miner metadata is flagged as a follow-up.

5. **MAX_FILE_SIZE comments were misleading.** Said source size "does
   not affect storage or embedding cost" — true per-drawer, but source
   size still scales drawer count, embedding work, and memory usage
   (files are read in full, not streamed). Corrected in both
   `miner.py` and `convo_miner.py`.

6. Added the tie-break regression test that reproduces the correctness
   bug from (1).

Tests: 970 passed (was 969), ruff + pre-commit clean.

Co-Authored-By: MSL <[email protected]>
@igorls igorls merged commit 74a31b7 into develop Apr 18, 2026
6 checks passed
@igorls igorls deleted the fix/silent-transcript-drop branch April 18, 2026 17:00
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
…uard

Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding),
MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4),
and MemPalace#998 (sweeper/tandem transcript safety net).

Fork changes preserved:
- quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault)
- get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault)
- paginated status() loop (guards SQLite variable limit on large palaces)
- searcher hits-loop, BM25 fallback, _count_in_scope
- .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large)
- _validate_where() + operator constants taken from upstream

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
igorls added a commit that referenced this pull request Apr 19, 2026
Version bumps across pyproject.toml, mempalace/version.py, README badge,
uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*).

CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added
covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop
fix + tandem sweeper (#998), None-metadata guards (#999, #1013),
chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681),
HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path
cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
@igorls igorls mentioned this pull request Apr 19, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants