Skip to content

Claude Code JSONL mining silently drops all tool output (49% content loss) #590

@jphein

Description

@jphein

Problem

_extract_content() in normalize.py only extracts type: "text" content blocks from Claude Code JSONL transcripts. Tool use and tool result blocks are silently dropped.

In a typical Claude Code session, tool blocks account for 49% of the meaningful content — Bash command output, search results, file reads, and the commands that produced them. All of this is lost during normalization.

What's dropped

Block type Role Count (sample session) Contains
tool_use assistant 847 Commands run, files read, searches performed
tool_result user 847 Command output, search results, runtime findings
thinking assistant 147 Empty (redacted in JSONL) — not actionable

Impact

Unique runtime findings that exist nowhere else are lost:

  • Bash output: build errors, probe results, test output, system diagnostics
  • Search results: what was found and where
  • The command context that produced each result

File contents from Read and diffs from Edit are less critical since they're typically mined separately as project files or available in git history.

Fix

PR #562 adds tool-aware extraction with per-tool strategies:

Tool Strategy Rationale
Bash Head 20 lines + tail 20 lines, gap marker Unique findings; errors at tail
Read Path-only breadcrumb: [Read /path/to/file.py] Content already mined as project files
Grep/Glob Query + first 20 matches Matches are the finding
Edit/Write Path-only breadcrumb Diff is in git
Other First 2KB, truncation marker Safe default

Tool use blocks are formatted as one-liners ([Bash] lsusb | grep razer) and results are prefixed with . Tool-result-only user messages (no human text, just tool output) are merged into the preceding assistant turn to avoid spurious > markers.

Before

> Check the camera
Let me check. Found it — Razer Kiyo Pro is connected.

After

> Check the camera
Let me check.
[Bash] lsusb | grep razer
→ Bus 002 Device 005: ID 1532:0e05 Razer USA, Ltd Razer Kiyo Pro
Found it — Razer Kiyo Pro is connected.

Tested on a real 847-tool-call session: 6,928 tool output lines recovered, palace growth ~5% (head+tail and path-only strategies keep it tight).

648 tests pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions