Problem
_extract_content() in normalize.py only extracts type: "text" content blocks from Claude Code JSONL transcripts. Tool use and tool result blocks are silently dropped.
In a typical Claude Code session, tool blocks account for 49% of the meaningful content — Bash command output, search results, file reads, and the commands that produced them. All of this is lost during normalization.
What's dropped
| Block type |
Role |
Count (sample session) |
Contains |
tool_use |
assistant |
847 |
Commands run, files read, searches performed |
tool_result |
user |
847 |
Command output, search results, runtime findings |
thinking |
assistant |
147 |
Empty (redacted in JSONL) — not actionable |
Impact
Unique runtime findings that exist nowhere else are lost:
- Bash output: build errors, probe results, test output, system diagnostics
- Search results: what was found and where
- The command context that produced each result
File contents from Read and diffs from Edit are less critical since they're typically mined separately as project files or available in git history.
Fix
PR #562 adds tool-aware extraction with per-tool strategies:
| Tool |
Strategy |
Rationale |
| Bash |
Head 20 lines + tail 20 lines, gap marker |
Unique findings; errors at tail |
| Read |
Path-only breadcrumb: [Read /path/to/file.py] |
Content already mined as project files |
| Grep/Glob |
Query + first 20 matches |
Matches are the finding |
| Edit/Write |
Path-only breadcrumb |
Diff is in git |
| Other |
First 2KB, truncation marker |
Safe default |
Tool use blocks are formatted as one-liners ([Bash] lsusb | grep razer) and results are prefixed with →. Tool-result-only user messages (no human text, just tool output) are merged into the preceding assistant turn to avoid spurious > markers.
Before
> Check the camera
Let me check. Found it — Razer Kiyo Pro is connected.
After
> Check the camera
Let me check.
[Bash] lsusb | grep razer
→ Bus 002 Device 005: ID 1532:0e05 Razer USA, Ltd Razer Kiyo Pro
Found it — Razer Kiyo Pro is connected.
Tested on a real 847-tool-call session: 6,928 tool output lines recovered, palace growth ~5% (head+tail and path-only strategies keep it tight).
648 tests pass.
Problem
_extract_content()innormalize.pyonly extractstype: "text"content blocks from Claude Code JSONL transcripts. Tool use and tool result blocks are silently dropped.In a typical Claude Code session, tool blocks account for 49% of the meaningful content — Bash command output, search results, file reads, and the commands that produced them. All of this is lost during normalization.
What's dropped
tool_usetool_resultthinkingImpact
Unique runtime findings that exist nowhere else are lost:
File contents from
Readand diffs fromEditare less critical since they're typically mined separately as project files or available in git history.Fix
PR #562 adds tool-aware extraction with per-tool strategies:
[Read /path/to/file.py]Tool use blocks are formatted as one-liners (
[Bash] lsusb | grep razer) and results are prefixed with→. Tool-result-only user messages (no human text, just tool output) are merged into the preceding assistant turn to avoid spurious>markers.Before
After
Tested on a real 847-tool-call session: 6,928 tool output lines recovered, palace growth ~5% (head+tail and path-only strategies keep it tight).
648 tests pass.