Skip to content

Claude Code JSONL mining: user messages silently dropped + tool-result files pollute palace #111

@tavaresgmg

Description

@tavaresgmg

Summary

When mining Claude Code sessions with mempalace mine --mode convos, two issues cause poor results:

  1. User messages are silently droppednormalize.py checks msg_type == "human" but Claude Code JSONL uses type: "user", so all user turns are lost
  2. Tool-result files and metadata pollute the palacescan_convos() picks up .txt files from tool-results/ dirs (raw grep/bash/file-read outputs up to 19MB each), .meta.json subagent metadata, and memory/*.md files

Impact

  • Issue 1: Transcripts have 0 user turns. Only assistant responses are indexed, losing the question-answer pairing that makes exchange chunking meaningful.
  • Issue 2: A single tool-result .txt can generate 1000+ drawers of raw code/terminal output, drowning actual conversations in noise.

Reproduction

# Mine Claude Code sessions
mempalace mine ~/.claude/projects --mode convos

# Check a specific session — 0 user turns
python -c "
from mempalace.normalize import normalize
result = normalize('~/.claude/projects/SESSION_DIR/SESSION_ID.jsonl')
lines = result.split('\n')
quote_count = sum(1 for l in lines if l.strip().startswith('>'))
print(f'User turns: {quote_count}')  # prints 0
"

Claude Code JSONL format

{"type":"user","message":{"content":"fix the bug in auth.py"},...}
{"type":"assistant","message":{"content":[{"type":"thinking",...},{"type":"text","text":"I'll look into..."},{"type":"tool_use",...}]},...}

Note: type: "user" not type: "human".

Claude Code directory structure

~/.claude/projects/
  PROJECT_DIR/
    SESSION_ID.jsonl              # ← actual conversation (valuable)
    SESSION_ID/
      subagents/
        agent-XXXX.jsonl          # ← subagent conversation (valuable)
        agent-XXXX.meta.json      # ← tiny metadata (noise)
      tool-results/
        toolu_XXXX.txt            # ← raw tool output, up to 19MB (noise)
    memory/
      MEMORY.md                   # ← file-based memory (noise, duplicated)
      user_profile.md             # ← file-based memory (noise, duplicated)

Suggested fix

normalize.py (line 84)

# Before:
if msg_type == "human":

# After:
if msg_type in ("human", "user"):

convo_miner.py — SKIP_DIRS

SKIP_DIRS = {
    # ... existing entries ...
    "tool-results",
    "memory",
}

convo_miner.py — scan_convos()

def scan_convos(convo_dir: str) -> list:
    """Find all potential conversation files."""
    convo_path = Path(convo_dir).expanduser().resolve()
    files = []
    for root, dirs, filenames in os.walk(convo_path):
        dirs[:] = [d for d in dirs if d not in SKIP_DIRS]
        for filename in filenames:
            if filename.endswith(".meta.json"):
                continue
            filepath = Path(root) / filename
            if filepath.suffix.lower() in CONVO_EXTENSIONS:
                files.append(filepath)
    return files

Environment

  • mempalace 3.0.0
  • Python 3.13
  • Windows 11 / macOS
  • Claude Code CLI sessions (~312 JSONL files)

Notes

  • Related to Need some soft of .mempalace-ignore funtionality #102 (.mempalace-ignore) but these are default behaviors that should work out of the box, not require user configuration
  • The _extract_content() function in normalize.py already correctly filters tool_use/tool_result content blocks inside JSONL entries — only the entry-level type field matching is wrong

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions