Skip to content

Support claude.ai privacy export format with sender field#605

Open
carlito1979 wants to merge 2 commits intoMemPalace:developfrom
carlito1979:claude/fix-claude-ai-exports-1leWI
Open

Support claude.ai privacy export format with sender field#605
carlito1979 wants to merge 2 commits intoMemPalace:developfrom
carlito1979:claude/fix-claude-ai-exports-1leWI

Conversation

@carlito1979
Copy link
Copy Markdown

@carlito1979 carlito1979 commented Apr 11, 2026

Fixes #602

What does this PR do?

Extends Claude.ai JSON export support to handle the actual privacy export format, which uses sender instead of role and stores rendered messages in a top-level text field alongside structured content blocks.

Key changes:

  • Refactored message extraction into _extract_claude_ai_message() helper that:
    • Accepts both role and sender fields for author identification
    • Falls back to top-level text field when content blocks are empty
    • Handles both nested (privacy export) and flat message list formats
  • Increased MAX_FILE_SIZE from 10 MB to 100 MB to accommodate typical claude.ai privacy exports (20–50 MB)
  • Added user-visible warnings when files exceed the size limit instead of silently skipping them
  • Improved docstrings to document the supported formats

How to test

Run the test suite:

python -m pytest tests/test_normalize.py::test_claude_ai_privacy_export_sender_field -v
python -m pytest tests/test_normalize.py::test_claude_ai_privacy_export_text_field_fallback -v
python -m pytest tests/test_normalize.py::test_claude_ai_flat_messages_sender_field -v
python -m pytest tests/test_convo_miner_unit.py::TestScanConvos::test_scan_warns_on_oversized_file -v
python -m pytest tests/ -v

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

https://claude.ai/code/session_01GUH8MeAt6jAjKpbQ227AcC

Two bugs caused `mine --mode convos` to silently file zero drawers from
claude.ai privacy exports:

1. `_try_claude_ai_json` only looked at `role`, but the privacy export
   uses `sender` ("human" / "assistant"). Now accepts either field, and
   falls back to the message's top-level `text` when the structured
   `content` blocks yield nothing.

2. `convo_miner.MAX_FILE_SIZE` was 10 MB while real claude.ai exports
   routinely run 20–50 MB, so `conversations.json` was dropped before
   parsing with no diagnostic. The default cap is now 100 MB and
   oversize files emit a visible warning to stderr.

Adds unit tests covering the `sender` field, the `text` fallback, and
the new oversize-file warning.
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the fix I described in #602 — clean, well-scoped, and the implementation is more thorough than the minimal one-liner I suggested.

_extract_claude_ai_message() helper — the right abstraction

Factoring this out as a dedicated helper (rather than inline item.get("role") or item.get("sender")) is the correct choice. Both the flat and nested paths now share the same extraction logic, which eliminates the class of "fix one path and miss the other" bug that would have happened with a patch approach.

The fallback chain is right:

  1. Try role first (legacy exports)
  2. Fall back to sender (current privacy exports)
  3. Try structured content blocks
  4. Fall back to top-level text

That ordering handles all known format versions and will degrade gracefully for future variations.

MAX_FILE_SIZE = 100MB

In #602 I suggested 50MB, but 100MB is reasonable — claude.ai exports scale with conversation volume and 50MB is already uncomfortably close to the observed 38MB exports being reported. 100MB gives headroom without being reckless. The comment "20–50 MB JSON files" is accurate and useful context.

Warning on stderr rather than stdout

Correct: file=sys.stderr keeps stdout clean for piping. The warning format is human-readable and includes both actual size and limit, which is what users need to understand the skip.

Tests

Three normalize tests + the oversized warning test cover the main cases. The patch.object(convo_miner, "MAX_FILE_SIZE", 1) pattern for the warning test is the right way to trigger the condition without writing a 100MB file.

One minor note: the test_scan_default_limit_accepts_typical_claude_ai_export test just checks MAX_FILE_SIZE >= 50MB — that will pass even if someone accidentally sets it to 51MB. It is useful as a regression guard for the constant though.

This closes #602 cleanly. The original report mentioned "10MB" as the issue and both root causes (silent skip + schema mismatch) are addressed.

LGTM. Approving.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
@igorls igorls added the area/mining File and conversation mining label Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mine --mode convos silently skips claude.ai exports due to sender/role field mismatch and 10MB file size limit

4 participants