Support claude.ai privacy export format with sender field#605
Support claude.ai privacy export format with sender field#605carlito1979 wants to merge 2 commits intoMemPalace:developfrom
Conversation
Two bugs caused `mine --mode convos` to silently file zero drawers from
claude.ai privacy exports:
1. `_try_claude_ai_json` only looked at `role`, but the privacy export
uses `sender` ("human" / "assistant"). Now accepts either field, and
falls back to the message's top-level `text` when the structured
`content` blocks yield nothing.
2. `convo_miner.MAX_FILE_SIZE` was 10 MB while real claude.ai exports
routinely run 20–50 MB, so `conversations.json` was dropped before
parsing with no diagnostic. The default cap is now 100 MB and
oversize files emit a visible warning to stderr.
Adds unit tests covering the `sender` field, the `text` fallback, and
the new oversize-file warning.
web3guru888
left a comment
There was a problem hiding this comment.
This is exactly the fix I described in #602 — clean, well-scoped, and the implementation is more thorough than the minimal one-liner I suggested.
_extract_claude_ai_message() helper — the right abstraction
Factoring this out as a dedicated helper (rather than inline item.get("role") or item.get("sender")) is the correct choice. Both the flat and nested paths now share the same extraction logic, which eliminates the class of "fix one path and miss the other" bug that would have happened with a patch approach.
The fallback chain is right:
- Try
rolefirst (legacy exports) - Fall back to
sender(current privacy exports) - Try structured
contentblocks - Fall back to top-level
text
That ordering handles all known format versions and will degrade gracefully for future variations.
MAX_FILE_SIZE = 100MB
In #602 I suggested 50MB, but 100MB is reasonable — claude.ai exports scale with conversation volume and 50MB is already uncomfortably close to the observed 38MB exports being reported. 100MB gives headroom without being reckless. The comment "20–50 MB JSON files" is accurate and useful context.
Warning on stderr rather than stdout
Correct: file=sys.stderr keeps stdout clean for piping. The warning format is human-readable and includes both actual size and limit, which is what users need to understand the skip.
Tests
Three normalize tests + the oversized warning test cover the main cases. The patch.object(convo_miner, "MAX_FILE_SIZE", 1) pattern for the warning test is the right way to trigger the condition without writing a 100MB file.
One minor note: the test_scan_default_limit_accepts_typical_claude_ai_export test just checks MAX_FILE_SIZE >= 50MB — that will pass even if someone accidentally sets it to 51MB. It is useful as a regression guard for the constant though.
This closes #602 cleanly. The original report mentioned "10MB" as the issue and both root causes (silent skip + schema mismatch) are addressed.
LGTM. Approving.
Fixes #602
What does this PR do?
Extends Claude.ai JSON export support to handle the actual privacy export format, which uses
senderinstead ofroleand stores rendered messages in a top-leveltextfield alongside structuredcontentblocks.Key changes:
_extract_claude_ai_message()helper that:roleandsenderfields for author identificationtextfield whencontentblocks are emptyMAX_FILE_SIZEfrom 10 MB to 100 MB to accommodate typical claude.ai privacy exports (20–50 MB)How to test
Run the test suite:
Checklist
python -m pytest tests/ -v)ruff check .)https://claude.ai/code/session_01GUH8MeAt6jAjKpbQ227AcC