Skip to content

feat: add Cursor agent transcript JSONL normalizer#232

Closed
marerem wants to merge 2 commits intoMemPalace:mainfrom
marerem:feat/cursor-jsonl-normalizer
Closed

feat: add Cursor agent transcript JSONL normalizer#232
marerem wants to merge 2 commits intoMemPalace:mainfrom
marerem:feat/cursor-jsonl-normalizer

Conversation

@marerem
Copy link
Copy Markdown

@marerem marerem commented Apr 8, 2026

Summary

Adds _try_cursor_jsonl() parser to normalize.py for importing Cursor IDE agent transcript sessions, addressing the Cursor portion of #59.

  • Format: JSONL files at ~/.cursor/projects/<project>/agent-transcripts/<uuid>/<uuid>.jsonl
  • Line schema: {"role": "user"|"assistant", "message": {"content": [{"type": "text", "text": "..."}]}}
  • Detection: Requires top-level role key (not type) and list-typed content blocks, preventing false positives against Claude Code JSONL and Codex JSONL

Why JSONL only (no SQLite)?

Cursor's state.vscdb SQLite database (~/Library/Application Support/Cursor/User/workspaceStorage/<hash>/state.vscdb) was investigated. It stores composer.composerData (conversation metadata only — IDs, names, timestamps) and aiService.prompts (user prompts without assistant responses). Since there are no paired conversation turns in the SQLite format, the agent transcript JSONL is the only format with complete conversations suitable for MemPalace ingestion.

Changes

File Change
mempalace/normalize.py Add _try_cursor_jsonl() parser + wire into _try_normalize_json() dispatch chain + update module docstring
tests/test_normalize.py Add 8 test cases: basic, multi-turn, empty content, single-message rejection, Claude Code JSONL rejection, multi-block content, malformed lines, plain-string content rejection

Test plan

  • All 109 existing + new tests pass (pytest tests/ -v)
  • Zero Ruff lint errors
  • Reviewer: verify parser ordering doesn't interfere with existing Claude Code / Codex JSONL detection
  • Reviewer: confirm has_cursor_structure guard is sufficient to prevent false positives

Closes #59 (Cursor portion)

Add _try_cursor_jsonl() parser for Cursor IDE agent transcript files
(~/.cursor/projects/<proj>/agent-transcripts/<uuid>.jsonl).

Each JSONL line follows {"role": "user"|"assistant", "message":
{"content": [{"type": "text", "text": "..."}]}}. The parser is
discriminated from Claude Code JSONL (top-level "type" key) and
Codex JSONL ("session_meta"/"event_msg" wrappers) via a
has_cursor_structure guard that requires list-typed content blocks.

Includes 8 test cases covering multi-turn conversations, empty
content handling, malformed line tolerance, and false-positive
rejection for other JSONL formats.

Closes MemPalace#59 (Cursor portion)

Made-with: Cursor
@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: feat: add Cursor agent transcript JSONL normalizer

Executive Summary

Aspect Value
PR Goal Add _try_cursor_jsonl() parser to import Cursor IDE agent transcript sessions into the mempalace normalization pipeline
Files Changed 2
Risk Level 🟢 LOW - Pure additive change, reuses existing helpers, well-guarded format discrimination
Review Effort 2 - Clean, focused feature with thorough tests
Recommendation ✅ APPROVE

Affected Areas: mempalace/normalize.py (parser + dispatch), tests/test_normalize.py (9 new test functions)

Business Impact: Enables mining Cursor IDE conversations into the palace, expanding the supported conversation sources. Addresses part of #59.

Flow Changes: New parser slot in _try_normalize_json() dispatch chain — runs after _try_codex_jsonl, before json.loads(). No existing flows altered.

Ratings

Aspect Score
Correctness 5/5
Security 5/5
Performance 5/5
Maintainability 4/5

PR Health

High Priority Issues

None.

Medium Priority Issues

None.

Low Priority Issues

🐛 #1: has_cursor_structure flag can be set by empty-text entries

Location: mempalace/normalize.py:~178 (new code) | Confidence: ⚠️ MED

The has_cursor_structure = True flag is set whenever raw_content is a list, regardless of whether the entry actually contributes text to messages. In a pathological input where the only list-content entry has empty text (so it gets skipped), but two string-content entries survive, the parser would match — even though no list-typed entry actually contributed to the output.

Practically near-impossible with real Cursor transcripts, but the flag could be tightened:

-        if isinstance(raw_content, list):
-            has_cursor_structure = True
-
         text = _extract_content(raw_content)
         if not text:
             continue
+        if isinstance(raw_content, list):
+            has_cursor_structure = True
         messages.append((role, text))

This moves the flag after the emptiness check, so only entries that actually produce output set it.


🎨 #2: Missing test for entries with role but no message key

Location: tests/test_normalize.py (new tests section) | Confidence: ✅ HIGH

The parser correctly handles entries like {"role": "user", "other_key": "data"} (no message) — entry.get("message") returns None, isinstance(None, dict) is False, entry is skipped. However, no test covers this edge case. A brief test would document the contract:

def test_cursor_jsonl_skips_entries_without_message_key():
    """Entries with role but no message dict are skipped."""
    path = _write_jsonl([
        {"role": "user"},
        {"role": "user", "message": {"content": [{"type": "text", "text": "Q"}]}},
        {"role": "assistant", "message": {"content": [{"type": "text", "text": "A"}]}},
    ])
    result = normalize(path)
    assert "> Q" in result
    assert "A" in result
    os.unlink(path)

🔗 #3: Discrimination test couples to Claude Code parser output

Location: tests/test_normalize.py:test_cursor_jsonl_ignores_non_cursor_jsonl | Confidence: ⚠️ MED

The test asserts "> Hi from Claude Code" in result — verifying the Claude Code parser picks up the input after the Cursor parser rejects it. This is useful as an integration test but means a future change to _try_claude_code_jsonl output formatting could break this test even though the Cursor parser is correct. Consider splitting into a focused unit assertion (Cursor returns None) and a separate integration check.


Flow Impact Analysis

normalize(filepath)
  └─ _try_normalize_json(content)
       ├─ _try_claude_code_jsonl(content)   ← uses "type" key → no overlap
       ├─ _try_codex_jsonl(content)          ← uses "event_msg" / "session_meta" → no overlap
       ├─ _try_cursor_jsonl(content)         ← NEW: uses "role" + message.content as list
       ├─ json.loads(content) → ...
       │    ├─ _try_claude_ai_json(data)
       │    ├─ _try_chatgpt_json(data)
       │    └─ _try_slack_json(data)
       └─ return None

Discrimination matrix (why parsers don't interfere):

Format Key used Cursor parser sees Result
Claude Code JSONL type: "human" role = "" → skip all ✅ No match
Codex JSONL event_msg / session_meta role = "" → skip all ✅ No match
Cursor JSONL role: "user" + list content Matches correctly ✅ Match
Claude Code → Cursor? Has type, not role All entries skipped ✅ Safe
Cursor → Claude Code? Has role, not type All entries skipped ✅ Safe

No cross-contamination risk between any parser pair.

Strengths

  • Clean discrimination: The role vs type distinction is simple and bulletproof
  • has_cursor_structure flag: Smart heuristic — requires list-typed content to prevent matching generic role-based JSONL
  • Helper reuse: Leverages existing _extract_content() and _messages_to_transcript() — zero duplication of output logic
  • Excellent test coverage: 9 tests covering happy path, multi-turn, empty content, single message rejection, format discrimination, multiple content blocks, malformed lines, and string-vs-list content
  • Consistent style: Follows the exact same pattern as _try_claude_code_jsonl and _try_codex_jsonl

Created by Octocode MCP https://octocode.ai 🔍🐙

- Set has_cursor_structure only after non-empty text from list-typed content
- Add test for role entries without message; unit test that Claude Code JSONL
  does not match _try_cursor_jsonl; relax integration assertion on format

Made-with: Cursor
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean implementation. A few observations from working with this format in production:

Detection logic is solid. The dual guard ( in + ) correctly discriminates against Claude Code JSONL (top-level ) and Codex JSONL ( wrapper). The has_cursor_structure flag requiring at least one list-typed content block is a good extra check before committing.

The SQLite investigation note is useful context. The explanation of why is incomplete (no paired turns in , only prompts without responses) saves future contributors from going down that path again. Consider keeping a brief note in the docstring or a comment.

Missing edge case: tool/image content blocks. Cursor agent transcripts can include non-text content blocks (, ). The current will silently skip them (correct behavior), but it's worth a test confirming that a turn containing only non-text blocks doesn't produce a phantom message. Something like:

Test coverage is thorough — malformed lines, single-message files, wrong schema, multi-block content. The integration check is a nice touch.

One nit: the import in the test file suggests it's part of the public API. If that's intentional (for testing), it's fine — just worth noting that it's now a surface others might rely on.

Overall this is a clean, focused addition that fills a real gap for teams using Cursor. The heuristic is conservative enough to avoid false positives.

@web3guru888
Copy link
Copy Markdown

(Addendum to my review above — the shell ate the backtick formatting, sorry for the noise. The cleaned-up version:)

Detection logic is solid. The dual guard (role in {"user","assistant"} + isinstance(raw_content, list)) correctly discriminates against Claude Code JSONL (top-level type) and Codex JSONL (event_msg wrapper). The has_cursor_structure flag requiring at least one list-typed content block is a good extra check before committing.

Missing edge case worth testing: tool/image content blocks. Cursor transcripts can include non-text blocks ({"type": "tool_result", ...}, {"type": "image", ...}). The current _extract_content will silently skip these (correct), but worth adding a test that a turn containing only non-text blocks doesn't produce a phantom empty message in the output.

The SQLite investigation note is useful context. The explanation of why state.vscdb is incomplete (no paired turns in aiService.prompts) saves future contributors from going down that path. Consider preserving it in the docstring.

Test coverage otherwise looks thorough — malformed lines, single-message files, wrong schema, multi-block content all covered. The >=2 messages + list content heuristic is conservative enough to avoid false positives.

@bensig
Copy link
Copy Markdown
Collaborator

bensig commented Apr 11, 2026

Conflicts with main. Cursor JSONL normalization is being addressed in #287. Thanks.

@bensig bensig closed this Apr 11, 2026
bensig added a commit that referenced this pull request Apr 18, 2026
Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.

Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
  ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
  promise with a verifiable one; byte_preserving adapters declare
  the empty set, declared_lossy adapters enumerate. Existing
  miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
  no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
  filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
  json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
  closet_hints. Closes existing gap where convo drawers get no
  closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
  added_by, normalize_version, entities, ingest_mode all preserved.
  Spec adds adapter_name, adapter_version, privacy_class

§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).

Tracking issue: #989
jphein pushed a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
…Code, MemPalace#274/MemPalace#232 Cursor, MemPalace#169 Pi, MemPalace#702 Cursor+factory.ai)

Updates the multi-agent-support bullet to cite the actual upstream
work instead of just gesturing at it. RFC 002 itself is PR MemPalace#990
(tracking issue MemPalace#989). Existing third-party prototypes already
proposed against the spec:

* OpenCode SQLite — PR MemPalace#23
* Cursor SQLite — issue MemPalace#274
* Cursor JSONL (earlier variant) — PR MemPalace#232
* Pi agent JSONL — PR MemPalace#169
* Combined Cursor + factory.ai — PR MemPalace#702

Each becomes a mempalace-source-<agent> package once RFC 002 lands.
Names the path explicitly: fork unblocks the pattern by helping land
RFC 002; per-agent adapter PRs land from their respective authors.

Aider, Gemini CLI, Codex CLI, and Warp are roadmap targets without
existing adapter PRs and are listed as such (no fabricated PR refs).

https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add import support for more AI tool session formats (Cursor, Copilot, Codex, Windsurf, Aider, etc.)

4 participants