feat: add Cursor agent transcript JSONL normalizer by marerem · Pull Request #232 · MemPalace/mempalace

marerem · 2026-04-08T10:16:35Z

Summary

Adds _try_cursor_jsonl() parser to normalize.py for importing Cursor IDE agent transcript sessions, addressing the Cursor portion of #59.

Format: JSONL files at ~/.cursor/projects/<project>/agent-transcripts/<uuid>/<uuid>.jsonl
Line schema: {"role": "user"|"assistant", "message": {"content": [{"type": "text", "text": "..."}]}}
Detection: Requires top-level role key (not type) and list-typed content blocks, preventing false positives against Claude Code JSONL and Codex JSONL

Why JSONL only (no SQLite)?

Cursor's state.vscdb SQLite database (~/Library/Application Support/Cursor/User/workspaceStorage/<hash>/state.vscdb) was investigated. It stores composer.composerData (conversation metadata only — IDs, names, timestamps) and aiService.prompts (user prompts without assistant responses). Since there are no paired conversation turns in the SQLite format, the agent transcript JSONL is the only format with complete conversations suitable for MemPalace ingestion.

Changes

File	Change
`mempalace/normalize.py`	Add `_try_cursor_jsonl()` parser + wire into `_try_normalize_json()` dispatch chain + update module docstring
`tests/test_normalize.py`	Add 8 test cases: basic, multi-turn, empty content, single-message rejection, Claude Code JSONL rejection, multi-block content, malformed lines, plain-string content rejection

Test plan

All 109 existing + new tests pass (pytest tests/ -v)
Zero Ruff lint errors
Reviewer: verify parser ordering doesn't interfere with existing Claude Code / Codex JSONL detection
Reviewer: confirm has_cursor_structure guard is sufficient to prevent false positives

Closes #59 (Cursor portion)

Add _try_cursor_jsonl() parser for Cursor IDE agent transcript files (~/.cursor/projects/<proj>/agent-transcripts/<uuid>.jsonl). Each JSONL line follows {"role": "user"|"assistant", "message": {"content": [{"type": "text", "text": "..."}]}}. The parser is discriminated from Claude Code JSONL (top-level "type" key) and Codex JSONL ("session_meta"/"event_msg" wrappers) via a has_cursor_structure guard that requires list-typed content blocks. Includes 8 test cases covering multi-turn conversations, empty content handling, malformed line tolerance, and false-positive rejection for other JSONL formats. Closes MemPalace#59 (Cursor portion) Made-with: Cursor

bgauryy · 2026-04-08T16:33:48Z

PR Review: feat: add Cursor agent transcript JSONL normalizer

Executive Summary

Aspect	Value
PR Goal	Add `_try_cursor_jsonl()` parser to import Cursor IDE agent transcript sessions into the mempalace normalization pipeline
Files Changed	2
Risk Level	🟢 LOW - Pure additive change, reuses existing helpers, well-guarded format discrimination
Review Effort	2 - Clean, focused feature with thorough tests
Recommendation	✅ APPROVE

Affected Areas: mempalace/normalize.py (parser + dispatch), tests/test_normalize.py (9 new test functions)

Business Impact: Enables mining Cursor IDE conversations into the palace, expanding the supported conversation sources. Addresses part of #59.

Flow Changes: New parser slot in _try_normalize_json() dispatch chain — runs after _try_codex_jsonl, before json.loads(). No existing flows altered.

Ratings

Aspect	Score
Correctness	5/5
Security	5/5
Performance	5/5
Maintainability	4/5

PR Health

Has clear description
References ticket/issue (feat: add import support for more AI tool session formats (Cursor, Copilot, Codex, Windsurf, Aider, etc.) #59)
Appropriate size (163 additions, 2 files)
Has relevant tests (9 test functions)

High Priority Issues

None.

Medium Priority Issues

None.

Low Priority Issues

🐛 #1: `has_cursor_structure` flag can be set by empty-text entries

Location: mempalace/normalize.py:~178 (new code) | Confidence: ⚠️ MED

The has_cursor_structure = True flag is set whenever raw_content is a list, regardless of whether the entry actually contributes text to messages. In a pathological input where the only list-content entry has empty text (so it gets skipped), but two string-content entries survive, the parser would match — even though no list-typed entry actually contributed to the output.

Practically near-impossible with real Cursor transcripts, but the flag could be tightened:

-        if isinstance(raw_content, list):
-            has_cursor_structure = True
-
         text = _extract_content(raw_content)
         if not text:
             continue
+        if isinstance(raw_content, list):
+            has_cursor_structure = True
         messages.append((role, text))

This moves the flag after the emptiness check, so only entries that actually produce output set it.

🎨 #2: Missing test for entries with `role` but no `message` key

Location: tests/test_normalize.py (new tests section) | Confidence: ✅ HIGH

The parser correctly handles entries like {"role": "user", "other_key": "data"} (no message) — entry.get("message") returns None, isinstance(None, dict) is False, entry is skipped. However, no test covers this edge case. A brief test would document the contract:

def test_cursor_jsonl_skips_entries_without_message_key():
    """Entries with role but no message dict are skipped."""
    path = _write_jsonl([
        {"role": "user"},
        {"role": "user", "message": {"content": [{"type": "text", "text": "Q"}]}},
        {"role": "assistant", "message": {"content": [{"type": "text", "text": "A"}]}},
    ])
    result = normalize(path)
    assert "> Q" in result
    assert "A" in result
    os.unlink(path)

🔗 #3: Discrimination test couples to Claude Code parser output

Location: tests/test_normalize.py:test_cursor_jsonl_ignores_non_cursor_jsonl | Confidence: ⚠️ MED

The test asserts "> Hi from Claude Code" in result — verifying the Claude Code parser picks up the input after the Cursor parser rejects it. This is useful as an integration test but means a future change to _try_claude_code_jsonl output formatting could break this test even though the Cursor parser is correct. Consider splitting into a focused unit assertion (Cursor returns None) and a separate integration check.

Flow Impact Analysis

normalize(filepath)
  └─ _try_normalize_json(content)
       ├─ _try_claude_code_jsonl(content)   ← uses "type" key → no overlap
       ├─ _try_codex_jsonl(content)          ← uses "event_msg" / "session_meta" → no overlap
       ├─ _try_cursor_jsonl(content)         ← NEW: uses "role" + message.content as list
       ├─ json.loads(content) → ...
       │    ├─ _try_claude_ai_json(data)
       │    ├─ _try_chatgpt_json(data)
       │    └─ _try_slack_json(data)
       └─ return None

Discrimination matrix (why parsers don't interfere):

Format	Key used	Cursor parser sees	Result
Claude Code JSONL	`type: "human"`	`role` = `""` → skip all	✅ No match
Codex JSONL	`event_msg` / `session_meta`	`role` = `""` → skip all	✅ No match
Cursor JSONL	`role: "user"` + list content	Matches correctly	✅ Match
Claude Code → Cursor?	Has `type`, not `role`	All entries skipped	✅ Safe
Cursor → Claude Code?	Has `role`, not `type`	All entries skipped	✅ Safe

No cross-contamination risk between any parser pair.

Strengths

Clean discrimination: The role vs type distinction is simple and bulletproof
has_cursor_structure flag: Smart heuristic — requires list-typed content to prevent matching generic role-based JSONL
Helper reuse: Leverages existing _extract_content() and _messages_to_transcript() — zero duplication of output logic
Excellent test coverage: 9 tests covering happy path, multi-turn, empty content, single message rejection, format discrimination, multiple content blocks, malformed lines, and string-vs-list content
Consistent style: Follows the exact same pattern as _try_claude_code_jsonl and _try_codex_jsonl

Created by Octocode MCP https://octocode.ai 🔍🐙

- Set has_cursor_structure only after non-empty text from list-typed content - Add test for role entries without message; unit test that Claude Code JSONL does not match _try_cursor_jsonl; relax integration assertion on format Made-with: Cursor

web3guru888

Clean implementation. A few observations from working with this format in production:

Detection logic is solid. The dual guard ( in + ) correctly discriminates against Claude Code JSONL (top-level ) and Codex JSONL ( wrapper). The has_cursor_structure flag requiring at least one list-typed content block is a good extra check before committing.

The SQLite investigation note is useful context. The explanation of why is incomplete (no paired turns in , only prompts without responses) saves future contributors from going down that path again. Consider keeping a brief note in the docstring or a comment.

Missing edge case: tool/image content blocks. Cursor agent transcripts can include non-text content blocks (, ). The current will silently skip them (correct behavior), but it's worth a test confirming that a turn containing only non-text blocks doesn't produce a phantom message. Something like:

Test coverage is thorough — malformed lines, single-message files, wrong schema, multi-block content. The integration check is a nice touch.

One nit: the import in the test file suggests it's part of the public API. If that's intentional (for testing), it's fine — just worth noting that it's now a surface others might rely on.

Overall this is a clean, focused addition that fills a real gap for teams using Cursor. The heuristic is conservative enough to avoid false positives.

web3guru888 · 2026-04-10T13:02:32Z

(Addendum to my review above — the shell ate the backtick formatting, sorry for the noise. The cleaned-up version:)

Detection logic is solid. The dual guard (role in {"user","assistant"} + isinstance(raw_content, list)) correctly discriminates against Claude Code JSONL (top-level type) and Codex JSONL (event_msg wrapper). The has_cursor_structure flag requiring at least one list-typed content block is a good extra check before committing.

Missing edge case worth testing: tool/image content blocks. Cursor transcripts can include non-text blocks ({"type": "tool_result", ...}, {"type": "image", ...}). The current _extract_content will silently skip these (correct), but worth adding a test that a turn containing only non-text blocks doesn't produce a phantom empty message in the output.

The SQLite investigation note is useful context. The explanation of why state.vscdb is incomplete (no paired turns in aiService.prompts) saves future contributors from going down that path. Consider preserving it in the docstring.

Test coverage otherwise looks thorough — malformed lines, single-message files, wrong schema, multi-block content all covered. The >=2 messages + list content heuristic is conservative enough to avoid false positives.

bensig · 2026-04-11T05:28:04Z

Conflicts with main. Cursor JSONL normalization is being addressed in #287. Thanks.

Draft plugin specification for source adapters, mirroring RFC 001's role for storage backends. Formalizes the contract six community ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's metadata-only mode have been reinventing ad-hoc, so adapter authors can build to a stable surface. Key decisions: - Single ingest() method; lazy adapters yield SourceItemMetadata ahead of drawers, eager adapters interleave - Declared-transformation model (§1.4) replaces informal verbatim promise with a verifiable one; byte_preserving adapters declare the empty set, declared_lossy adapters enumerate. Existing miner.py and the convo_miner+normalize pipeline map cleanly - Palace is the incremental cursor via is_current(item, metadata); no sidecar persistence - Routing is adapter-owned; detect_room/detect_hall move into the filesystem adapter - Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as json_string field, KG triples route to SQLite knowledge graph - Closets stay core-built as a post-step; adapters may emit flat closet_hints. Closes existing gap where convo drawers get no closets - No per-drawer field renames: source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode all preserved. Spec adds adapter_name, adapter_version, privacy_class §9 enumerates the cleanup PR prerequisites (mempalace/sources/ module, PalaceContext facade, KnowledgeGraph.add_triple gaining backwards-compatible source_drawer_id + adapter_name params). Tracking issue: #989

…Code, MemPalace#274/MemPalace#232 Cursor, MemPalace#169 Pi, MemPalace#702 Cursor+factory.ai) Updates the multi-agent-support bullet to cite the actual upstream work instead of just gesturing at it. RFC 002 itself is PR MemPalace#990 (tracking issue MemPalace#989). Existing third-party prototypes already proposed against the spec: * OpenCode SQLite — PR MemPalace#23 * Cursor SQLite — issue MemPalace#274 * Cursor JSONL (earlier variant) — PR MemPalace#232 * Pi agent JSONL — PR MemPalace#169 * Combined Cursor + factory.ai — PR MemPalace#702 Each becomes a mempalace-source-<agent> package once RFC 002 lands. Names the path explicitly: fork unblocks the pattern by helping land RFC 002; per-agent adapter PRs land from their respective authors. Aider, Gemini CLI, Codex CLI, and Warp are roadmap targets without existing adapter PRs and are listed as such (no fabricated PR refs). https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6

web3guru888 reviewed Apr 10, 2026

View reviewed changes

bensig closed this Apr 11, 2026

nanoscopic mentioned this pull request Apr 16, 2026

feat: add native Cursor SQLite convo ingestion #287

Closed

bensig mentioned this pull request Apr 18, 2026

RFC: Source adapter plugin specification #989

Open

bensig mentioned this pull request Apr 18, 2026

docs: RFC 002 — Source adapter plugin specification #990

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Cursor agent transcript JSONL normalizer#232

feat: add Cursor agent transcript JSONL normalizer#232
marerem wants to merge 2 commits intoMemPalace:mainfrom
marerem:feat/cursor-jsonl-normalizer

marerem commented Apr 8, 2026 •

edited

Loading

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

marerem commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why JSONL only (no SQLite)?

Changes

Test plan

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: feat: add Cursor agent transcript JSONL normalizer

Executive Summary

Ratings

PR Health

High Priority Issues

Medium Priority Issues

Low Priority Issues

🐛 #1: has_cursor_structure flag can be set by empty-text entries

🎨 #2: Missing test for entries with role but no message key

🔗 #3: Discrimination test couples to Claude Code parser output

Flow Impact Analysis

Strengths

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

marerem commented Apr 8, 2026 •

edited

Loading

🐛 #1: `has_cursor_structure` flag can be set by empty-text entries

🎨 #2: Missing test for entries with `role` but no `message` key