Skip to content

feat: add OpenAI Codex CLI JSONL normalizer#61

Merged
bensig merged 1 commit intoMemPalace:mainfrom
adv3nt3:feat/codex-cli-normalizer
Apr 7, 2026
Merged

feat: add OpenAI Codex CLI JSONL normalizer#61
bensig merged 1 commit intoMemPalace:mainfrom
adv3nt3:feat/codex-cli-normalizer

Conversation

@adv3nt3
Copy link
Copy Markdown
Contributor

@adv3nt3 adv3nt3 commented Apr 7, 2026

Summary

Add _try_codex_jsonl parser for OpenAI Codex CLI session files stored at ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl. This is the 6th normalize format supported by MemPalace, alongside Claude AI JSON, ChatGPT JSON, Claude Code JSONL, Slack JSON, and plain text.

Codex JSONL format

Codex CLI stores session transcripts as JSONL files with one event per line. Each line has:

{"timestamp": "...", "type": "<event_type>", "payload": {...}}

Relevant event types for conversation extraction:

Event type Payload subtype Contains
session_meta Session ID, cwd, model, git info
event_msg user_message Real user prompts (payload.message)
event_msg agent_message Real assistant replies (payload.message)
response_item message API-level items — includes synthetic context, duplicates real messages
event_msg other subtypes Tool calls, turn boundaries, exec results — not conversation content

Design decisions

Only event_msg entries are extracted

The parser uses event_msg with user_message / agent_message subtypes as the sole source of conversation turns. These represent the canonical user-authored prompts and assistant replies.

response_item entries are intentionally skipped

Codex rollout files can contain response_item entries with role: "user" that are not real user input — they include auto-injected <environment_context> blocks and other synthetic setup context. The same assistant reply can also appear both as an event_msg/agent_message and as a response_item with role: "assistant", leading to duplicated turns if both are extracted. Skipping response_item entirely avoids both problems.

session_meta header required

The parser only recognizes a file as a Codex rollout if it contains at least one session_meta event. This prevents false-positive matches on Claude Code JSONL or other JSONL formats that happen to contain event_msg-like structures.

Defensive payload handling

payload.message is checked with isinstance(msg, str) before .strip() to avoid AttributeError if the field is null or non-string in an unexpected rollout variant.

Prior art

  • PR feat: add OpenAI Codex CLI JSONL normalizer #5 (self-closed by author as "not ready") proposed a Codex normalizer that only handled response_item with a role field. That approach would pull in synthetic environment context as real user input and miss the primary event_msg conversation events entirely.
  • Format based on Codex source tests at codex-rs/rollout/src/recorder_tests.rs. This is a conservative interpretation — it extracts conversation messages without attempting full rollout reconstruction.

Changes

1 file changed (mempalace/normalize.py), 53 insertions:

  • New _try_codex_jsonl() parser function
  • Registered in _try_normalize_json() dispatcher after Claude Code JSONL
  • Module docstring updated to list Codex CLI JSONL as a supported format

Test plan

  • ruff check mempalace/normalize.py passes clean
  • ruff format --check already formatted
  • python3 -m py_compile mempalace/normalize.py compiles OK
  • Pyright reports 0 new diagnostics
  • JSONL event structure based on Codex source tests (recorder_tests.rs)
  • session_meta gating verified to prevent false positives on Claude Code JSONL
  • response_item exclusion prevents synthetic context pollution and message duplication

Refs: #59

Add _try_codex_jsonl parser for Codex CLI session files stored at
~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl.

Uses only event_msg entries (user_message / agent_message) which
represent the canonical conversation turns. response_item entries
are intentionally skipped — they include synthetic context injections
(environment_context) and can duplicate real messages when both
representations are present in the same rollout.

Format based on Codex source tests (codex-rs/rollout/src/recorder_tests.rs).
Requires session_meta header to reduce false positives on other JSONL.

Refs: MemPalace#59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants