Skip to content

fix: normalize incomplete Codex sessions and add coverage#334

Open
fuzzymoomoo wants to merge 3 commits intoMemPalace:developfrom
fuzzymoomoo:fix/codex-normalization-coverage
Open

fix: normalize incomplete Codex sessions and add coverage#334
fuzzymoomoo wants to merge 3 commits intoMemPalace:developfrom
fuzzymoomoo:fix/codex-normalization-coverage

Conversation

@fuzzymoomoo
Copy link
Copy Markdown

Closes #295

This adds Codex-specific normalization coverage and fixes one noisy ingest case that shows up in real Codex session history.

What changed:

  • add end-to-end Codex JSONL normalization tests for multi-turn sessions
  • verify that response_item noise and malformed lines are ignored
  • keep the current session_meta requirement explicit in coverage
  • normalize incomplete-but-real Codex sessions with a single user turn instead of falling back to raw JSONL

Why this matters:

  • _try_codex_jsonl() currently requires at least two messages before it returns a transcript
  • sessions that contain a real user message but no assistant reply yet therefore fall back to raw JSONL
  • when those files are mined, session metadata and event noise can leak into retrieval instead of a clean transcript

Validation:

  • python -m pytest tests/test_normalize.py -q -> 7 passed
  • python -m pytest tests/test_normalize.py tests/test_convo_miner.py tests/test_miner.py -q still hits unrelated Windows file-lock cleanup failures in test_convo_miner.py and test_miner.py

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on the incomplete Codex session handling. The core fix is small but impactful:

# Before: required >= 2 messages — incomplete sessions fell back to raw JSONL
if len(messages) >= 2 and has_session_meta:

# After: any real message with session_meta gets normalized
if messages and has_session_meta:

This matters because raw JSONL fallback leaks session_meta, response_item, and task_started noise into retrieval — exactly the kind of pollution that degrades search relevance over time.

The test coverage is thorough:

  • Multi-turn session — verifies noise lines (response_item, task_started) are stripped ✅
  • Incomplete session (single user turn) — validates the actual fix ✅
  • No session_meta — confirms the guard still requires it ✅
  • Malformed lines + empty messages — edge cases handled gracefully ✅

One question: does the has_session_meta guard still make sense as-is? If a Codex JSONL file has user/assistant messages but the session_meta line was truncated (e.g., partial write), we'd still fall back to raw. That seems like the right behavior — just confirming the intent.

Solid fix with excellent coverage. This kind of normalization hygiene is exactly what keeps palaces clean at scale.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@igorls igorls added area/ci CI/CD and workflows area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci CI/CD and workflows area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex JSONL normalization quality and test coverage

3 participants