Skip to content

feat: add Gemini CLI session JSON normalizer#155

Open
adv3nt3 wants to merge 1 commit intoMemPalace:developfrom
adv3nt3:feat/gemini-cli-normalizer
Open

feat: add Gemini CLI session JSON normalizer#155
adv3nt3 wants to merge 1 commit intoMemPalace:developfrom
adv3nt3:feat/gemini-cli-normalizer

Conversation

@adv3nt3
Copy link
Copy Markdown
Contributor

@adv3nt3 adv3nt3 commented Apr 7, 2026

Summary

Add _try_gemini_json parser for Gemini CLI session files stored at ~/.gemini/tmp/{project_hash}/chats/session-{timestamp}-{id}.json. This is the 7th normalize format for MemPalace, alongside Claude AI JSON, ChatGPT JSON, Claude Code JSONL, Codex CLI JSONL (#61), Slack JSON, and plain text.

Gemini CLI session format

Gemini CLI auto-saves every conversation as a single JSON file per session. Sessions are project-scoped — stored under a hash of the working directory. Retention defaults to 30 days / 100 sessions (configurable via settings.json).

Path: ~/.gemini/tmp/{project_hash}/chats/session-{timestamp}-{short_id}.json

Structure:

{
  "sessionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "projectHash": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...",
  "startTime": "2026-03-30T10:28:04.070Z",
  "lastUpdated": "2026-03-30T10:28:16.793Z",
  "messages": [
    {
      "id": "xxxxxxxx-...",
      "timestamp": "2026-03-30T10:28:04.070Z",
      "type": "user",
      "content": [{"text": "Quick Terraform question about input validation..."}]
    },
    {
      "id": "xxxxxxxx-...",
      "timestamp": "2026-03-30T10:28:16.793Z",
      "type": "gemini",
      "content": "Yes, the validation is **worth adding**..."
    }
  ],
  "kind": "main"
}

Message types

type value content format Represents
"user" List of {"text": "..."} blocks User prompts
"gemini" Plain string Assistant replies

Other message types (model changes, tool calls, etc.) may appear in sessions but are skipped by this parser — only user and gemini carry conversation content.

Design decisions

Custom content extraction instead of shared _extract_content

Gemini user content blocks are {"text": "..."} without a "type" field. The shared _extract_content helper in normalize.py expects {"type": "text", "text": "..."} (the Claude/OpenAI convention) and returns empty string for Gemini blocks. Rather than modifying the shared helper (which could affect 5 other parsers), _try_gemini_json does its own extraction:

  • Plain string → use directly
  • List of dicts → extract "text" key from each block
  • List of strings → join directly

Fingerprints on sessionId + messages keys

The parser requires both sessionId and messages in the top-level dict. This prevents false positives on:

  • ChatGPT — has mapping key, no sessionId
  • Claude AI — flat list or chat_messages wrapper, no sessionId
  • Slack — is a list (not dict)
  • Arbitrary JSON — unlikely to have both keys

Single JSON file (not JSONL)

Unlike Codex (JSONL per line) and Claude Code (JSONL per line), Gemini stores the entire session as one JSON object with a messages array. This means the parser registers in the _try_normalize_json dispatcher alongside the other JSON parsers (after json.loads), not in the JSONL section.

What's NOT handled (and why)

  • Checkpoints / forked sessions: Gemini supports /resume save <tag> for manual checkpoints and conversation forking. These may create additional session files. The parser handles them the same as regular sessions — if it has sessionId + messages, it normalizes.
  • Aborted/empty sessions: Sessions with fewer than 2 messages return None (same threshold as all other parsers).
  • Tool call details: Only "user" and "gemini" message types are extracted. Tool calls, model changes, and thinking level changes are skipped — they're operational metadata, not conversation content.
  • /chat share exports: Gemini can export conversations to Markdown or JSON via /chat share. The exported JSON format may differ from the auto-saved session format. This parser targets auto-saved sessions only.

Prior art

Changes

1 file changed (mempalace/normalize.py), 47 insertions:

  • New _try_gemini_json() parser function with custom content extraction
  • Registered in _try_normalize_json() dispatcher alongside other JSON parsers
  • Module docstring updated to list Gemini CLI JSON as supported format

Test plan

  • ruff check mempalace/normalize.py passes clean
  • ruff format --check already formatted
  • python3 -m py_compile mempalace/normalize.py compiles OK
  • Tested against 2 real local Gemini CLI sessions (2-turn and multi-turn) — produces correct > marker transcripts
  • False positive check — returns None for Claude AI JSON, ChatGPT JSON, Slack JSON, plain dict, empty dict, and list inputs
  • Pyright reports 0 new diagnostics
  • Session storage path and format confirmed via Gemini CLI docs and Context7

Refs: #59

@adv3nt3
Copy link
Copy Markdown
Contributor Author

adv3nt3 commented Apr 7, 2026

@bensig Same CI failure as PR #44not from this PR. Both failing tests are in test_dialect.py, unrelated to normalize.py:

  1. TestCompressionStats::test_statsKeyError: 'ratio'
  2. TestCompressionStats::test_count_tokens — old heuristic vs new word-based counting

Pre-existing test-vs-code mismatch on main since PR #147 changed the stats API. PR #150 is the fix. All 97 other tests pass.

Add _try_gemini_json parser for Gemini CLI session files stored at
~/.gemini/tmp/{project_hash}/chats/session-{timestamp}-{id}.json.

Gemini sessions are single JSON files (not JSONL) with a messages
array. User messages have type "user" with content as a list of
{"text": "..."} blocks (no "type" key — differs from Claude/OpenAI
content blocks). Assistant messages have type "gemini" with content
as a plain string.

Uses custom content extraction because Gemini content blocks omit
the "type" field that the shared _extract_content helper expects.
Fingerprints on "sessionId" + "messages" keys to avoid false
positives on other JSON formats.

Tested against real local Gemini CLI sessions. Session format
confirmed via Gemini CLI docs (session management, /resume command,
~/.gemini/tmp/{hash}/chats/ path).

Refs: MemPalace#59
@adv3nt3 adv3nt3 force-pushed the feat/gemini-cli-normalizer branch from c5669b9 to 44c9d2b Compare April 9, 2026 17:53
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #155feat: add Gemini CLI session JSON normalizer

Scope: +47/−1 · 1 file(s)

  • mempalace/normalize.py (modified: +47/−1)

Suggestions

  • 💡 No tests included — consider adding coverage for the new code paths

🟢 Approved — clean, well-structured PR. Good work @adv3nt3!


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@igorls igorls added area/cli CLI commands area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/mining File and conversation mining enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants