Skip to content

Session stuck in permanent 400 error loop when tool result contains Gemini-invalid content #11475

@teng-lin

Description

@teng-lin

Summary

When a tool result contains content that the Gemini API rejects (e.g., large base64/minified JavaScript from reading a dist file), the session becomes permanently stuck in a 400 error loop. Every subsequent API call — including /new and /reset — fails because the full session history (containing the offending content) is sent to the API before the command is processed.

There is no circuit-breaker, auto-recovery, or tool result size guard to prevent or recover from this state.

Root Cause

Three missing safeguards combine to create an unrecoverable failure:

  1. No tool result size limit or content sanitizationsession-tool-result-guard.ts appends tool results directly to session history without any truncation or filtering. A single exec tool call that reads a large dist file (containing base64 source maps, minified JS, etc.) can inject hundreds of KB of content that the Gemini API treats as invalid.

  2. No circuit-breaker for consecutive API errorspi-embedded-runner/run.ts handles context overflow errors (via isContextOverflowError) and auth failover errors, but has no handling for repeated 400 "invalid argument" errors. The session keeps retrying with the same poisoned history indefinitely.

  3. /new command cannot break the loop — While /new does reset the session at the routing layer (commands-core.ts lines 65-107), in group channels the incoming Discord message first triggers an agent turn on the existing session. The agent turn sends the poisoned history to the API, gets a 400 error, and the /new is never processed. The session stays stuck.

Steps to Reproduce

  1. Start a Discord group session using google-gemini-cli/gemini-3-flash-preview
  2. Have the agent read a large compiled JavaScript file via the exec tool (e.g., a bundled dist file with inline source maps)
  3. The tool result (containing base64/minified content) is stored in session history
  4. Every subsequent API call fails with: Cloud Code Assist API error (400): Request contains an invalid argument.
  5. Incoming Discord messages append to the session but all API calls fail
  6. Sending /new in the channel also fails — the command triggers an API call with the poisoned history before the session reset can take effect
  7. Session is permanently broken

Evidence

From session JSONL (2a47c269):

# Line 779: Tool result with base64/minified content from dist file read
role=toolResult content=,CAAC,EAAE,MAAMA,EAAE,CAAC,EAAE,YAAY,GAAG,EAAE,EAAE...

# Line 780: First 400 error — every API call after this fails
role=assistant stopReason=error errorMessage="Cloud Code Assist API error (400): Request contains an invalid argument."

# Lines 782-817: 19 consecutive 400 errors over 3+ hours, interspersed with user messages that cannot be processed

Session was at 925k/1049k tokens (88%) with 817 messages. The only recovery was manually archiving the session file and restarting the gateway.

Impact

  • Critical — Renders the session completely unusable with no self-recovery
  • User sees repeated Cloud Code Assist API error (400) messages in Discord
  • /new and /reset commands cannot break the loop
  • Manual file system intervention required (delete/archive session JSONL)
  • Affects any session where the agent reads large binary/minified content via tools

Proposed Fixes

1. Circuit-breaker for consecutive API errors (high priority)

In pi-embedded-runner/run.ts, add error tracking per session:

// After N consecutive non-overflow 400 errors on the same session,
// auto-compact (drop oldest messages) or auto-rotate to a new session
if (isConsecutive400Error(errorText) && consecutiveErrorCount >= MAX_CONSECUTIVE_ERRORS) {
  // Option A: Auto-compact — drop messages before the last successful exchange
  // Option B: Auto-rotate — create a new session and notify the user
  // Option C: Trim the last N tool results and retry
}

2. Tool result size guard (medium priority)

In session-tool-result-guard.ts, add truncation before persisting:

const MAX_TOOL_RESULT_CHARS = 100_000; // ~25k tokens

if (resultText.length > MAX_TOOL_RESULT_CHARS) {
  resultText = resultText.slice(0, MAX_TOOL_RESULT_CHARS) + 
    `\n\n[Truncated: output was ${resultText.length} chars, limit is ${MAX_TOOL_RESULT_CHARS}]`;
}

3. /new bypass for poisoned sessions (medium priority)

Process /new and /reset commands at the message routing layer before triggering an agent turn, so they work even when the session history is invalid.

4. Content sanitization for known-bad patterns (low priority)

Strip or replace content that is known to cause issues with specific providers (e.g., large base64 blobs, source maps, minified JS bundles) before persisting tool results.

Workaround

Manually archive or delete the poisoned session file:

mkdir -p ~/.openclaw/agents/main/sessions/_archived
mv ~/.openclaw/agents/main/sessions/<session-id>.jsonl ~/.openclaw/agents/main/sessions/_archived/
# Restart gateway
pkill -USR1 -f "openclaw.*gateway"

Environment

  • OpenClaw version: 2026.2.6-3
  • Channel: Discord (group)
  • Model: google-gemini-cli/gemini-3-flash-preview
  • Platform: macOS Darwin 25.2.0 (arm64) + Linux (same behavior on both)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions