Session stuck in permanent 400 error loop when tool result contains Gemini-invalid content

## Summary

When a tool result contains content that the Gemini API rejects (e.g., large base64/minified JavaScript from reading a dist file), the session becomes **permanently stuck** in a 400 error loop. Every subsequent API call — including `/new` and `/reset` — fails because the full session history (containing the offending content) is sent to the API before the command is processed.

There is no circuit-breaker, auto-recovery, or tool result size guard to prevent or recover from this state.

## Root Cause

Three missing safeguards combine to create an unrecoverable failure:

1. **No tool result size limit or content sanitization** — `session-tool-result-guard.ts` appends tool results directly to session history without any truncation or filtering. A single `exec` tool call that reads a large dist file (containing base64 source maps, minified JS, etc.) can inject hundreds of KB of content that the Gemini API treats as invalid.

2. **No circuit-breaker for consecutive API errors** — `pi-embedded-runner/run.ts` handles context overflow errors (via `isContextOverflowError`) and auth failover errors, but has **no handling for repeated 400 "invalid argument" errors**. The session keeps retrying with the same poisoned history indefinitely.

3. **`/new` command cannot break the loop** — While `/new` does reset the session at the routing layer (`commands-core.ts` lines 65-107), in group channels the incoming Discord message first triggers an agent turn on the **existing** session. The agent turn sends the poisoned history to the API, gets a 400 error, and the `/new` is never processed. The session stays stuck.

## Steps to Reproduce

1. Start a Discord group session using `google-gemini-cli/gemini-3-flash-preview`
2. Have the agent read a large compiled JavaScript file via the `exec` tool (e.g., a bundled dist file with inline source maps)
3. The tool result (containing base64/minified content) is stored in session history
4. **Every subsequent API call fails** with: `Cloud Code Assist API error (400): Request contains an invalid argument.`
5. Incoming Discord messages append to the session but all API calls fail
6. Sending `/new` in the channel also fails — the command triggers an API call with the poisoned history before the session reset can take effect
7. Session is permanently broken

## Evidence

From session JSONL (`2a47c269`):

```
# Line 779: Tool result with base64/minified content from dist file read
role=toolResult content=,CAAC,EAAE,MAAMA,EAAE,CAAC,EAAE,YAAY,GAAG,EAAE,EAAE...

# Line 780: First 400 error — every API call after this fails
role=assistant stopReason=error errorMessage="Cloud Code Assist API error (400): Request contains an invalid argument."

# Lines 782-817: 19 consecutive 400 errors over 3+ hours, interspersed with user messages that cannot be processed
```

Session was at 925k/1049k tokens (88%) with 817 messages. The only recovery was manually archiving the session file and restarting the gateway.

## Impact

- **Critical** — Renders the session completely unusable with no self-recovery
- User sees repeated `Cloud Code Assist API error (400)` messages in Discord
- `/new` and `/reset` commands cannot break the loop
- Manual file system intervention required (delete/archive session JSONL)
- Affects any session where the agent reads large binary/minified content via tools

## Proposed Fixes

### 1. Circuit-breaker for consecutive API errors (high priority)

In `pi-embedded-runner/run.ts`, add error tracking per session:

```typescript
// After N consecutive non-overflow 400 errors on the same session,
// auto-compact (drop oldest messages) or auto-rotate to a new session
if (isConsecutive400Error(errorText) && consecutiveErrorCount >= MAX_CONSECUTIVE_ERRORS) {
  // Option A: Auto-compact — drop messages before the last successful exchange
  // Option B: Auto-rotate — create a new session and notify the user
  // Option C: Trim the last N tool results and retry
}
```

### 2. Tool result size guard (medium priority)

In `session-tool-result-guard.ts`, add truncation before persisting:

```typescript
const MAX_TOOL_RESULT_CHARS = 100_000; // ~25k tokens

if (resultText.length > MAX_TOOL_RESULT_CHARS) {
  resultText = resultText.slice(0, MAX_TOOL_RESULT_CHARS) + 
    `\n\n[Truncated: output was ${resultText.length} chars, limit is ${MAX_TOOL_RESULT_CHARS}]`;
}
```

### 3. `/new` bypass for poisoned sessions (medium priority)

Process `/new` and `/reset` commands at the message routing layer **before** triggering an agent turn, so they work even when the session history is invalid.

### 4. Content sanitization for known-bad patterns (low priority)

Strip or replace content that is known to cause issues with specific providers (e.g., large base64 blobs, source maps, minified JS bundles) before persisting tool results.

## Workaround

Manually archive or delete the poisoned session file:

```bash
mkdir -p ~/.openclaw/agents/main/sessions/_archived
mv ~/.openclaw/agents/main/sessions/<session-id>.jsonl ~/.openclaw/agents/main/sessions/_archived/
# Restart gateway
pkill -USR1 -f "openclaw.*gateway"
```

## Environment

- OpenClaw version: 2026.2.6-3
- Channel: Discord (group)
- Model: `google-gemini-cli/gemini-3-flash-preview`
- Platform: macOS Darwin 25.2.0 (arm64) + Linux (same behavior on both)

## Related Issues

- #8946 — Session should auto-recover when corrupted tool response makes history invalid (same fundamental problem, different trigger)
- #11291 — Tool call formatting + context overflow on model switch (same error loop pattern, filed today)
- #9672 — Compaction orphans tool_result blocks, permanently breaking session
- #6202 — Base64 images in tool results causing context overflow (related: no size guard on tool results)
- #3014 — Orphaned tool_result causes API 400 error loop
- #3154 — Context overflow does not trigger automatic session recovery
- #5430 — Terminated tool calls break all subsequent requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Session stuck in permanent 400 error loop when tool result contains Gemini-invalid content #11475

Summary

Root Cause

Steps to Reproduce

Evidence

Impact

Proposed Fixes

1. Circuit-breaker for consecutive API errors (high priority)

2. Tool result size guard (medium priority)

3. `/new` bypass for poisoned sessions (medium priority)

4. Content sanitization for known-bad patterns (low priority)

Workaround

Environment

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Session stuck in permanent 400 error loop when tool result contains Gemini-invalid content #11475

Description

Summary

Root Cause

Steps to Reproduce

Evidence

Impact

Proposed Fixes

1. Circuit-breaker for consecutive API errors (high priority)

2. Tool result size guard (medium priority)

3. /new bypass for poisoned sessions (medium priority)

4. Content sanitization for known-bad patterns (low priority)

Workaround

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `/new` bypass for poisoned sessions (medium priority)