Skip to content

Session repair leaves JSONL ending on assistant turn → triggers Anthropic 400 prefill loop #75271

@thierrybwai

Description

@thierrybwai

Summary

agent/embedded session-repair logic leaves session JSONL files ending on a role=assistant message after "repair", which then resubmits to Anthropic with assistant prefill — Anthropic rejects with HTTP 400 (This model does not support assistant message prefill). When the agent's failover chain has only one candidate, this surfaces as a user-visible error and can wedge the embedded agent in a silent loop.

Environment

  • OpenClaw: latest (running via LaunchAgent on macOS)
  • Model: anthropic/claude-opus-4-7
  • Platform: Darwin 25.3.0 arm64, Node v25.8.1

Repro / Observed Behaviour

  1. Session JSONL ends with role=assistant (e.g. previous run interrupted before the next user turn was appended).
  2. agent/embedded attempts repair on session resume.
  3. Repair rewrites the assistant message but file still ends on role=assistant.
  4. Request submitted to Anthropic ends with assistant content → HTTP 400 format error.
  5. Failover decision = surface_error because primary is the only candidate.
  6. Loop continues until process is SIGTERM'd.

Real-world Impact

Out of 877 session files on this gateway, 250 (~28%) had this corruption. Yesterday alone: 300 prefill rejections + 49 repair attempts. Today before manual restart (~7h window): 84 + 14. Continuous, not one-off.

Suggested Fix

Either:

  • Drop trailing assistant entries during repair until the file ends on a role=user turn, OR
  • Append a synthetic user "(continue)" turn before resubmission.

Also worth considering: a session-janitor cron that quarantines any JSONL whose last message entry isn't role=user, run weekly or on startup.

Workaround (current)

  • Quarantined 250 corrupt sessions to ~/.openclaw/agents/<id>/sessions/_quarantine_YYYY-MM-DD/ with a MANIFEST.
  • Added fallbacks array to agents.defaults.model so single-candidate 4xx no longer surfaces directly.

Logs (sanitized)

06:51:36  agent/embedded repair → session e9256c82-...
06:51:38  Anthropic 400: "This model does not support assistant message prefill"
06:51:38  failover decision: surface_error (1 candidate)
06:51:53→06:53:21  90s silence, embedded lane stuck
06:53:21  SIGTERM (manual)

Happy to provide the quarantine MANIFEST or full logs offline if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions