Skip to content

Failed non-retryable chat turn poisons thread history and bricks the thread #15

@sanil-23

Description

@sanil-23

Summary

A non-retryable failed chat turn (e.g. a 400/bad-request from the provider, or an image attachment sent to a non-vision model) leaves the offending message in the thread's authoritative history. Every subsequent turn replays that poisoned history and fails the same way, so the thread becomes permanently unusable.

Problem

What happens: When a turn fails on a non-retryable error, the just-appended user (and any partial assistant) message stays in the persisted history. On the next turn the harness reloads that history, replays the same malformed/poison content, and the provider rejects it again — repeating forever.

What's expected: A single failed turn should not brick the thread. The user should be able to keep chatting in the same thread after an error.

Impact: The thread is permanently dead — not just one bad message but the whole conversation. Today the only recovery is to start a new thread (the interim copy from the error-handling spec). It also causes per-replay Sentry noise (the same client-side error pages on every retry).

Steps to reproduce (representative):

  1. In a chat thread, trigger a non-retryable provider 400 (e.g. a model/parameter incompatibility, or attach an image to a non-vision model).
  2. Observe the turn fails.
  3. Send any follow-up message in the same thread.
  4. It fails again with the same error, regardless of the new input — the thread is stuck.

Platform: Desktop (macOS/Windows/Linux), current main.

Solution (optional)

Three layers, in priority order (the "R" items from the inference error-handling plan):

  • R1 — roll back the poison turn (P1): on a non-retryable bad-request failure, remove the just-appended user (+partial assistant) message from authoritative history, or flag it excluded_from_context so it is never replayed. Anchors: src/openhuman/channels/providers/web.rs, src/openhuman/agent/harness/session/runtime.rs, turn.rs.
  • R2 — sanitize history on read (P1): when loading a thread's authoritative history, drop/repair orphan tool_calls, empty messages, and bad role ordering — not just the KV-cache transcript — so an already-poisoned thread heals. Anchor: runtime.rs seed_resume_from_messages, web.rs.
  • R4 — recovery UX (P2): add edit-and-resend / delete-last-turn in the chat UI (app/src) so the user can remove the offending message and keep the thread instead of abandoning it.

Note: R3 (dedup the per-turn Sentry report so a replayed poison pages once, not every retry) and R5 (the "This chat can't continue — please start a new thread" copy) are the interim band-aids; R5 ships with the frontend error-handling work, R3 can ride along here.

Acceptance criteria

  • Repro gone — after a non-retryable failed turn, the next turn in the same thread succeeds (the poison turn is rolled back / excluded), and a previously-poisoned thread recovers on reload (read-side sanitize).
  • Regression safety — Rust unit/integration coverage for rollback + history sanitization (orphan tool_calls, empty messages, bad role order); app coverage for any recovery-UX affordance.
  • Diff coverage ≥ 80% — the fix PR meets the changed-lines coverage gate.
  • No per-replay Sentry spam — a replayed client-side error pages at most once per turn (R3).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions