Failed non-retryable chat turn poisons thread history and bricks the thread

## Summary

A non-retryable failed chat turn (e.g. a `400`/bad-request from the provider, or an image attachment sent to a non-vision model) leaves the offending message in the thread's authoritative history. Every subsequent turn replays that poisoned history and fails the same way, so the thread becomes permanently unusable.

## Problem

**What happens:** When a turn fails on a non-retryable error, the just-appended user (and any partial assistant) message stays in the persisted history. On the next turn the harness reloads that history, replays the same malformed/poison content, and the provider rejects it again — repeating forever.

**What's expected:** A single failed turn should not brick the thread. The user should be able to keep chatting in the same thread after an error.

**Impact:** The thread is permanently dead — not just one bad message but the whole conversation. Today the only recovery is to start a new thread (the interim copy from the error-handling spec). It also causes per-replay Sentry noise (the same client-side error pages on every retry).

**Steps to reproduce (representative):**
1. In a chat thread, trigger a non-retryable provider 400 (e.g. a model/parameter incompatibility, or attach an image to a non-vision model).
2. Observe the turn fails.
3. Send any follow-up message in the same thread.
4. It fails again with the same error, regardless of the new input — the thread is stuck.

**Platform:** Desktop (macOS/Windows/Linux), current `main`.

## Solution (optional)

Three layers, in priority order (the "R" items from the inference error-handling plan):

- **R1 — roll back the poison turn (P1):** on a non-retryable bad-request failure, remove the just-appended user (+partial assistant) message from authoritative history, or flag it `excluded_from_context` so it is never replayed. Anchors: `src/openhuman/channels/providers/web.rs`, `src/openhuman/agent/harness/session/runtime.rs`, `turn.rs`.
- **R2 — sanitize history on read (P1):** when loading a thread's authoritative history, drop/repair orphan `tool_calls`, empty messages, and bad role ordering — not just the KV-cache transcript — so an already-poisoned thread heals. Anchor: `runtime.rs` `seed_resume_from_messages`, `web.rs`.
- **R4 — recovery UX (P2):** add edit-and-resend / delete-last-turn in the chat UI (`app/src`) so the user can remove the offending message and keep the thread instead of abandoning it.

Note: **R3** (dedup the per-turn Sentry report so a replayed poison pages once, not every retry) and **R5** (the "This chat can't continue — please start a new thread" copy) are the interim band-aids; R5 ships with the frontend error-handling work, R3 can ride along here.

## Acceptance criteria

- [ ] **Repro gone** — after a non-retryable failed turn, the next turn in the same thread succeeds (the poison turn is rolled back / excluded), and a previously-poisoned thread recovers on reload (read-side sanitize).
- [ ] **Regression safety** — Rust unit/integration coverage for rollback + history sanitization (orphan tool_calls, empty messages, bad role order); app coverage for any recovery-UX affordance.
- [ ] **Diff coverage ≥ 80%** — the fix PR meets the changed-lines coverage gate.
- [ ] **No per-replay Sentry spam** — a replayed client-side error pages at most once per turn (R3).

## Related

- Backend error taxonomy: `tinyhumansai/backend` PR #870 (operator faults → 503, `ApiErrorCode` + `errorCode` envelope, 413/context codes).
- Frontend error-handling spec + F-item PR (`fix/inference-errorcode-classification`) — surfaces the classified errors and ships the interim R5 "start a new thread" copy.
- This issue tracks the deferred R1/R2/R4 (thread recovery) work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed non-retryable chat turn poisons thread history and bricks the thread #15

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failed non-retryable chat turn poisons thread history and bricks the thread #15

Description

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions