Summary
A non-retryable failed chat turn (e.g. a 400/bad-request from the provider, or an image attachment sent to a non-vision model) leaves the offending message in the thread's authoritative history. Every subsequent turn replays that poisoned history and fails the same way, so the thread becomes permanently unusable.
Problem
What happens: When a turn fails on a non-retryable error, the just-appended user (and any partial assistant) message stays in the persisted history. On the next turn the harness reloads that history, replays the same malformed/poison content, and the provider rejects it again — repeating forever.
What's expected: A single failed turn should not brick the thread. The user should be able to keep chatting in the same thread after an error.
Impact: The thread is permanently dead — not just one bad message but the whole conversation. Today the only recovery is to start a new thread (the interim copy from the error-handling spec). It also causes per-replay Sentry noise (the same client-side error pages on every retry).
Steps to reproduce (representative):
- In a chat thread, trigger a non-retryable provider 400 (e.g. a model/parameter incompatibility, or attach an image to a non-vision model).
- Observe the turn fails.
- Send any follow-up message in the same thread.
- It fails again with the same error, regardless of the new input — the thread is stuck.
Platform: Desktop (macOS/Windows/Linux), current main.
Solution (optional)
Three layers, in priority order (the "R" items from the inference error-handling plan):
- R1 — roll back the poison turn (P1): on a non-retryable bad-request failure, remove the just-appended user (+partial assistant) message from authoritative history, or flag it
excluded_from_context so it is never replayed. Anchors: src/openhuman/channels/providers/web.rs, src/openhuman/agent/harness/session/runtime.rs, turn.rs.
- R2 — sanitize history on read (P1): when loading a thread's authoritative history, drop/repair orphan
tool_calls, empty messages, and bad role ordering — not just the KV-cache transcript — so an already-poisoned thread heals. Anchor: runtime.rs seed_resume_from_messages, web.rs.
- R4 — recovery UX (P2): add edit-and-resend / delete-last-turn in the chat UI (
app/src) so the user can remove the offending message and keep the thread instead of abandoning it.
Note: R3 (dedup the per-turn Sentry report so a replayed poison pages once, not every retry) and R5 (the "This chat can't continue — please start a new thread" copy) are the interim band-aids; R5 ships with the frontend error-handling work, R3 can ride along here.
Acceptance criteria
Related
Summary
A non-retryable failed chat turn (e.g. a
400/bad-request from the provider, or an image attachment sent to a non-vision model) leaves the offending message in the thread's authoritative history. Every subsequent turn replays that poisoned history and fails the same way, so the thread becomes permanently unusable.Problem
What happens: When a turn fails on a non-retryable error, the just-appended user (and any partial assistant) message stays in the persisted history. On the next turn the harness reloads that history, replays the same malformed/poison content, and the provider rejects it again — repeating forever.
What's expected: A single failed turn should not brick the thread. The user should be able to keep chatting in the same thread after an error.
Impact: The thread is permanently dead — not just one bad message but the whole conversation. Today the only recovery is to start a new thread (the interim copy from the error-handling spec). It also causes per-replay Sentry noise (the same client-side error pages on every retry).
Steps to reproduce (representative):
Platform: Desktop (macOS/Windows/Linux), current
main.Solution (optional)
Three layers, in priority order (the "R" items from the inference error-handling plan):
excluded_from_contextso it is never replayed. Anchors:src/openhuman/channels/providers/web.rs,src/openhuman/agent/harness/session/runtime.rs,turn.rs.tool_calls, empty messages, and bad role ordering — not just the KV-cache transcript — so an already-poisoned thread heals. Anchor:runtime.rsseed_resume_from_messages,web.rs.app/src) so the user can remove the offending message and keep the thread instead of abandoning it.Note: R3 (dedup the per-turn Sentry report so a replayed poison pages once, not every retry) and R5 (the "This chat can't continue — please start a new thread" copy) are the interim band-aids; R5 ships with the frontend error-handling work, R3 can ride along here.
Acceptance criteria
Related
tinyhumansai/backendPR feat(redirect_links): SQLite-backed URL shortener for token-heavy links tinyhumansai/openhuman#870 (operator faults → 503,ApiErrorCode+errorCodeenvelope, 413/context codes).fix/inference-errorcode-classification) — surfaces the classified errors and ships the interim R5 "start a new thread" copy.