Add configurable retry with exponential backoff for LLM provider errors (529/overloaded)

## Problem

When the LLM provider (e.g. Anthropic) returns a transient error like HTTP 529 ("overloaded"), the gateway immediately sends an error message to the user:

> "The AI service is temporarily overloaded. Please try again in a moment."

This is a poor UX — the user has to manually retry, and has no way to know if the issue is transient (it usually is).

## Current behavior

- Channel-level retries exist (Telegram 429, Discord rate limits) — see `concepts/retry.md`
- **No retry exists for LLM/completion provider errors** (529, 503, timeouts)
- The raw error is surfaced directly to the end user

## Proposed solution

Add a configurable retry policy for LLM provider calls, similar to the existing channel retry policy:

```json5
{
  providers: {
    anthropic: {
      retry: {
        attempts: 3,          // max retries before surfacing error
        minDelayMs: 2000,     // initial backoff
        maxDelayMs: 30000,    // cap
        jitter: 0.1,          // 10% jitter to avoid thundering herd
        timeoutMs: 60000,     // total timeout across all attempts
      }
    }
  }
}
```

### Behavior

- On transient LLM errors (529, 503, 502, ECONNRESET, timeout), retry with exponential backoff
- Respect `Retry-After` header if present
- After all attempts exhausted OR `timeoutMs` reached, surface the error to the user
- Non-retryable errors (400, 401, 403) should fail immediately (no retry)
- Applies to both main session completions and cron/isolated session completions

### Defaults

Sensible defaults that work out of the box:
- 3 attempts, 2s min delay, 30s max delay, 60s total timeout

## Context

This is the #1 UX friction for end users on Telegram/WhatsApp — transient provider overload is common during peak hours and almost always resolves within seconds. The existing channel retry infrastructure could likely be generalized.

## Related

- Existing retry policy: `concepts/retry.md` (channels only)
- Anthropic 529 errors are common during peak usage hours

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configurable retry with exponential backoff for LLM provider errors (529/overloaded) #24321

Problem

Current behavior

Proposed solution

Behavior

Defaults

Context

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add configurable retry with exponential backoff for LLM provider errors (529/overloaded) #24321

Description

Problem

Current behavior

Proposed solution

Behavior

Defaults

Context

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions