Skip to content

Add configurable retry with exponential backoff for LLM provider errors (529/overloaded) #24321

@davidwter

Description

@davidwter

Problem

When the LLM provider (e.g. Anthropic) returns a transient error like HTTP 529 ("overloaded"), the gateway immediately sends an error message to the user:

"The AI service is temporarily overloaded. Please try again in a moment."

This is a poor UX — the user has to manually retry, and has no way to know if the issue is transient (it usually is).

Current behavior

  • Channel-level retries exist (Telegram 429, Discord rate limits) — see concepts/retry.md
  • No retry exists for LLM/completion provider errors (529, 503, timeouts)
  • The raw error is surfaced directly to the end user

Proposed solution

Add a configurable retry policy for LLM provider calls, similar to the existing channel retry policy:

{
  providers: {
    anthropic: {
      retry: {
        attempts: 3,          // max retries before surfacing error
        minDelayMs: 2000,     // initial backoff
        maxDelayMs: 30000,    // cap
        jitter: 0.1,          // 10% jitter to avoid thundering herd
        timeoutMs: 60000,     // total timeout across all attempts
      }
    }
  }
}

Behavior

  • On transient LLM errors (529, 503, 502, ECONNRESET, timeout), retry with exponential backoff
  • Respect Retry-After header if present
  • After all attempts exhausted OR timeoutMs reached, surface the error to the user
  • Non-retryable errors (400, 401, 403) should fail immediately (no retry)
  • Applies to both main session completions and cron/isolated session completions

Defaults

Sensible defaults that work out of the box:

  • 3 attempts, 2s min delay, 30s max delay, 60s total timeout

Context

This is the #1 UX friction for end users on Telegram/WhatsApp — transient provider overload is common during peak hours and almost always resolves within seconds. The existing channel retry infrastructure could likely be generalized.

Related

  • Existing retry policy: concepts/retry.md (channels only)
  • Anthropic 529 errors are common during peak usage hours

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions