Problem
When the LLM provider (e.g. Anthropic) returns a transient error like HTTP 529 ("overloaded"), the gateway immediately sends an error message to the user:
"The AI service is temporarily overloaded. Please try again in a moment."
This is a poor UX — the user has to manually retry, and has no way to know if the issue is transient (it usually is).
Current behavior
- Channel-level retries exist (Telegram 429, Discord rate limits) — see
concepts/retry.md
- No retry exists for LLM/completion provider errors (529, 503, timeouts)
- The raw error is surfaced directly to the end user
Proposed solution
Add a configurable retry policy for LLM provider calls, similar to the existing channel retry policy:
{
providers: {
anthropic: {
retry: {
attempts: 3, // max retries before surfacing error
minDelayMs: 2000, // initial backoff
maxDelayMs: 30000, // cap
jitter: 0.1, // 10% jitter to avoid thundering herd
timeoutMs: 60000, // total timeout across all attempts
}
}
}
}
Behavior
- On transient LLM errors (529, 503, 502, ECONNRESET, timeout), retry with exponential backoff
- Respect
Retry-After header if present
- After all attempts exhausted OR
timeoutMs reached, surface the error to the user
- Non-retryable errors (400, 401, 403) should fail immediately (no retry)
- Applies to both main session completions and cron/isolated session completions
Defaults
Sensible defaults that work out of the box:
- 3 attempts, 2s min delay, 30s max delay, 60s total timeout
Context
This is the #1 UX friction for end users on Telegram/WhatsApp — transient provider overload is common during peak hours and almost always resolves within seconds. The existing channel retry infrastructure could likely be generalized.
Related
- Existing retry policy:
concepts/retry.md (channels only)
- Anthropic 529 errors are common during peak usage hours
Problem
When the LLM provider (e.g. Anthropic) returns a transient error like HTTP 529 ("overloaded"), the gateway immediately sends an error message to the user:
This is a poor UX — the user has to manually retry, and has no way to know if the issue is transient (it usually is).
Current behavior
concepts/retry.mdProposed solution
Add a configurable retry policy for LLM provider calls, similar to the existing channel retry policy:
Behavior
Retry-Afterheader if presenttimeoutMsreached, surface the error to the userDefaults
Sensible defaults that work out of the box:
Context
This is the #1 UX friction for end users on Telegram/WhatsApp — transient provider overload is common during peak hours and almost always resolves within seconds. The existing channel retry infrastructure could likely be generalized.
Related
concepts/retry.md(channels only)