fix: treat HTTP 503 as failover-eligible for LLM provider errors by Protocol-zero-0 · Pull Request #21086 · openclaw/openclaw

Protocol-zero-0 · 2026-02-19T16:47:47Z

Summary

When LLM providers (e.g. Google Gemini) return HTTP 503 "Service Unavailable" errors, their SDKs sometimes strip the leading status code from the error message — producing messages like "This model is currently experiencing high demand" or "UNAVAILABLE" instead of "503 ...".

The existing isTransientHttpError in errors.ts only matches messages starting with "503", so these wrapped/rewritten error messages silently fall through without triggering failover (no profile rotation, no model fallback).

This patch closes that gap with two complementary changes:

failover-error.ts: Map HTTP status code 503 → rate_limit in resolveFailoverReasonFromError, covering structured error objects that carry a numeric .status field.
errors.ts: Add three patterns to ERROR_PATTERNS.overloaded — /\b503\b/, "service unavailable", "high demand" — covering message-only classification when the leading status prefix is absent.

Existing isTransientHttpError behavior is unchanged; these additions are strictly complementary and only fire for errors that previously fell through unclassified.

Test plan

Added e2e test: resolveFailoverReasonFromError({ status: 503 }) returns "rate_limit"
Added e2e test: classifyFailoverReason correctly classifies Google "high demand" and generic "service unavailable" messages as "rate_limit"
Verified existing tests still pass

Motivation

Observed frequent 503 errors from Google Gemini API ("This model is currently experiencing high demand") that were not triggering openclaw's failover mechanism, causing tasks to fail instead of rotating to the next auth profile or falling back to an alternative model.

Made with Cursor

Greptile Summary

Maps HTTP 503 errors to failover-eligible classifications, closing a gap where SDK-rewritten error messages were silently failing without triggering profile rotation or model fallback.

Structured errors (status: 503): now map to "timeout" via resolveFailoverReasonFromError
SDK-rewritten messages (lacking numeric prefix): "service unavailable" and "high demand" patterns map to "rate_limit" via ERROR_PATTERNS.overloaded
Both classifications trigger failover; the distinction reflects error semantics (transient HTTP vs. overload)
Comprehensive test coverage added for both code paths

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are focused, well-tested, and address a real production issue (Google Gemini 503 errors not triggering failover). The implementation correctly handles both structured error objects and SDK-rewritten string messages. Previous semantic inconsistency concern was addressed by excluding the /\b503\b/ pattern. Both new classifications ("timeout" and "rate_limit") are failover-eligible, ensuring the core fix works as intended. Test coverage is comprehensive.
No files require special attention

_{Last reviewed commit: 232d622}

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified.

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-helpers/errors.ts

Copilot

Pull request overview

This PR improves failover handling for HTTP 503 "Service Unavailable" errors from LLM providers. When providers return 503 errors, their SDKs sometimes strip the leading status code from error messages, preventing the existing failover mechanism from detecting these as retryable errors. The PR addresses this gap through two complementary approaches: mapping structured 503 status codes to the rate_limit failover reason, and adding error message patterns to catch SDK-generated messages like "high demand" and "service unavailable."

Changes:

Map HTTP status code 503 → rate_limit in structured error objects
Add error message patterns (/\b503\b/, "service unavailable", "high demand") to detect stripped 503 messages
Add comprehensive test coverage for both structured and message-based error detection

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/agents/failover-error.ts	Added status code 503 → `rate_limit` mapping for structured error objects
src/agents/pi-embedded-helpers/errors.ts	Extended `ERROR_PATTERNS.overloaded` with `/\b503\b/`, "service unavailable", and "high demand" patterns
src/agents/failover-error.e2e.test.ts	Added test verifying `{ status: 503 }` maps to `rate_limit`
src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts	Added tests for "high demand" and "service unavailable" message classification

src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage.

… isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged.

Protocol-zero-0 · 2026-02-19T17:36:53Z

Hi @vincentkoc — thanks for maintaining this project!

TL;DR: When Google Gemini returns 503 "high demand" errors, their SDK rewrites the message to just "This model is currently experiencing high demand" — no 503 prefix. The existing isTransientHttpError only matches messages starting with "503", so these errors silently skip failover.

What this PR does (3 commits, +22/−2 net):

failover-error.ts: Map structured {status: 503} → "timeout" (aligned with isTransientHttpError)
errors.ts: Add "service unavailable" + "high demand" to ERROR_PATTERNS.overloaded (catches SDK-rewritten messages)
Two e2e tests covering both paths

Why it's safe: No existing classification is changed. These patterns only fire for errors that previously fell through unclassified. All 40 existing e2e tests pass.

Happy to adjust anything — appreciate your time!

vincentkoc · 2026-02-19T20:42:38Z

@greptileai review

…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]>

Protocol-zero-0 · 2026-02-20T20:36:50Z

@vincentkoc Glad to see this merged. Treat HTTP 503 as failover-eligible is essential for agent reliability under high load. We're observing the stability of this fix in our production environment and planning further optimizations for multi-model load balancing.

…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]>

) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]>

…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]>

…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]> (cherry picked from commit 2af3415) # Conflicts: # src/agents/pi-embedded-helpers/errors.ts

…nclaw#21086) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <[email protected]>

Copilot AI review requested due to automatic review settings February 19, 2026 16:47

openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Feb 19, 2026

Copilot started reviewing on behalf of Protocol-zero-0 February 19, 2026 16:48 View session

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

src/agents/pi-embedded-helpers/errors.ts Show resolved Hide resolved

Copilot AI reviewed Feb 19, 2026

View reviewed changes

src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts Show resolved Hide resolved

Protocol-zero-0 added 2 commits February 19, 2026 17:17

Merge branch 'main' into fix/503-failover

232d622

vincentkoc merged commit 2af3415 into openclaw:main Feb 19, 2026
24 checks passed

brandonwise mentioned this pull request Feb 19, 2026

fix: trigger model fallback on 503 UNAVAILABLE and high-demand errors #20331

Closed

7 tasks

vincentkoc mentioned this pull request Feb 19, 2026

chore(changelog): add unreleased fixes for #7734 and #21086 #21254

Merged

github-actions bot mentioned this pull request Mar 1, 2026

cherry-pick: upstream bugfix commits (2026-03-01-0443) hughdidit/DAISy-Agency#140

Closed

6 tasks

Protocol-zero-0 deleted the fix/503-failover branch March 17, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: treat HTTP 503 as failover-eligible for LLM provider errors#21086

fix: treat HTTP 503 as failover-eligible for LLM provider errors#21086
vincentkoc merged 4 commits intoopenclaw:mainfrom
Protocol-zero-0:fix/503-failover

Protocol-zero-0 commented Feb 19, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Protocol-zero-0 commented Feb 19, 2026

Uh oh!

vincentkoc commented Feb 19, 2026

Uh oh!

Uh oh!

Protocol-zero-0 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Protocol-zero-0 commented Feb 19, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Motivation

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Protocol-zero-0 commented Feb 19, 2026

Uh oh!

vincentkoc commented Feb 19, 2026

Uh oh!

Uh oh!

Protocol-zero-0 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Protocol-zero-0 commented Feb 19, 2026 •

edited by greptile-apps bot

Loading