fix(agents): handle overloaded failover separately by altaywtf · Pull Request #38301 · openclaw/openclaw

altaywtf · 2026-03-06T19:54:49Z

Summary

introduce overloaded as a first-class failover reason instead of routing overload through rate_limit
persist overloaded failures as transient auth-profile cooldowns so later turns can observe and probe them deliberately
add overload-only backoff before failover while keeping bare status-only 503 and generic service unavailable on the conservative timeout path

Changes

classifier and failover plumbing:
- src/agents/pi-embedded-helpers/types.ts
- src/agents/pi-embedded-helpers/errors.ts
- src/agents/failover-error.ts
auth-profile + fallback policy:
- src/agents/auth-profiles/types.ts
- src/agents/auth-profiles/usage.ts
- src/agents/model-fallback.ts
- src/agents/pi-embedded-runner/run.ts
- src/agents/pi-embedded-runner/run/params.ts
user-facing/status surfaces:
- src/commands/models/list.probe.ts
- src/discord/monitor/auto-presence.ts
regression coverage across classifier, runner, fallback, and integration seams, including a new higher-boundary test:
- src/agents/model-fallback.run-embedded.e2e.test.ts

Verification

boundary / policy tests passed:
- src/agents/failover-error.test.ts
- src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts
- src/agents/model-fallback.probe.test.ts
- src/agents/model-fallback.test.ts
- src/commands/models/list.probe.test.ts
- src/discord/monitor/auto-presence.test.ts
- total: 146 tests
higher-boundary e2e / integration tests passed:
- src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.ts
- src/agents/model-fallback.run-embedded.e2e.test.ts
- src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts
- total: 67 tests
focused lint passed for the new runtime/integration seam
pnpm build passed
pnpm check is currently red on unrelated existing Feishu type errors in extensions/feishu/src/media.ts

Linked Issues

greptile-apps · 2026-03-06T20:06:09Z

Greptile Summary

This PR cleanly elevates "overloaded" to a first-class failover reason, separating it from the "rate_limit" bucket it was previously merged into. The classifier, cooldown system, auth-profile recording, model-fallback probe logic, Discord presence surface, and the /models probe view are all updated consistently, and each change is backed by targeted unit or e2e tests.

Key changes:

FailoverReason / AuthProfileFailureReason — "overloaded" added to both type unions; HTTP 529 and overload-worded 503 bodies now resolve to "overloaded" instead of "rate_limit" or "timeout"
resolveAuthProfileFailureReason helper filters only null and "timeout" from being persisted; all other failover reasons (including "overloaded") are recorded as transient cooldowns, enabling cross-turn probe/fallback behaviour
maybeBackoffBeforeOverloadFailover adds a short exponential backoff (250 ms → 1.5 s, ×2, 20% jitter) before any profile-rotation continue or model-fallback throw triggered by an "overloaded" result; the two call sites are in mutually exclusive branches so no double-sleep is possible
allowRateLimitCooldownProbe renamed to allowTransientCooldownProbe with the probe gate extended to cover "overloaded" alongside "rate_limit" — a straightforward rename cascaded through all call sites
One subtle behaviour change worth verifying: the new resolveAuthProfileFailureReason helper returns null for a null input, whereas the previous shouldRotate block used ?? "unknown" — meaning unrecognized failover errors no longer record an "unknown" cooldown on the responsible profile (see inline comment)

Confidence Score: 4/5

Safe to merge with one unverified intentional behaviour change — unrecognized failover errors no longer record "unknown" cooldowns.
The logic is internally consistent: the classifier, cooldown recorder, probe gate, and backoff path all handle "overloaded" correctly. Mutually exclusive branches ensure no double-backoff. AbortSignal propagation through the new sleep is correct. Tests cover cross-turn probe/fallback, timeout-lane isolation, and abort propagation. Score is 4 rather than 5 solely because of the silent null → no-op change in resolveAuthProfileFailureReason vs the prior ?? "unknown" path, which changes how unrecognized failover messages affect profile cooldown state and is not explicitly called out in the PR description.
src/agents/pi-embedded-runner/run.ts — specifically resolveAuthProfileFailureReason and the shouldRotate block to confirm the intent around null failover reasons no longer triggering "unknown" cooldowns.

_{Last reviewed commit: a343a0e}

greptile-apps · 2026-03-06T20:06:17Z

src/agents/pi-embedded-runner/run.ts

          agentDir,
        });
      };
+      const resolveAuthProfileFailureReason = (
+        failoverReason: FailoverReason | null,
+      ): AuthProfileFailureReason | null => {
+        // Timeouts are transport/model-path failures, not auth health signals,
+        // so they should not persist auth-profile failure state.
+        if (!failoverReason || failoverReason === "timeout") {
+          return null;
+        }
+        return failoverReason;
+      };
+      const maybeBackoffBeforeOverloadFailover = async (reason: FailoverReason | null) => {


Silent drop of "unknown" cooldown marks for unrecognized failover errors

resolveAuthProfileFailureReason returns null for a null input (unrecognized reason), but the old code in the shouldRotate block used assistantFailoverReason ?? "unknown", which would fall through to record an "unknown" cooldown on the profile.

With the new code, when failoverFailure is true but classifyFailoverReason returns null (a message that isFailoverAssistantError accepts but no classifier matches), reason becomes null and maybeMarkAuthProfileFailure silently does nothing.

This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on resolveAuthProfileFailureReason would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown").

If it is not intentional, the old fallback should be preserved:

// Instead of: return failoverReason; // Consider: return failoverReason ?? "unknown"; // preserve prior unknown-cooldown behaviour

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/pi-embedded-runner/run.ts Line: 751-764 Comment: **Silent drop of `"unknown"` cooldown marks for unrecognized failover errors** `resolveAuthProfileFailureReason` returns `null` for a `null` input (unrecognized reason), but the old code in the `shouldRotate` block used `assistantFailoverReason ?? "unknown"`, which would fall through to record an `"unknown"` cooldown on the profile. With the new code, when `failoverFailure` is true but `classifyFailoverReason` returns `null` (a message that `isFailoverAssistantError` accepts but no classifier matches), `reason` becomes `null` and `maybeMarkAuthProfileFailure` silently does nothing. This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on `resolveAuthProfileFailureReason` would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown"). If it is **not** intentional, the old fallback should be preserved: ```typescript // Instead of: return failoverReason; // Consider: return failoverReason ?? "unknown"; // preserve prior unknown-cooldown behaviour ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4dd0bfa23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/agents/pi-embedded-helpers/errors.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32abc238ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/cron/service/timer.ts

jalehman

Very helpful!

jalehman · 2026-03-06T21:28:31Z

@altaywtf Any idea what's up with the secrets check?

altaywtf · 2026-03-06T22:15:57Z

@altaywtf Any idea what's up with the secrets check?

nope, attempted several fixes but didn't work. unfortunately they're broken in main branch as well 😞

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bcced6dab0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/agents/pi-embedded-runner/run.ts

altaywtf · 2026-03-06T22:42:43Z

@jalehman merged this one, thanks a lot for having a look!

jalehman · 2026-03-06T22:46:49Z

Even read the code myself — been a while since I've done that :)

* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry

(cherry picked from commit 6e962d8)

openclaw-barnacle bot added channel: discord Channel integration: discord commands Command implementations agents Agent runtime and tooling size: L maintainer Maintainer-authored PR labels Mar 6, 2026

altaywtf self-assigned this Mar 6, 2026

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

altaywtf force-pushed the fix/overloaded-failover-policy branch from a343a0e to c4dd0bf Compare March 6, 2026 20:26

openclaw-barnacle bot added size: XL and removed size: L labels Mar 6, 2026

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

src/agents/pi-embedded-helpers/errors.ts Show resolved Hide resolved

jalehman self-assigned this Mar 6, 2026

openclaw-barnacle bot added the docs Improvements or additions to documentation label Mar 6, 2026

altaywtf requested a review from jalehman March 6, 2026 20:52

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

src/cron/service/timer.ts Outdated Show resolved Hide resolved

jalehman approved these changes Mar 6, 2026

View reviewed changes

altaywtf mentioned this pull request Mar 6, 2026

fix: unblock check and secrets CI failures #38353

Closed

18 tasks

altaywtf force-pushed the fix/overloaded-failover-policy branch from 33d8976 to bcced6d Compare March 6, 2026 22:16

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run.ts Show resolved Hide resolved

altaywtf added 8 commits March 7, 2026 01:36

fix(agents): skip auth-profile failure on overload

68da6e9

fix(agents): note overload auth-profile fallback fix

15f738f

fix(agents): classify overloaded failures separately

492c34b

fix(agents): back off before overload failover

4eda1fd

fix(agents): tighten overload probe and backoff state

19c080e

fix(agents): persist overloaded cooldown across runs

b652437

fix(agents): tighten overloaded status handling

5db0617

test(agents): add overload regression coverage

d2863fa

altaywtf and others added 7 commits March 7, 2026 01:36

fix(agents): restore runner imports after rebase

834dc67

test(agents): add overload fallback integration coverage

e94cb98

fix(agents): harden overloaded failover abort handling

4ee6bf3

test(agents): tighten overload classifier coverage

9c2b59e

test(agents): cover all-overloaded fallback exhaustion

e26d588

fix(cron): retry overloaded fallback summaries

b4c4d67

fix(cron): treat HTTP 529 as overloaded retry

9ce3c9f

altaywtf force-pushed the fix/overloaded-failover-policy branch from bcced6d to 9ce3c9f Compare March 6, 2026 22:36

altaywtf merged commit 6e962d8 into main Mar 6, 2026
29 of 30 checks passed

altaywtf deleted the fix/overloaded-failover-policy branch March 6, 2026 22:42

jalehman restored the fix/overloaded-failover-policy branch March 6, 2026 22:48

github-actions bot mentioned this pull request Mar 7, 2026

📡 Upstream Digest — 2026-03-07 01:15 UTC curtismercier/openclaw-mods#196

Open

Takhoffman mentioned this pull request Mar 7, 2026

fix: Anthropic 529 overloaded_error failover + thinking block corruption on retry #34723

Closed

alexyyyander mentioned this pull request Mar 7, 2026

fix/gateway token mismatch 38617 #38676

Closed

github-actions bot mentioned this pull request Mar 8, 2026

上游更新: v2026.3.7 — 19 P0 + 122 P1 待合并 jiulingyun/openclaw-cn#486

Open

alexey-pelykh mentioned this pull request Mar 10, 2026

Cherry-pick: gateway readiness probes, stale routes, HEIC media, Discord dedup remoteclaw/remoteclaw#845

Closed

This was referenced Mar 10, 2026

fix(agents): fail over on streaming server_error #42562

Closed

[Bug]: Codex Responses API streaming server_error does not trigger model fallback (variant of #24378) #35220

Open

alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026

fix(agents): handle overloaded failover separately (openclaw#38301)

c5117a3

(cherry picked from commit 6e962d8)

alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026

fix(agents): handle overloaded failover separately (openclaw#38301)

2850c79

(cherry picked from commit 6e962d8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): handle overloaded failover separately#38301

fix(agents): handle overloaded failover separately#38301
altaywtf merged 15 commits intomainfrom
fix/overloaded-failover-policy

altaywtf commented Mar 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

jalehman left a comment

Uh oh!

jalehman commented Mar 6, 2026

Uh oh!

altaywtf commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

altaywtf commented Mar 6, 2026

Uh oh!

jalehman commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

altaywtf commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Linked Issues

Uh oh!

greptile-apps bot commented Mar 6, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jalehman left a comment

Choose a reason for hiding this comment

Uh oh!

jalehman commented Mar 6, 2026

Uh oh!

altaywtf commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

altaywtf commented Mar 6, 2026

Uh oh!

jalehman commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

altaywtf commented Mar 6, 2026 •

edited

Loading