Skip to content

fix(agents): handle overloaded failover separately#38301

Merged
altaywtf merged 15 commits intomainfrom
fix/overloaded-failover-policy
Mar 6, 2026
Merged

fix(agents): handle overloaded failover separately#38301
altaywtf merged 15 commits intomainfrom
fix/overloaded-failover-policy

Conversation

@altaywtf
Copy link
Copy Markdown
Member

@altaywtf altaywtf commented Mar 6, 2026

Summary

  • introduce overloaded as a first-class failover reason instead of routing overload through rate_limit
  • persist overloaded failures as transient auth-profile cooldowns so later turns can observe and probe them deliberately
  • add overload-only backoff before failover while keeping bare status-only 503 and generic service unavailable on the conservative timeout path

Changes

  • classifier and failover plumbing:
    • src/agents/pi-embedded-helpers/types.ts
    • src/agents/pi-embedded-helpers/errors.ts
    • src/agents/failover-error.ts
  • auth-profile + fallback policy:
    • src/agents/auth-profiles/types.ts
    • src/agents/auth-profiles/usage.ts
    • src/agents/model-fallback.ts
    • src/agents/pi-embedded-runner/run.ts
    • src/agents/pi-embedded-runner/run/params.ts
  • user-facing/status surfaces:
    • src/commands/models/list.probe.ts
    • src/discord/monitor/auto-presence.ts
  • regression coverage across classifier, runner, fallback, and integration seams, including a new higher-boundary test:
    • src/agents/model-fallback.run-embedded.e2e.test.ts

Verification

  • boundary / policy tests passed:
    • src/agents/failover-error.test.ts
    • src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts
    • src/agents/model-fallback.probe.test.ts
    • src/agents/model-fallback.test.ts
    • src/commands/models/list.probe.test.ts
    • src/discord/monitor/auto-presence.test.ts
    • total: 146 tests
  • higher-boundary e2e / integration tests passed:
    • src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.ts
    • src/agents/model-fallback.run-embedded.e2e.test.ts
    • src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts
    • total: 67 tests
  • focused lint passed for the new runtime/integration seam
  • pnpm build passed
  • pnpm check is currently red on unrelated existing Feishu type errors in extensions/feishu/src/media.ts

Linked Issues

@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord commands Command implementations agents Agent runtime and tooling size: L maintainer Maintainer-authored PR labels Mar 6, 2026
@altaywtf altaywtf self-assigned this Mar 6, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR cleanly elevates "overloaded" to a first-class failover reason, separating it from the "rate_limit" bucket it was previously merged into. The classifier, cooldown system, auth-profile recording, model-fallback probe logic, Discord presence surface, and the /models probe view are all updated consistently, and each change is backed by targeted unit or e2e tests.

Key changes:

  • FailoverReason / AuthProfileFailureReason"overloaded" added to both type unions; HTTP 529 and overload-worded 503 bodies now resolve to "overloaded" instead of "rate_limit" or "timeout"
  • resolveAuthProfileFailureReason helper filters only null and "timeout" from being persisted; all other failover reasons (including "overloaded") are recorded as transient cooldowns, enabling cross-turn probe/fallback behaviour
  • maybeBackoffBeforeOverloadFailover adds a short exponential backoff (250 ms → 1.5 s, ×2, 20% jitter) before any profile-rotation continue or model-fallback throw triggered by an "overloaded" result; the two call sites are in mutually exclusive branches so no double-sleep is possible
  • allowRateLimitCooldownProbe renamed to allowTransientCooldownProbe with the probe gate extended to cover "overloaded" alongside "rate_limit" — a straightforward rename cascaded through all call sites
  • One subtle behaviour change worth verifying: the new resolveAuthProfileFailureReason helper returns null for a null input, whereas the previous shouldRotate block used ?? "unknown" — meaning unrecognized failover errors no longer record an "unknown" cooldown on the responsible profile (see inline comment)

Confidence Score: 4/5

  • Safe to merge with one unverified intentional behaviour change — unrecognized failover errors no longer record "unknown" cooldowns.
  • The logic is internally consistent: the classifier, cooldown recorder, probe gate, and backoff path all handle "overloaded" correctly. Mutually exclusive branches ensure no double-backoff. AbortSignal propagation through the new sleep is correct. Tests cover cross-turn probe/fallback, timeout-lane isolation, and abort propagation. Score is 4 rather than 5 solely because of the silent null → no-op change in resolveAuthProfileFailureReason vs the prior ?? "unknown" path, which changes how unrecognized failover messages affect profile cooldown state and is not explicitly called out in the PR description.
  • src/agents/pi-embedded-runner/run.ts — specifically resolveAuthProfileFailureReason and the shouldRotate block to confirm the intent around null failover reasons no longer triggering "unknown" cooldowns.

Last reviewed commit: a343a0e

Comment on lines 751 to +764
agentDir,
});
};
const resolveAuthProfileFailureReason = (
failoverReason: FailoverReason | null,
): AuthProfileFailureReason | null => {
// Timeouts are transport/model-path failures, not auth health signals,
// so they should not persist auth-profile failure state.
if (!failoverReason || failoverReason === "timeout") {
return null;
}
return failoverReason;
};
const maybeBackoffBeforeOverloadFailover = async (reason: FailoverReason | null) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent drop of "unknown" cooldown marks for unrecognized failover errors

resolveAuthProfileFailureReason returns null for a null input (unrecognized reason), but the old code in the shouldRotate block used assistantFailoverReason ?? "unknown", which would fall through to record an "unknown" cooldown on the profile.

With the new code, when failoverFailure is true but classifyFailoverReason returns null (a message that isFailoverAssistantError accepts but no classifier matches), reason becomes null and maybeMarkAuthProfileFailure silently does nothing.

This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on resolveAuthProfileFailureReason would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown").

If it is not intentional, the old fallback should be preserved:

// Instead of:
return failoverReason;

// Consider:
return failoverReason ?? "unknown";   // preserve prior unknown-cooldown behaviour
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run.ts
Line: 751-764

Comment:
**Silent drop of `"unknown"` cooldown marks for unrecognized failover errors**

`resolveAuthProfileFailureReason` returns `null` for a `null` input (unrecognized reason), but the old code in the `shouldRotate` block used `assistantFailoverReason ?? "unknown"`, which would fall through to record an `"unknown"` cooldown on the profile.

With the new code, when `failoverFailure` is true but `classifyFailoverReason` returns `null` (a message that `isFailoverAssistantError` accepts but no classifier matches), `reason` becomes `null` and `maybeMarkAuthProfileFailure` silently does nothing.

This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on `resolveAuthProfileFailureReason` would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown").

If it is **not** intentional, the old fallback should be preserved:
```typescript
// Instead of:
return failoverReason;

// Consider:
return failoverReason ?? "unknown";   // preserve prior unknown-cooldown behaviour
```

How can I resolve this? If you propose a fix, please make it concise.

@altaywtf altaywtf force-pushed the fix/overloaded-failover-policy branch from a343a0e to c4dd0bf Compare March 6, 2026 20:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4dd0bfa23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jalehman jalehman self-assigned this Mar 6, 2026
@openclaw-barnacle openclaw-barnacle bot added the docs Improvements or additions to documentation label Mar 6, 2026
@altaywtf altaywtf requested a review from jalehman March 6, 2026 20:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32abc238ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@jalehman jalehman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful!

@jalehman
Copy link
Copy Markdown
Contributor

jalehman commented Mar 6, 2026

@altaywtf Any idea what's up with the secrets check?

@altaywtf
Copy link
Copy Markdown
Member Author

altaywtf commented Mar 6, 2026

@altaywtf Any idea what's up with the secrets check?

nope, attempted several fixes but didn't work. unfortunately they're broken in main branch as well 😞

@altaywtf altaywtf force-pushed the fix/overloaded-failover-policy branch from 33d8976 to bcced6d Compare March 6, 2026 22:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bcced6dab0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@altaywtf altaywtf force-pushed the fix/overloaded-failover-policy branch from bcced6d to 9ce3c9f Compare March 6, 2026 22:36
@altaywtf altaywtf merged commit 6e962d8 into main Mar 6, 2026
29 of 30 checks passed
@altaywtf altaywtf deleted the fix/overloaded-failover-policy branch March 6, 2026 22:42
@altaywtf
Copy link
Copy Markdown
Member Author

altaywtf commented Mar 6, 2026

@jalehman merged this one, thanks a lot for having a look!

@jalehman
Copy link
Copy Markdown
Contributor

jalehman commented Mar 6, 2026

Even read the code myself — been a while since I've done that :)

@jalehman jalehman restored the fix/overloaded-failover-policy branch March 6, 2026 22:48
vincentkoc pushed a commit to BryanTegomoh/openclaw-fork that referenced this pull request Mar 8, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: discord Channel integration: discord commands Command implementations docs Improvements or additions to documentation maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Anthropic overloaded_error (529) not triggering model fallback

2 participants