fix(agents): handle overloaded failover separately#38301
Conversation
Greptile SummaryThis PR cleanly elevates Key changes:
Confidence Score: 4/5
Last reviewed commit: a343a0e |
| agentDir, | ||
| }); | ||
| }; | ||
| const resolveAuthProfileFailureReason = ( | ||
| failoverReason: FailoverReason | null, | ||
| ): AuthProfileFailureReason | null => { | ||
| // Timeouts are transport/model-path failures, not auth health signals, | ||
| // so they should not persist auth-profile failure state. | ||
| if (!failoverReason || failoverReason === "timeout") { | ||
| return null; | ||
| } | ||
| return failoverReason; | ||
| }; | ||
| const maybeBackoffBeforeOverloadFailover = async (reason: FailoverReason | null) => { |
There was a problem hiding this comment.
Silent drop of "unknown" cooldown marks for unrecognized failover errors
resolveAuthProfileFailureReason returns null for a null input (unrecognized reason), but the old code in the shouldRotate block used assistantFailoverReason ?? "unknown", which would fall through to record an "unknown" cooldown on the profile.
With the new code, when failoverFailure is true but classifyFailoverReason returns null (a message that isFailoverAssistantError accepts but no classifier matches), reason becomes null and maybeMarkAuthProfileFailure silently does nothing.
This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on resolveAuthProfileFailureReason would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown").
If it is not intentional, the old fallback should be preserved:
// Instead of:
return failoverReason;
// Consider:
return failoverReason ?? "unknown"; // preserve prior unknown-cooldown behaviourPrompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run.ts
Line: 751-764
Comment:
**Silent drop of `"unknown"` cooldown marks for unrecognized failover errors**
`resolveAuthProfileFailureReason` returns `null` for a `null` input (unrecognized reason), but the old code in the `shouldRotate` block used `assistantFailoverReason ?? "unknown"`, which would fall through to record an `"unknown"` cooldown on the profile.
With the new code, when `failoverFailure` is true but `classifyFailoverReason` returns `null` (a message that `isFailoverAssistantError` accepts but no classifier matches), `reason` becomes `null` and `maybeMarkAuthProfileFailure` silently does nothing.
This means genuinely unrecognized failover errors will no longer accumulate exponential backoff on the responsible profile. Whether that is the intended consequence is not stated in the PR description. If it is intentional, a short comment on `resolveAuthProfileFailureReason` would clarify the decision (e.g., "null/unknown reasons are excluded because they are too ambiguous to warrant a persistent cooldown").
If it is **not** intentional, the old fallback should be preserved:
```typescript
// Instead of:
return failoverReason;
// Consider:
return failoverReason ?? "unknown"; // preserve prior unknown-cooldown behaviour
```
How can I resolve this? If you propose a fix, please make it concise.a343a0e to
c4dd0bf
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4dd0bfa23
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32abc238ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@altaywtf Any idea what's up with the secrets check? |
nope, attempted several fixes but didn't work. unfortunately they're broken in main branch as well 😞 |
33d8976 to
bcced6d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bcced6dab0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
bcced6d to
9ce3c9f
Compare
|
@jalehman merged this one, thanks a lot for having a look! |
|
Even read the code myself — been a while since I've done that :) |
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
* fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry
(cherry picked from commit 6e962d8)
(cherry picked from commit 6e962d8)
Summary
overloadedas a first-class failover reason instead of routing overload throughrate_limit503and genericservice unavailableon the conservativetimeoutpathChanges
src/agents/pi-embedded-helpers/types.tssrc/agents/pi-embedded-helpers/errors.tssrc/agents/failover-error.tssrc/agents/auth-profiles/types.tssrc/agents/auth-profiles/usage.tssrc/agents/model-fallback.tssrc/agents/pi-embedded-runner/run.tssrc/agents/pi-embedded-runner/run/params.tssrc/commands/models/list.probe.tssrc/discord/monitor/auto-presence.tssrc/agents/model-fallback.run-embedded.e2e.test.tsVerification
src/agents/failover-error.test.tssrc/agents/pi-embedded-helpers.isbillingerrormessage.test.tssrc/agents/model-fallback.probe.test.tssrc/agents/model-fallback.test.tssrc/commands/models/list.probe.test.tssrc/discord/monitor/auto-presence.test.ts146testssrc/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.tssrc/agents/model-fallback.run-embedded.e2e.test.tssrc/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts67testspnpm buildpassedpnpm checkis currently red on unrelated existing Feishu type errors inextensions/feishu/src/media.tsLinked Issues