fix(auth): improve multi-account round-robin rotation and 429 handling#342
Closed
mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
Closed
fix(auth): improve multi-account round-robin rotation and 429 handling#342mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
Conversation
b518e17 to
387b885
Compare
This commit fixes several issues with multi-account OAuth rotation that
were causing slow responses and inefficient account cycling.
## Changes
### 1. Fix usageStats race condition (auth-profiles.ts)
The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`,
and `clearAuthProfileCooldown` functions were using a stale in-memory store
passed as a parameter. Long-running sessions would overwrite usageStats
updates from concurrent sessions when saving.
**Fix:** Re-read the store from disk before each update to get fresh
usageStats from other sessions, then merge the update.
### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)
When a request timed out, `session.abort()` was called which throws an
`AbortError`. The code structure was:
```javascript
try {
await session.prompt(params.prompt);
} catch (err) {
promptError = err; // Catches AbortError here
}
await waitForCompactionRetry(); // But THIS also throws AbortError!
```
The second `AbortError` from `waitForCompactionRetry()` escaped and
bypassed the rotation/fallback logic entirely.
**Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture
the error as `promptError`, enabling proper timeout handling.
Root cause analysis and fix proposed by @erikpr1994 in openclaw#313.
Fixes openclaw#313
### 3. Fail fast on 429 rate limits (pi-ai patch)
The pi-ai library was retrying 429 errors up to 3 times with exponential
backoff before throwing. This meant a rate-limited account would waste
30+ seconds retrying before our rotation code could try the next account.
**Fix:** Patch google-gemini-cli.js to:
- Throw immediately on first 429 (no retries)
- Not catch and retry 429 errors in the network error handler
This allows the caller to rotate to the next account instantly on rate limit.
Note: We submitted this fix upstream (badlogic/pi-mono#504)
but it was closed without merging. Keeping as a local patch for now.
## Testing
With 6 Antigravity accounts configured:
- Accounts rotate properly based on lastUsed (round-robin)
- 429s trigger immediate rotation to next account
- usageStats persist correctly across concurrent sessions
- Cooldown tracking works as expected
## Before/After
**Before:** Multiple 429 retries on same account, 30-90s delays
**After:** Instant rotation on 429, responses in seconds
387b885 to
046b5ad
Compare
Contributor
dgarson
added a commit
to dgarson/clawdbot
that referenced
this pull request
Feb 9, 2026
* infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Tool approvals: preserve exec command
dgarson
added a commit
to dgarson/clawdbot
that referenced
this pull request
Feb 9, 2026
* feat: tool journal/diagnostics * feat: journal fixes * feat(ui): add error boundary component with retry & friendly messages - New error-boundary.ts component with renderError/renderErrorIf helpers - Custom element <error-boundary> with auto-retry and exponential backoff - friendlyError() maps raw errors to user-friendly messages + suggestions - Supports severity levels (danger/warning/info), compact mode, dismiss - Collapsible technical details section - ARIA compliance with role=alert and aria-live - Replaces all inline callout danger patterns across 23 view files - Consistent error UX across agents, channels, sessions, config, etc. * Web: reset retry timers on error changes (openclaw#273) * Gateway: unify exec approvals with tool approval flow (openclaw#319) * Gateway: unify exec approvals * Gateway: guard exec approval resolves * Feat/pr review monitor (openclaw#313) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config (openclaw#324) * Codex/review branch changes and identify issues (openclaw#325) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config * UI: reset auto-retry timers on error changes (openclaw#328) * feat(ui): add error boundary component with retry & friendly messages - New error-boundary.ts component with renderError/renderErrorIf helpers - Custom element <error-boundary> with auto-retry and exponential backoff - friendlyError() maps raw errors to user-friendly messages + suggestions - Supports severity levels (danger/warning/info), compact mode, dismiss - Collapsible technical details section - ARIA compliance with role=alert and aria-live - Replaces all inline callout danger patterns across 23 view files - Consistent error UX across agents, channels, sessions, config, etc. * Web: reset retry timers on error changes (openclaw#273) * UI: reset auto-retry timers on error changes * Add execution layer runtime parity gap analysis (openclaw#280) * Add execution layer runtime parity gap analysis Comprehensive analysis of Pi Runtime vs Claude Agent SDK feature gaps in the unified execution layer, with 20 prioritized next steps. https://claude.ai/code/session_017oEzmayzdirGAKmSw2ryQZ * Meridia: wire multi-factor scoring into capture hook * merge/minor fixes for ui/* * Meridia: add per-capture graph fanout queue with retries * Meridia: enforce sanitization before persistence and fanout * Meridia: complete Tier2 vector probing and Postgres vector support --------- Co-authored-by: Claude <[email protected]> * Codex/review branch changes and identify issues (openclaw#325) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config * Work queue: add heartbeat leases (openclaw#329) * fix: duplicate lines on main * Tools: clarify work_item refs and workstream (openclaw#332) Co-authored-by: Claude Opus 4.6 <[email protected]> * Config: clarify agents.list placement, accept agents.list in web import, and document guidance (openclaw#331) * Config: clarify agents.list validation * Web: tighten agents list import validation * Sessions: align label limits (openclaw#333) * Work queue: add work item refs support (openclaw#312) * Tests: update migration count * Tools: accept refs in work_item tool * Work queue: link Codex tasks to PRs (post GitHub comments) (openclaw#337) * Work queue: link codex tasks to PRs * Work queue: skip branchPrefix-only PR lookup * Claude/runtime orchestrator tools eu d uu (openclaw#327) * feat(agents): add runtime tool-approval orchestrator with approvals.tools config - Add approvals.tools config types + zod schema (enabled, mode, timeoutMs, policy, routing, classifier) - Create tool-approval orchestrator module (decision engine, param redaction, gateway integration) - Integrate orchestrator into before-tool-call wrapper path (runs after plugin hooks, before execution) - Add ToolApprovalBlockedError with stable machine-readable error shape - Add 90 tests covering all mode/decision/risk branches - Backward-compatible: no behavior change when approvals.tools is missing or disabled * feat: upgrade /approve and Discord handler to canonical tool approvals - /approve now queries tool.approvals.get for canonical records and resolves via tool.approval.resolve (with requestHash); falls back to legacy exec.approval.resolve when no canonical record is found - Discord handler listens for tool.approval.requested/resolved events and renders generic tool approval embeds for non-exec tools - resolveApproval prefers tool.approval.resolve when requestHash is cached, keeping legacy exec path for backward compatibility - Updated command description to 'tool approval requests' - Added shouldHandleToolApproval for canonical event filtering - Extended tests with canonical, legacy-fallback, and gateway-error scenarios * refactor: rename .clawdbrain → .openclaw and fix repo/domain references - Settings dir: ~/.clawdbrain → ~/.openclaw - Repo references: openclaw/clawdbrain → dgarson/clawdbrain - Domain: clawdbrain.bot → openclaw.ai - CLI command: clawdbrain login → openclaw login - 48 files changed across src/, docs/, apps/web/, ui/ * cron timeout fixes * feat(agents): wire tool approval context from config into tool creation path - Inject approvals.tools config into wrapToolWithBeforeToolCallHook context - Populate channel field from messageProvider via resolveGatewayMessageChannel - Wire callGatewayTool as the gateway call adapter for approval requests - Approval context is only constructed when approvals.tools exists and is enabled * fix: address review gaps in tool approval handler - Exec dedup: store canonical request for exec tools and defer embed creation by 200ms so the legacy mirror gets first shot; if the mirror never arrives, fall back to a generic tool embed (future-proofs against legacy event removal) - Extract sendToolApprovalEmbed to eliminate code duplication - Add buildApprovalCustomId / parseApprovalData generic aliases (same wire format, clearer naming for non-exec tool code paths) - Add alias identity tests * fix: minor tool approval request fixes * auto-reply/approval integration fix * include exec approval doc * fix: agent-runner-execution integration into auto-reply, executor/kernel fixes * more work on agent runner and memory/heartbeta integration * lots of tests resulting from unification of exec kernel; refactored * Redact arrays in approval helper * lancedb fixes * more fixes/test updates * fix: minor problem * fix: restore proper non-throwing session label truncation --------- Co-authored-by: Claude <[email protected]> * Tool approval/protocol cleanup (openclaw#334) * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Codex/map paramssummary to exec command field (openclaw#342) * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Tool approvals: preserve exec command * Codex/add web inbox for tool approvals (openclaw#339) * Web: add tool approval inbox support * Web: fallback approval resolution * Web: fall back to agent approvals when IDs differ (openclaw#263) * memclawd: scaffold phase 0 service foundation (openclaw#330) * memclawd: apply oxfmt * Memclawd: add client samples and align pipeline config * Codex/implement work item refs system d2mkjz (openclaw#344) * Tools: clarify work_item refs and workstream * Tests: update migration count --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * Codex/review branch changes and identify issues kuj3uy (openclaw#343) * Tests: update migration count * Tools: accept refs in work_item tool * Work queue: add refs reindex command * Work queue: align refs migration and add refs-reindex CLI (openclaw#345) * Tests: update migration count * Work queue: move refs backfill to 004 migration --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes several issues with multi-account OAuth rotation that were causing slow responses and inefficient account cycling.
Changes
1. Fix usageStats race condition (auth-profiles.ts)
The
markAuthProfileUsed,markAuthProfileCooldown,markAuthProfileGood, andclearAuthProfileCooldownfunctions were using a stale in-memory store passed as a parameter. Long-running sessions would overwrite usageStats updates from concurrent sessions when saving.Fix: Re-read the store from disk before each update to get fresh usageStats from other sessions, then merge the update.
2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)
When a request timed out,
session.abort()throws anAbortError. The secondAbortErrorfromwaitForCompactionRetry()was escaping and bypassing the rotation/fallback logic entirely.Fix: Wrap
waitForCompactionRetry()in its own try/catch to capture the error aspromptError, enabling proper timeout handling.Root cause analysis and fix proposed by @erikpr1994 in #313.
Fixes #313
3. Fail fast on 429 rate limits (pi-ai patch)
The pi-ai library was retrying 429 errors up to 3 times with exponential backoff before throwing. This meant a rate-limited account would waste 30+ seconds retrying before our rotation code could try the next account.
Fix: Patch google-gemini-cli.js to:
This allows the caller to rotate to the next account instantly on rate limit.
Note: We submitted this fix upstream (badlogic/pi-mono#504) but it was closed without merging. Keeping as a local patch for now.
Testing
With 6 Antigravity accounts configured:
Before/After
Before: Multiple 429 retries on same account, 30-90s delays
After: Instant rotation on 429, responses in seconds