fix(auth): improve multi-account round-robin rotation and 429 handling by mukhtharcm · Pull Request #342 · openclaw/openclaw

mukhtharcm · 2026-01-06T22:51:00Z

This PR fixes several issues with multi-account OAuth rotation that were causing slow responses and inefficient account cycling.

Changes

1. Fix usageStats race condition (auth-profiles.ts)

The markAuthProfileUsed, markAuthProfileCooldown, markAuthProfileGood, and clearAuthProfileCooldown functions were using a stale in-memory store passed as a parameter. Long-running sessions would overwrite usageStats updates from concurrent sessions when saving.

Fix: Re-read the store from disk before each update to get fresh usageStats from other sessions, then merge the update.

2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, session.abort() throws an AbortError. The second AbortError from waitForCompactionRetry() was escaping and bypassing the rotation/fallback logic entirely.

Fix: Wrap waitForCompactionRetry() in its own try/catch to capture the error as promptError, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in #313.

Fixes #313

3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential backoff before throwing. This meant a rate-limited account would waste 30+ seconds retrying before our rotation code could try the next account.

Fix: Patch google-gemini-cli.js to:

Throw immediately on first 429 (no retries)
Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504) but it was closed without merging. Keeping as a local patch for now.

Testing

With 6 Antigravity accounts configured:

✅ Accounts rotate properly based on lastUsed (round-robin)
✅ 429s trigger immediate rotation to next account
✅ usageStats persist correctly across concurrent sessions
✅ Cooldown tracking works as expected

Before/After

Before: Multiple 429 retries on same account, 30-90s delays
After: Instant rotation on 429, responses in seconds

@erikpr1994

This commit fixes several issues with multi-account OAuth rotation that were causing slow responses and inefficient account cycling. ## Changes ### 1. Fix usageStats race condition (auth-profiles.ts) The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`, and `clearAuthProfileCooldown` functions were using a stale in-memory store passed as a parameter. Long-running sessions would overwrite usageStats updates from concurrent sessions when saving. **Fix:** Re-read the store from disk before each update to get fresh usageStats from other sessions, then merge the update. ### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts) When a request timed out, `session.abort()` was called which throws an `AbortError`. The code structure was: ```javascript try { await session.prompt(params.prompt); } catch (err) { promptError = err; // Catches AbortError here } await waitForCompactionRetry(); // But THIS also throws AbortError! ``` The second `AbortError` from `waitForCompactionRetry()` escaped and bypassed the rotation/fallback logic entirely. **Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture the error as `promptError`, enabling proper timeout handling. Root cause analysis and fix proposed by @erikpr1994 in openclaw#313. Fixes openclaw#313 ### 3. Fail fast on 429 rate limits (pi-ai patch) The pi-ai library was retrying 429 errors up to 3 times with exponential backoff before throwing. This meant a rate-limited account would waste 30+ seconds retrying before our rotation code could try the next account. **Fix:** Patch google-gemini-cli.js to: - Throw immediately on first 429 (no retries) - Not catch and retry 429 errors in the network error handler This allows the caller to rotate to the next account instantly on rate limit. Note: We submitted this fix upstream (badlogic/pi-mono#504) but it was closed without merging. Keeping as a local patch for now. ## Testing With 6 Antigravity accounts configured: - Accounts rotate properly based on lastUsed (round-robin) - 429s trigger immediate rotation to next account - usageStats persist correctly across concurrent sessions - Cooldown tracking works as expected ## Before/After **Before:** Multiple 429 retries on same account, 30-90s delays **After:** Instant rotation on 429, responses in seconds

steipete · 2026-01-07T00:33:08Z

Thanks for the PR! This is already on main via eb5f758 (includes the 429 fail-fast patch + auth/profile updates + compaction retry handling). Follow-ups 96d72ff and 19c95d0 hardened the auth-profile concurrency/serialization, so this PR is now superseded. Closing with thanks!

* infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Tool approvals: preserve exec command

* feat: tool journal/diagnostics * feat: journal fixes * feat(ui): add error boundary component with retry & friendly messages - New error-boundary.ts component with renderError/renderErrorIf helpers - Custom element <error-boundary> with auto-retry and exponential backoff - friendlyError() maps raw errors to user-friendly messages + suggestions - Supports severity levels (danger/warning/info), compact mode, dismiss - Collapsible technical details section - ARIA compliance with role=alert and aria-live - Replaces all inline callout danger patterns across 23 view files - Consistent error UX across agents, channels, sessions, config, etc. * Web: reset retry timers on error changes (openclaw#273) * Gateway: unify exec approvals with tool approval flow (openclaw#319) * Gateway: unify exec approvals * Gateway: guard exec approval resolves * Feat/pr review monitor (openclaw#313) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config (openclaw#324) * Codex/review branch changes and identify issues (openclaw#325) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config * UI: reset auto-retry timers on error changes (openclaw#328) * feat(ui): add error boundary component with retry & friendly messages - New error-boundary.ts component with renderError/renderErrorIf helpers - Custom element <error-boundary> with auto-retry and exponential backoff - friendlyError() maps raw errors to user-friendly messages + suggestions - Supports severity levels (danger/warning/info), compact mode, dismiss - Collapsible technical details section - ARIA compliance with role=alert and aria-live - Replaces all inline callout danger patterns across 23 view files - Consistent error UX across agents, channels, sessions, config, etc. * Web: reset retry timers on error changes (openclaw#273) * UI: reset auto-retry timers on error changes * Add execution layer runtime parity gap analysis (openclaw#280) * Add execution layer runtime parity gap analysis Comprehensive analysis of Pi Runtime vs Claude Agent SDK feature gaps in the unified execution layer, with 20 prioritized next steps. https://claude.ai/code/session_017oEzmayzdirGAKmSw2ryQZ * Meridia: wire multi-factor scoring into capture hook * merge/minor fixes for ui/* * Meridia: add per-capture graph fanout queue with retries * Meridia: enforce sanitization before persistence and fanout * Meridia: complete Tier2 vector probing and Postgres vector support --------- Co-authored-by: Claude <[email protected]> * Codex/review branch changes and identify issues (openclaw#325) * minor fixes * feat: monitor AI PR review comments * PR review monitor: add pagination config * Work queue: add heartbeat leases (openclaw#329) * fix: duplicate lines on main * Tools: clarify work_item refs and workstream (openclaw#332) Co-authored-by: Claude Opus 4.6 <[email protected]> * Config: clarify agents.list placement, accept agents.list in web import, and document guidance (openclaw#331) * Config: clarify agents.list validation * Web: tighten agents list import validation * Sessions: align label limits (openclaw#333) * Work queue: add work item refs support (openclaw#312) * Tests: update migration count * Tools: accept refs in work_item tool * Work queue: link Codex tasks to PRs (post GitHub comments) (openclaw#337) * Work queue: link codex tasks to PRs * Work queue: skip branchPrefix-only PR lookup * Claude/runtime orchestrator tools eu d uu (openclaw#327) * feat(agents): add runtime tool-approval orchestrator with approvals.tools config - Add approvals.tools config types + zod schema (enabled, mode, timeoutMs, policy, routing, classifier) - Create tool-approval orchestrator module (decision engine, param redaction, gateway integration) - Integrate orchestrator into before-tool-call wrapper path (runs after plugin hooks, before execution) - Add ToolApprovalBlockedError with stable machine-readable error shape - Add 90 tests covering all mode/decision/risk branches - Backward-compatible: no behavior change when approvals.tools is missing or disabled * feat: upgrade /approve and Discord handler to canonical tool approvals - /approve now queries tool.approvals.get for canonical records and resolves via tool.approval.resolve (with requestHash); falls back to legacy exec.approval.resolve when no canonical record is found - Discord handler listens for tool.approval.requested/resolved events and renders generic tool approval embeds for non-exec tools - resolveApproval prefers tool.approval.resolve when requestHash is cached, keeping legacy exec path for backward compatibility - Updated command description to 'tool approval requests' - Added shouldHandleToolApproval for canonical event filtering - Extended tests with canonical, legacy-fallback, and gateway-error scenarios * refactor: rename .clawdbrain → .openclaw and fix repo/domain references - Settings dir: ~/.clawdbrain → ~/.openclaw - Repo references: openclaw/clawdbrain → dgarson/clawdbrain - Domain: clawdbrain.bot → openclaw.ai - CLI command: clawdbrain login → openclaw login - 48 files changed across src/, docs/, apps/web/, ui/ * cron timeout fixes * feat(agents): wire tool approval context from config into tool creation path - Inject approvals.tools config into wrapToolWithBeforeToolCallHook context - Populate channel field from messageProvider via resolveGatewayMessageChannel - Wire callGatewayTool as the gateway call adapter for approval requests - Approval context is only constructed when approvals.tools exists and is enabled * fix: address review gaps in tool approval handler - Exec dedup: store canonical request for exec tools and defer embed creation by 200ms so the legacy mirror gets first shot; if the mirror never arrives, fall back to a generic tool embed (future-proofs against legacy event removal) - Extract sendToolApprovalEmbed to eliminate code duplication - Add buildApprovalCustomId / parseApprovalData generic aliases (same wire format, clearer naming for non-exec tool code paths) - Add alias identity tests * fix: minor tool approval request fixes * auto-reply/approval integration fix * include exec approval doc * fix: agent-runner-execution integration into auto-reply, executor/kernel fixes * more work on agent runner and memory/heartbeta integration * lots of tests resulting from unification of exec kernel; refactored * Redact arrays in approval helper * lancedb fixes * more fixes/test updates * fix: minor problem * fix: restore proper non-throwing session label truncation --------- Co-authored-by: Claude <[email protected]> * Tool approval/protocol cleanup (openclaw#334) * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Codex/map paramssummary to exec command field (openclaw#342) * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * infra: consolidate tool approval types and clean protocol schema * infra: bridge tool approval routing config into forwarder * agents: enrich tool approval decision engine with config resolution and reason codes * test: update tool approval tests for protocol and decision engine changes * chore: conflict resolution * chore: checkou tfrom main * Tool approvals: preserve exec command * Codex/add web inbox for tool approvals (openclaw#339) * Web: add tool approval inbox support * Web: fallback approval resolution * Web: fall back to agent approvals when IDs differ (openclaw#263) * memclawd: scaffold phase 0 service foundation (openclaw#330) * memclawd: apply oxfmt * Memclawd: add client samples and align pipeline config * Codex/implement work item refs system d2mkjz (openclaw#344) * Tools: clarify work_item refs and workstream * Tests: update migration count --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * Codex/review branch changes and identify issues kuj3uy (openclaw#343) * Tests: update migration count * Tools: accept refs in work_item tool * Work queue: add refs reindex command * Work queue: align refs migration and add refs-reindex CLI (openclaw#345) * Tests: update migration count * Work queue: move refs backfill to 004 migration --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude <[email protected]>

mukhtharcm force-pushed the fix/auth-profile-usage-stats-race branch 2 times, most recently from b518e17 to 387b885 Compare January 6, 2026 23:11

mukhtharcm force-pushed the fix/auth-profile-usage-stats-race branch from 387b885 to 046b5ad Compare January 6, 2026 23:14

steipete closed this Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(auth): improve multi-account round-robin rotation and 429 handling#342

fix(auth): improve multi-account round-robin rotation and 429 handling#342
mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
mukhtharcm:fix/auth-profile-usage-stats-race

mukhtharcm commented Jan 6, 2026

Uh oh!

steipete commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mukhtharcm commented Jan 6, 2026

Changes

1. Fix usageStats race condition (auth-profiles.ts)

2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

3. Fail fast on 429 rate limits (pi-ai patch)

Testing

Before/After

Uh oh!

steipete commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants