Skip to content

fix(auth): improve multi-account round-robin rotation and 429 handling#342

Closed
mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
mukhtharcm:fix/auth-profile-usage-stats-race
Closed

fix(auth): improve multi-account round-robin rotation and 429 handling#342
mukhtharcm wants to merge 1 commit intoopenclaw:mainfrom
mukhtharcm:fix/auth-profile-usage-stats-race

Conversation

@mukhtharcm
Copy link
Copy Markdown
Member

This PR fixes several issues with multi-account OAuth rotation that were causing slow responses and inefficient account cycling.

Changes

1. Fix usageStats race condition (auth-profiles.ts)

The markAuthProfileUsed, markAuthProfileCooldown, markAuthProfileGood, and clearAuthProfileCooldown functions were using a stale in-memory store passed as a parameter. Long-running sessions would overwrite usageStats updates from concurrent sessions when saving.

Fix: Re-read the store from disk before each update to get fresh usageStats from other sessions, then merge the update.

2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, session.abort() throws an AbortError. The second AbortError from waitForCompactionRetry() was escaping and bypassing the rotation/fallback logic entirely.

Fix: Wrap waitForCompactionRetry() in its own try/catch to capture the error as promptError, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in #313.

Fixes #313

3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential backoff before throwing. This meant a rate-limited account would waste 30+ seconds retrying before our rotation code could try the next account.

Fix: Patch google-gemini-cli.js to:

  • Throw immediately on first 429 (no retries)
  • Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504) but it was closed without merging. Keeping as a local patch for now.

Testing

With 6 Antigravity accounts configured:

  • ✅ Accounts rotate properly based on lastUsed (round-robin)
  • ✅ 429s trigger immediate rotation to next account
  • ✅ usageStats persist correctly across concurrent sessions
  • ✅ Cooldown tracking works as expected

Before/After

Before: Multiple 429 retries on same account, 30-90s delays
After: Instant rotation on 429, responses in seconds

@mukhtharcm mukhtharcm force-pushed the fix/auth-profile-usage-stats-race branch 2 times, most recently from b518e17 to 387b885 Compare January 6, 2026 23:11
This commit fixes several issues with multi-account OAuth rotation that
were causing slow responses and inefficient account cycling.

## Changes

### 1. Fix usageStats race condition (auth-profiles.ts)

The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`,
and `clearAuthProfileCooldown` functions were using a stale in-memory store
passed as a parameter. Long-running sessions would overwrite usageStats
updates from concurrent sessions when saving.

**Fix:** Re-read the store from disk before each update to get fresh
usageStats from other sessions, then merge the update.

### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, `session.abort()` was called which throws an
`AbortError`. The code structure was:

```javascript
try {
  await session.prompt(params.prompt);
} catch (err) {
  promptError = err;  // Catches AbortError here
}
await waitForCompactionRetry();  // But THIS also throws AbortError!
```

The second `AbortError` from `waitForCompactionRetry()` escaped and
bypassed the rotation/fallback logic entirely.

**Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture
the error as `promptError`, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in openclaw#313.

Fixes openclaw#313

### 3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential
backoff before throwing. This meant a rate-limited account would waste
30+ seconds retrying before our rotation code could try the next account.

**Fix:** Patch google-gemini-cli.js to:
- Throw immediately on first 429 (no retries)
- Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504)
but it was closed without merging. Keeping as a local patch for now.

## Testing

With 6 Antigravity accounts configured:
- Accounts rotate properly based on lastUsed (round-robin)
- 429s trigger immediate rotation to next account
- usageStats persist correctly across concurrent sessions
- Cooldown tracking works as expected

## Before/After

**Before:** Multiple 429 retries on same account, 30-90s delays
**After:** Instant rotation on 429, responses in seconds
@mukhtharcm mukhtharcm force-pushed the fix/auth-profile-usage-stats-race branch from 387b885 to 046b5ad Compare January 6, 2026 23:14
@steipete
Copy link
Copy Markdown
Contributor

steipete commented Jan 7, 2026

Thanks for the PR! This is already on main via eb5f758 (includes the 429 fail-fast patch + auth/profile updates + compaction retry handling). Follow-ups 96d72ff and 19c95d0 hardened the auth-profile concurrency/serialization, so this PR is now superseded. Closing with thanks!

@steipete steipete closed this Jan 7, 2026
dgarson added a commit to dgarson/clawdbot that referenced this pull request Feb 9, 2026
* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* chore: conflict resolution

* chore: checkou tfrom main

* Tool approvals: preserve exec command
dgarson added a commit to dgarson/clawdbot that referenced this pull request Feb 9, 2026
* feat: tool journal/diagnostics

* feat: journal fixes

* feat(ui): add error boundary component with retry & friendly messages

- New error-boundary.ts component with renderError/renderErrorIf helpers
- Custom element <error-boundary> with auto-retry and exponential backoff
- friendlyError() maps raw errors to user-friendly messages + suggestions
- Supports severity levels (danger/warning/info), compact mode, dismiss
- Collapsible technical details section
- ARIA compliance with role=alert and aria-live
- Replaces all inline callout danger patterns across 23 view files
- Consistent error UX across agents, channels, sessions, config, etc.

* Web: reset retry timers on error changes (openclaw#273)

* Gateway: unify exec approvals with tool approval flow (openclaw#319)

* Gateway: unify exec approvals

* Gateway: guard exec approval resolves

* Feat/pr review monitor (openclaw#313)

* minor fixes

* feat: monitor AI PR review comments

* PR review monitor: add pagination config (openclaw#324)

* Codex/review branch changes and identify issues (openclaw#325)

* minor fixes

* feat: monitor AI PR review comments

* PR review monitor: add pagination config

* UI: reset auto-retry timers on error changes (openclaw#328)

* feat(ui): add error boundary component with retry & friendly messages

- New error-boundary.ts component with renderError/renderErrorIf helpers
- Custom element <error-boundary> with auto-retry and exponential backoff
- friendlyError() maps raw errors to user-friendly messages + suggestions
- Supports severity levels (danger/warning/info), compact mode, dismiss
- Collapsible technical details section
- ARIA compliance with role=alert and aria-live
- Replaces all inline callout danger patterns across 23 view files
- Consistent error UX across agents, channels, sessions, config, etc.

* Web: reset retry timers on error changes (openclaw#273)

* UI: reset auto-retry timers on error changes

* Add execution layer runtime parity gap analysis (openclaw#280)

* Add execution layer runtime parity gap analysis

Comprehensive analysis of Pi Runtime vs Claude Agent SDK feature
gaps in the unified execution layer, with 20 prioritized next steps.

https://claude.ai/code/session_017oEzmayzdirGAKmSw2ryQZ

* Meridia: wire multi-factor scoring into capture hook

* merge/minor fixes for ui/*

* Meridia: add per-capture graph fanout queue with retries

* Meridia: enforce sanitization before persistence and fanout

* Meridia: complete Tier2 vector probing and Postgres vector support

---------

Co-authored-by: Claude <[email protected]>

* Codex/review branch changes and identify issues (openclaw#325)

* minor fixes

* feat: monitor AI PR review comments

* PR review monitor: add pagination config

* Work queue: add heartbeat leases (openclaw#329)

* fix: duplicate lines on main

* Tools: clarify work_item refs and workstream (openclaw#332)

Co-authored-by: Claude Opus 4.6 <[email protected]>

* Config: clarify agents.list placement, accept agents.list in web import, and document guidance (openclaw#331)

* Config: clarify agents.list validation

* Web: tighten agents list import validation

* Sessions: align label limits (openclaw#333)

* Work queue: add work item refs support (openclaw#312)

* Tests: update migration count

* Tools: accept refs in work_item tool

* Work queue: link Codex tasks to PRs (post GitHub comments) (openclaw#337)

* Work queue: link codex tasks to PRs

* Work queue: skip branchPrefix-only PR lookup

* Claude/runtime orchestrator tools eu d uu (openclaw#327)

* feat(agents): add runtime tool-approval orchestrator with approvals.tools config

- Add approvals.tools config types + zod schema (enabled, mode, timeoutMs, policy, routing, classifier)
- Create tool-approval orchestrator module (decision engine, param redaction, gateway integration)
- Integrate orchestrator into before-tool-call wrapper path (runs after plugin hooks, before execution)
- Add ToolApprovalBlockedError with stable machine-readable error shape
- Add 90 tests covering all mode/decision/risk branches
- Backward-compatible: no behavior change when approvals.tools is missing or disabled

* feat: upgrade /approve and Discord handler to canonical tool approvals

- /approve now queries tool.approvals.get for canonical records and resolves
  via tool.approval.resolve (with requestHash); falls back to legacy
  exec.approval.resolve when no canonical record is found
- Discord handler listens for tool.approval.requested/resolved events and
  renders generic tool approval embeds for non-exec tools
- resolveApproval prefers tool.approval.resolve when requestHash is cached,
  keeping legacy exec path for backward compatibility
- Updated command description to 'tool approval requests'
- Added shouldHandleToolApproval for canonical event filtering
- Extended tests with canonical, legacy-fallback, and gateway-error scenarios

* refactor: rename .clawdbrain → .openclaw and fix repo/domain references

- Settings dir: ~/.clawdbrain → ~/.openclaw
- Repo references: openclaw/clawdbrain → dgarson/clawdbrain
- Domain: clawdbrain.bot → openclaw.ai
- CLI command: clawdbrain login → openclaw login
- 48 files changed across src/, docs/, apps/web/, ui/

* cron timeout fixes

* feat(agents): wire tool approval context from config into tool creation path

- Inject approvals.tools config into wrapToolWithBeforeToolCallHook context
- Populate channel field from messageProvider via resolveGatewayMessageChannel
- Wire callGatewayTool as the gateway call adapter for approval requests
- Approval context is only constructed when approvals.tools exists and is enabled

* fix: address review gaps in tool approval handler

- Exec dedup: store canonical request for exec tools and defer embed
  creation by 200ms so the legacy mirror gets first shot; if the mirror
  never arrives, fall back to a generic tool embed (future-proofs against
  legacy event removal)
- Extract sendToolApprovalEmbed to eliminate code duplication
- Add buildApprovalCustomId / parseApprovalData generic aliases (same
  wire format, clearer naming for non-exec tool code paths)
- Add alias identity tests

* fix: minor tool approval request fixes

* auto-reply/approval integration fix

* include exec approval doc

* fix: agent-runner-execution integration into auto-reply, executor/kernel fixes

* more work on agent runner and memory/heartbeta integration

* lots of tests resulting from unification of exec kernel; refactored

* Redact arrays in approval helper

* lancedb fixes

* more fixes/test updates

* fix: minor problem

* fix: restore proper non-throwing session label truncation

---------

Co-authored-by: Claude <[email protected]>

* Tool approval/protocol cleanup (openclaw#334)

* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* chore: conflict resolution

* chore: checkou tfrom main

* Codex/map paramssummary to exec command field (openclaw#342)

* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* infra: consolidate tool approval types and clean protocol schema

* infra: bridge tool approval routing config into forwarder

* agents: enrich tool approval decision engine with config resolution and reason codes

* test: update tool approval tests for protocol and decision engine changes

* chore: conflict resolution

* chore: checkou tfrom main

* Tool approvals: preserve exec command

* Codex/add web inbox for tool approvals (openclaw#339)

* Web: add tool approval inbox support

* Web: fallback approval resolution

* Web: fall back to agent approvals when IDs differ (openclaw#263)

* memclawd: scaffold phase 0 service foundation (openclaw#330)

* memclawd: apply oxfmt

* Memclawd: add client samples and align pipeline config

* Codex/implement work item refs system d2mkjz (openclaw#344)

* Tools: clarify work_item refs and workstream

* Tests: update migration count

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* Codex/review branch changes and identify issues kuj3uy (openclaw#343)

* Tests: update migration count

* Tools: accept refs in work_item tool

* Work queue: add refs reindex command

* Work queue: align refs migration and add refs-reindex CLI (openclaw#345)

* Tests: update migration count

* Work queue: move refs backfill to 004 migration

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Model fallback not triggered when Antigravity model times out

2 participants