Skip to content

fix(ai): fail fast on long quota delays in google-gemini-cli#504

Closed
mukhtharcm wants to merge 1 commit intobadlogic:mainfrom
mukhtharcm:fix/gemini-cli-long-quota-delay
Closed

fix(ai): fail fast on long quota delays in google-gemini-cli#504
mukhtharcm wants to merge 1 commit intobadlogic:mainfrom
mukhtharcm:fix/gemini-cli-long-quota-delay

Conversation

@mukhtharcm
Copy link
Copy Markdown

Problem

When Antigravity API returns a rate limit error with a long retry delay (e.g., Your quota will reset after 12m10s), the code would sleep() for the full duration inside the retry loop.

This caused issues for callers with shorter timeouts (like Clawdbot's 90-second timeout):

  1. Request hits quota-exhausted account
  2. API returns "wait 12 minutes"
  3. extractRetryDelay() returns 731,000ms
  4. sleep(731000) starts waiting
  5. Caller's timeout fires after 90s, marks as "possible rate limit"
  6. Rotates to next account... which might also be exhausted
  7. Repeat until all accounts tried or timeout cascade

Solution

Don't wait for delays longer than 30 seconds. Instead, throw an error immediately with a clear message indicating it's a rate limit issue. This lets callers like Clawdbot quickly rotate to another account that isn't quota-exhausted.

const MAX_RETRY_WAIT_MS = 30_000;
if (delayMs > MAX_RETRY_WAIT_MS) {
  throw new Error(
    \`Cloud Code Assist API rate limit (${response.status}): \` +
    \`retry delay ${Math.round(delayMs / 1000)}s exceeds max wait ${MAX_RETRY_WAIT_MS / 1000}s. ${errorText}\`
  );
}

Testing

Tested with a quota-exhausted Antigravity account:

  • Before: Request would hang for 90s (timeout), then rotate
  • After: Request fails in ~400ms, immediately rotates to working account
18:31:49 [pi-ai] fetch END: 429 in 419ms  
18:31:49 Quota delay too long (54s > 30s), failing fast

When Antigravity API returns a rate limit error with a long retry delay
(e.g., 12+ minutes for quota exhaustion), don't wait inside the retry
loop. Instead, throw immediately so callers (like Clawdbot) can rotate
to another account.

Previously, if the API said 'wait 12 minutes', the code would sleep for
12 minutes inside the request, causing the caller's timeout to fire.
Now, any delay > 30 seconds triggers an immediate error with a clear
message indicating it's a rate limit issue.

This enables multi-account setups to gracefully handle per-account
quota limits by quickly failing over to accounts that aren't exhausted.
@badlogic
Copy link
Copy Markdown
Owner

badlogic commented Jan 6, 2026

No, that is something clawdbot has to implement on its side. The sleep receives an abort signal. Use that to abort after the clawdbot specific timeout has passed. I'm afraid I will not add clawdbot specific "fixes" that can be done without modifications to pi.

@badlogic badlogic closed this Jan 6, 2026
mukhtharcm added a commit to mukhtharcm/clawdbot that referenced this pull request Jan 6, 2026
This commit fixes several issues with multi-account OAuth rotation that
were causing slow responses and inefficient account cycling.

## Changes

### 1. Fix usageStats race condition (auth-profiles.ts)

The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`,
and `clearAuthProfileCooldown` functions were using a stale in-memory store
passed as a parameter. Long-running sessions would overwrite usageStats
updates from concurrent sessions when saving.

**Fix:** Re-read the store from disk before each update to get fresh
usageStats from other sessions, then merge the update.

### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, `session.abort()` was called which throws an
`AbortError`. The code structure was:

```javascript
try {
  await session.prompt(params.prompt);
} catch (err) {
  promptError = err;  // Catches AbortError here
}
await waitForCompactionRetry();  // But THIS also throws AbortError!
```

The second `AbortError` from `waitForCompactionRetry()` escaped and
bypassed the rotation/fallback logic entirely.

**Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture
the error as `promptError`, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in openclaw#313.

Fixes openclaw#313

### 3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential
backoff before throwing. This meant a rate-limited account would waste
30+ seconds retrying before our rotation code could try the next account.

**Fix:** Patch google-gemini-cli.js to:
- Throw immediately on first 429 (no retries)
- Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504)
but it was closed without merging. Keeping as a local patch for now.

## Testing

With 6 Antigravity accounts configured:
- Accounts rotate properly based on lastUsed (round-robin)
- 429s trigger immediate rotation to next account
- usageStats persist correctly across concurrent sessions
- Cooldown tracking works as expected

## Before/After

**Before:** Multiple 429 retries on same account, 30-90s delays
**After:** Instant rotation on 429, responses in seconds
steipete pushed a commit to openclaw/openclaw that referenced this pull request Jan 7, 2026
This commit fixes several issues with multi-account OAuth rotation that
were causing slow responses and inefficient account cycling.

## Changes

### 1. Fix usageStats race condition (auth-profiles.ts)

The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`,
and `clearAuthProfileCooldown` functions were using a stale in-memory store
passed as a parameter. Long-running sessions would overwrite usageStats
updates from concurrent sessions when saving.

**Fix:** Re-read the store from disk before each update to get fresh
usageStats from other sessions, then merge the update.

### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, `session.abort()` was called which throws an
`AbortError`. The code structure was:

```javascript
try {
  await session.prompt(params.prompt);
} catch (err) {
  promptError = err;  // Catches AbortError here
}
await waitForCompactionRetry();  // But THIS also throws AbortError!
```

The second `AbortError` from `waitForCompactionRetry()` escaped and
bypassed the rotation/fallback logic entirely.

**Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture
the error as `promptError`, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in #313.

Fixes #313

### 3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential
backoff before throwing. This meant a rate-limited account would waste
30+ seconds retrying before our rotation code could try the next account.

**Fix:** Patch google-gemini-cli.js to:
- Throw immediately on first 429 (no retries)
- Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504)
but it was closed without merging. Keeping as a local patch for now.

## Testing

With 6 Antigravity accounts configured:
- Accounts rotate properly based on lastUsed (round-robin)
- 429s trigger immediate rotation to next account
- usageStats persist correctly across concurrent sessions
- Cooldown tracking works as expected

## Before/After

**Before:** Multiple 429 retries on same account, 30-90s delays
**After:** Instant rotation on 429, responses in seconds
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
This commit fixes several issues with multi-account OAuth rotation that
were causing slow responses and inefficient account cycling.

## Changes

### 1. Fix usageStats race condition (auth-profiles.ts)

The `markAuthProfileUsed`, `markAuthProfileCooldown`, `markAuthProfileGood`,
and `clearAuthProfileCooldown` functions were using a stale in-memory store
passed as a parameter. Long-running sessions would overwrite usageStats
updates from concurrent sessions when saving.

**Fix:** Re-read the store from disk before each update to get fresh
usageStats from other sessions, then merge the update.

### 2. Capture AbortError from waitForCompactionRetry (pi-embedded-runner.ts)

When a request timed out, `session.abort()` was called which throws an
`AbortError`. The code structure was:

```javascript
try {
  await session.prompt(params.prompt);
} catch (err) {
  promptError = err;  // Catches AbortError here
}
await waitForCompactionRetry();  // But THIS also throws AbortError!
```

The second `AbortError` from `waitForCompactionRetry()` escaped and
bypassed the rotation/fallback logic entirely.

**Fix:** Wrap `waitForCompactionRetry()` in its own try/catch to capture
the error as `promptError`, enabling proper timeout handling.

Root cause analysis and fix proposed by @erikpr1994 in openclaw#313.

Fixes openclaw#313

### 3. Fail fast on 429 rate limits (pi-ai patch)

The pi-ai library was retrying 429 errors up to 3 times with exponential
backoff before throwing. This meant a rate-limited account would waste
30+ seconds retrying before our rotation code could try the next account.

**Fix:** Patch google-gemini-cli.js to:
- Throw immediately on first 429 (no retries)
- Not catch and retry 429 errors in the network error handler

This allows the caller to rotate to the next account instantly on rate limit.

Note: We submitted this fix upstream (badlogic/pi-mono#504)
but it was closed without merging. Keeping as a local patch for now.

## Testing

With 6 Antigravity accounts configured:
- Accounts rotate properly based on lastUsed (round-robin)
- 429s trigger immediate rotation to next account
- usageStats persist correctly across concurrent sessions
- Cooldown tracking works as expected

## Before/After

**Before:** Multiple 429 retries on same account, 30-90s delays
**After:** Instant rotation on 429, responses in seconds
@badlogic badlogic added the possibly-openclaw-clanker User has activity on openclaw/openclaw label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

possibly-openclaw-clanker User has activity on openclaw/openclaw

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants