-
-
Notifications
You must be signed in to change notification settings - Fork 41.4k
Description
Issue: OAuth provider cooldown creates long wait times when multiple providers fail simultaneously
Problem Description
When multiple OAuth-based providers (e.g., google-gemini-cli, google-antigravity) fail or hit rate limits in quick succession, the exponential backoff cooldown mechanism forces users to wait up to 1 hour before the failover system can successfully switch to a non-OAuth provider (e.g., API key-based providers like zai).
This creates a poor user experience where the assistant becomes essentially unresponsive for extended periods, even though functional fallback providers are available in the configuration.
Reproduction Steps
-
Configure multiple OAuth-based providers as fallbacks:
{ "agents": { "defaults": { "model": { "primary": "google-gemini-cli/gemini-3-pro-preview", "fallbacks": [ "google-antigravity/claude-opus-4-5-thinking", "google-antigravity/gemini-3-pro-high", "zai/glm-4.7", // API key provider, no OAuth issues ... ] } } } } -
Trigger failures in the first 2-3 OAuth providers (e.g., expired tokens, rate limits, billing issues)
-
Observe the cooldown behavior:
- 1st error: 5 minute cooldown
- 2nd error: 25 minute cooldown
- 3rd error: 125 minute cooldown (capped at 1 hour)
-
Attempt to send a message → system attempts failover but hits cooldown on all OAuth providers repeatedly
-
Wait time: ~1 hour before system finally reaches the API key provider successfully
Expected Behavior
When OAuth providers are in cooldown, the failover system should:
- Skip cooldowned providers immediately (don't retry them until cooldown expires)
- Prioritize non-OAuth providers when OAuth providers are unavailable
- Fall through to API key providers without delay
Alternatively, provide a configuration option to:
- Disable cooldown for specific providers (e.g.,
cooldown.enabled: false) - Set shorter cooldown windows (e.g., via
auth.cooldowns.maxCooldownMs) - Prioritize certain provider types (OAuth vs API key) during failover
Actual Behavior
From /agents/auth-profiles/usage.js:
export function calculateAuthProfileCooldownMs(errorCount) {
const normalized = Math.max(1, errorCount);
return Math.min(60 * 60 * 1000, // 1 hour max
60 * 1000 * 5 ** Math.min(normalized - 1, 3));
}Cooldown times:
- Error 1: 5 minutes
- Error 2: 25 minutes
- Error 3: 125 minutes (capped to 60 minutes)
The failover loop (/agents/model-fallback.js) attempts each provider sequentially, but if multiple OAuth providers are in cooldown, it retries them and gets blocked repeatedly, causing:
3:08:21 - FailoverError: No available auth profile for google-gemini-cli
3:08:21 - FailoverError: No available auth profile for google-antigravity
3:08:55 - FailoverError: No available auth profile for google-gemini-cli
3:08:55 - FailoverError: No available auth profile for google-antigravity
[...repeated for ~1 hour]
3:15:09 - Finally succeeds with zai/glm-4.7
Impact
- User experience: ~1 hour of unresponsiveness when OAuth providers fail
- Productivity: Users must manually change primary model to API key provider to avoid this
- Reliability: Failover mechanism exists but doesn't work effectively for this scenario
Suggested Solutions
Option 1: Skip cooldowned providers in failover
Check isProfileInCooldown(store, profileId) before attempting each candidate and skip to the next if in cooldown.
Option 2: Add cooldown configuration options
{
"auth": {
"cooldowns": {
"maxCooldownMs": 15 * 60 * 1000, // 15 minutes instead of 1 hour
"skipCooldownOnFailover": true, // New option
"providerTypePriority": ["api_key", "oauth"] // Prefer API keys during failover
}
}
}Option 3: Separate cooldown for auth vs rate_limit
autherrors: longer cooldown (token issues need time to resolve)rate_limiterrors: shorter cooldown (quick recovery possible)timeouterrors: no cooldown or very short (network issues are transient)
Option 4: Early exit when API key providers are available
If zai (API key) is in fallbacks and OAuth providers are failing, skip OAuth providers entirely.
Environment
- Clawdbot version: 2026.1.24-3
- OS: macOS 26.2 (arm64)
- Node: v25.4.0
Additional Context
The exponential backoff is appropriate for preventing abuse of rate-limited APIs, but the current implementation doesn't account for:
- Multiple OAuth providers failing simultaneously (common when tokens expire batch-wise)
- Availability of non-OAuth alternatives (API key providers are more stable)
- User experience during failover (1+ hour wait is too long)
Workaround: Set primary model to an API key provider (e.g., zai/glm-4.7) to bypass the issue entirely.
References
- Failover logic:
/dist/agents/model-fallback.js - Cooldown calculation:
/dist/agents/auth-profiles/usage.js - Error classification:
/dist/agents/failover-error.js