Skip to content

OAuth provider cooldown creates long wait times when multiple providers fail simultaneously #2142

@YiWang24

Description

@YiWang24

Issue: OAuth provider cooldown creates long wait times when multiple providers fail simultaneously

Problem Description

When multiple OAuth-based providers (e.g., google-gemini-cli, google-antigravity) fail or hit rate limits in quick succession, the exponential backoff cooldown mechanism forces users to wait up to 1 hour before the failover system can successfully switch to a non-OAuth provider (e.g., API key-based providers like zai).

This creates a poor user experience where the assistant becomes essentially unresponsive for extended periods, even though functional fallback providers are available in the configuration.

Reproduction Steps

  1. Configure multiple OAuth-based providers as fallbacks:

    {
      "agents": {
        "defaults": {
          "model": {
            "primary": "google-gemini-cli/gemini-3-pro-preview",
            "fallbacks": [
              "google-antigravity/claude-opus-4-5-thinking",
              "google-antigravity/gemini-3-pro-high",
              "zai/glm-4.7",  // API key provider, no OAuth issues
              ...
            ]
          }
        }
      }
    }
  2. Trigger failures in the first 2-3 OAuth providers (e.g., expired tokens, rate limits, billing issues)

  3. Observe the cooldown behavior:

    • 1st error: 5 minute cooldown
    • 2nd error: 25 minute cooldown
    • 3rd error: 125 minute cooldown (capped at 1 hour)
  4. Attempt to send a message → system attempts failover but hits cooldown on all OAuth providers repeatedly

  5. Wait time: ~1 hour before system finally reaches the API key provider successfully

Expected Behavior

When OAuth providers are in cooldown, the failover system should:

  • Skip cooldowned providers immediately (don't retry them until cooldown expires)
  • Prioritize non-OAuth providers when OAuth providers are unavailable
  • Fall through to API key providers without delay

Alternatively, provide a configuration option to:

  • Disable cooldown for specific providers (e.g., cooldown.enabled: false)
  • Set shorter cooldown windows (e.g., via auth.cooldowns.maxCooldownMs)
  • Prioritize certain provider types (OAuth vs API key) during failover

Actual Behavior

From /agents/auth-profiles/usage.js:

export function calculateAuthProfileCooldownMs(errorCount) {
    const normalized = Math.max(1, errorCount);
    return Math.min(60 * 60 * 1000, // 1 hour max
                    60 * 1000 * 5 ** Math.min(normalized - 1, 3));
}

Cooldown times:

  • Error 1: 5 minutes
  • Error 2: 25 minutes
  • Error 3: 125 minutes (capped to 60 minutes)

The failover loop (/agents/model-fallback.js) attempts each provider sequentially, but if multiple OAuth providers are in cooldown, it retries them and gets blocked repeatedly, causing:

3:08:21 - FailoverError: No available auth profile for google-gemini-cli
3:08:21 - FailoverError: No available auth profile for google-antigravity
3:08:55 - FailoverError: No available auth profile for google-gemini-cli
3:08:55 - FailoverError: No available auth profile for google-antigravity
[...repeated for ~1 hour]
3:15:09 - Finally succeeds with zai/glm-4.7

Impact

  • User experience: ~1 hour of unresponsiveness when OAuth providers fail
  • Productivity: Users must manually change primary model to API key provider to avoid this
  • Reliability: Failover mechanism exists but doesn't work effectively for this scenario

Suggested Solutions

Option 1: Skip cooldowned providers in failover

Check isProfileInCooldown(store, profileId) before attempting each candidate and skip to the next if in cooldown.

Option 2: Add cooldown configuration options

{
  "auth": {
    "cooldowns": {
      "maxCooldownMs": 15 * 60 * 1000,  // 15 minutes instead of 1 hour
      "skipCooldownOnFailover": true,  // New option
      "providerTypePriority": ["api_key", "oauth"]  // Prefer API keys during failover
    }
  }
}

Option 3: Separate cooldown for auth vs rate_limit

  • auth errors: longer cooldown (token issues need time to resolve)
  • rate_limit errors: shorter cooldown (quick recovery possible)
  • timeout errors: no cooldown or very short (network issues are transient)

Option 4: Early exit when API key providers are available

If zai (API key) is in fallbacks and OAuth providers are failing, skip OAuth providers entirely.

Environment

  • Clawdbot version: 2026.1.24-3
  • OS: macOS 26.2 (arm64)
  • Node: v25.4.0

Additional Context

The exponential backoff is appropriate for preventing abuse of rate-limited APIs, but the current implementation doesn't account for:

  1. Multiple OAuth providers failing simultaneously (common when tokens expire batch-wise)
  2. Availability of non-OAuth alternatives (API key providers are more stable)
  3. User experience during failover (1+ hour wait is too long)

Workaround: Set primary model to an API key provider (e.g., zai/glm-4.7) to bypass the issue entirely.

References

  • Failover logic: /dist/agents/model-fallback.js
  • Cooldown calculation: /dist/agents/auth-profiles/usage.js
  • Error classification: /dist/agents/failover-error.js

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions