OAuth provider cooldown creates long wait times when multiple providers fail simultaneously

# Issue: OAuth provider cooldown creates long wait times when multiple providers fail simultaneously

## Problem Description

When multiple OAuth-based providers (e.g., `google-gemini-cli`, `google-antigravity`) fail or hit rate limits in quick succession, the exponential backoff cooldown mechanism forces users to wait up to 1 hour before the failover system can successfully switch to a non-OAuth provider (e.g., API key-based providers like `zai`).

This creates a poor user experience where the assistant becomes essentially unresponsive for extended periods, even though functional fallback providers are available in the configuration.

## Reproduction Steps

1. Configure multiple OAuth-based providers as fallbacks:
   ```json
   {
     "agents": {
       "defaults": {
         "model": {
           "primary": "google-gemini-cli/gemini-3-pro-preview",
           "fallbacks": [
             "google-antigravity/claude-opus-4-5-thinking",
             "google-antigravity/gemini-3-pro-high",
             "zai/glm-4.7",  // API key provider, no OAuth issues
             ...
           ]
         }
       }
     }
   }
   ```

2. Trigger failures in the first 2-3 OAuth providers (e.g., expired tokens, rate limits, billing issues)

3. Observe the cooldown behavior:
   - 1st error: 5 minute cooldown
   - 2nd error: 25 minute cooldown
   - 3rd error: 125 minute cooldown (capped at 1 hour)

4. Attempt to send a message → system attempts failover but hits cooldown on all OAuth providers repeatedly

5. Wait time: ~1 hour before system finally reaches the API key provider successfully

## Expected Behavior

When OAuth providers are in cooldown, the failover system should:
- **Skip cooldowned providers immediately** (don't retry them until cooldown expires)
- **Prioritize non-OAuth providers** when OAuth providers are unavailable
- **Fall through to API key providers without delay**

Alternatively, provide a configuration option to:
- Disable cooldown for specific providers (e.g., `cooldown.enabled: false`)
- Set shorter cooldown windows (e.g., via `auth.cooldowns.maxCooldownMs`)
- Prioritize certain provider types (OAuth vs API key) during failover

## Actual Behavior

From `/agents/auth-profiles/usage.js`:
```typescript
export function calculateAuthProfileCooldownMs(errorCount) {
    const normalized = Math.max(1, errorCount);
    return Math.min(60 * 60 * 1000, // 1 hour max
                    60 * 1000 * 5 ** Math.min(normalized - 1, 3));
}
```

Cooldown times:
- Error 1: 5 minutes
- Error 2: 25 minutes
- Error 3: 125 minutes (capped to 60 minutes)

The failover loop (`/agents/model-fallback.js`) attempts each provider sequentially, but if multiple OAuth providers are in cooldown, it retries them and gets blocked repeatedly, causing:

```
3:08:21 - FailoverError: No available auth profile for google-gemini-cli
3:08:21 - FailoverError: No available auth profile for google-antigravity
3:08:55 - FailoverError: No available auth profile for google-gemini-cli
3:08:55 - FailoverError: No available auth profile for google-antigravity
[...repeated for ~1 hour]
3:15:09 - Finally succeeds with zai/glm-4.7
```

## Impact

- **User experience**: ~1 hour of unresponsiveness when OAuth providers fail
- **Productivity**: Users must manually change primary model to API key provider to avoid this
- **Reliability**: Failover mechanism exists but doesn't work effectively for this scenario

## Suggested Solutions

### Option 1: Skip cooldowned providers in failover
Check `isProfileInCooldown(store, profileId)` before attempting each candidate and skip to the next if in cooldown.

### Option 2: Add cooldown configuration options
```json
{
  "auth": {
    "cooldowns": {
      "maxCooldownMs": 15 * 60 * 1000,  // 15 minutes instead of 1 hour
      "skipCooldownOnFailover": true,  // New option
      "providerTypePriority": ["api_key", "oauth"]  // Prefer API keys during failover
    }
  }
}
```

### Option 3: Separate cooldown for auth vs rate_limit
- `auth` errors: longer cooldown (token issues need time to resolve)
- `rate_limit` errors: shorter cooldown (quick recovery possible)
- `timeout` errors: no cooldown or very short (network issues are transient)

### Option 4: Early exit when API key providers are available
If `zai` (API key) is in fallbacks and OAuth providers are failing, skip OAuth providers entirely.

## Environment

- Clawdbot version: 2026.1.24-3
- OS: macOS 26.2 (arm64)
- Node: v25.4.0

## Additional Context

The exponential backoff is appropriate for preventing abuse of rate-limited APIs, but the current implementation doesn't account for:
1. Multiple OAuth providers failing simultaneously (common when tokens expire batch-wise)
2. Availability of non-OAuth alternatives (API key providers are more stable)
3. User experience during failover (1+ hour wait is too long)

Workaround: Set primary model to an API key provider (e.g., `zai/glm-4.7`) to bypass the issue entirely.

## References

- Failover logic: `/dist/agents/model-fallback.js`
- Cooldown calculation: `/dist/agents/auth-profiles/usage.js`
- Error classification: `/dist/agents/failover-error.js`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OAuth provider cooldown creates long wait times when multiple providers fail simultaneously #2142

Issue: OAuth provider cooldown creates long wait times when multiple providers fail simultaneously

Problem Description

Reproduction Steps

Expected Behavior

Actual Behavior

Impact

Suggested Solutions

Option 1: Skip cooldowned providers in failover

Option 2: Add cooldown configuration options

Option 3: Separate cooldown for auth vs rate_limit

Option 4: Early exit when API key providers are available

Environment

Additional Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

OAuth provider cooldown creates long wait times when multiple providers fail simultaneously #2142

Description

Issue: OAuth provider cooldown creates long wait times when multiple providers fail simultaneously

Problem Description

Reproduction Steps

Expected Behavior

Actual Behavior

Impact

Suggested Solutions

Option 1: Skip cooldowned providers in failover

Option 2: Add cooldown configuration options

Option 3: Separate cooldown for auth vs rate_limit

Option 4: Early exit when API key providers are available

Environment

Additional Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions