[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification

# [Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification

## Environment
- **OpenClaw Version**: v2026.3.7
- **Node Version**: v22.22.1
- **OS**: Linux 6.17.0-14-generic (x64)
- **Installation**: Local system (VPS)
- **Auth Method**: API Key (not OAuth)

## Bug Description

The ZAI API provider enters an infinite cooldown loop when encountering rate limit errors, despite having multiple fallback models configured. The cooldown mechanism incorrectly persists across model switches and gateway restarts.

### Expected Behavior
1. Rate limit errors should trigger model fallback to zai/glm-4.5-flash or zai/glm-4.5
2. Cooldown should be per-profile, not per-provider
3. Gateway restart should clear cooldown state
4. Exponential backoff should prevent repeated hammering of API

### Actual Behavior
1. **Rate limit errors cause provider-wide cooldown**: All ZAI models (glm-4.7, glm-4.5-flash, glm-4.5) become unavailable simultaneously
2. **Infinite cooldown loop**: Cooldown persists across gateway restarts (verified by PID changes in logs)
3. **No model fallback occurs**: System continues retrying the same rate-limited model instead of switching to fallbacks
4. **Misclassified errors**: Rate limit errors are treated as cooldown errors, preventing proper recovery

## Step-by-step Reproduction

1. Start with default model: `zai/glm-4.7`
2. Make several API calls in quick succession
3. Observe rate limit error: `⚠️ API rate limit reached. Please try again later.`
4. Check model status: All ZAI models show "cooldown" state
5. Restart gateway: Cooldown persists
6. Try to use any ZAI model: All fail with cooldown errors

## Log Evidence

### Key Error Patterns from journalctl:

```
21:02:53 [diagnostic] FailoverError: ⚠️ API rate limit reached. Please try again later.
21:02:53 [diagnostic] FailoverError: No available auth profile for zai (all in cooldown or unavailable).
21:02:53 [agent] Embedded agent failed: All models failed (3): 
  zai/glm-4.7: ⚠️ API rate limit reached. Please try again later. (rate_limit)
  zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
  zai/glm-4.5: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
```

### Auth Profile State:
```json
"zai:default": {
  "type": "api_key",
  "provider": "zai",
  "key": "af9a0bdb53024f20947c1d98a281c8cb.CxG68nzFh5vmwmbP"
},
"usageStats": {
  "zai:default": {
    "errorCount": 3,
    "lastFailureAt": 1773059977264,
    "failureCounts": {
      "rate_limit": 3
    },
    "cooldownUntil": 1773061477264  // 23 minute cooldown
  }
}
```

### Timeline of Events:
- **21:02:53**: First rate limit error triggers cooldown
- **21:03:03**: Gateway restart (PID 373433 → 415945)
- **21:03:08**: Gateway restarts but ZAI still in cooldown
- **21:04:57**: Retry attempts continue to fail
- **21:10:38**: Same pattern repeats
- **21:18:53**: Pattern continues
- **21:22:23**: Pattern continues
- **21:32:23**: Timeout errors begin appearing
- **21:39:37**: Still experiencing cooldown issues

## Root Cause Analysis

### 1. Provider vs Profile Cooldown Issue
The system implements provider-level cooldown instead of profile-level cooldown. When `zai:default` hits rate limits, all ZAI models are marked as unavailable.

### 2. Rate Limit Misclassification
Rate limit errors (429) are being treated as cooldown errors, preventing proper model fallback. The error message shows:
```
zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
```

### 3. Lack of Exponential Backoff
The system retries failed requests immediately, creating a hammer effect that exacerbates rate limiting.

### 4. Persistent Cooldown State
Cooldown state persists across gateway restarts, indicating it's stored in auth-profiles.json and not properly cleared.

## Impact

- **User Experience**: Complete service interruption for 20+ minutes
- **Productivity**: Unable to complete tasks requiring AI assistance
- **API Costs**: Repeated error responses may still incur costs
- **System Stability**: Frequent gateway restarts needed for recovery

## Suggested Fixes

### 1. Implement Per-Profile Cooldown
```javascript
// Current: Provider cooldown
cooldownUntil: 1773061477264  // All ZAI models affected

// Should be: Profile cooldown  
cooldownUntil: 1773061477264  // Only zai:default affected
```

### 2. Add CLI Commands for Recovery
```bash
openclaw provider reset zai        # Clear cooldown for specific provider
openclaw auth refresh zai          # Revalidate API key
openclaw cooldown status zai       # Check cooldown status
```

### 3. Fix Rate Limit Error Classification
- Distinguish between rate limit errors (429) and auth errors (403)
- Allow fallback models when primary model hits rate limits
- Implement proper exponential backoff for 429 errors

### 4. Gateway Restart Behavior
- Add `--clear-cooldowns` flag to gateway restart
- Automatically clear temporary cooldowns on restart
- Add cooldown state validation during startup

### 5. Monitoring and Diagnostics
- Add `openclaw doctor` command to detect cooldown loops
- Provide clear error messages distinguishing error types
- Add cooldown status to `openclaw models status`

## Additional Context

This issue affects users with:
- Multiple models from the same provider
- API key authentication (not OAuth)
- High-volume usage patterns
- Production environments requiring reliability

The problem is particularly severe for ZAI provider users as it affects all available models simultaneously, unlike other providers that may have better fallback mechanisms.

## Related Issues

- #5240: Provider cooldown reports 'rate_limit' regardless of actual failure reason
- #5159: No exponential backoff on 429 rate limit errors
- #32828: False 'API rate limit reached' on all models despite APIs being fully functional
- #13909: OAuth 403 misclassified as rate_limit → infinite cooldown loop

## Verification Steps

1. Configure multiple ZAI models as fallbacks
2. Trigger rate limit by making rapid API calls
3. Verify all models become unavailable
4. Restart gateway and confirm cooldown persists
5. Check that fallback models don't activate

## Environment Details

- **System**: Linux 6.17.0-14-generic x86_64
- **OpenClaw**: v2026.3.7 (PID changes confirm restarts occurred)
- **Models**: zai/glm-4.7 (default), zai/glm-4.5-flash, zai/glm-4.5
- **Auth**: API key authentication
- **Timeline**: Issues persisting from 21:02 to 21:39+ (37+ minutes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification #40989

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification

Environment

Bug Description

Expected Behavior

Actual Behavior

Step-by-step Reproduction

Log Evidence

Key Error Patterns from journalctl:

Auth Profile State:

Timeline of Events:

Root Cause Analysis

1. Provider vs Profile Cooldown Issue

2. Rate Limit Misclassification

3. Lack of Exponential Backoff

4. Persistent Cooldown State

Impact

Suggested Fixes

1. Implement Per-Profile Cooldown

2. Add CLI Commands for Recovery

3. Fix Rate Limit Error Classification

4. Gateway Restart Behavior

5. Monitoring and Diagnostics

Additional Context

Related Issues

Verification Steps

Environment Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification #40989

Description

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification

Environment

Bug Description

Expected Behavior

Actual Behavior

Step-by-step Reproduction

Log Evidence

Key Error Patterns from journalctl:

Auth Profile State:

Timeline of Events:

Root Cause Analysis

1. Provider vs Profile Cooldown Issue

2. Rate Limit Misclassification

3. Lack of Exponential Backoff

4. Persistent Cooldown State

Impact

Suggested Fixes

1. Implement Per-Profile Cooldown

2. Add CLI Commands for Recovery

3. Fix Rate Limit Error Classification

4. Gateway Restart Behavior

5. Monitoring and Diagnostics

Additional Context

Related Issues

Verification Steps

Environment Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions