Skip to content

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification #40989

@downwind7clawd-ctrl

Description

@downwind7clawd-ctrl

[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification

Environment

  • OpenClaw Version: v2026.3.7
  • Node Version: v22.22.1
  • OS: Linux 6.17.0-14-generic (x64)
  • Installation: Local system (VPS)
  • Auth Method: API Key (not OAuth)

Bug Description

The ZAI API provider enters an infinite cooldown loop when encountering rate limit errors, despite having multiple fallback models configured. The cooldown mechanism incorrectly persists across model switches and gateway restarts.

Expected Behavior

  1. Rate limit errors should trigger model fallback to zai/glm-4.5-flash or zai/glm-4.5
  2. Cooldown should be per-profile, not per-provider
  3. Gateway restart should clear cooldown state
  4. Exponential backoff should prevent repeated hammering of API

Actual Behavior

  1. Rate limit errors cause provider-wide cooldown: All ZAI models (glm-4.7, glm-4.5-flash, glm-4.5) become unavailable simultaneously
  2. Infinite cooldown loop: Cooldown persists across gateway restarts (verified by PID changes in logs)
  3. No model fallback occurs: System continues retrying the same rate-limited model instead of switching to fallbacks
  4. Misclassified errors: Rate limit errors are treated as cooldown errors, preventing proper recovery

Step-by-step Reproduction

  1. Start with default model: zai/glm-4.7
  2. Make several API calls in quick succession
  3. Observe rate limit error: ⚠️ API rate limit reached. Please try again later.
  4. Check model status: All ZAI models show "cooldown" state
  5. Restart gateway: Cooldown persists
  6. Try to use any ZAI model: All fail with cooldown errors

Log Evidence

Key Error Patterns from journalctl:

21:02:53 [diagnostic] FailoverError: ⚠️ API rate limit reached. Please try again later.
21:02:53 [diagnostic] FailoverError: No available auth profile for zai (all in cooldown or unavailable).
21:02:53 [agent] Embedded agent failed: All models failed (3): 
  zai/glm-4.7: ⚠️ API rate limit reached. Please try again later. (rate_limit)
  zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
  zai/glm-4.5: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)

Auth Profile State:

"zai:default": {
  "type": "api_key",
  "provider": "zai",
  "key": "af9a0bdb53024f20947c1d98a281c8cb.CxG68nzFh5vmwmbP"
},
"usageStats": {
  "zai:default": {
    "errorCount": 3,
    "lastFailureAt": 1773059977264,
    "failureCounts": {
      "rate_limit": 3
    },
    "cooldownUntil": 1773061477264  // 23 minute cooldown
  }
}

Timeline of Events:

  • 21:02:53: First rate limit error triggers cooldown
  • 21:03:03: Gateway restart (PID 373433 → 415945)
  • 21:03:08: Gateway restarts but ZAI still in cooldown
  • 21:04:57: Retry attempts continue to fail
  • 21:10:38: Same pattern repeats
  • 21:18:53: Pattern continues
  • 21:22:23: Pattern continues
  • 21:32:23: Timeout errors begin appearing
  • 21:39:37: Still experiencing cooldown issues

Root Cause Analysis

1. Provider vs Profile Cooldown Issue

The system implements provider-level cooldown instead of profile-level cooldown. When zai:default hits rate limits, all ZAI models are marked as unavailable.

2. Rate Limit Misclassification

Rate limit errors (429) are being treated as cooldown errors, preventing proper model fallback. The error message shows:

zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)

3. Lack of Exponential Backoff

The system retries failed requests immediately, creating a hammer effect that exacerbates rate limiting.

4. Persistent Cooldown State

Cooldown state persists across gateway restarts, indicating it's stored in auth-profiles.json and not properly cleared.

Impact

  • User Experience: Complete service interruption for 20+ minutes
  • Productivity: Unable to complete tasks requiring AI assistance
  • API Costs: Repeated error responses may still incur costs
  • System Stability: Frequent gateway restarts needed for recovery

Suggested Fixes

1. Implement Per-Profile Cooldown

// Current: Provider cooldown
cooldownUntil: 1773061477264  // All ZAI models affected

// Should be: Profile cooldown  
cooldownUntil: 1773061477264  // Only zai:default affected

2. Add CLI Commands for Recovery

openclaw provider reset zai        # Clear cooldown for specific provider
openclaw auth refresh zai          # Revalidate API key
openclaw cooldown status zai       # Check cooldown status

3. Fix Rate Limit Error Classification

  • Distinguish between rate limit errors (429) and auth errors (403)
  • Allow fallback models when primary model hits rate limits
  • Implement proper exponential backoff for 429 errors

4. Gateway Restart Behavior

  • Add --clear-cooldowns flag to gateway restart
  • Automatically clear temporary cooldowns on restart
  • Add cooldown state validation during startup

5. Monitoring and Diagnostics

  • Add openclaw doctor command to detect cooldown loops
  • Provide clear error messages distinguishing error types
  • Add cooldown status to openclaw models status

Additional Context

This issue affects users with:

  • Multiple models from the same provider
  • API key authentication (not OAuth)
  • High-volume usage patterns
  • Production environments requiring reliability

The problem is particularly severe for ZAI provider users as it affects all available models simultaneously, unlike other providers that may have better fallback mechanisms.

Related Issues

Verification Steps

  1. Configure multiple ZAI models as fallbacks
  2. Trigger rate limit by making rapid API calls
  3. Verify all models become unavailable
  4. Restart gateway and confirm cooldown persists
  5. Check that fallback models don't activate

Environment Details

  • System: Linux 6.17.0-14-generic x86_64
  • OpenClaw: v2026.3.7 (PID changes confirm restarts occurred)
  • Models: zai/glm-4.7 (default), zai/glm-4.5-flash, zai/glm-4.5
  • Auth: API key authentication
  • Timeline: Issues persisting from 21:02 to 21:39+ (37+ minutes)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions