-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification #40989
Description
[Bug] ZAI API provider enters infinite cooldown loop due to rate limit misclassification
Environment
- OpenClaw Version: v2026.3.7
- Node Version: v22.22.1
- OS: Linux 6.17.0-14-generic (x64)
- Installation: Local system (VPS)
- Auth Method: API Key (not OAuth)
Bug Description
The ZAI API provider enters an infinite cooldown loop when encountering rate limit errors, despite having multiple fallback models configured. The cooldown mechanism incorrectly persists across model switches and gateway restarts.
Expected Behavior
- Rate limit errors should trigger model fallback to zai/glm-4.5-flash or zai/glm-4.5
- Cooldown should be per-profile, not per-provider
- Gateway restart should clear cooldown state
- Exponential backoff should prevent repeated hammering of API
Actual Behavior
- Rate limit errors cause provider-wide cooldown: All ZAI models (glm-4.7, glm-4.5-flash, glm-4.5) become unavailable simultaneously
- Infinite cooldown loop: Cooldown persists across gateway restarts (verified by PID changes in logs)
- No model fallback occurs: System continues retrying the same rate-limited model instead of switching to fallbacks
- Misclassified errors: Rate limit errors are treated as cooldown errors, preventing proper recovery
Step-by-step Reproduction
- Start with default model:
zai/glm-4.7 - Make several API calls in quick succession
- Observe rate limit error:
⚠️ API rate limit reached. Please try again later. - Check model status: All ZAI models show "cooldown" state
- Restart gateway: Cooldown persists
- Try to use any ZAI model: All fail with cooldown errors
Log Evidence
Key Error Patterns from journalctl:
21:02:53 [diagnostic] FailoverError: ⚠️ API rate limit reached. Please try again later.
21:02:53 [diagnostic] FailoverError: No available auth profile for zai (all in cooldown or unavailable).
21:02:53 [agent] Embedded agent failed: All models failed (3):
zai/glm-4.7: ⚠️ API rate limit reached. Please try again later. (rate_limit)
zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
zai/glm-4.5: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
Auth Profile State:
"zai:default": {
"type": "api_key",
"provider": "zai",
"key": "af9a0bdb53024f20947c1d98a281c8cb.CxG68nzFh5vmwmbP"
},
"usageStats": {
"zai:default": {
"errorCount": 3,
"lastFailureAt": 1773059977264,
"failureCounts": {
"rate_limit": 3
},
"cooldownUntil": 1773061477264 // 23 minute cooldown
}
}Timeline of Events:
- 21:02:53: First rate limit error triggers cooldown
- 21:03:03: Gateway restart (PID 373433 → 415945)
- 21:03:08: Gateway restarts but ZAI still in cooldown
- 21:04:57: Retry attempts continue to fail
- 21:10:38: Same pattern repeats
- 21:18:53: Pattern continues
- 21:22:23: Pattern continues
- 21:32:23: Timeout errors begin appearing
- 21:39:37: Still experiencing cooldown issues
Root Cause Analysis
1. Provider vs Profile Cooldown Issue
The system implements provider-level cooldown instead of profile-level cooldown. When zai:default hits rate limits, all ZAI models are marked as unavailable.
2. Rate Limit Misclassification
Rate limit errors (429) are being treated as cooldown errors, preventing proper model fallback. The error message shows:
zai/glm-4.5-flash: No available auth profile for zai (all in cooldown or unavailable). (rate_limit)
3. Lack of Exponential Backoff
The system retries failed requests immediately, creating a hammer effect that exacerbates rate limiting.
4. Persistent Cooldown State
Cooldown state persists across gateway restarts, indicating it's stored in auth-profiles.json and not properly cleared.
Impact
- User Experience: Complete service interruption for 20+ minutes
- Productivity: Unable to complete tasks requiring AI assistance
- API Costs: Repeated error responses may still incur costs
- System Stability: Frequent gateway restarts needed for recovery
Suggested Fixes
1. Implement Per-Profile Cooldown
// Current: Provider cooldown
cooldownUntil: 1773061477264 // All ZAI models affected
// Should be: Profile cooldown
cooldownUntil: 1773061477264 // Only zai:default affected2. Add CLI Commands for Recovery
openclaw provider reset zai # Clear cooldown for specific provider
openclaw auth refresh zai # Revalidate API key
openclaw cooldown status zai # Check cooldown status3. Fix Rate Limit Error Classification
- Distinguish between rate limit errors (429) and auth errors (403)
- Allow fallback models when primary model hits rate limits
- Implement proper exponential backoff for 429 errors
4. Gateway Restart Behavior
- Add
--clear-cooldownsflag to gateway restart - Automatically clear temporary cooldowns on restart
- Add cooldown state validation during startup
5. Monitoring and Diagnostics
- Add
openclaw doctorcommand to detect cooldown loops - Provide clear error messages distinguishing error types
- Add cooldown status to
openclaw models status
Additional Context
This issue affects users with:
- Multiple models from the same provider
- API key authentication (not OAuth)
- High-volume usage patterns
- Production environments requiring reliability
The problem is particularly severe for ZAI provider users as it affects all available models simultaneously, unlike other providers that may have better fallback mechanisms.
Related Issues
- [Bug]: Provider cooldown reports 'rate_limit' regardless of actual failure reason (timeout, billing, etc.) #5240: Provider cooldown reports 'rate_limit' regardless of actual failure reason
- No exponential backoff on 429 rate limit errors #5159: No exponential backoff on 429 rate limit errors
- [Bug]: False 'API rate limit reached' on all models despite APIs being fully functional #32828: False 'API rate limit reached' on all models despite APIs being fully functional
- [Bug]: OAuth 403 misclassified as rate_limit → infinite cooldown loop, no CLI recovery #13909: OAuth 403 misclassified as rate_limit → infinite cooldown loop
Verification Steps
- Configure multiple ZAI models as fallbacks
- Trigger rate limit by making rapid API calls
- Verify all models become unavailable
- Restart gateway and confirm cooldown persists
- Check that fallback models don't activate
Environment Details
- System: Linux 6.17.0-14-generic x86_64
- OpenClaw: v2026.3.7 (PID changes confirm restarts occurred)
- Models: zai/glm-4.7 (default), zai/glm-4.5-flash, zai/glm-4.5
- Auth: API key authentication
- Timeline: Issues persisting from 21:02 to 21:39+ (37+ minutes)