-
-
Notifications
You must be signed in to change notification settings - Fork 40.1k
Labels
bugSomething isn't workingSomething isn't working
Description
Summary
When a cron task with sessionTarget="isolated" fails, the cron scheduler retries it infinitely. Each retry spawns a new isolated agent session and calls the LLM API, causing rapid rate_limit cooldown that freezes the entire OpenClaw instance.
Steps to Reproduce
- Create a cron task that will fail:
{
"name": "Test failing isolated task",
"schedule": { "kind": "at", "atMs": 1770165091000 },
"sessionTarget": "isolated",
"payload": {
"kind": "agentTurn",
"message": "Execute: pkill -f 'process_that_does_not_exist_xyz'"
}
}- Wait for the scheduled time
- Task fails because the process doesn't exist
- Observe the logs:
- Cron immediately retries the failed task
- Each retry creates a new isolated session
- Each session calls the LLM API
- Rapid 429 rate_limit errors flood the logs
- API enters cooldown: "Provider anthropic is in cooldown (all profiles unavailable)"
- All sessions become unresponsive
Current Behavior
Error: All models failed (2): anthropic/claude-haiku-4-5: Provider anthropic is in cooldown (all profiles unavailable) (rate_limit) | anthropic/claude-opus-4-5: Provider anthropic is in cooldown (all profiles unavailable) (rate_limit)
This error repeats indefinitely, preventing all message processing.
Expected Behavior
- Cron tasks should NOT infinitely retry on failure
- After N failures (e.g., 3), the task should:
- Be automatically disabled/marked as failed
- Stop retrying
- Notify the user with error details
- Failing tasks should NOT trigger full provider cooldown
- Other sessions should remain responsive
Impact
- Severity: CRITICAL 🔴
- Affected users: Anyone using isolated cron tasks (agentTurn payload)
- Trigger: Creating duplicate cron tasks or tasks with invalid commands
- Recovery: Manual intervention in web UI to disable the failed task
Example Log Output
18:31:31 PM - cron task 32982bf7-... started (isolated)
18:31:31 PM - agentTurn failed (exit code 1)
18:31:32 PM - cron retry attempt 1
18:31:33 PM - HTTP 429 rate_limit_error
18:31:34 PM - cron retry attempt 2
18:31:34 PM - HTTP 429 rate_limit_error
18:31:35 PM - Provider anthropic is in cooldown
18:31:35 PM - [No response] API frozen
Proposed Fix
Option A: Failure limit + auto-disable
// cron/scheduler.ts
const MAX_CONSECUTIVE_FAILURES = 3;
if (job.state.consecutiveFailures >= MAX_CONSECUTIVE_FAILURES) {
job.enabled = false;
logger.error(`Job ${job.id} disabled after ${MAX_CONSECUTIVE_FAILURES} failures`);
notifyUser(`Cron job "${job.name}" disabled: max retries exceeded`);
return;
}Option B: Exponential backoff on isolated tasks
// cron/scheduler.ts
if (sessionTarget === "isolated") {
const delay = Math.min(
1000 * Math.pow(2, job.state.consecutiveFailures),
3600000 // max 1 hour
);
scheduleRetry(job, delay);
}Option C: Rate limit circuit breaker
// If a single cron task triggers rate_limit, pause all cron execution
// until rate limit clears (rather than keep hammering the API)Environment
- OpenClaw version: 2026-01-31
- Node: v22.22.0
- OS: macOS 24.6.0 (arm64)
- Provider: Anthropic
Additional Context
This mirrors the documented issues #5159 (no exponential backoff) and #5744 (provider-wide cooldown), but is specific to the cron scheduler's handling of isolated task failures.
Related Issues
- No exponential backoff on 429 rate limit errors #5159 - No exponential backoff on 429 rate limit errors
- Per-model rate limit on one Google model triggers full provider cooldown, blocking other Google models with available quota #5744 - Per-model rate limit triggers full provider cooldown
- [Feature]: Auto-restart gateway on stuck sessions #4410 - Auto-restart gateway on stuck sessions
Logs or screenshots
Paste relevant logs or add screenshots (redact secrets).

Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working