Skip to content

[Bug]: ⚠️Cron isolated tasks without failure limit trigger infinite retry loop → API cooldown #8520

@Baoxd123

Description

@Baoxd123

Summary

When a cron task with sessionTarget="isolated" fails, the cron scheduler retries it infinitely. Each retry spawns a new isolated agent session and calls the LLM API, causing rapid rate_limit cooldown that freezes the entire OpenClaw instance.

Steps to Reproduce

  1. Create a cron task that will fail:
{
  "name": "Test failing isolated task",
  "schedule": { "kind": "at", "atMs": 1770165091000 },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "Execute: pkill -f 'process_that_does_not_exist_xyz'"
  }
}
  1. Wait for the scheduled time
  2. Task fails because the process doesn't exist
  3. Observe the logs:
    • Cron immediately retries the failed task
    • Each retry creates a new isolated session
    • Each session calls the LLM API
    • Rapid 429 rate_limit errors flood the logs
    • API enters cooldown: "Provider anthropic is in cooldown (all profiles unavailable)"
    • All sessions become unresponsive

Current Behavior

Error: All models failed (2): anthropic/claude-haiku-4-5: Provider anthropic is in cooldown (all profiles unavailable) (rate_limit) | anthropic/claude-opus-4-5: Provider anthropic is in cooldown (all profiles unavailable) (rate_limit)

This error repeats indefinitely, preventing all message processing.

Expected Behavior

  1. Cron tasks should NOT infinitely retry on failure
  2. After N failures (e.g., 3), the task should:
    • Be automatically disabled/marked as failed
    • Stop retrying
    • Notify the user with error details
  3. Failing tasks should NOT trigger full provider cooldown
  4. Other sessions should remain responsive

Impact

  • Severity: CRITICAL 🔴
  • Affected users: Anyone using isolated cron tasks (agentTurn payload)
  • Trigger: Creating duplicate cron tasks or tasks with invalid commands
  • Recovery: Manual intervention in web UI to disable the failed task

Example Log Output

18:31:31 PM - cron task 32982bf7-... started (isolated)
18:31:31 PM - agentTurn failed (exit code 1)
18:31:32 PM - cron retry attempt 1
18:31:33 PM - HTTP 429 rate_limit_error
18:31:34 PM - cron retry attempt 2
18:31:34 PM - HTTP 429 rate_limit_error
18:31:35 PM - Provider anthropic is in cooldown
18:31:35 PM - [No response] API frozen

Proposed Fix

Option A: Failure limit + auto-disable

// cron/scheduler.ts
const MAX_CONSECUTIVE_FAILURES = 3;

if (job.state.consecutiveFailures >= MAX_CONSECUTIVE_FAILURES) {
  job.enabled = false;
  logger.error(`Job ${job.id} disabled after ${MAX_CONSECUTIVE_FAILURES} failures`);
  notifyUser(`Cron job "${job.name}" disabled: max retries exceeded`);
  return;
}

Option B: Exponential backoff on isolated tasks

// cron/scheduler.ts
if (sessionTarget === "isolated") {
  const delay = Math.min(
    1000 * Math.pow(2, job.state.consecutiveFailures),
    3600000 // max 1 hour
  );
  scheduleRetry(job, delay);
}

Option C: Rate limit circuit breaker

// If a single cron task triggers rate_limit, pause all cron execution
// until rate limit clears (rather than keep hammering the API)

Environment

  • OpenClaw version: 2026-01-31
  • Node: v22.22.0
  • OS: macOS 24.6.0 (arm64)
  • Provider: Anthropic

Additional Context

This mirrors the documented issues #5159 (no exponential backoff) and #5744 (provider-wide cooldown), but is specific to the cron scheduler's handling of isolated task failures.

Related Issues

Logs or screenshots

Paste relevant logs or add screenshots (redact secrets).

Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions