Skip to content

Cron: add retry policy for transient failures (rate limit, network) #24355

@mediateo

Description

@mediateo

Problem

When a cron job fails due to a transient error (e.g., model provider rate limit cooldown, temporary network outage), the job is immediately set to enabled: false with no retry. This is especially problematic for one-shot jobs (schedule.kind: "at") since there is no next scheduled run — the job is effectively lost.

Observed behavior

  • Cron job fires on schedule ✅
  • Model provider returns rate limit (429) → OpenClaw enters cooldown
  • Job state: lastStatus: "error", enabled: false
  • One-shot job with deleteAfterRun: true → permanently disabled, never retries

Expected behavior

For transient/retryable errors (rate limit, network timeout, provider cooldown), the scheduler should:

  1. Automatically retry with exponential backoff (e.g., 1m → 2m → 5m → 10m)
  2. Only disable/fail permanently after max retries exhausted
  3. Distinguish between transient vs permanent errors (auth failure = permanent, rate limit = transient)

Context

This is standard in most modern schedulers:

  • AWS EventBridge: retry policy with up to 185 retries
  • Kubernetes CronJob: backoffLimit for retry count
  • Celery/Bull: exponential backoff by default

Suggestion

A possible configuration could look like:

{
  "retry": {
    "maxAttempts": 3,
    "backoffMs": [60000, 120000, 300000],
    "retryOn": ["rate_limit", "network"]
  }
}

Or a simpler global default: retry transient errors up to 3 times with exponential backoff.

Environment

  • OpenClaw 2026.2.15
  • macOS (Darwin arm64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions