Skip to content

Cron: kind:"at" jobs with non-"ok" terminal status loop forever at 100% CPU #11452

@kikemarti-eu

Description

@kikemarti-eu

Summary

One-shot cron jobs (kind: "at") that complete with a non-"ok" status (e.g., "skipped", "error") are never disabled. This causes computeJobNextRunAtMs() to keep returning the past scheduled time, which triggers a setTimeout(..., 0) tight loop that pegs the gateway at 100%+ CPU indefinitely.

Environment

  • OpenClaw version: 2026.2.6-3
  • OS: Raspberry Pi OS (Linux 6.12.62+rpt-rpi-2712, aarch64)
  • Install method: npm global

Steps to Reproduce

  1. Create a one-shot cron job scheduled for a time in the near future:

    {
      "name": "test-reminder",
      "schedule": {"kind": "at", "atMs": <timestamp_2_minutes_from_now>},
      "sessionTarget": "isolated",
      "wakeMode": "now",
      "payload": {"kind": "systemEvent", "text": "test"}
    }
  2. Wait for the scheduled time to pass

  3. If the job completes with lastStatus: "skipped" (or any status other than "ok"), the loop begins

Expected Behavior

  • kind: "at" jobs should be disabled after any terminal execution — whether "ok", "skipped", or "error"
  • At minimum, there should be a retry cap or exponential backoff for past-due one-shot jobs

Actual Behavior

The job stays enabled: true and enters an infinite tight loop:

  1. armTimer()nextWakeAtMs() returns past time → setTimeout(fn, 0) (fires immediately)
  2. onTimer() → executes job → status is "skipped"nextRunAtMs recomputed to same past time
  3. armTimer() again → goto 1

This repeats ~4,800 times/second, driving the gateway to 100%+ CPU.

Evidence

CPU Profile (--cpu-prof, 105-second capture)

Top functions by sample count during the busy loop:

Function % of samples Source
croner.js (g, C, partToArray) ~16% Schedule evaluation
json5/parse.js (lex, string, push) ~6% loadCronStore
saveCronStore ~2% Writing jobs.json
loadCronStore ~1.6% Reading jobs.json
File I/O (copyFile, rename, stat) ~5% Per-iteration persistence
Idle 38.6% Should be ~99% at rest

strace confirmation

The loop manifests as ~5,200 reads/second on an eventfd (io_uring completion):

read(19, "\1\0\0\0\0\0\0\0", 8) = 8   # repeated thousands of times/sec

The eventfd activity is driven by the constant file I/O from saveCronStore/loadCronStore on every iteration.

Before/After

State CPU usage
With stuck job ~116% (single core pegged)
After removing job ~3% (normal idle: Telegram polling + heartbeat)

Root Cause

In the cron timer logic, kind: "at" jobs are only auto-disabled when status === "ok". The relevant logic (pseudocode from bundled source):

// After job execution:
if (job.schedule.kind === "at" && lastStatus === "ok") {
  job.enabled = false;  // ← only disables on "ok"
}

// Next run calculation:
if (job.schedule.kind === "at" && job.enabled) {
  return job.schedule.atMs;  // ← returns the original (past) time
}

Since "skipped""ok", the job remains enabled. computeJobNextRunAtMs() returns the original atMs (which is in the past), causing setTimeout to fire with delay 0 on every cycle.

Suggested Fix

// Disable kind:"at" jobs on ANY terminal status, not just "ok":
if (job.schedule.kind === "at") {
  job.enabled = false;
}

Or more conservatively:

  • Add a retry counter with a cap (e.g., 3 attempts)
  • Add exponential backoff for past-due one-shot jobs
  • Treat "skipped" and "error" as terminal for one-shot jobs

Impact

  • Severity: High — a single stuck job silently pegs an entire CPU core
  • In our case, the job ran undetected for 4+ days at 100% CPU on a Raspberry Pi 5, causing significant NVMe write amplification and heat
  • The only fix is manual removal of the stuck job from cron/jobs.json

Workaround

Identify and remove (or disable) stuck one-shot jobs:

# Find enabled at-jobs with nextRunAtMs in the past:
openclaw cron list --all --json | jq '[.[] | select(.schedule.kind == "at" and .enabled == true and .state.nextRunAtMs < (now * 1000))]'

Related Issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions