-
-
Notifications
You must be signed in to change notification settings - Fork 39.6k
Description
Summary
One-shot cron jobs (kind: "at") that complete with a non-"ok" status (e.g., "skipped", "error") are never disabled. This causes computeJobNextRunAtMs() to keep returning the past scheduled time, which triggers a setTimeout(..., 0) tight loop that pegs the gateway at 100%+ CPU indefinitely.
Environment
- OpenClaw version: 2026.2.6-3
- OS: Raspberry Pi OS (Linux 6.12.62+rpt-rpi-2712, aarch64)
- Install method: npm global
Steps to Reproduce
-
Create a one-shot cron job scheduled for a time in the near future:
{ "name": "test-reminder", "schedule": {"kind": "at", "atMs": <timestamp_2_minutes_from_now>}, "sessionTarget": "isolated", "wakeMode": "now", "payload": {"kind": "systemEvent", "text": "test"} } -
Wait for the scheduled time to pass
-
If the job completes with
lastStatus: "skipped"(or any status other than"ok"), the loop begins
Expected Behavior
kind: "at"jobs should be disabled after any terminal execution — whether"ok","skipped", or"error"- At minimum, there should be a retry cap or exponential backoff for past-due one-shot jobs
Actual Behavior
The job stays enabled: true and enters an infinite tight loop:
armTimer()→nextWakeAtMs()returns past time →setTimeout(fn, 0)(fires immediately)onTimer()→ executes job → status is"skipped"→nextRunAtMsrecomputed to same past timearmTimer()again → goto 1
This repeats ~4,800 times/second, driving the gateway to 100%+ CPU.
Evidence
CPU Profile (--cpu-prof, 105-second capture)
Top functions by sample count during the busy loop:
| Function | % of samples | Source |
|---|---|---|
croner.js (g, C, partToArray) |
~16% | Schedule evaluation |
json5/parse.js (lex, string, push) |
~6% | loadCronStore |
saveCronStore |
~2% | Writing jobs.json |
loadCronStore |
~1.6% | Reading jobs.json |
| File I/O (copyFile, rename, stat) | ~5% | Per-iteration persistence |
| Idle | 38.6% | Should be ~99% at rest |
strace confirmation
The loop manifests as ~5,200 reads/second on an eventfd (io_uring completion):
read(19, "\1\0\0\0\0\0\0\0", 8) = 8 # repeated thousands of times/sec
The eventfd activity is driven by the constant file I/O from saveCronStore/loadCronStore on every iteration.
Before/After
| State | CPU usage |
|---|---|
| With stuck job | ~116% (single core pegged) |
| After removing job | ~3% (normal idle: Telegram polling + heartbeat) |
Root Cause
In the cron timer logic, kind: "at" jobs are only auto-disabled when status === "ok". The relevant logic (pseudocode from bundled source):
// After job execution:
if (job.schedule.kind === "at" && lastStatus === "ok") {
job.enabled = false; // ← only disables on "ok"
}
// Next run calculation:
if (job.schedule.kind === "at" && job.enabled) {
return job.schedule.atMs; // ← returns the original (past) time
}Since "skipped" ≠ "ok", the job remains enabled. computeJobNextRunAtMs() returns the original atMs (which is in the past), causing setTimeout to fire with delay 0 on every cycle.
Suggested Fix
// Disable kind:"at" jobs on ANY terminal status, not just "ok":
if (job.schedule.kind === "at") {
job.enabled = false;
}Or more conservatively:
- Add a retry counter with a cap (e.g., 3 attempts)
- Add exponential backoff for past-due one-shot jobs
- Treat
"skipped"and"error"as terminal for one-shot jobs
Impact
- Severity: High — a single stuck job silently pegs an entire CPU core
- In our case, the job ran undetected for 4+ days at 100% CPU on a Raspberry Pi 5, causing significant NVMe write amplification and heat
- The only fix is manual removal of the stuck job from
cron/jobs.json
Workaround
Identify and remove (or disable) stuck one-shot jobs:
# Find enabled at-jobs with nextRunAtMs in the past:
openclaw cron list --all --json | jq '[.[] | select(.schedule.kind == "at" and .enabled == true and .state.nextRunAtMs < (now * 1000))]'Related Issues
- Cron scheduler enters infinite retry loop on model rejection #11438 — Same infinite loop mechanism, triggered by model rejection errors
- [Bug]: Cron Reminders Skipping/Silent Failure + Temporary Fix #8298 — Case 2 describes the same "skipped" loop behavior
- Cron jobs created via tool are disabled by default (enabled not defaulting to true) #6483 — Previous cron bug (enabled default), fixed in v2026.2.6