Skip to content

Cron: disable one-shot at jobs after any terminal status to prevent retry storms (#11438, #11452)#11459

Closed
lailoo wants to merge 2 commits intoopenclaw:mainfrom
lailoo:fix/cron-at-retry-storm-11438
Closed

Cron: disable one-shot at jobs after any terminal status to prevent retry storms (#11438, #11452)#11459
lailoo wants to merge 2 commits intoopenclaw:mainfrom
lailoo:fix/cron-at-retry-storm-11438

Conversation

@lailoo
Copy link
Contributor

@lailoo lailoo commented Feb 7, 2026

Summary

Fixes #11438
Fixes #11452

One-shot at cron jobs that complete with a non-"ok" status (e.g. "error", "skipped") enter an infinite retry loop because computeJobNextRunAtMs returns the original (past) atMs, causing findDueJobs to immediately re-trigger the job. This generates 1,900+ failed attempts in 5 seconds and pegs the gateway at 100% CPU.

Problem

When an at job finishes with any non-"ok" status:

  1. onTimer calls computeJobNextRunAtMs(job, result.endedAt)
  2. For at jobs, this returns the original atMs (already in the past)
  3. armTimer fires immediately, findDueJobs picks up the job again
  4. Repeat → infinite retry storm with ~2-3ms between attempts

Solution

Disable one-shot at jobs after any terminal status (ok, error, skipped), not just ok. One-shot jobs are inherently run-once — if the job fails or is skipped, the user can manually re-enable or re-create it.

This applies to both onTimer (timer-triggered) and executeJob (cron run command) code paths.

Changes

  • src/cron/service/timer.ts: Simplify at job post-execution to disable on any terminal status
  • src/cron/service.runs-one-shot-main-job-disables-it.test.ts: Add regression test
  • CHANGELOG.md: Add entry referencing both issues

Testing

  • New unit test: "disables a one-shot at job after failure to prevent retry storm"
  • All 87 cron tests pass (pnpm vitest run src/cron/)
  • Real-environment verification with standalone script:
Branch Run attempts (5s) Job enabled Job status
main (before fix) 1,933 true (still retrying) error
fix/cron-at-retry-storm-11438 1 false (disabled) error

@lailoo
Copy link
Contributor Author

lailoo commented Feb 7, 2026

Real-environment verification

Reproduced the retry storm on main and verified the fix on the PR branch using a standalone script that creates a CronService with a failing runIsolatedAgentJob and a past-due at job:

Branch Run attempts (5s) Job enabled Job status
main (before fix) 1,933 true (still retrying) error
fix/cron-at-retry-storm-11438 1 false (disabled) error

The fix correctly disables one-shot at jobs after failure, preventing the infinite retry loop.

@lailoo lailoo changed the title Cron: disable one-shot at jobs on failure to prevent retry storms (#11438) Cron: disable one-shot at jobs after any terminal status to prevent retry storms (#11438, #11452) Feb 7, 2026
@tyler6204
Copy link
Member

Superseded by #11641 (merge commit: 8fae55e). Closing to reduce duplicate PR noise. Please open a new PR only if there is additional scope beyond this fix.

@tyler6204 tyler6204 closed this Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron: kind:"at" jobs with non-"ok" terminal status loop forever at 100% CPU Cron scheduler enters infinite retry loop on model rejection

2 participants

Comments