Skip to content

Comments

fix(cron): add failure limit and exponential backoff for isolated tasks#8578

Open
Baoxd123 wants to merge 3 commits intoopenclaw:mainfrom
Baoxd123:fix/cron-failure-protection
Open

fix(cron): add failure limit and exponential backoff for isolated tasks#8578
Baoxd123 wants to merge 3 commits intoopenclaw:mainfrom
Baoxd123:fix/cron-failure-protection

Conversation

@Baoxd123
Copy link

@Baoxd123 Baoxd123 commented Feb 4, 2026

Fixes #8520

What's Fixed

Prevents infinite retry loops in cron scheduler for isolated tasks.

Changes

  • Added MAX_CONSECUTIVE_FAILURES (3 failures max)
  • Track consecutive failures in CronJobState
  • Auto-disable jobs after 3 consecutive failures
  • Implement exponential backoff (1s, 2s, 4s retry delays)

Testing

  • All 5054 tests pass ✅
  • 6 new comprehensive test cases pass

🤖 AI-Assisted Development

Built with Gemini 3 Pro, Claude 4.5 Haiku&Opus and OpenClaw. Code is fully tested and ready for production. Confirms understanding of implementation.

Greptile Overview

Greptile Summary

This PR adds failure protection to the cron scheduler by introducing CronJobState.consecutiveFailures, enforcing a max consecutive error threshold (3) that auto-disables a job, and applying exponential backoff for isolated jobs on error. It also adds a focused vitest suite covering failure counting, disabling, and backoff behavior.

The changes are centered in src/cron/service/timer.ts’s executeJob() finish path, which now mutates job state based on status and adjusts nextRunAtMs scheduling decisions accordingly. src/cron/types.ts extends the persisted job state to track failures.

Confidence Score: 3/5

  • This PR is likely safe to merge, but there are a couple of behavioral edge cases worth clarifying before landing.
  • Core approach (failure counter + disable + backoff) is straightforward and well-covered by tests, but skipped handling can unintentionally carry failures forward, and removing the finally-block resync may affect long-running job scheduling/metadata. Neither is a compile-time issue, but both can cause surprising runtime behavior.
  • src/cron/service/timer.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

- Add MAX_CONSECUTIVE_FAILURES (3) to prevent infinite retry loops
- Track consecutiveFailures in CronJobState
- Auto-disable jobs after 3 consecutive failures
- Implement exponential backoff for isolated tasks (1s, 2s, 4s)
- Add comprehensive test suite (6 test cases)
- Prevents API rate limit cooldown from failed cron tasks

Fixes openclaw#8520
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 101 to 107
// Track consecutive failures
if (status === "error") {
job.state.consecutiveFailures = (job.state.consecutiveFailures ?? 0) + 1;
} else if (status === "ok") {
// Reset failure counter on success
job.state.consecutiveFailures = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] consecutiveFailures is not reset on status === "skipped", so skips can carry forward prior failures and cause an auto-disable sooner than intended.

Example: error → error → skipped → error will disable the job on that last error (counter goes 2→2→3) even though the skipped run likely indicates “not applicable / no-op” rather than continued failure. If the intent is “only consecutive errors count,” consider resetting on non-error statuses (e.g., treat skipped like ok for the counter), or explicitly decide and document otherwise.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 101:107

Comment:
[P0] `consecutiveFailures` is not reset on `status === "skipped"`, so skips can carry forward prior failures and cause an auto-disable sooner than intended.

Example: error → error → skipped → error will disable the job on that last error (counter goes 2→2→3) even though the skipped run likely indicates “not applicable / no-op” rather than continued failure. If the intent is “only consecutive *errors* count,” consider resetting on non-error statuses (e.g., treat `skipped` like `ok` for the counter), or explicitly decide and document otherwise.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 265 to 268
} finally {
job.updatedAtMs = nowMs;
if (!opts.forced && job.enabled && !deleted) {
// Keep nextRunAtMs in sync in case the schedule advanced during a long run.
job.state.nextRunAtMs = computeJobNextRunAtMs(job, state.deps.nowMs());
}
// Note: Do not override job.state.nextRunAtMs here, as it has already been
// set correctly in the finish() callback with proper failure protection logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] The finally block no longer recomputes nextRunAtMs using the current time (state.deps.nowMs()), and now always sets job.updatedAtMs = nowMs (the function arg) even if the job ran for a long time.

Previously this acted as a safety net (“schedule advanced during a long run”) and kept updatedAtMs aligned with completion. With the change, long-running jobs can end with stale updatedAtMs, and nextRunAtMs is locked to endedAt computed inside finish(). If your scheduler expects nextRunAtMs to be based on the wall clock at completion (not start), consider using endedAt for updatedAtMs and/or restoring the sync logic (while still preserving failure-protection behavior).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 265:268

Comment:
[P1] The `finally` block no longer recomputes `nextRunAtMs` using the *current* time (`state.deps.nowMs()`), and now always sets `job.updatedAtMs = nowMs` (the function arg) even if the job ran for a long time.

Previously this acted as a safety net (“schedule advanced during a long run”) and kept `updatedAtMs` aligned with completion. With the change, long-running jobs can end with stale `updatedAtMs`, and `nextRunAtMs` is locked to `endedAt` computed inside `finish()`. If your scheduler expects `nextRunAtMs` to be based on the wall clock at completion (not start), consider using `endedAt` for `updatedAtMs` and/or restoring the sync logic (while still preserving failure-protection behavior).

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +105 to +113

// First failure: should schedule retry after 1s (1000 * 2^0)
await executeJob(mockState as CronServiceState, testJob, nowMs, { forced: false });
expect(testJob.state.consecutiveFailures).toBe(1);
const firstBackoff = testJob.state.nextRunAtMs ? testJob.state.nextRunAtMs - nowMs : 0;
expect(firstBackoff).toBe(1000); // 1 second

// Second failure: should schedule retry after 2s (1000 * 2^1)
nowMs += 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] This test name/comment and expectation conflict: the test is titled "should not auto-disable main session tasks at same threshold" but it ultimately asserts expect(testJob.enabled).toBe(false) after repeated failures.

Consider renaming to reflect the actual behavior being verified (e.g., that main jobs do respect MAX_CONSECUTIVE_FAILURES) to avoid confusion when maintaining the suite.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.failure-protection.test.ts
Line: 105:113

Comment:
[P3] This test name/comment and expectation conflict: the test is titled `"should not auto-disable main session tasks at same threshold"` but it ultimately asserts `expect(testJob.enabled).toBe(false)` after repeated failures.

Consider renaming to reflect the actual behavior being verified (e.g., that main jobs *do* respect `MAX_CONSECUTIVE_FAILURES`) to avoid confusion when maintaining the suite.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Andy added 2 commits February 4, 2026 01:28
- Reset consecutiveFailures on skipped status (skipped = not applicable/no-op)
- Use state.deps.nowMs() for updatedAtMs to reflect actual completion time
- Rename test to match actual behavior (main tasks also auto-disable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ⚠️Cron isolated tasks without failure limit trigger infinite retry loop → API cooldown

1 participant