fix(cron): add failure limit and exponential backoff for isolated tasks#8578
fix(cron): add failure limit and exponential backoff for isolated tasks#8578Baoxd123 wants to merge 3 commits intoopenclaw:mainfrom
Conversation
- Add MAX_CONSECUTIVE_FAILURES (3) to prevent infinite retry loops - Track consecutiveFailures in CronJobState - Auto-disable jobs after 3 consecutive failures - Implement exponential backoff for isolated tasks (1s, 2s, 4s) - Add comprehensive test suite (6 test cases) - Prevents API rate limit cooldown from failed cron tasks Fixes openclaw#8520
| // Track consecutive failures | ||
| if (status === "error") { | ||
| job.state.consecutiveFailures = (job.state.consecutiveFailures ?? 0) + 1; | ||
| } else if (status === "ok") { | ||
| // Reset failure counter on success | ||
| job.state.consecutiveFailures = 0; | ||
| } |
There was a problem hiding this comment.
[P0] consecutiveFailures is not reset on status === "skipped", so skips can carry forward prior failures and cause an auto-disable sooner than intended.
Example: error → error → skipped → error will disable the job on that last error (counter goes 2→2→3) even though the skipped run likely indicates “not applicable / no-op” rather than continued failure. If the intent is “only consecutive errors count,” consider resetting on non-error statuses (e.g., treat skipped like ok for the counter), or explicitly decide and document otherwise.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 101:107
Comment:
[P0] `consecutiveFailures` is not reset on `status === "skipped"`, so skips can carry forward prior failures and cause an auto-disable sooner than intended.
Example: error → error → skipped → error will disable the job on that last error (counter goes 2→2→3) even though the skipped run likely indicates “not applicable / no-op” rather than continued failure. If the intent is “only consecutive *errors* count,” consider resetting on non-error statuses (e.g., treat `skipped` like `ok` for the counter), or explicitly decide and document otherwise.
How can I resolve this? If you propose a fix, please make it concise.
src/cron/service/timer.ts
Outdated
| } finally { | ||
| job.updatedAtMs = nowMs; | ||
| if (!opts.forced && job.enabled && !deleted) { | ||
| // Keep nextRunAtMs in sync in case the schedule advanced during a long run. | ||
| job.state.nextRunAtMs = computeJobNextRunAtMs(job, state.deps.nowMs()); | ||
| } | ||
| // Note: Do not override job.state.nextRunAtMs here, as it has already been | ||
| // set correctly in the finish() callback with proper failure protection logic. |
There was a problem hiding this comment.
[P1] The finally block no longer recomputes nextRunAtMs using the current time (state.deps.nowMs()), and now always sets job.updatedAtMs = nowMs (the function arg) even if the job ran for a long time.
Previously this acted as a safety net (“schedule advanced during a long run”) and kept updatedAtMs aligned with completion. With the change, long-running jobs can end with stale updatedAtMs, and nextRunAtMs is locked to endedAt computed inside finish(). If your scheduler expects nextRunAtMs to be based on the wall clock at completion (not start), consider using endedAt for updatedAtMs and/or restoring the sync logic (while still preserving failure-protection behavior).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 265:268
Comment:
[P1] The `finally` block no longer recomputes `nextRunAtMs` using the *current* time (`state.deps.nowMs()`), and now always sets `job.updatedAtMs = nowMs` (the function arg) even if the job ran for a long time.
Previously this acted as a safety net (“schedule advanced during a long run”) and kept `updatedAtMs` aligned with completion. With the change, long-running jobs can end with stale `updatedAtMs`, and `nextRunAtMs` is locked to `endedAt` computed inside `finish()`. If your scheduler expects `nextRunAtMs` to be based on the wall clock at completion (not start), consider using `endedAt` for `updatedAtMs` and/or restoring the sync logic (while still preserving failure-protection behavior).
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| // First failure: should schedule retry after 1s (1000 * 2^0) | ||
| await executeJob(mockState as CronServiceState, testJob, nowMs, { forced: false }); | ||
| expect(testJob.state.consecutiveFailures).toBe(1); | ||
| const firstBackoff = testJob.state.nextRunAtMs ? testJob.state.nextRunAtMs - nowMs : 0; | ||
| expect(firstBackoff).toBe(1000); // 1 second | ||
|
|
||
| // Second failure: should schedule retry after 2s (1000 * 2^1) | ||
| nowMs += 1000; |
There was a problem hiding this comment.
[P3] This test name/comment and expectation conflict: the test is titled "should not auto-disable main session tasks at same threshold" but it ultimately asserts expect(testJob.enabled).toBe(false) after repeated failures.
Consider renaming to reflect the actual behavior being verified (e.g., that main jobs do respect MAX_CONSECUTIVE_FAILURES) to avoid confusion when maintaining the suite.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.failure-protection.test.ts
Line: 105:113
Comment:
[P3] This test name/comment and expectation conflict: the test is titled `"should not auto-disable main session tasks at same threshold"` but it ultimately asserts `expect(testJob.enabled).toBe(false)` after repeated failures.
Consider renaming to reflect the actual behavior being verified (e.g., that main jobs *do* respect `MAX_CONSECUTIVE_FAILURES`) to avoid confusion when maintaining the suite.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Reset consecutiveFailures on skipped status (skipped = not applicable/no-op) - Use state.deps.nowMs() for updatedAtMs to reflect actual completion time - Rename test to match actual behavior (main tasks also auto-disable)
bfc1ccb to
f92900f
Compare
Fixes #8520
What's Fixed
Prevents infinite retry loops in cron scheduler for isolated tasks.
Changes
Testing
🤖 AI-Assisted Development
Built with Gemini 3 Pro, Claude 4.5 Haiku&Opus and OpenClaw. Code is fully tested and ready for production. Confirms understanding of implementation.
Greptile Overview
Greptile Summary
This PR adds failure protection to the cron scheduler by introducing
CronJobState.consecutiveFailures, enforcing a max consecutive error threshold (3) that auto-disables a job, and applying exponential backoff for isolated jobs on error. It also adds a focused vitest suite covering failure counting, disabling, and backoff behavior.The changes are centered in
src/cron/service/timer.ts’sexecuteJob()finish path, which now mutates job state based onstatusand adjustsnextRunAtMsscheduling decisions accordingly.src/cron/types.tsextends the persisted job state to track failures.Confidence Score: 3/5
skippedhandling can unintentionally carry failures forward, and removing thefinally-block resync may affect long-running job scheduling/metadata. Neither is a compile-time issue, but both can cause surprising runtime behavior.(2/5) Greptile learns from your feedback when you react with thumbs up/down!