fix: prevent cron jobs from skipping execution when nextRunAtMs advances#14019
fix: prevent cron jobs from skipping execution when nextRunAtMs advances#14019WalterSumbon wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Fixes openclaw#13992 **Problem:** Cron jobs were having their nextRunAtMs advanced to future times without executing. The scheduler would wake up, mark the time as processed, but never run the job - causing silent skips. **Root Cause:** When runDueJobs() found no due jobs (due to timing issues, race conditions, or bugs in job filtering), onTimer() would still call recomputeNextRuns() which unconditionally recomputed nextRunAtMs for all enabled jobs. This advanced past-due nextRunAtMs values to their next scheduled times WITHOUT executing the jobs first. **Fix:** 1. Modified runDueJobs() to return the count of executed jobs 2. Created recomputeNextRunsForMaintenance() that only performs maintenance tasks (clearing disabled jobs, stuck markers) but does NOT recompute nextRunAtMs for enabled jobs with existing values 3. In onTimer(), use maintenance-only recompute when no jobs were executed to prevent silently advancing past-due nextRunAtMs values **Testing:** - Added comprehensive regression tests for openclaw#13992 - Tests verify past-due nextRunAtMs are NOT advanced during maintenance - Tests verify missing nextRunAtMs ARE still computed - Tests verify disabled jobs and stuck markers are still cleaned up
Additional Comments (3)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 1:3
Comment:
**Regression: scheduler behavior changed**
`src/cron/service/timer.ts` in this commit appears to be an older/simplified timer implementation that drops prior behavior from `main` (e.g., max 60s wake clamp + drift recovery, debug logging, job execution timeout/backoff logic, session reaper piggyback, and `findDueJobs` handling of missing `job.state`). If this file replacement is intentional it needs a clear migration path; otherwise it looks like an accidental overwrite that will change cron semantics and remove safety nets.
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 41:44
Comment:
**Unsafe `job.state` access**
`runDueJobs()` assumes `j.state` exists (`j.state.runningAtMs`, `j.state.nextRunAtMs`), but in the existing codebase jobs defensively initialize `job.state = {}` when missing. If a persisted/migrated job lacks `state`, this will throw at runtime during a timer tick. Consider mirroring the `findDueJobs()` pattern from `main` (ensure `j.state` is initialized before reads).
How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/cron/service/jobs.ts
Line: 15:17
Comment:
**Stuck-run threshold mismatch**
`recomputeNextRunsForMaintenance()` uses `STUCK_RUN_MS` which is currently `2 * 60 * 60 * 1000` (2 hours) in this module, but the new regression test asserts a 10-minute threshold. As written, the test description/expectations don’t match production behavior, and the “stuck marker” test will be flaky/incorrect if the threshold is actually 2 hours.
How can I resolve this? If you propose a fix, please make it concise. |
- openclaw#14019: prevent cron jobs from skipping execution (nextRunAtMs regression) - openclaw#13931: drain active agent turns before gateway restart - openclaw#14023: filter skills watcher file types to prevent FD exhaustion - openclaw#13983: use requested agentId for isolated cron job auth resolution - openclaw#13956: fix Grok web search returning empty content (xAI Responses API) - openclaw#14018: auto-remind AI to check workspace context files on session reset - openclaw#13960: preserve structured config validation errors in UI - openclaw#13951: improve provider error logging in gateway logs - openclaw#14021: optional memory flush before manual /compact command Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Closing this PR due to base branch sync issues. The fix has been updated to preserve all existing scheduler logic. Will create a new PR with the corrected version. |
Fixes #13992
Problem:
Cron jobs were having their
nextRunAtMsadvanced to future times without executing. The scheduler would wake up, mark the time as processed, but never run the job - causing silent skips.Root Cause:
When
runDueJobs()found no due jobs (due to timing issues, race conditions, or bugs in job filtering),onTimer()would still callrecomputeNextRuns()which unconditionally recomputednextRunAtMsfor all enabled jobs. This advanced past-duenextRunAtMsvalues to their next scheduled times WITHOUT executing the jobs first.Fix:
runDueJobs()to return the count of executed jobsrecomputeNextRunsForMaintenance()that only performs maintenance tasks (clearing disabled jobs, stuck markers) but does NOT recomputenextRunAtMsfor enabled jobs with existing valuesonTimer(), use maintenance-only recompute when no jobs were executed to prevent silently advancing past-duenextRunAtMsvaluesTesting:
nextRunAtMsare NOT advanced during maintenancenextRunAtMsARE still computedChanges:
src/cron/service/timer.ts- Modified onTimer and runDueJobssrc/cron/service/jobs.ts- Added recomputeNextRunsForMaintenance functionsrc/cron/service.issue-13992-regression.test.ts- Complete regression test suiteGreptile Overview
Greptile Summary
This PR changes the cron scheduler tick behavior to avoid advancing
nextRunAtMsfor enabled jobs when a timer tick finds no due jobs, by introducing a maintenance-only recompute path and makingrunDueJobs()return an executed count.Key additions:
recomputeNextRunsForMaintenance()insrc/cron/service/jobs.tsto only clear disabled jobs / stuck markers and fill missingnextRunAtMswithout recalculating existing values.src/cron/service/timer.tsuses maintenance-only recompute whenrunDueJobs()executes zero jobs.However, the current
timer.tsin this commit appears to replace a much more feature-complete implementation frommain(timeouts/backoff, drift handling, logging, session reaper, defensivejob.stateinit). If unintentional, this is a merge-blocking regression.Confidence Score: 2/5
src/cron/service/timer.tsin this commit appears to drop substantial existing scheduler logic frommain(timeouts/backoff, drift recovery, session reaping, and defensive state handling). Additionally, the newrunDueJobs()readsjob.statewithout initialization and the regression test expectations don’t match the production stuck-run threshold constant.