Skip to content

fix: prevent cron jobs from skipping execution when nextRunAtMs advances#14019

Closed
WalterSumbon wants to merge 1 commit intoopenclaw:mainfrom
WalterSumbon:fix/cron-nextrunatms-skip-bug
Closed

fix: prevent cron jobs from skipping execution when nextRunAtMs advances#14019
WalterSumbon wants to merge 1 commit intoopenclaw:mainfrom
WalterSumbon:fix/cron-nextrunatms-skip-bug

Conversation

@WalterSumbon
Copy link
Contributor

@WalterSumbon WalterSumbon commented Feb 11, 2026

Fixes #13992

Problem:
Cron jobs were having their nextRunAtMs advanced to future times without executing. The scheduler would wake up, mark the time as processed, but never run the job - causing silent skips.

Root Cause:
When runDueJobs() found no due jobs (due to timing issues, race conditions, or bugs in job filtering), onTimer() would still call recomputeNextRuns() which unconditionally recomputed nextRunAtMs for all enabled jobs. This advanced past-due nextRunAtMs values to their next scheduled times WITHOUT executing the jobs first.

Fix:

  1. Modified runDueJobs() to return the count of executed jobs
  2. Created recomputeNextRunsForMaintenance() that only performs maintenance tasks (clearing disabled jobs, stuck markers) but does NOT recompute nextRunAtMs for enabled jobs with existing values
  3. In onTimer(), use maintenance-only recompute when no jobs were executed to prevent silently advancing past-due nextRunAtMs values

Testing:

  • Added comprehensive regression tests for Cron jobs advance nextRunAtMs without executing #13992
  • Tests verify past-due nextRunAtMs are NOT advanced during maintenance
  • Tests verify missing nextRunAtMs ARE still computed
  • Tests verify disabled jobs and stuck markers are still cleaned up

Changes:

  • src/cron/service/timer.ts - Modified onTimer and runDueJobs
  • src/cron/service/jobs.ts - Added recomputeNextRunsForMaintenance function
  • src/cron/service.issue-13992-regression.test.ts - Complete regression test suite

Greptile Overview

Greptile Summary

This PR changes the cron scheduler tick behavior to avoid advancing nextRunAtMs for enabled jobs when a timer tick finds no due jobs, by introducing a maintenance-only recompute path and making runDueJobs() return an executed count.

Key additions:

  • recomputeNextRunsForMaintenance() in src/cron/service/jobs.ts to only clear disabled jobs / stuck markers and fill missing nextRunAtMs without recalculating existing values.
  • src/cron/service/timer.ts uses maintenance-only recompute when runDueJobs() executes zero jobs.
  • A new regression test suite for issue Cron jobs advance nextRunAtMs without executing #13992.

However, the current timer.ts in this commit appears to replace a much more feature-complete implementation from main (timeouts/backoff, drift handling, logging, session reaper, defensive job.state init). If unintentional, this is a merge-blocking regression.

Confidence Score: 2/5

  • Not safe to merge due to likely functional regressions in cron scheduler behavior.
  • The fix idea is reasonable, but the src/cron/service/timer.ts in this commit appears to drop substantial existing scheduler logic from main (timeouts/backoff, drift recovery, session reaping, and defensive state handling). Additionally, the new runDueJobs() reads job.state without initialization and the regression test expectations don’t match the production stuck-run threshold constant.
  • src/cron/service/timer.ts, src/cron/service/jobs.ts, src/cron/service.issue-13992-regression.test.ts

Fixes openclaw#13992

**Problem:**
Cron jobs were having their nextRunAtMs advanced to future times without
executing. The scheduler would wake up, mark the time as processed, but
never run the job - causing silent skips.

**Root Cause:**
When runDueJobs() found no due jobs (due to timing issues, race conditions,
or bugs in job filtering), onTimer() would still call recomputeNextRuns()
which unconditionally recomputed nextRunAtMs for all enabled jobs. This
advanced past-due nextRunAtMs values to their next scheduled times WITHOUT
executing the jobs first.

**Fix:**
1. Modified runDueJobs() to return the count of executed jobs
2. Created recomputeNextRunsForMaintenance() that only performs maintenance
   tasks (clearing disabled jobs, stuck markers) but does NOT recompute
   nextRunAtMs for enabled jobs with existing values
3. In onTimer(), use maintenance-only recompute when no jobs were executed
   to prevent silently advancing past-due nextRunAtMs values

**Testing:**
- Added comprehensive regression tests for openclaw#13992
- Tests verify past-due nextRunAtMs are NOT advanced during maintenance
- Tests verify missing nextRunAtMs ARE still computed
- Tests verify disabled jobs and stuck markers are still cleaned up
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Additional Comments (3)

src/cron/service/timer.ts
Regression: scheduler behavior changed

src/cron/service/timer.ts in this commit appears to be an older/simplified timer implementation that drops prior behavior from main (e.g., max 60s wake clamp + drift recovery, debug logging, job execution timeout/backoff logic, session reaper piggyback, and findDueJobs handling of missing job.state). If this file replacement is intentional it needs a clear migration path; otherwise it looks like an accidental overwrite that will change cron semantics and remove safety nets.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 1:3

Comment:
**Regression: scheduler behavior changed**

`src/cron/service/timer.ts` in this commit appears to be an older/simplified timer implementation that drops prior behavior from `main` (e.g., max 60s wake clamp + drift recovery, debug logging, job execution timeout/backoff logic, session reaper piggyback, and `findDueJobs` handling of missing `job.state`). If this file replacement is intentional it needs a clear migration path; otherwise it looks like an accidental overwrite that will change cron semantics and remove safety nets.

How can I resolve this? If you propose a fix, please make it concise.

src/cron/service/timer.ts
Unsafe job.state access

runDueJobs() assumes j.state exists (j.state.runningAtMs, j.state.nextRunAtMs), but in the existing codebase jobs defensively initialize job.state = {} when missing. If a persisted/migrated job lacks state, this will throw at runtime during a timer tick. Consider mirroring the findDueJobs() pattern from main (ensure j.state is initialized before reads).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 41:44

Comment:
**Unsafe `job.state` access**

`runDueJobs()` assumes `j.state` exists (`j.state.runningAtMs`, `j.state.nextRunAtMs`), but in the existing codebase jobs defensively initialize `job.state = {}` when missing. If a persisted/migrated job lacks `state`, this will throw at runtime during a timer tick. Consider mirroring the `findDueJobs()` pattern from `main` (ensure `j.state` is initialized before reads).

How can I resolve this? If you propose a fix, please make it concise.

src/cron/service/jobs.ts
Stuck-run threshold mismatch

recomputeNextRunsForMaintenance() uses STUCK_RUN_MS which is currently 2 * 60 * 60 * 1000 (2 hours) in this module, but the new regression test asserts a 10-minute threshold. As written, the test description/expectations don’t match production behavior, and the “stuck marker” test will be flaky/incorrect if the threshold is actually 2 hours.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/jobs.ts
Line: 15:17

Comment:
**Stuck-run threshold mismatch**

`recomputeNextRunsForMaintenance()` uses `STUCK_RUN_MS` which is currently `2 * 60 * 60 * 1000` (2 hours) in this module, but the new regression test asserts a 10-minute threshold. As written, the test description/expectations don’t match production behavior, and the “stuck marker” test will be flaky/incorrect if the threshold is actually 2 hours.

How can I resolve this? If you propose a fix, please make it concise.

TWFBusiness added a commit to TWFBusiness/openclaw-tw that referenced this pull request Feb 11, 2026
- openclaw#14019: prevent cron jobs from skipping execution (nextRunAtMs regression)
- openclaw#13931: drain active agent turns before gateway restart
- openclaw#14023: filter skills watcher file types to prevent FD exhaustion
- openclaw#13983: use requested agentId for isolated cron job auth resolution
- openclaw#13956: fix Grok web search returning empty content (xAI Responses API)
- openclaw#14018: auto-remind AI to check workspace context files on session reset
- openclaw#13960: preserve structured config validation errors in UI
- openclaw#13951: improve provider error logging in gateway logs
- openclaw#14021: optional memory flush before manual /compact command

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@WalterSumbon
Copy link
Contributor Author

Closing this PR due to base branch sync issues. The fix has been updated to preserve all existing scheduler logic. Will create a new PR with the corrected version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron jobs advance nextRunAtMs without executing

1 participant

Comments