Skip to content

fix: clamp cron timer delay to 60s as drift guard#10580

Closed
zhiyuanw101 wants to merge 1 commit intoopenclaw:mainfrom
zhiyuanw101:fix/cron-drift-guard
Closed

fix: clamp cron timer delay to 60s as drift guard#10580
zhiyuanw101 wants to merge 1 commit intoopenclaw:mainfrom
zhiyuanw101:fix/cron-drift-guard

Conversation

@zhiyuanw101
Copy link

@zhiyuanw101 zhiyuanw101 commented Feb 6, 2026

Problem

Fixes #10564

The cron scheduler relies solely on setTimeout to trigger job execution. If the timer is silently lost — due to macOS sleep/wake, GC pause, process throttling, or any other runtime disruption — no recovery mechanism exists. The missed job slot is never executed until an external API call (e.g. cron list) happens to call ensureLoadedrecomputeNextRunsarmTimer.

Observed behavior: A cron job scheduled for 7:00 AM PST (0 7-19 * * 1-5) did not fire until manually triggered 2+ hours later. Gateway logs showed the timer was correctly armed after job creation but never fired, likely due to macOS power management suspending the Node.js event loop during sleep.

Solution

Clamp the maximum setTimeout delay in armTimer() to 60 seconds via DRIFT_GUARD_MAX_DELAY_MS. This means:

  • When a job is due in <60s: timer fires at the exact due time (no change)
  • When a job is due in >60s: timer fires every 60s, onTimer() finds no overdue jobs, does nothing, and re-arms
  • When a timer was lost (sleep/wake): the next 60s wake catches the overdue job within at most 1 minute

This is simpler and more robust than a separate setInterval drift guard:

  • No second timer to manage or leak
  • No interaction issues with fake timers in tests
  • Same setTimeout + re-arm pattern already used by the scheduler
  • Minimal overhead (~1 timer wake/minute when idle)

Changes

  • src/cron/service/timer.ts: Add DRIFT_GUARD_MAX_DELAY_MS = 60_000 constant, clamp delay in armTimer()
  • src/cron/service.drift-guard-catches-overdue-jobs.test.ts: Two new tests validating the drift guard behavior

Test Results

All 60 tests pass (58 existing + 2 new):

 Test Files  14 passed (14)
      Tests  60 passed (60)

Greptile Overview

Greptile Summary

  • Adds a drift-guard to the cron scheduler by clamping setTimeout delays in armTimer() to a maximum of 60s, so overdue jobs can be detected even if the original timer was lost (sleep/GC/throttling).
  • Introduces DRIFT_GUARD_MAX_DELAY_MS and updates timer arming to use the minimum of computed delay, Node’s max timeout, and the drift-guard cap.
  • Adds two Vitest tests that simulate timer loss and validate that the scheduler re-arms and eventually runs overdue jobs, while not executing far-future jobs early.

Confidence Score: 4/5

  • This PR is generally safe to merge; the core timer change is small and the added tests cover the intended drift-guard behavior.
  • The change is localized to clamping the timer delay and is backed by new tests that exercise overdue-job recovery and far-future behavior. The main outstanding concern is test resource cleanup patterns that can leak temp dirs/timers on assertion failure, which can cause flakiness in CI or local runs.
  • src/cron/service.drift-guard-catches-overdue-jobs.test.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 34 to 36
afterEach(() => {
vi.useRealTimers();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup skipped on failure

Both tests call cron.stop()/store.cleanup() at the end of the test body, so if an assertion throws earlier the temp dir is leaked and the CronService may keep timers open until afterEach runs. This can make the suite flaky over time (especially when running individual tests repeatedly). Consider moving cron.stop() + store.cleanup() into afterEach or a try/finally around the test body so cleanup always happens even on failure.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.drift-guard-catches-overdue-jobs.test.ts
Line: 34:36

Comment:
**Cleanup skipped on failure**

Both tests call `cron.stop()`/`store.cleanup()` at the end of the test body, so if an assertion throws earlier the temp dir is leaked and the CronService may keep timers open until `afterEach` runs. This can make the suite flaky over time (especially when running individual tests repeatedly). Consider moving `cron.stop()` + `store.cleanup()` into `afterEach` or a `try/finally` around the test body so cleanup always happens even on failure.

How can I resolve this? If you propose a fix, please make it concise.

The cron scheduler relies solely on setTimeout to trigger job execution.
If the timer is silently lost (macOS sleep/wake, GC pause, process
throttling), no recovery mechanism exists until an external API call
happens to re-arm the timer.

Clamp the maximum setTimeout delay in armTimer() to 60 seconds. When the
timer fires before a job is due, onTimer() finds no overdue jobs, does
nothing, and re-arms — effectively polling every 60s. When a job IS
overdue (because the previous timer was lost), onTimer() catches it
within at most 60 seconds of the missed slot.

This is simpler and more robust than a separate setInterval drift guard:
- No second timer to manage or leak
- No interaction issues with fake timers in tests
- Same setTimeout + re-arm pattern already used by the scheduler
- Minimal overhead (one timer wake per minute when idle)
@tyler6204
Copy link
Member

Superseded by #11641 (merge commit: 8fae55e). Closing to reduce duplicate PR noise. Please open a new PR only if there is additional scope beyond this fix.

@tyler6204 tyler6204 closed this Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron scheduler stuck with stale nextWakeAtMs after gateway restarts — misses all scheduled slots until external API call

2 participants

Comments