fix: Re-arm cron service timer with a delay if already running to prevent missed events.#2990
fix: Re-arm cron service timer with a delay if already running to prevent missed events.#2990kuzeyc wants to merge 2 commits intoopenclaw:mainfrom
Conversation
…vent missed events.
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition in the cron service timer mechanism where onTimer() calls that occurred while another cron operation was in progress (state.running = true) would return silently without re-arming the timer, causing scheduled jobs to be missed.
Changes:
- Added a 500ms retry mechanism in
onTimer()whenstate.runningis true to ensure the timer eventually gets re-armed - Added test coverage for the retry behavior when concurrent operations overlap
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/cron/service/timer.ts | Implements retry logic that schedules a new timer after 500ms when onTimer() is called during an active operation |
| src/cron/service.timer-rearm-on-running.test.ts | Adds test cases to verify retry scheduling and normal execution paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| setTimeout(() => { | ||
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); |
There was a problem hiding this comment.
The retry mechanism lacks a maximum retry limit, which could lead to infinite recursion if state.running never becomes false (e.g., due to an unhandled exception preventing the finally block from executing, or state corruption). Consider adding a retry counter and a maximum retry limit (e.g., 10 retries) before giving up and logging an error. This would prevent indefinite recursion and resource exhaustion.
| setTimeout(() => { | ||
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); |
There was a problem hiding this comment.
The retry timer is not stored in state.timer or any other state field, which means it cannot be cleared when the service is stopped. This could lead to the retry timer firing after the service has been stopped, potentially causing unexpected behavior. Consider storing the retry timer reference so it can be properly cleaned up during service shutdown, similar to how the main timer is stored in state.timer.
| setTimeout(() => { | |
| void onTimer(state).catch((err) => { | |
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | |
| }); | |
| }, 500).unref?.(); | |
| if (state.timer) { | |
| clearTimeout(state.timer); | |
| } | |
| state.timer = setTimeout(() => { | |
| void onTimer(state).catch((err) => { | |
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | |
| }); | |
| }, 500); | |
| state.timer.unref?.(); |
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); |
There was a problem hiding this comment.
The choice of 500ms for the retry delay is not justified in the code or PR description. Based on codebase patterns (e.g., src/gateway/client.ts:358 uses a backoff starting at 750ms for reconnection), consider using a longer initial delay or implementing exponential backoff. A 500ms retry interval could cause excessive CPU usage if the state.running condition persists. Consider using a longer delay (e.g., 1000ms) or implementing exponential backoff with a maximum delay to reduce system load during extended contention periods.
| }, 500).unref?.(); | |
| }, 1000).unref?.(); |
| it("schedules a retry when state.running is true", async () => { | ||
| const store = await makeStorePath(); | ||
| const enqueueSystemEvent = vi.fn(); | ||
| const requestHeartbeatNow = vi.fn(); | ||
| const runIsolatedAgentJob = vi.fn(async () => ({ status: "ok" as const })); | ||
|
|
||
| const deps: CronServiceDeps = { | ||
| storePath: store.storePath, | ||
| cronEnabled: true, | ||
| log: noopLogger, | ||
| enqueueSystemEvent, | ||
| requestHeartbeatNow, | ||
| runIsolatedAgentJob, | ||
| }; | ||
|
|
||
| const state = createCronServiceState(deps); | ||
|
|
||
| // Manually set state.running to true to simulate concurrent operation | ||
| state.running = true; | ||
|
|
||
| // Call onTimer - it should schedule a retry instead of returning silently | ||
| const onTimerPromise = onTimer(state); | ||
|
|
||
| // The function should return immediately since running is true | ||
| await onTimerPromise; | ||
|
|
||
| // Verify that running is still true (we didn't enter the critical section) | ||
| expect(state.running).toBe(true); | ||
|
|
||
| // Now simulate the retry timer firing by advancing time 500ms | ||
| state.running = false; // concurrent operation finished | ||
|
|
||
| // Create a mock store for the retry | ||
| state.store = { version: 1, jobs: [] }; | ||
|
|
||
| await vi.advanceTimersByTimeAsync(600); | ||
|
|
||
| // The retry should have set running back to false after completing | ||
| // (We can't easily verify the retry was scheduled, but we can verify | ||
| // the mechanism works in the integration test below) | ||
|
|
||
| await store.cleanup(); |
There was a problem hiding this comment.
The test doesn't adequately validate the retry mechanism. On line 72, state.running is set to false before advancing timers, which means the retry won't actually encounter the state.running === true condition again. To properly test the retry mechanism, the test should verify that when state.running remains true for some time, the retry timer keeps firing, and only succeeds once state.running becomes false. Consider using a mock or spy to verify that onTimer is called multiple times, or simulate a scenario where state.running transitions from true to false after the first retry.
| setTimeout(() => { | ||
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); |
There was a problem hiding this comment.
The retry mechanism doesn't check if the cron service has been disabled or stopped before scheduling the retry. If state.deps.cronEnabled becomes false or the service is stopped after the initial check but before the retry fires, the retry will still execute. Consider adding a check for state.deps.cronEnabled and potentially a state.stopped flag before scheduling the retry to prevent unnecessary retries after service shutdown.
| }; | ||
|
|
||
| async function makeStorePath() { | ||
| const dir = await fs.mkdtemp(path.join(os.tmpdir(), "moltbot-cron-")); |
There was a problem hiding this comment.
The test uses 'moltbot' in the temporary directory path, but other cron tests in this directory use 'openclaw' (e.g., service.prevents-duplicate-timers.test.ts:15 uses 'openclaw-cron-'). For consistency with the existing test suite, consider changing 'moltbot-cron-' to 'openclaw-cron-'.
| setTimeout(() => { | ||
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); |
There was a problem hiding this comment.
The solution addresses a symptom rather than the root cause. The timer should only fire once per armTimer call, so onTimer being called while state.running is true suggests either: (1) the timer is being armed multiple times without clearing the previous timer, or (2) the operation is taking longer than the scheduled interval. The fix adds a retry mechanism, but doesn't prevent the underlying race condition. Consider investigating why onTimer is being called concurrently in the first place. If operations routinely take longer than the scheduled interval, the 500ms retry could compound the problem. A more robust solution might involve: (1) checking and clearing any existing timer before arming a new one (which armTimer already does), (2) adding logging to detect concurrent calls, or (3) implementing a queuing mechanism instead of immediate retry.
| setTimeout(() => { | |
| void onTimer(state).catch((err) => { | |
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | |
| }); | |
| }, 500).unref?.(); | |
| state.deps.log.warn( | |
| "cron: onTimer invoked while a previous run is still in progress; skipping concurrent execution", | |
| ); |
| export async function onTimer(state: CronServiceState) { | ||
| if (state.running) { | ||
| setTimeout(() => { | ||
| void onTimer(state).catch((err) => { | ||
| state.deps.log.error({ err: String(err) }, "cron: timer retry failed"); | ||
| }); | ||
| }, 500).unref?.(); | ||
| return; |
There was a problem hiding this comment.
[P0] onTimer() retry scheduling bypasses service stop/disable lifecycle
At src/cron/service/timer.ts:33-40, when state.running is true you schedule a 500ms retry via setTimeout(() => void onTimer(state), 500) but this retry isn’t gated on deps.cronEnabled and isn’t tracked in state.timer (so stopTimer() won’t cancel it). If CronService.stop() is called while state.running stays true (or is accidentally left true), the retry loop can keep firing indefinitely while the process is alive, doing unexpected work and/or spamming cron: timer retry failed logs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 33:40
Comment:
[P0] `onTimer()` retry scheduling bypasses service stop/disable lifecycle
At `src/cron/service/timer.ts:33-40`, when `state.running` is true you schedule a 500ms retry via `setTimeout(() => void onTimer(state), 500)` but this retry isn’t gated on `deps.cronEnabled` and isn’t tracked in `state.timer` (so `stopTimer()` won’t cancel it). If `CronService.stop()` is called while `state.running` stays true (or is accidentally left true), the retry loop can keep firing indefinitely while the process is alive, doing unexpected work and/or spamming `cron: timer retry failed` logs.
How can I resolve this? If you propose a fix, please make it concise.| it("schedules a retry when state.running is true", async () => { | ||
| const store = await makeStorePath(); | ||
| const enqueueSystemEvent = vi.fn(); | ||
| const requestHeartbeatNow = vi.fn(); | ||
| const runIsolatedAgentJob = vi.fn(async () => ({ status: "ok" as const })); | ||
|
|
||
| const deps: CronServiceDeps = { | ||
| storePath: store.storePath, | ||
| cronEnabled: true, | ||
| log: noopLogger, | ||
| enqueueSystemEvent, | ||
| requestHeartbeatNow, | ||
| runIsolatedAgentJob, | ||
| }; | ||
|
|
||
| const state = createCronServiceState(deps); | ||
|
|
||
| // Manually set state.running to true to simulate concurrent operation | ||
| state.running = true; | ||
|
|
||
| // Call onTimer - it should schedule a retry instead of returning silently | ||
| const onTimerPromise = onTimer(state); | ||
|
|
||
| // The function should return immediately since running is true | ||
| await onTimerPromise; | ||
|
|
||
| // Verify that running is still true (we didn't enter the critical section) | ||
| expect(state.running).toBe(true); |
There was a problem hiding this comment.
[P1] This test doesn’t actually assert the retry was scheduled
In src/cron/service.timer-rearm-on-running.test.ts:42-84, the test never checks that setTimeout was called with a 500ms delay (or that onTimer is invoked again), and it has no assertion after advancing timers. As written it would still pass if onTimer() simply returned early when running=true (the original regression), so it won’t protect the fix.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.timer-rearm-on-running.test.ts
Line: 42:69
Comment:
[P1] This test doesn’t actually assert the retry was scheduled
In `src/cron/service.timer-rearm-on-running.test.ts:42-84`, the test never checks that `setTimeout` was called with a 500ms delay (or that `onTimer` is invoked again), and it has no assertion after advancing timers. As written it would still pass if `onTimer()` simply returned early when `running=true` (the original regression), so it won’t protect the fix.
How can I resolve this? If you propose a fix, please make it concise.| async function makeStorePath() { | ||
| const dir = await fs.mkdtemp(path.join(os.tmpdir(), "moltbot-cron-")); | ||
| return { | ||
| storePath: path.join(dir, "cron", "jobs.json"), | ||
| cleanup: async () => { | ||
| await fs.rm(dir, { recursive: true, force: true }); |
There was a problem hiding this comment.
[P2] Temp dir prefix looks copy/pasted
src/cron/service.timer-rearm-on-running.test.ts:19 uses "moltbot-cron-" as the mkdtemp prefix, which seems unrelated to this repo and can make local debugging confusing.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.timer-rearm-on-running.test.ts
Line: 18:23
Comment:
[P2] Temp dir prefix looks copy/pasted
`src/cron/service.timer-rearm-on-running.test.ts:19` uses `"moltbot-cron-"` as the mkdtemp prefix, which seems unrelated to this repo and can make local debugging confusing.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Schedules a retry timer when onTimer() is called while another cron operation is in progress (state.running = true). The retry occurs after 500ms, ensuring the timer is eventually re-armed even when concurrent operations overlap.
Problem
The cron timer mechanism had a race condition where when onTimer() was called but state.running was already true, the function returned immediately without re-arming the timer. This caused:
Solution
In onTimer(), when state.running is true:
Test Plan
Added service.timer-rearm-on-running.test.ts with 2 tests:
Fixes #2813
Greptile Overview
Greptile Summary
This PR updates the cron service timer logic so that if
onTimer()is invoked while another cron operation is already in progress (state.running === true), it schedules a delayed retry (500ms) rather than returning and potentially leaving the system unarmed. A new Vitest file adds coverage around the intended behavior and verifies normal job execution in a non-concurrent case.The change sits in
src/cron/service/timer.tsalongside the existingarmTimer()scheduling andonTimer()tick handler, and aims to prevent missed cron events caused by overlapping operations.Confidence Score: 3/5
state.timerorcronEnabled, so it can continue firing afterstop()/disable in certain states. The added test does not assert the retry scheduling and could miss regressions.Context used:
dashboard- CLAUDE.md (source)dashboard- AGENTS.md (source)