Skip to content

fix: Re-arm cron service timer with a delay if already running to prevent missed events.#2990

Closed
kuzeyc wants to merge 2 commits intoopenclaw:mainfrom
kuzeyc:fix/cron_job
Closed

fix: Re-arm cron service timer with a delay if already running to prevent missed events.#2990
kuzeyc wants to merge 2 commits intoopenclaw:mainfrom
kuzeyc:fix/cron_job

Conversation

@kuzeyc
Copy link

@kuzeyc kuzeyc commented Jan 27, 2026

Summary

Schedules a retry timer when onTimer() is called while another cron operation is in progress (state.running = true). The retry occurs after 500ms, ensuring the timer is eventually re-armed even when concurrent operations overlap.

Problem

The cron timer mechanism had a race condition where when onTimer() was called but state.running was already true, the function returned immediately without re-arming the timer. This caused:

  • The event loop to go completely silent
  • Scheduled cron jobs to never fire
  • No log activity during the window when jobs should have executed
  • Mission-critical healthcare cron jobs to be missed

Solution
In onTimer(), when state.running is true:

  • Schedule a 500ms retry timer instead of returning silently
  • The retry timer is also unref()'ed to avoid keeping the process alive unnecessarily
  • Retry logic recursively calls
  • onTimer()
  • which will eventually succeed when state.running becomes false

Test Plan

Added service.timer-rearm-on-running.test.ts with 2 tests:

  • Verifies retry is scheduled when state.running is true
  • Verifies normal execution when no concurrent operation
  • All existing cron tests pass (789/790 total; 1 unrelated failure in canvas-host)
  • Lint passes

  • AI-assisted

Fixes #2813

Greptile Overview

Greptile Summary

This PR updates the cron service timer logic so that if onTimer() is invoked while another cron operation is already in progress (state.running === true), it schedules a delayed retry (500ms) rather than returning and potentially leaving the system unarmed. A new Vitest file adds coverage around the intended behavior and verifies normal job execution in a non-concurrent case.

The change sits in src/cron/service/timer.ts alongside the existing armTimer() scheduling and onTimer() tick handler, and aims to prevent missed cron events caused by overlapping operations.

Confidence Score: 3/5

  • This PR is likely safe but has a lifecycle edge case and a weak regression test that should be addressed before merge.
  • Core behavior change is small and localized, but the new retry timer is not tied to state.timer or cronEnabled, so it can continue firing after stop()/disable in certain states. The added test does not assert the retry scheduling and could miss regressions.
  • src/cron/service/timer.ts; src/cron/service.timer-rearm-on-running.test.ts

Context used:

  • Context from dashboard - CLAUDE.md (source)
  • Context from dashboard - AGENTS.md (source)

Copilot AI review requested due to automatic review settings February 1, 2026 13:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition in the cron service timer mechanism where onTimer() calls that occurred while another cron operation was in progress (state.running = true) would return silently without re-arming the timer, causing scheduled jobs to be missed.

Changes:

  • Added a 500ms retry mechanism in onTimer() when state.running is true to ensure the timer eventually gets re-armed
  • Added test coverage for the retry behavior when concurrent operations overlap

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
src/cron/service/timer.ts Implements retry logic that schedules a new timer after 500ms when onTimer() is called during an active operation
src/cron/service.timer-rearm-on-running.test.ts Adds test cases to verify retry scheduling and normal execution paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to +39
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry mechanism lacks a maximum retry limit, which could lead to infinite recursion if state.running never becomes false (e.g., due to an unhandled exception preventing the finally block from executing, or state corruption). Consider adding a retry counter and a maximum retry limit (e.g., 10 retries) before giving up and logging an error. This would prevent indefinite recursion and resource exhaustion.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +39
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry timer is not stored in state.timer or any other state field, which means it cannot be cleared when the service is stopped. This could lead to the retry timer firing after the service has been stopped, potentially causing unexpected behavior. Consider storing the retry timer reference so it can be properly cleaned up during service shutdown, similar to how the main timer is stored in state.timer.

Suggested change
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
if (state.timer) {
clearTimeout(state.timer);
}
state.timer = setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500);
state.timer.unref?.();

Copilot uses AI. Check for mistakes.
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of 500ms for the retry delay is not justified in the code or PR description. Based on codebase patterns (e.g., src/gateway/client.ts:358 uses a backoff starting at 750ms for reconnection), consider using a longer initial delay or implementing exponential backoff. A 500ms retry interval could cause excessive CPU usage if the state.running condition persists. Consider using a longer delay (e.g., 1000ms) or implementing exponential backoff with a maximum delay to reduce system load during extended contention periods.

Suggested change
}, 500).unref?.();
}, 1000).unref?.();

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +83
it("schedules a retry when state.running is true", async () => {
const store = await makeStorePath();
const enqueueSystemEvent = vi.fn();
const requestHeartbeatNow = vi.fn();
const runIsolatedAgentJob = vi.fn(async () => ({ status: "ok" as const }));

const deps: CronServiceDeps = {
storePath: store.storePath,
cronEnabled: true,
log: noopLogger,
enqueueSystemEvent,
requestHeartbeatNow,
runIsolatedAgentJob,
};

const state = createCronServiceState(deps);

// Manually set state.running to true to simulate concurrent operation
state.running = true;

// Call onTimer - it should schedule a retry instead of returning silently
const onTimerPromise = onTimer(state);

// The function should return immediately since running is true
await onTimerPromise;

// Verify that running is still true (we didn't enter the critical section)
expect(state.running).toBe(true);

// Now simulate the retry timer firing by advancing time 500ms
state.running = false; // concurrent operation finished

// Create a mock store for the retry
state.store = { version: 1, jobs: [] };

await vi.advanceTimersByTimeAsync(600);

// The retry should have set running back to false after completing
// (We can't easily verify the retry was scheduled, but we can verify
// the mechanism works in the integration test below)

await store.cleanup();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't adequately validate the retry mechanism. On line 72, state.running is set to false before advancing timers, which means the retry won't actually encounter the state.running === true condition again. To properly test the retry mechanism, the test should verify that when state.running remains true for some time, the retry timer keeps firing, and only succeeds once state.running becomes false. Consider using a mock or spy to verify that onTimer is called multiple times, or simulate a scenario where state.running transitions from true to false after the first retry.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +39
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry mechanism doesn't check if the cron service has been disabled or stopped before scheduling the retry. If state.deps.cronEnabled becomes false or the service is stopped after the initial check but before the retry fires, the retry will still execute. Consider adding a check for state.deps.cronEnabled and potentially a state.stopped flag before scheduling the retry to prevent unnecessary retries after service shutdown.

Copilot uses AI. Check for mistakes.
};

async function makeStorePath() {
const dir = await fs.mkdtemp(path.join(os.tmpdir(), "moltbot-cron-"));
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses 'moltbot' in the temporary directory path, but other cron tests in this directory use 'openclaw' (e.g., service.prevents-duplicate-timers.test.ts:15 uses 'openclaw-cron-'). For consistency with the existing test suite, consider changing 'moltbot-cron-' to 'openclaw-cron-'.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +39
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
Copy link

Copilot AI Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution addresses a symptom rather than the root cause. The timer should only fire once per armTimer call, so onTimer being called while state.running is true suggests either: (1) the timer is being armed multiple times without clearing the previous timer, or (2) the operation is taking longer than the scheduled interval. The fix adds a retry mechanism, but doesn't prevent the underlying race condition. Consider investigating why onTimer is being called concurrently in the first place. If operations routinely take longer than the scheduled interval, the 500ms retry could compound the problem. A more robust solution might involve: (1) checking and clearing any existing timer before arming a new one (which armTimer already does), (2) adding logging to detect concurrent calls, or (3) implementing a queuing mechanism instead of immediate retry.

Suggested change
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
state.deps.log.warn(
"cron: onTimer invoked while a previous run is still in progress; skipping concurrent execution",
);

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 33 to 40
export async function onTimer(state: CronServiceState) {
if (state.running) {
setTimeout(() => {
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer retry failed");
});
}, 500).unref?.();
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] onTimer() retry scheduling bypasses service stop/disable lifecycle

At src/cron/service/timer.ts:33-40, when state.running is true you schedule a 500ms retry via setTimeout(() => void onTimer(state), 500) but this retry isn’t gated on deps.cronEnabled and isn’t tracked in state.timer (so stopTimer() won’t cancel it). If CronService.stop() is called while state.running stays true (or is accidentally left true), the retry loop can keep firing indefinitely while the process is alive, doing unexpected work and/or spamming cron: timer retry failed logs.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 33:40

Comment:
[P0] `onTimer()` retry scheduling bypasses service stop/disable lifecycle

At `src/cron/service/timer.ts:33-40`, when `state.running` is true you schedule a 500ms retry via `setTimeout(() => void onTimer(state), 500)` but this retry isn’t gated on `deps.cronEnabled` and isn’t tracked in `state.timer` (so `stopTimer()` won’t cancel it). If `CronService.stop()` is called while `state.running` stays true (or is accidentally left true), the retry loop can keep firing indefinitely while the process is alive, doing unexpected work and/or spamming `cron: timer retry failed` logs.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +42 to +69
it("schedules a retry when state.running is true", async () => {
const store = await makeStorePath();
const enqueueSystemEvent = vi.fn();
const requestHeartbeatNow = vi.fn();
const runIsolatedAgentJob = vi.fn(async () => ({ status: "ok" as const }));

const deps: CronServiceDeps = {
storePath: store.storePath,
cronEnabled: true,
log: noopLogger,
enqueueSystemEvent,
requestHeartbeatNow,
runIsolatedAgentJob,
};

const state = createCronServiceState(deps);

// Manually set state.running to true to simulate concurrent operation
state.running = true;

// Call onTimer - it should schedule a retry instead of returning silently
const onTimerPromise = onTimer(state);

// The function should return immediately since running is true
await onTimerPromise;

// Verify that running is still true (we didn't enter the critical section)
expect(state.running).toBe(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This test doesn’t actually assert the retry was scheduled

In src/cron/service.timer-rearm-on-running.test.ts:42-84, the test never checks that setTimeout was called with a 500ms delay (or that onTimer is invoked again), and it has no assertion after advancing timers. As written it would still pass if onTimer() simply returned early when running=true (the original regression), so it won’t protect the fix.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.timer-rearm-on-running.test.ts
Line: 42:69

Comment:
[P1] This test doesn’t actually assert the retry was scheduled

In `src/cron/service.timer-rearm-on-running.test.ts:42-84`, the test never checks that `setTimeout` was called with a 500ms delay (or that `onTimer` is invoked again), and it has no assertion after advancing timers. As written it would still pass if `onTimer()` simply returned early when `running=true` (the original regression), so it won’t protect the fix.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +18 to +23
async function makeStorePath() {
const dir = await fs.mkdtemp(path.join(os.tmpdir(), "moltbot-cron-"));
return {
storePath: path.join(dir, "cron", "jobs.json"),
cleanup: async () => {
await fs.rm(dir, { recursive: true, force: true });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Temp dir prefix looks copy/pasted

src/cron/service.timer-rearm-on-running.test.ts:19 uses "moltbot-cron-" as the mkdtemp prefix, which seems unrelated to this repo and can make local debugging confusing.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.timer-rearm-on-running.test.ts
Line: 18:23

Comment:
[P2] Temp dir prefix looks copy/pasted

`src/cron/service.timer-rearm-on-running.test.ts:19` uses `"moltbot-cron-"` as the mkdtemp prefix, which seems unrelated to this repo and can make local debugging confusing.

How can I resolve this? If you propose a fix, please make it concise.

@kuzeyc kuzeyc closed this Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron job scheduled but never fired during event loop silence

1 participant

Comments