fix(cron): use per-job timeout for stuck lock threshold by shariqriazz · Pull Request #43816 · openclaw/openclaw

shariqriazz · 2026-03-12T07:42:47Z

Summary

Replace the fixed 2-hour STUCK_RUN_MS constant with a per-job stuck threshold derived from resolveCronJobTimeoutMs() + a 5-minute buffer
systemEvent jobs (10 min timeout) now clear stale locks after ~15 minutes instead of 2 hours
agentTurn jobs with custom timeoutSeconds (e.g. 30 min) clear after ~35 minutes instead of 2 hours
Jobs with no resolved timeout use a 30-minute fallback (down from 2 hours)

Problem

When the gateway crashes mid-execution and restarts within 2 hours, the runningAtMs lock persists in jobs.json. The startup path (start() in ops.ts) unconditionally clears all locks, but the timer-tick maintenance path (normalizeJobTickState in jobs.ts) only clears locks older than STUCK_RUN_MS (2 hours). If the startup clear succeeds but a subsequent timer tick re-reads a stale on-disk state, or if the lock gets re-set during runMissedJobs and the process crashes again, the 2-hour window keeps the job permanently blocked.

Users with 30-minute job timeouts observed jobs stuck for hours after unclean shutdowns, with no warning.

Fix

The stuck threshold now scales with each job's actual timeout:

resolveCronJobTimeoutMs(job) + STUCK_RUN_BUFFER_MS (5 min) for jobs with a resolved timeout
STUCK_RUN_FALLBACK_MS (30 min) for jobs where the resolver returns undefined

The startup unconditional clear in ops.ts is unchanged — this fix tightens the secondary safety net in the timer-tick path.

Testing

Added 4 new test cases in service.issue-13992-regression.test.ts covering:
- Custom timeout job clears lock after timeout + buffer (bug: Stale cron locks persist across gateway restarts #30096)
- Fresh lock within threshold is preserved (bug: Stale cron locks persist across gateway restarts #30096)
- systemEvent vs agentTurn threshold difference (bug: Stale cron locks persist across gateway restarts #30096)
Updated comments in service.armtimer-tight-loop.test.ts to reflect per-job threshold
All existing cron tests pass (23/23)
pnpm build, pnpm check pass

AI Disclosure

This PR was authored with AI assistance (OpenCode/Claude). The changes have been reviewed and tested locally. Degree of testing: fully tested against existing + new test suite.

greptile-apps · 2026-03-12T07:46:09Z

Greptile Summary

This PR replaces the fixed 2-hour STUCK_RUN_MS constant with a per-job stuck-lock threshold (resolveCronJobTimeoutMs(job) + 5-minute buffer), so jobs with shorter execution timeouts no longer have to wait 2 hours for a stale lock to be cleared after an unclean shutdown.

Key changes:

STUCK_RUN_MS (2 h) removed; replaced with STUCK_RUN_BUFFER_MS (5 min) added on top of each job's resolved timeout, and STUCK_RUN_FALLBACK_MS (30 min) used when the timeout resolver returns undefined.
resolveCronJobTimeoutMs is now imported into jobs.ts and called per-job on every timer tick that has a runningAtMs marker.
Four new regression tests added covering the custom-timeout clearing path, the preserve-within-threshold path, and the systemEvent vs agentTurn comparison.

Potential concern: STUCK_RUN_FALLBACK_MS = 30 min applies to any job where resolveCronJobTimeoutMs returns undefined — specifically agentTurn jobs with timeoutSeconds ≤ 0 (i.e. explicitly disabled execution timeout). Before this PR those jobs used the 2-hour window; after this PR they use 30 minutes. A job that legitimately runs longer than 30 minutes with no execution timeout will have its runningAtMs lock cleared by the maintenance tick, leaving it eligible to run again while the original execution is still in progress. The PR description calls this intentional, but the scenario is worth careful validation against real-world timeoutSeconds: 0 usage in production.

Confidence Score: 3/5

Mostly safe to merge for standard jobs; needs validation for production use of timeoutSeconds: 0 or negative values before merging.
The implementation is correct and well-tested for the primary use-cases (jobs with positive timeoutSeconds or using the platform defaults). The new tests are mathematically sound. However, the STUCK_RUN_FALLBACK_MS = 30 min fallback represents a silent regression for jobs configured with timeoutSeconds ≤ 0 — their stuck window drops from 2 hours to 30 minutes, increasing the risk of duplicate concurrent executions for any such job that legitimately runs past 30 minutes. If no production jobs use timeoutSeconds ≤ 0, the risk is low; if they do, this could cause real double-execution bugs.
Pay close attention to src/cron/service/jobs.ts lines 44-48 (STUCK_RUN_FALLBACK_MS) and any callers that may create agentTurn jobs with timeoutSeconds: 0 or negative values.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/cron/service/jobs.ts
Line: 44-48

Comment:
**Fallback threshold may clear locks for indefinitely-running jobs**

`STUCK_RUN_FALLBACK_MS = 30 min` applies to any job where `resolveCronJobTimeoutMs` returns `undefined` — which happens when `timeoutSeconds <= 0`. A user who sets `timeoutSeconds: 0` to mean "no execution timeout" (i.e. the job is allowed to run indefinitely) will now have their lock cleared after 30 minutes, down from the previous 2-hour default.

If such a job is legitimately running past the 30-minute mark, `normalizeJobTickState` will clear `runningAtMs`, `isJobDue` will see no lock, and the job can be scheduled again — producing a duplicate concurrent execution that the lock was specifically meant to prevent.

Before this PR, the fixed `STUCK_RUN_MS = 2h` applied uniformly, so an indefinite-timeout job had a 2-hour safety window. The PR's intent of tightening things is sound, but this particular edge case swings in the wrong direction: the one class of jobs most likely to run long (those with `timeoutSeconds ≤ 0`) now gets the *shortest* fallback threshold.

Consider using a larger fallback (e.g. the old `STUCK_RUN_MS = 2h`) specifically for `undefined`-timeout jobs, or documenting clearly that `timeoutSeconds: 0` should not be used with long-running jobs.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 82d454b}

src/cron/service/jobs.ts

Copilot

Pull request overview

This PR updates cron maintenance behavior to clear stale runningAtMs markers using each job’s configured execution timeout (plus a small buffer) rather than a fixed 2-hour window, helping the scheduler recover faster after crashes without leaving jobs blocked for long periods.

Changes:

Replace the fixed stuck-run threshold with a per-job threshold derived from resolveCronJobTimeoutMs(job) + a buffer, with a fallback when no timeout applies.
Expand regression coverage to assert stuck-marker clearing differs between job kinds and honors explicit timeoutSeconds.
Update test comments to reflect the new “per-job stuck threshold” semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/cron/service/jobs.ts	Clears stale `runningAtMs` markers based on per-job timeout policy (plus buffer) instead of a fixed constant.
src/cron/service.issue-13992-regression.test.ts	Adds/updates regression tests to validate per-job stuck thresholds (including agentTurn vs systemEvent).
src/cron/service.armtimer-tight-loop.test.ts	Updates comment text to align with the new stuck-marker policy.

You can also share your feedback on Copilot code review. Take the survey.

src/cron/service/jobs.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82d454bcf0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/cron/service/jobs.ts

Replace fixed 2-hour STUCK_RUN_MS with per-job threshold based on resolveCronJobTimeoutMs + 5-minute buffer. Jobs with 30-minute timeouts now clear stale locks after ~35 minutes instead of waiting 2 hours, preventing silent cron failures after fast gateway restarts.

Address review feedback: - Bump STUCK_RUN_FALLBACK_MS from 30 min to 2h to preserve the safety window for jobs with timeoutSeconds <= 0 (no execution timeout). - Use 60_000 numeric separator style to match codebase convention.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 822a3b04ba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T08:37:28Z

src/cron/service/jobs.ts

+    const stuckThresholdMs =
+      typeof jobTimeoutMs === "number" ? jobTimeoutMs + STUCK_RUN_BUFFER_MS : STUCK_RUN_FALLBACK_MS;


Clamp stuck threshold to runtime timeout ceiling

stuckThresholdMs now uses the raw resolveCronJobTimeoutMs(job) value, but timeoutSeconds is unbounded in the API schema (src/gateway/protocol/schema/cron.ts:12) while execution timeouts are enforced via setTimeout(..., jobTimeoutMs) (src/cron/service/timer.ts:80-84), which overflows above ~2^31-1 ms. In that overflow range, runs time out almost immediately but stale-lock cleanup waits for the huge computed threshold, so a crash/interrupted persist can leave runningAtMs blocking the job for days; this regression did not exist with the previous fixed 2-hour cleanup window.

Useful? React with 👍 / 👎.

…overflow

shariqriazz · 2026-03-12T08:42:00Z

Fixed the P2 (clamp stuck threshold): clamped stuckThresholdMs to Math.min(jobTimeout + buffer, STUCK_RUN_FALLBACK_MS) so absurdly large timeoutSeconds values that overflow setTimeout's 2^31-1 ms limit degrade to the 2-hour fallback instead of leaving stuck locks for days.

Copilot AI review requested due to automatic review settings March 12, 2026 07:42

openclaw-barnacle bot added the size: S label Mar 12, 2026

Copilot started reviewing on behalf of shariqriazz March 12, 2026 07:43 View session

greptile-apps bot reviewed Mar 12, 2026

View reviewed changes

src/cron/service/jobs.ts Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 12, 2026

View reviewed changes

src/cron/service/jobs.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

src/cron/service/jobs.ts Outdated Show resolved Hide resolved

shariqriazz added 2 commits March 12, 2026 08:26

shariqriazz force-pushed the fix/cron-stale-lock-stuck-threshold-30096 branch from d22d68f to 822a3b0 Compare March 12, 2026 08:26

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

fix: clamp stuck threshold to fallback ceiling to prevent setTimeout …

fa5fb18

…overflow

Takhoffman requested a review from a team as a code owner March 24, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): use per-job timeout for stuck lock threshold#43816

fix(cron): use per-job timeout for stuck lock threshold#43816
shariqriazz wants to merge 3 commits intoopenclaw:mainfrom
shariqriazz:fix/cron-stale-lock-stuck-threshold-30096

shariqriazz commented Mar 12, 2026

Uh oh!

greptile-apps bot commented Mar 12, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Uh oh!

shariqriazz commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const stuckThresholdMs =
		typeof jobTimeoutMs === "number" ? jobTimeoutMs + STUCK_RUN_BUFFER_MS : STUCK_RUN_FALLBACK_MS;

Uh oh!

Conversation

shariqriazz commented Mar 12, 2026

Summary

Problem

Fix

Testing

AI Disclosure

Uh oh!

greptile-apps bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

shariqriazz commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants