Skip to content

fix(cron): use per-job timeout for stuck lock threshold#43816

Open
shariqriazz wants to merge 3 commits intoopenclaw:mainfrom
shariqriazz:fix/cron-stale-lock-stuck-threshold-30096
Open

fix(cron): use per-job timeout for stuck lock threshold#43816
shariqriazz wants to merge 3 commits intoopenclaw:mainfrom
shariqriazz:fix/cron-stale-lock-stuck-threshold-30096

Conversation

@shariqriazz
Copy link
Copy Markdown

Summary

  • Replace the fixed 2-hour STUCK_RUN_MS constant with a per-job stuck threshold derived from resolveCronJobTimeoutMs() + a 5-minute buffer
  • systemEvent jobs (10 min timeout) now clear stale locks after ~15 minutes instead of 2 hours
  • agentTurn jobs with custom timeoutSeconds (e.g. 30 min) clear after ~35 minutes instead of 2 hours
  • Jobs with no resolved timeout use a 30-minute fallback (down from 2 hours)

Fixes #30096

Problem

When the gateway crashes mid-execution and restarts within 2 hours, the runningAtMs lock persists in jobs.json. The startup path (start() in ops.ts) unconditionally clears all locks, but the timer-tick maintenance path (normalizeJobTickState in jobs.ts) only clears locks older than STUCK_RUN_MS (2 hours). If the startup clear succeeds but a subsequent timer tick re-reads a stale on-disk state, or if the lock gets re-set during runMissedJobs and the process crashes again, the 2-hour window keeps the job permanently blocked.

Users with 30-minute job timeouts observed jobs stuck for hours after unclean shutdowns, with no warning.

Fix

The stuck threshold now scales with each job's actual timeout:

  • resolveCronJobTimeoutMs(job) + STUCK_RUN_BUFFER_MS (5 min) for jobs with a resolved timeout
  • STUCK_RUN_FALLBACK_MS (30 min) for jobs where the resolver returns undefined

The startup unconditional clear in ops.ts is unchanged — this fix tightens the secondary safety net in the timer-tick path.

Testing

AI Disclosure

This PR was authored with AI assistance (OpenCode/Claude). The changes have been reviewed and tested locally. Degree of testing: fully tested against existing + new test suite.

Copilot AI review requested due to automatic review settings March 12, 2026 07:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR replaces the fixed 2-hour STUCK_RUN_MS constant with a per-job stuck-lock threshold (resolveCronJobTimeoutMs(job) + 5-minute buffer), so jobs with shorter execution timeouts no longer have to wait 2 hours for a stale lock to be cleared after an unclean shutdown.

Key changes:

  • STUCK_RUN_MS (2 h) removed; replaced with STUCK_RUN_BUFFER_MS (5 min) added on top of each job's resolved timeout, and STUCK_RUN_FALLBACK_MS (30 min) used when the timeout resolver returns undefined.
  • resolveCronJobTimeoutMs is now imported into jobs.ts and called per-job on every timer tick that has a runningAtMs marker.
  • Four new regression tests added covering the custom-timeout clearing path, the preserve-within-threshold path, and the systemEvent vs agentTurn comparison.

Potential concern: STUCK_RUN_FALLBACK_MS = 30 min applies to any job where resolveCronJobTimeoutMs returns undefined — specifically agentTurn jobs with timeoutSeconds ≤ 0 (i.e. explicitly disabled execution timeout). Before this PR those jobs used the 2-hour window; after this PR they use 30 minutes. A job that legitimately runs longer than 30 minutes with no execution timeout will have its runningAtMs lock cleared by the maintenance tick, leaving it eligible to run again while the original execution is still in progress. The PR description calls this intentional, but the scenario is worth careful validation against real-world timeoutSeconds: 0 usage in production.

Confidence Score: 3/5

  • Mostly safe to merge for standard jobs; needs validation for production use of timeoutSeconds: 0 or negative values before merging.
  • The implementation is correct and well-tested for the primary use-cases (jobs with positive timeoutSeconds or using the platform defaults). The new tests are mathematically sound. However, the STUCK_RUN_FALLBACK_MS = 30 min fallback represents a silent regression for jobs configured with timeoutSeconds ≤ 0 — their stuck window drops from 2 hours to 30 minutes, increasing the risk of duplicate concurrent executions for any such job that legitimately runs past 30 minutes. If no production jobs use timeoutSeconds ≤ 0, the risk is low; if they do, this could cause real double-execution bugs.
  • Pay close attention to src/cron/service/jobs.ts lines 44-48 (STUCK_RUN_FALLBACK_MS) and any callers that may create agentTurn jobs with timeoutSeconds: 0 or negative values.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service/jobs.ts
Line: 44-48

Comment:
**Fallback threshold may clear locks for indefinitely-running jobs**

`STUCK_RUN_FALLBACK_MS = 30 min` applies to any job where `resolveCronJobTimeoutMs` returns `undefined` — which happens when `timeoutSeconds <= 0`. A user who sets `timeoutSeconds: 0` to mean "no execution timeout" (i.e. the job is allowed to run indefinitely) will now have their lock cleared after 30 minutes, down from the previous 2-hour default.

If such a job is legitimately running past the 30-minute mark, `normalizeJobTickState` will clear `runningAtMs`, `isJobDue` will see no lock, and the job can be scheduled again — producing a duplicate concurrent execution that the lock was specifically meant to prevent.

Before this PR, the fixed `STUCK_RUN_MS = 2h` applied uniformly, so an indefinite-timeout job had a 2-hour safety window. The PR's intent of tightening things is sound, but this particular edge case swings in the wrong direction: the one class of jobs most likely to run long (those with `timeoutSeconds ≤ 0`) now gets the *shortest* fallback threshold.

Consider using a larger fallback (e.g. the old `STUCK_RUN_MS = 2h`) specifically for `undefined`-timeout jobs, or documenting clearly that `timeoutSeconds: 0` should not be used with long-running jobs.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 82d454b

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates cron maintenance behavior to clear stale runningAtMs markers using each job’s configured execution timeout (plus a small buffer) rather than a fixed 2-hour window, helping the scheduler recover faster after crashes without leaving jobs blocked for long periods.

Changes:

  • Replace the fixed stuck-run threshold with a per-job threshold derived from resolveCronJobTimeoutMs(job) + a buffer, with a fallback when no timeout applies.
  • Expand regression coverage to assert stuck-marker clearing differs between job kinds and honors explicit timeoutSeconds.
  • Update test comments to reflect the new “per-job stuck threshold” semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/cron/service/jobs.ts Clears stale runningAtMs markers based on per-job timeout policy (plus buffer) instead of a fixed constant.
src/cron/service.issue-13992-regression.test.ts Adds/updates regression tests to validate per-job stuck thresholds (including agentTurn vs systemEvent).
src/cron/service.armtimer-tight-loop.test.ts Updates comment text to align with the new stuck-marker policy.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82d454bcf0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Replace fixed 2-hour STUCK_RUN_MS with per-job threshold based on
resolveCronJobTimeoutMs + 5-minute buffer. Jobs with 30-minute timeouts
now clear stale locks after ~35 minutes instead of waiting 2 hours,
preventing silent cron failures after fast gateway restarts.
Address review feedback:
- Bump STUCK_RUN_FALLBACK_MS from 30 min to 2h to preserve the safety
  window for jobs with timeoutSeconds <= 0 (no execution timeout).
- Use 60_000 numeric separator style to match codebase convention.
@shariqriazz shariqriazz force-pushed the fix/cron-stale-lock-stuck-threshold-30096 branch from d22d68f to 822a3b0 Compare March 12, 2026 08:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 822a3b04ba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +388 to +389
const stuckThresholdMs =
typeof jobTimeoutMs === "number" ? jobTimeoutMs + STUCK_RUN_BUFFER_MS : STUCK_RUN_FALLBACK_MS;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp stuck threshold to runtime timeout ceiling

stuckThresholdMs now uses the raw resolveCronJobTimeoutMs(job) value, but timeoutSeconds is unbounded in the API schema (src/gateway/protocol/schema/cron.ts:12) while execution timeouts are enforced via setTimeout(..., jobTimeoutMs) (src/cron/service/timer.ts:80-84), which overflows above ~2^31-1 ms. In that overflow range, runs time out almost immediately but stale-lock cleanup waits for the huge computed threshold, so a crash/interrupted persist can leave runningAtMs blocking the job for days; this regression did not exist with the previous fixed 2-hour cleanup window.

Useful? React with 👍 / 👎.

@shariqriazz
Copy link
Copy Markdown
Author

Fixed the P2 (clamp stuck threshold): clamped stuckThresholdMs to Math.min(jobTimeout + buffer, STUCK_RUN_FALLBACK_MS) so absurdly large timeoutSeconds values that overflow setTimeout's 2^31-1 ms limit degrade to the 2-hour fallback instead of leaving stuck locks for days.

@Takhoffman Takhoffman requested a review from a team as a code owner March 24, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Stale cron locks persist across gateway restarts

2 participants