fix(cron): use per-job timeout for stuck lock threshold#43816
fix(cron): use per-job timeout for stuck lock threshold#43816shariqriazz wants to merge 3 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR replaces the fixed 2-hour Key changes:
Potential concern: Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/cron/service/jobs.ts
Line: 44-48
Comment:
**Fallback threshold may clear locks for indefinitely-running jobs**
`STUCK_RUN_FALLBACK_MS = 30 min` applies to any job where `resolveCronJobTimeoutMs` returns `undefined` — which happens when `timeoutSeconds <= 0`. A user who sets `timeoutSeconds: 0` to mean "no execution timeout" (i.e. the job is allowed to run indefinitely) will now have their lock cleared after 30 minutes, down from the previous 2-hour default.
If such a job is legitimately running past the 30-minute mark, `normalizeJobTickState` will clear `runningAtMs`, `isJobDue` will see no lock, and the job can be scheduled again — producing a duplicate concurrent execution that the lock was specifically meant to prevent.
Before this PR, the fixed `STUCK_RUN_MS = 2h` applied uniformly, so an indefinite-timeout job had a 2-hour safety window. The PR's intent of tightening things is sound, but this particular edge case swings in the wrong direction: the one class of jobs most likely to run long (those with `timeoutSeconds ≤ 0`) now gets the *shortest* fallback threshold.
Consider using a larger fallback (e.g. the old `STUCK_RUN_MS = 2h`) specifically for `undefined`-timeout jobs, or documenting clearly that `timeoutSeconds: 0` should not be used with long-running jobs.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 82d454b |
There was a problem hiding this comment.
Pull request overview
This PR updates cron maintenance behavior to clear stale runningAtMs markers using each job’s configured execution timeout (plus a small buffer) rather than a fixed 2-hour window, helping the scheduler recover faster after crashes without leaving jobs blocked for long periods.
Changes:
- Replace the fixed stuck-run threshold with a per-job threshold derived from
resolveCronJobTimeoutMs(job)+ a buffer, with a fallback when no timeout applies. - Expand regression coverage to assert stuck-marker clearing differs between job kinds and honors explicit
timeoutSeconds. - Update test comments to reflect the new “per-job stuck threshold” semantics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/cron/service/jobs.ts | Clears stale runningAtMs markers based on per-job timeout policy (plus buffer) instead of a fixed constant. |
| src/cron/service.issue-13992-regression.test.ts | Adds/updates regression tests to validate per-job stuck thresholds (including agentTurn vs systemEvent). |
| src/cron/service.armtimer-tight-loop.test.ts | Updates comment text to align with the new stuck-marker policy. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82d454bcf0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Replace fixed 2-hour STUCK_RUN_MS with per-job threshold based on resolveCronJobTimeoutMs + 5-minute buffer. Jobs with 30-minute timeouts now clear stale locks after ~35 minutes instead of waiting 2 hours, preventing silent cron failures after fast gateway restarts.
Address review feedback: - Bump STUCK_RUN_FALLBACK_MS from 30 min to 2h to preserve the safety window for jobs with timeoutSeconds <= 0 (no execution timeout). - Use 60_000 numeric separator style to match codebase convention.
d22d68f to
822a3b0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 822a3b04ba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/cron/service/jobs.ts
Outdated
| const stuckThresholdMs = | ||
| typeof jobTimeoutMs === "number" ? jobTimeoutMs + STUCK_RUN_BUFFER_MS : STUCK_RUN_FALLBACK_MS; |
There was a problem hiding this comment.
Clamp stuck threshold to runtime timeout ceiling
stuckThresholdMs now uses the raw resolveCronJobTimeoutMs(job) value, but timeoutSeconds is unbounded in the API schema (src/gateway/protocol/schema/cron.ts:12) while execution timeouts are enforced via setTimeout(..., jobTimeoutMs) (src/cron/service/timer.ts:80-84), which overflows above ~2^31-1 ms. In that overflow range, runs time out almost immediately but stale-lock cleanup waits for the huge computed threshold, so a crash/interrupted persist can leave runningAtMs blocking the job for days; this regression did not exist with the previous fixed 2-hour cleanup window.
Useful? React with 👍 / 👎.
|
Fixed the P2 (clamp stuck threshold): clamped |
Summary
STUCK_RUN_MSconstant with a per-job stuck threshold derived fromresolveCronJobTimeoutMs()+ a 5-minute buffertimeoutSeconds(e.g. 30 min) clear after ~35 minutes instead of 2 hoursFixes #30096
Problem
When the gateway crashes mid-execution and restarts within 2 hours, the
runningAtMslock persists injobs.json. The startup path (start()inops.ts) unconditionally clears all locks, but the timer-tick maintenance path (normalizeJobTickStateinjobs.ts) only clears locks older thanSTUCK_RUN_MS(2 hours). If the startup clear succeeds but a subsequent timer tick re-reads a stale on-disk state, or if the lock gets re-set duringrunMissedJobsand the process crashes again, the 2-hour window keeps the job permanently blocked.Users with 30-minute job timeouts observed jobs stuck for hours after unclean shutdowns, with no warning.
Fix
The stuck threshold now scales with each job's actual timeout:
resolveCronJobTimeoutMs(job) + STUCK_RUN_BUFFER_MS (5 min)for jobs with a resolved timeoutSTUCK_RUN_FALLBACK_MS (30 min)for jobs where the resolver returnsundefinedThe startup unconditional clear in
ops.tsis unchanged — this fix tightens the secondary safety net in the timer-tick path.Testing
service.issue-13992-regression.test.tscovering:service.armtimer-tight-loop.test.tsto reflect per-job thresholdpnpm build,pnpm checkpassAI Disclosure
This PR was authored with AI assistance (OpenCode/Claude). The changes have been reviewed and tested locally. Degree of testing: fully tested against existing + new test suite.