Skip to content

fix(cron): return accepted run ticket and reconcile timeout (AI-assisted)#29546

Closed
HOYALIM wants to merge 3 commits intoopenclaw:mainfrom
HOYALIM:codex/pr-cron-run-timeout-acceptance
Closed

fix(cron): return accepted run ticket and reconcile timeout (AI-assisted)#29546
HOYALIM wants to merge 3 commits intoopenclaw:mainfrom
HOYALIM:codex/pr-cron-run-timeout-acceptance

Conversation

@HOYALIM
Copy link
Copy Markdown
Contributor

@HOYALIM HOYALIM commented Feb 28, 2026

Summary

  • Problem: cron run can time out at transport level while the job still executes later, causing clients/users to misclassify runs as failed and retrigger duplicates.
  • Why it matters: false failure signaling can cause duplicate force-runs and duplicate notifications.
  • What changed:
  • cron.run now returns an immediate accepted payload (accepted, requestId/runId, queuedAt, reconcile hint) and executes asynchronously.
  • Added force-run dedupe window (60s) for identical manual runs.
  • Added explicit not-accepted payload (RUN_NOT_ACCEPTED) for missing/running/not-due cases.
  • Added transport-timeout taxonomy in gateway call path (GatewayTransportTimeoutError) and CLI reconciliation (single cron.runs check on timeout-like responses).
  • Added runId propagation/filtering across cron events, run logs, schema, validators, and cron.runs.
  • What did NOT change (scope boundary):
  • No scheduler timing semantics rewrite.
  • No job payload execution model changes.
  • No auth/token/capability expansion.
  • AI assistance/testing note: AI-assisted. Testing degree: lightly tested (pnpm build, pnpm check, and scoped tests), with known unrelated baseline failures in full local pnpm test.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

  • cron.run now returns an immediate accepted ticket-style payload including requestId, runId, queuedAt, and reconciliation hint.
  • Duplicate manual force-run requests within 60s are deduped.
  • cron.runs supports runId filtering for reconciliation.
  • CLI cron run performs one reconciliation call on timeout-like transport errors instead of immediately presenting a hard failure.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node + pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Gateway cron + CLI
  • Relevant config (redacted): local dev defaults

Steps

  1. Add a cron job and call cron.run in force mode.
  2. Simulate/observe transport-timeout path.
  3. Query cron.runs with returned runId.

Expected

  • Immediate accepted response with requestId/runId/queuedAt.
  • Timeout-like path is reconciled once via cron.runs.
  • Duplicate force-run request within dedupe window returns deduped accepted payload.

Actual

  • Matches expected behavior in scoped tests.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
  • Accepted response contract for cron.run
  • Deduped duplicate force runs
  • RUN_NOT_ACCEPTED payload shape
  • Timeout reconciliation in CLI
  • runId filtering in cron.runs
  • Edge cases checked:
  • Missing job id target
  • Job already running
  • Not-due run in due mode
  • What you did not verify:
  • Full local pnpm test green run (current local suite has unrelated baseline failures outside this PR scope)

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
  • Revert commits:
  • 8b3026708 (cron reliability changes)
  • fb5fae88f (type-check cast fix for test)
  • Files/config to restore:
  • src/gateway/server-methods/cron.ts
  • src/cli/cron-cli/register.cron-simple.ts
  • src/gateway/call.ts
  • src/cron/* touched files
  • Known bad symptoms reviewers should watch for:
  • cron.run returns synchronous execution result instead of accepted payload
  • Missing runId in run-log filtering or reconciliation paths

Risks and Mitigations

  • Risk: force-run dedupe window may suppress intentionally repeated manual runs within 60s.
  • Mitigation: dedupe is force-only, per-job, short-window, and response includes deduped: true.
  • Risk: clients expecting old cron.run immediate execution semantics may need adjustment.
  • Mitigation: accepted payload includes explicit reconcile contract and stable fields (requestId, runId, queuedAt).

@openclaw-barnacle openclaw-barnacle bot added app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling size: L labels Feb 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b3026708b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +391 to +395
if (
mode === "due" &&
typeof job.state.nextRunAtMs === "number" &&
Number.isFinite(job.state.nextRunAtMs) &&
job.state.nextRunAtMs > now
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate due-mode acceptance on full due predicate

The new cron.run fast-ack path only rejects due-mode requests when nextRunAtMs > now, but CronService.run() considers a job due only when it is enabled and has a numeric nextRunAtMs at or before now. As a result, disabled jobs (or jobs with missing/invalid nextRunAtMs) can be returned as status:"accepted" here and then silently resolve to ran:false in the async execution path, leaving a runId that never appears in cron.runs and causing incorrect reconciliation behavior.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR successfully addresses the cron run transport timeout problem by converting it to an async ticket-based model. The implementation adds immediate acceptance responses with requestId/runId, force-run deduplication (60s window), explicit not-accepted payloads, and CLI timeout reconciliation via cron.runs.

Key changes:

  • cron.run now returns accepted payload immediately and executes asynchronously (lines 410-446 in src/gateway/server-methods/cron.ts)
  • Added GatewayTransportTimeoutError class for proper timeout error taxonomy
  • Implemented 60s dedupe window for force runs to prevent duplicate manual triggers
  • Added runId propagation through events, run logs, and filtering in cron.runs
  • CLI now reconciles once via cron.runs on timeout-like responses instead of failing immediately

Architecture decisions:

  • Force mode checks "already running" before acceptance; due mode accepts first and lets cron service reject later. This means due mode clients may receive "accepted" for jobs that won't actually execute, requiring reconciliation
  • Dedupe check precedes "already running" check, so duplicate force-run requests return cached accepted payload even if job is currently running
  • runId equals requestId in current implementation (line 411), maintaining separate fields for potential future divergence

Testing coverage:
Tests cover acceptance tickets, deduplication, not-accepted payloads, runId filtering, and CLI timeout reconciliation. The type-cast changes in pi-embedded-runner-extraparams.test.ts are unrelated test fixes.

Confidence Score: 4/5

  • This PR is safe to merge with low risk - it's a well-structured reliability improvement with proper error handling and reconciliation mechanisms
  • Score of 4 reflects a solid implementation with comprehensive testing and proper error handling. Not a 5 because: (1) introduces a behavioral change from synchronous to asynchronous execution that requires client adaptation, (2) semantic inconsistency between force/due modes regarding "already running" checks could cause confusion, and (3) the ran: true field in accepted payload is semantically imprecise (indicates job will run, not that it has run). However, these are design trade-offs rather than bugs, and the reconciliation mechanism properly handles edge cases.
  • All files appear well-implemented. Key files to review for understanding: src/gateway/server-methods/cron.ts (core async logic), src/cli/cron-cli/register.cron-simple.ts (CLI reconciliation), and src/gateway/call.ts (timeout error taxonomy)

Last reviewed commit: 8b30267

@HOYALIM HOYALIM force-pushed the codex/pr-cron-run-timeout-acceptance branch from 8b30267 to 5eb02cf Compare February 28, 2026 08:04
@openclaw-barnacle openclaw-barnacle bot removed the agents Agent runtime and tooling label Feb 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5eb02cf40e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@openclaw-barnacle openclaw-barnacle bot added the app: macos App: macos label Feb 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 739c6659c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

thomasxm pushed a commit to thomasxm/openclaw that referenced this pull request Feb 28, 2026
…cation, hook bridge

Comprehensive cron subsystem reliability overhaul addressing 9 open issues:

**Configurable retry policy (openclaw#29527, openclaw#29546)**
- Extract backoff logic into `retry-policy.ts` with configurable `cron.retryBackoff`
  schedule and `cron.stuckRunTimeoutMs` config fields
- Classify execution errors as transient (retry with backoff) or terminal
  (disable job immediately) to prevent futile retry storms
- Add Zod validation for new config fields

**Fix delivery status reporting (openclaw#29660)**
- `resolveDeliveryStatus` now accepts `deliveryAttempted` and returns
  "not-delivered" when delivery was attempted but not confirmed
- Add explicit `delivered: false` to all error paths in delivery-dispatch.ts
- Pass `deliveryAttempted` through the full result chain

**Configurable response prefix (openclaw#29687)**
- Add `responsePrefix` to CronServiceDeps (defaults to "Cron")
- Use dep-injected prefix for main session summary messages

**Strip directive tags from cron output (openclaw#29646)**
- Strip `[[reply_to_current]]`, `[[reply_to:<id>]]`, `[[audio_as_voice]]`
  from isolated cron run output before delivery

**Startup schedule preservation (openclaw#29690)**
- Use maintenance-only recompute in `start()` to avoid advancing past-due
  `nextRunAtMs` values for jobs that `runMissedJobs` intentionally skipped
- Clear `nextRunAtMs` for interrupted jobs so they get fresh schedule computation

**Internal hook bridge (openclaw#29682)**
- Add "cron" to `InternalHookEventType` with `CronExecutionHookEvent` type
- Bridge `emitJobFinished` to `triggerInternalHook` so extensions can react
  to cron job completions (alerting, chaining, etc.)
- Add `isCronExecutionEvent` type guard

**Per-job skip-if-active flag (openclaw#29659)**
- Add `skipIfRunActive?: boolean` to CronJob type for semantic overlap prevention

Tests: 290 cron tests + 21 new retry-policy tests pass.

Fixes: openclaw#29527, openclaw#29546, openclaw#29601, openclaw#29646, openclaw#29659, openclaw#29660, openclaw#29682, openclaw#29687, openclaw#29690
@Takhoffman Takhoffman added the close:not-planned PR close reason label Feb 28, 2026
@Takhoffman
Copy link
Copy Markdown
Contributor

Thanks for the detailed proposal and tests.

I’m closing this for now because it introduces a broad contract shift for cron.run (ticketing/acceptance/reconciliation semantics) that is too large to merge safely in one PR. We need smaller, incremental changes with tighter compatibility control.

Appreciate the thoughtful effort here.

@Takhoffman Takhoffman closed this Feb 28, 2026
@HOYALIM
Copy link
Copy Markdown
Contributor Author

HOYALIM commented Mar 1, 2026

Thanks for the thoughtful review and context, really appreciate you taking the time to dig into this. I’ll keep this branch around as a reference and will look for opportunities to upstream smaller, compatibility-focused pieces in future PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: macos App: macos app: web-ui App: web-ui cli CLI command changes close:not-planned PR close reason gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants