fix(cron): treat transient tool error payloads as recoverable by Sid-Qin · Pull Request #29527 · openclaw/openclaw

Sid-Qin · 2026-02-28T06:48:10Z

Summary

Problem: Isolated cron runs marked status as error whenever any payload had isError=true, even if a later successful final payload existed.
Why it matters: False-positive tool errors (for example write-tool misclassification) can mark successful cron jobs as failed despite files being written.
What changed: In src/cron/isolated-agent/run.ts, payload-error handling now treats error payloads as terminal only when no later successful payload follows; added regression test in src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts.
What did NOT change: No change to pure error runs (still error), no delivery routing changes, and no changes to write tool implementation.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes Cron job: write tool returns error but file is actually written #29462

User-visible / Behavior Changes

Cron jobs with transient tool error payloads followed by successful final output now report ok instead of error.

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS (dev)
Runtime: Node.js + pnpm/vitest
Integration/channel: cron isolated agent turn

Steps

Run isolated cron turn returning an isError=true payload.
Follow it with a later non-error payload containing successful text.

Expected

Final run status should be ok with successful summary.

Actual

Before fix: any error payload forced error status.
After fix: only terminal error payloads force error; recovered runs are ok.

Evidence

pnpm vitest run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
Added test: treats transient error payloads as non-fatal when a later success payload exists.

Human Verification (required)

Verified scenarios: pure error payload still returns error; error-then-success returns ok.
Edge cases checked: summary still uses later success payload text.
What I did not verify: live cron execution against real write tool in production environment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Failure Recovery (if this breaks)

How to disable/revert: revert this PR/commit.
Files/config to restore: src/cron/isolated-agent/run.ts and its test file.
Known bad symptoms: cron runs with mixed payloads may be incorrectly classified.

Risks and Mitigations

Risk: recovered runs could mask genuinely failed tasks that emit a misleading success payload later.
Mitigation: behavior only changes when a later non-error payload exists; pure-error paths remain unchanged and covered by tests.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebb4561143

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T06:52:41Z

src/cron/isolated-agent/run.ts

+  const hasSuccessfulPayloadAfterLastError =
+    lastErrorPayloadIndex >= 0 &&
+    payloads
+      .slice(lastErrorPayloadIndex + 1)
+      .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));


Keep error status when post-error text is not true recovery

The new hasSuccessfulPayloadAfterLastError check treats any later non-error text as recovery, but embedded runs can legitimately produce that shape for failed turns: buildEmbeddedRunPayloads prepends an isError payload when lastAssistant.stopReason === "error" and may still append assistant text afterward (src/agents/pi-embedded-runner/run/payloads.ts, error push and answer-text append blocks). In that case cron now reports ok even though the underlying run ended in error, which can suppress failure alerts/retries for real model/tool failures. The recovery condition should be narrowed to known transient false-positive patterns rather than all post-error text.

Useful? React with 👍 / 👎.

greptile-apps · 2026-02-28T06:54:19Z

Greptile Summary

Refines cron error handling to distinguish between fatal and transient tool errors. The logic now only marks runs as error when an error payload is truly terminal (no successful payload follows). This prevents false-positive tool error payloads (like write-tool misclassifications) from incorrectly failing otherwise successful cron jobs.

Key changes:

Tracks the last error payload index and checks if any successful payload exists afterward
Replaces hasErrorPayload with hasFatalErrorPayload for status determination
Adds regression test verifying error-then-success scenarios return ok status
Preserves existing behavior: pure error runs still return error status

Why this matters:
Tool wrappers can emit transient error payloads before emitting a valid final success payload (e.g., write tool reports failure but file is actually written). Previously, any error payload forced the entire run to error status, even when later payloads indicated success.

Implementation quality:
The logic correctly handles various edge cases: multiple errors before success, empty payloads between error and success, and success-then-error scenarios. The code includes a clear explanatory comment and the test coverage adequately verifies both the new behavior and that existing error handling remains unchanged.

Confidence Score: 5/5

Safe to merge - well-tested bug fix with backward-compatible error handling
Clean implementation with sound logic verified across multiple edge cases. The change only affects error classification when successful payloads exist after errors, preserving all existing pure-error behavior. Test coverage is adequate with both new and existing tests confirming correctness. No security, compatibility, or architectural concerns.
No files require special attention

_{Last reviewed commit: ebb4561}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5532f4e993

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T07:56:18Z

src/cron/isolated-agent/run.ts

+    !runLevelError &&
+    lastErrorPayloadIndex >= 0 &&
+    payloads
+      .slice(lastErrorPayloadIndex + 1)
+      .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));


Keep stopReason error payloads from being marked recovered

Fresh evidence: runEmbeddedPiAgent can return payloads without meta.error even after a failover-classified assistant failure when rotation is exhausted and no fallback is configured (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for lastAssistant.stopReason === "error" while still appending later assistant text (src/agents/pi-embedded-runner/run/payloads.ts). With this check, that shape is now treated as recovered (ok) solely because later non-error text exists, so failed turns with partial output are misclassified and can skip cron failure handling/retries.

Useful? React with 👍 / 👎.

…cation, hook bridge Comprehensive cron subsystem reliability overhaul addressing 9 open issues: **Configurable retry policy (openclaw#29527, openclaw#29546)** - Extract backoff logic into `retry-policy.ts` with configurable `cron.retryBackoff` schedule and `cron.stuckRunTimeoutMs` config fields - Classify execution errors as transient (retry with backoff) or terminal (disable job immediately) to prevent futile retry storms - Add Zod validation for new config fields **Fix delivery status reporting (openclaw#29660)** - `resolveDeliveryStatus` now accepts `deliveryAttempted` and returns "not-delivered" when delivery was attempted but not confirmed - Add explicit `delivered: false` to all error paths in delivery-dispatch.ts - Pass `deliveryAttempted` through the full result chain **Configurable response prefix (openclaw#29687)** - Add `responsePrefix` to CronServiceDeps (defaults to "Cron") - Use dep-injected prefix for main session summary messages **Strip directive tags from cron output (openclaw#29646)** - Strip `[[reply_to_current]]`, `[[reply_to:<id>]]`, `[[audio_as_voice]]` from isolated cron run output before delivery **Startup schedule preservation (openclaw#29690)** - Use maintenance-only recompute in `start()` to avoid advancing past-due `nextRunAtMs` values for jobs that `runMissedJobs` intentionally skipped - Clear `nextRunAtMs` for interrupted jobs so they get fresh schedule computation **Internal hook bridge (openclaw#29682)** - Add "cron" to `InternalHookEventType` with `CronExecutionHookEvent` type - Bridge `emitJobFinished` to `triggerInternalHook` so extensions can react to cron job completions (alerting, chaining, etc.) - Add `isCronExecutionEvent` type guard **Per-job skip-if-active flag (openclaw#29659)** - Add `skipIfRunActive?: boolean` to CronJob type for semantic overlap prevention Tests: 290 cron tests + 21 new retry-policy tests pass. Fixes: openclaw#29527, openclaw#29546, openclaw#29601, openclaw#29646, openclaw#29659, openclaw#29660, openclaw#29682, openclaw#29687, openclaw#29690

Introduce a recovery heuristic for cron isolated runs: when an isError payload is followed by a non-error payload with real text AND the run itself did not report a model/context-level error (meta.error), downgrade the error to non-fatal and report "ok". This prevents write-tool false-positive error payloads from marking the entire cron job as failed while still preserving genuine model/context errors even when partial assistant text follows them. Includes regression tests for both the transient-recovery and the run-level-error scenarios.

Takhoffman · 2026-03-01T12:52:23Z

PR #29527 - fix(cron): treat transient tool error payloads as recoverable (#29527)

Merged via squash.

Merge commit: d509a81
Verified: pnpm install --frozen-lockfile, pnpm check, pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
Changes made:
M CHANGELOG.md
M src/cron/isolated-agent/run.ts
M src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
Why these changes were made:
Resolve false-positive cron failures when transient tool error payloads are followed by a successful payload, while keeping run-level errors fatal.
Changelog: CHANGELOG.md updated=true required=true opt_out=false

Thanks @Sid-Qin!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3df75a5dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-01T12:55:34Z

src/cron/isolated-agent/run.ts

+    !runLevelError &&
+    lastErrorPayloadIndex >= 0 &&
+    payloads
+      .slice(lastErrorPayloadIndex + 1)
+      .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));


Keep stopReason errors from being treated as recovered

This recovery check keys only on runResult.meta.error, but embedded runs that end with lastAssistant.stopReason === "error" can still take the normal return path without meta.error (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for that stop reason while still appending assistant text (src/agents/pi-embedded-runner/run/payloads.ts). In that payload shape, later text makes hasSuccessfulPayloadAfterLastError true, so cron reports ok and skips failure handling/retries even though the underlying model/tool turn failed.

Useful? React with 👍 / 👎.

@Sid-Qin