Skip to content

fix(cron): treat transient tool error payloads as recoverable#29527

Merged
Takhoffman merged 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/29462-cron-write-false-error
Mar 1, 2026
Merged

fix(cron): treat transient tool error payloads as recoverable#29527
Takhoffman merged 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/29462-cron-write-false-error

Conversation

@Sid-Qin
Copy link
Copy Markdown
Contributor

@Sid-Qin Sid-Qin commented Feb 28, 2026

Summary

  • Problem: Isolated cron runs marked status as error whenever any payload had isError=true, even if a later successful final payload existed.
  • Why it matters: False-positive tool errors (for example write-tool misclassification) can mark successful cron jobs as failed despite files being written.
  • What changed: In src/cron/isolated-agent/run.ts, payload-error handling now treats error payloads as terminal only when no later successful payload follows; added regression test in src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts.
  • What did NOT change: No change to pure error runs (still error), no delivery routing changes, and no changes to write tool implementation.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Cron jobs with transient tool error payloads followed by successful final output now report ok instead of error.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (dev)
  • Runtime: Node.js + pnpm/vitest
  • Integration/channel: cron isolated agent turn

Steps

  1. Run isolated cron turn returning an isError=true payload.
  2. Follow it with a later non-error payload containing successful text.

Expected

  • Final run status should be ok with successful summary.

Actual

  • Before fix: any error payload forced error status.
  • After fix: only terminal error payloads force error; recovered runs are ok.

Evidence

  • pnpm vitest run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
  • Added test: treats transient error payloads as non-fatal when a later success payload exists.

Human Verification (required)

  • Verified scenarios: pure error payload still returns error; error-then-success returns ok.
  • Edge cases checked: summary still uses later success payload text.
  • What I did not verify: live cron execution against real write tool in production environment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert: revert this PR/commit.
  • Files/config to restore: src/cron/isolated-agent/run.ts and its test file.
  • Known bad symptoms: cron runs with mixed payloads may be incorrectly classified.

Risks and Mitigations

  • Risk: recovered runs could mask genuinely failed tasks that emit a misleading success payload later.
  • Mitigation: behavior only changes when a later non-error payload exists; pure-error paths remain unchanged and covered by tests.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebb4561143

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +548 to +552
const hasSuccessfulPayloadAfterLastError =
lastErrorPayloadIndex >= 0 &&
payloads
.slice(lastErrorPayloadIndex + 1)
.some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep error status when post-error text is not true recovery

The new hasSuccessfulPayloadAfterLastError check treats any later non-error text as recovery, but embedded runs can legitimately produce that shape for failed turns: buildEmbeddedRunPayloads prepends an isError payload when lastAssistant.stopReason === "error" and may still append assistant text afterward (src/agents/pi-embedded-runner/run/payloads.ts, error push and answer-text append blocks). In that case cron now reports ok even though the underlying run ended in error, which can suppress failure alerts/retries for real model/tool failures. The recovery condition should be narrowed to known transient false-positive patterns rather than all post-error text.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Refines cron error handling to distinguish between fatal and transient tool errors. The logic now only marks runs as error when an error payload is truly terminal (no successful payload follows). This prevents false-positive tool error payloads (like write-tool misclassifications) from incorrectly failing otherwise successful cron jobs.

Key changes:

  • Tracks the last error payload index and checks if any successful payload exists afterward
  • Replaces hasErrorPayload with hasFatalErrorPayload for status determination
  • Adds regression test verifying error-then-success scenarios return ok status
  • Preserves existing behavior: pure error runs still return error status

Why this matters:
Tool wrappers can emit transient error payloads before emitting a valid final success payload (e.g., write tool reports failure but file is actually written). Previously, any error payload forced the entire run to error status, even when later payloads indicated success.

Implementation quality:
The logic correctly handles various edge cases: multiple errors before success, empty payloads between error and success, and success-then-error scenarios. The code includes a clear explanatory comment and the test coverage adequately verifies both the new behavior and that existing error handling remains unchanged.

Confidence Score: 5/5

  • Safe to merge - well-tested bug fix with backward-compatible error handling
  • Clean implementation with sound logic verified across multiple edge cases. The change only affects error classification when successful payloads exist after errors, preserving all existing pure-error behavior. Test coverage is adequate with both new and existing tests confirming correctness. No security, compatibility, or architectural concerns.
  • No files require special attention

Last reviewed commit: ebb4561

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5532f4e993

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +550 to +554
!runLevelError &&
lastErrorPayloadIndex >= 0 &&
payloads
.slice(lastErrorPayloadIndex + 1)
.some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep stopReason error payloads from being marked recovered

Fresh evidence: runEmbeddedPiAgent can return payloads without meta.error even after a failover-classified assistant failure when rotation is exhausted and no fallback is configured (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for lastAssistant.stopReason === "error" while still appending later assistant text (src/agents/pi-embedded-runner/run/payloads.ts). With this check, that shape is now treated as recovered (ok) solely because later non-error text exists, so failed turns with partial output are misclassified and can skip cron failure handling/retries.

Useful? React with 👍 / 👎.

thomasxm pushed a commit to thomasxm/openclaw that referenced this pull request Feb 28, 2026
…cation, hook bridge

Comprehensive cron subsystem reliability overhaul addressing 9 open issues:

**Configurable retry policy (openclaw#29527, openclaw#29546)**
- Extract backoff logic into `retry-policy.ts` with configurable `cron.retryBackoff`
  schedule and `cron.stuckRunTimeoutMs` config fields
- Classify execution errors as transient (retry with backoff) or terminal
  (disable job immediately) to prevent futile retry storms
- Add Zod validation for new config fields

**Fix delivery status reporting (openclaw#29660)**
- `resolveDeliveryStatus` now accepts `deliveryAttempted` and returns
  "not-delivered" when delivery was attempted but not confirmed
- Add explicit `delivered: false` to all error paths in delivery-dispatch.ts
- Pass `deliveryAttempted` through the full result chain

**Configurable response prefix (openclaw#29687)**
- Add `responsePrefix` to CronServiceDeps (defaults to "Cron")
- Use dep-injected prefix for main session summary messages

**Strip directive tags from cron output (openclaw#29646)**
- Strip `[[reply_to_current]]`, `[[reply_to:<id>]]`, `[[audio_as_voice]]`
  from isolated cron run output before delivery

**Startup schedule preservation (openclaw#29690)**
- Use maintenance-only recompute in `start()` to avoid advancing past-due
  `nextRunAtMs` values for jobs that `runMissedJobs` intentionally skipped
- Clear `nextRunAtMs` for interrupted jobs so they get fresh schedule computation

**Internal hook bridge (openclaw#29682)**
- Add "cron" to `InternalHookEventType` with `CronExecutionHookEvent` type
- Bridge `emitJobFinished` to `triggerInternalHook` so extensions can react
  to cron job completions (alerting, chaining, etc.)
- Add `isCronExecutionEvent` type guard

**Per-job skip-if-active flag (openclaw#29659)**
- Add `skipIfRunActive?: boolean` to CronJob type for semantic overlap prevention

Tests: 290 cron tests + 21 new retry-policy tests pass.

Fixes: openclaw#29527, openclaw#29546, openclaw#29601, openclaw#29646, openclaw#29659, openclaw#29660, openclaw#29682, openclaw#29687, openclaw#29690
Introduce a recovery heuristic for cron isolated runs: when an
isError payload is followed by a non-error payload with real text
AND the run itself did not report a model/context-level error
(meta.error), downgrade the error to non-fatal and report "ok".

This prevents write-tool false-positive error payloads from
marking the entire cron job as failed while still preserving
genuine model/context errors even when partial assistant text
follows them.

Includes regression tests for both the transient-recovery and
the run-level-error scenarios.
@Takhoffman Takhoffman force-pushed the fix/29462-cron-write-false-error branch from 5532f4e to b3df75a Compare March 1, 2026 12:52
@Takhoffman Takhoffman merged commit d509a81 into openclaw:main Mar 1, 2026
9 checks passed
@Takhoffman
Copy link
Copy Markdown
Contributor

PR #29527 - fix(cron): treat transient tool error payloads as recoverable (#29527)

Merged via squash.

  • Merge commit: d509a81
  • Verified: pnpm install --frozen-lockfile, pnpm check, pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
  • Changes made:
    M CHANGELOG.md
    M src/cron/isolated-agent/run.ts
    M src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts
  • Why these changes were made:
    Resolve false-positive cron failures when transient tool error payloads are followed by a successful payload, while keeping run-level errors fatal.
  • Changelog: CHANGELOG.md updated=true required=true opt_out=false

Thanks @Sid-Qin!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3df75a5dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +576 to +580
!runLevelError &&
lastErrorPayloadIndex >= 0 &&
payloads
.slice(lastErrorPayloadIndex + 1)
.some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep stopReason errors from being treated as recovered

This recovery check keys only on runResult.meta.error, but embedded runs that end with lastAssistant.stopReason === "error" can still take the normal return path without meta.error (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for that stop reason while still appending assistant text (src/agents/pi-embedded-runner/run/payloads.ts). In that payload shape, later text makes hasSuccessfulPayloadAfterLastError true, so cron reports ok and skips failure handling/retries even though the underlying model/tool turn failed.

Useful? React with 👍 / 👎.

hwb96 pushed a commit to hwb96/openclaw that referenced this pull request Mar 1, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 1, 2026
…aw#29527)

Manual port from upstream openclaw/openclaw d509a81.
Tool wrappers can emit false-positive error payloads; when a non-error
payload follows and there is no run-level error, mark the run "ok"
instead of "error".

Co-Authored-By: Claude Opus 4.6 <[email protected]>
ansh pushed a commit to vibecode/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
safzanpirani pushed a commit to safzanpirani/clawdbot that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
robertchang-ga pushed a commit to robertchang-ga/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
hanqizheng pushed a commit to hanqizheng/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
execute008 pushed a commit to execute008/openclaw that referenced this pull request Mar 2, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
dorgonman pushed a commit to kanohorizonia/openclaw that referenced this pull request Mar 3, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
sachinkundu pushed a commit to sachinkundu/openclaw that referenced this pull request Mar 6, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…aw#29527)

Manual port from upstream openclaw/openclaw d509a81.
Tool wrappers can emit false-positive error payloads; when a non-error
payload follows and there is no run-level error, mark the run "ok"
instead of "error".
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
Mateljan1 pushed a commit to Mateljan1/openclaw that referenced this pull request Mar 7, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 16, 2026
…aw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
(cherry picked from commit d509a81)
alexey-pelykh added a commit to remoteclaw/remoteclaw that referenced this pull request Mar 16, 2026
* fix(cron): treat transient tool error payloads as recoverable (openclaw#29527) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm check
- pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts

Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
(cherry picked from commit d509a81)

* fix: adapt runLevelError to fork's AgentDeliveryResult type

Upstream uses runResult.meta?.error (Pi-embedded style), fork has
top-level runResult.error on AgentDeliveryResult.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Sid <[email protected]>
Co-authored-by: Sid-Qin <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron job: write tool returns error but file is actually written

2 participants