fix(cron): treat transient tool error payloads as recoverable#29527
fix(cron): treat transient tool error payloads as recoverable#29527Takhoffman merged 1 commit intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ebb4561143
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const hasSuccessfulPayloadAfterLastError = | ||
| lastErrorPayloadIndex >= 0 && | ||
| payloads | ||
| .slice(lastErrorPayloadIndex + 1) | ||
| .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim())); |
There was a problem hiding this comment.
Keep error status when post-error text is not true recovery
The new hasSuccessfulPayloadAfterLastError check treats any later non-error text as recovery, but embedded runs can legitimately produce that shape for failed turns: buildEmbeddedRunPayloads prepends an isError payload when lastAssistant.stopReason === "error" and may still append assistant text afterward (src/agents/pi-embedded-runner/run/payloads.ts, error push and answer-text append blocks). In that case cron now reports ok even though the underlying run ended in error, which can suppress failure alerts/retries for real model/tool failures. The recovery condition should be narrowed to known transient false-positive patterns rather than all post-error text.
Useful? React with 👍 / 👎.
Greptile SummaryRefines cron error handling to distinguish between fatal and transient tool errors. The logic now only marks runs as Key changes:
Why this matters: Implementation quality: Confidence Score: 5/5
Last reviewed commit: ebb4561 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5532f4e993
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| !runLevelError && | ||
| lastErrorPayloadIndex >= 0 && | ||
| payloads | ||
| .slice(lastErrorPayloadIndex + 1) | ||
| .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim())); |
There was a problem hiding this comment.
Keep stopReason error payloads from being marked recovered
Fresh evidence: runEmbeddedPiAgent can return payloads without meta.error even after a failover-classified assistant failure when rotation is exhausted and no fallback is configured (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for lastAssistant.stopReason === "error" while still appending later assistant text (src/agents/pi-embedded-runner/run/payloads.ts). With this check, that shape is now treated as recovered (ok) solely because later non-error text exists, so failed turns with partial output are misclassified and can skip cron failure handling/retries.
Useful? React with 👍 / 👎.
…cation, hook bridge Comprehensive cron subsystem reliability overhaul addressing 9 open issues: **Configurable retry policy (openclaw#29527, openclaw#29546)** - Extract backoff logic into `retry-policy.ts` with configurable `cron.retryBackoff` schedule and `cron.stuckRunTimeoutMs` config fields - Classify execution errors as transient (retry with backoff) or terminal (disable job immediately) to prevent futile retry storms - Add Zod validation for new config fields **Fix delivery status reporting (openclaw#29660)** - `resolveDeliveryStatus` now accepts `deliveryAttempted` and returns "not-delivered" when delivery was attempted but not confirmed - Add explicit `delivered: false` to all error paths in delivery-dispatch.ts - Pass `deliveryAttempted` through the full result chain **Configurable response prefix (openclaw#29687)** - Add `responsePrefix` to CronServiceDeps (defaults to "Cron") - Use dep-injected prefix for main session summary messages **Strip directive tags from cron output (openclaw#29646)** - Strip `[[reply_to_current]]`, `[[reply_to:<id>]]`, `[[audio_as_voice]]` from isolated cron run output before delivery **Startup schedule preservation (openclaw#29690)** - Use maintenance-only recompute in `start()` to avoid advancing past-due `nextRunAtMs` values for jobs that `runMissedJobs` intentionally skipped - Clear `nextRunAtMs` for interrupted jobs so they get fresh schedule computation **Internal hook bridge (openclaw#29682)** - Add "cron" to `InternalHookEventType` with `CronExecutionHookEvent` type - Bridge `emitJobFinished` to `triggerInternalHook` so extensions can react to cron job completions (alerting, chaining, etc.) - Add `isCronExecutionEvent` type guard **Per-job skip-if-active flag (openclaw#29659)** - Add `skipIfRunActive?: boolean` to CronJob type for semantic overlap prevention Tests: 290 cron tests + 21 new retry-policy tests pass. Fixes: openclaw#29527, openclaw#29546, openclaw#29601, openclaw#29646, openclaw#29659, openclaw#29660, openclaw#29682, openclaw#29687, openclaw#29690
Introduce a recovery heuristic for cron isolated runs: when an isError payload is followed by a non-error payload with real text AND the run itself did not report a model/context-level error (meta.error), downgrade the error to non-fatal and report "ok". This prevents write-tool false-positive error payloads from marking the entire cron job as failed while still preserving genuine model/context errors even when partial assistant text follows them. Includes regression tests for both the transient-recovery and the run-level-error scenarios.
5532f4e to
b3df75a
Compare
|
PR #29527 - fix(cron): treat transient tool error payloads as recoverable (#29527) Merged via squash.
Thanks @Sid-Qin! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3df75a5dc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| !runLevelError && | ||
| lastErrorPayloadIndex >= 0 && | ||
| payloads | ||
| .slice(lastErrorPayloadIndex + 1) | ||
| .some((payload) => payload?.isError !== true && Boolean(payload?.text?.trim())); |
There was a problem hiding this comment.
Keep stopReason errors from being treated as recovered
This recovery check keys only on runResult.meta.error, but embedded runs that end with lastAssistant.stopReason === "error" can still take the normal return path without meta.error (src/agents/pi-embedded-runner/run.ts), and buildEmbeddedRunPayloads prepends an isError payload for that stop reason while still appending assistant text (src/agents/pi-embedded-runner/run/payloads.ts). In that payload shape, later text makes hasSuccessfulPayloadAfterLastError true, so cron reports ok and skips failure handling/retries even though the underlying model/tool turn failed.
Useful? React with 👍 / 👎.
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) Manual port from upstream openclaw/openclaw d509a81. Tool wrappers can emit false-positive error payloads; when a non-error payload follows and there is no run-level error, mark the run "ok" instead of "error". Co-Authored-By: Claude Opus 4.6 <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]>
…aw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]> (cherry picked from commit d509a81)
* fix(cron): treat transient tool error payloads as recoverable (openclaw#29527) thanks @Sid-Qin Verified: - pnpm install --frozen-lockfile - pnpm check - pnpm test -- --run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]> (cherry picked from commit d509a81) * fix: adapt runLevelError to fork's AgentDeliveryResult type Upstream uses runResult.meta?.error (Pi-embedded style), fork has top-level runResult.error on AgentDeliveryResult. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Sid <[email protected]> Co-authored-by: Sid-Qin <[email protected]> Co-authored-by: Tak Hoffman <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Summary
errorwhenever any payload hadisError=true, even if a later successful final payload existed.src/cron/isolated-agent/run.ts, payload-error handling now treats error payloads as terminal only when no later successful payload follows; added regression test insrc/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts.error), no delivery routing changes, and no changes to write tool implementation.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
okinstead oferror.Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
Steps
isError=truepayload.Expected
okwith successful summary.Actual
errorstatus.error; recovered runs areok.Evidence
pnpm vitest run src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.tstreats transient error payloads as non-fatal when a later success payload exists.Human Verification (required)
error; error-then-success returnsok.Compatibility / Migration
YesNoNoFailure Recovery (if this breaks)
src/cron/isolated-agent/run.tsand its test file.Risks and Mitigations