fix(delivery-queue): increment retryCount on entries deferred when time budget exceeded by stephenschoettler · Pull Request #31192 · openclaw/openclaw

stephenschoettler · 2026-03-02T02:42:23Z

Problem

When delivery recovery exceeded the 60s time budget, remaining pending entries were silently skipped with a log warning but no retryCount increment. This caused them to loop forever across restarts — they'd always defer, never hit MAX_RETRIES, and never move to failed/.

Two code paths were affected:

The deadline check (now >= deadline)
The backoff-exceeds-deadline check (now + backoff >= deadline)

Fix

Call failDelivery() on each remaining entry before breaking out of the loop. This increments retryCount so that entries eventually exhaust MAX_RETRIES and are permanently skipped instead of spinning forever.

The existing failDelivery helper already handles the retryCount increment and stateDir write — no new logic needed.

Fixes #24353

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dba6aa848b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T02:46:49Z

src/infra/outbound/delivery-queue.ts

+      for (const r of remaining) {
+        try {
+          await failDelivery(r.id, "Recovery time budget exceeded — deferred", opts.stateDir);


Stop processing deferred entries after recovery deadline

Once now >= deadline, this loop performs a failDelivery read/write/rename for every remaining queue item before returning, which can take arbitrarily long on large queues or slow disks and effectively nullifies the documented maxRecoveryMs wall-clock cap for startup recovery. The same unbounded work pattern is also present in the now + backoff >= deadline branch, so recovery can still block startup well past the configured budget.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-02T02:49:07Z

Greptile Summary

Fixed critical bug where delivery queue entries deferred due to time budget exhaustion would loop forever without incrementing retryCount. The fix calls failDelivery() on remaining entries at both deadline checkpoints (immediate deadline and backoff-exceeds-deadline), ensuring they eventually reach MAX_RETRIES and move to failed/ instead of spinning indefinitely across restarts.

delivery-queue.ts: Added retry count incrementation for deferred entries at both time budget checkpoints, preventing infinite retry loops
chrome.ts: Changed stdio from "pipe" to explicit ignore array to prevent buffer blocking in Docker environments
usage.ts & session-cost-usage.ts: Added null guards to sort comparators to handle undefined totals gracefully

The implementation correctly reuses existing failDelivery() helper which handles atomic file updates, and wraps calls in try/catch for best-effort resilience.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The fix addresses a well-defined bug with a straightforward solution that reuses existing, tested helpers (failDelivery, moveToFailed). The logic correctly handles both deadline scenarios and maintains consistency with existing error handling patterns. Additional changes are defensive improvements with low risk.
No files require special attention

_{Last reviewed commit: dba6aa8}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8876ec76b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T04:16:56Z

src/infra/outbound/delivery-queue.ts

    if (now >= deadline) {
-      const deferred = pending.length - recovered - failed - skippedMaxRetries - deferredBackoff;
-      opts.log.warn(`Recovery time budget exceeded — ${deferred} entries deferred to next restart`);
+      opts.log.warn(`Recovery time budget exceeded — remaining entries deferred to next startup`);


Keep recovery warning text consistent with existing assertions

This changes the timeout warning from deferred to next restart to deferred to next startup, but recoverPendingDeliveries is still asserted against the old wording in src/infra/outbound/outbound.test.ts:448 (expect.stringContaining("deferred to next restart")). When maxRecoveryMs is exhausted, that test path now fails deterministically, so this commit will break the current recovery test suite unless the assertion or warning text is updated together.

Useful? React with 👍 / 👎.

stephenschoettler · 2026-03-04T23:01:21Z

The install-smoke failure is not caused by changes in this PR. The error is a pre-existing TypeScript type mismatch in src/browser/chrome.ts (line 319):

TS2322: Type 'ChildProcessByStdio<null, null, null>' is not assignable to type 'ChildProcessWithoutNullStreams'.
Types of property 'stdin' are incompatible. Type 'null' is not assignable to type 'Writable'.

This PR only touches src/infra/outbound/delivery-queue.ts — no browser code at all. The chrome.ts issue is addressed in PR #27899. Merging #27899 first should clear the install-smoke check for this PR as well.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7a97f1378

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-22T22:04:41Z

src/browser/chrome.ts


    return spawn(exe.path, args, {
-      stdio: "pipe",
+      stdio: ["ignore", "ignore", "ignore"],


Keep Chrome stderr piped during launch attempts

When Chrome fails to come up (for example because of a bad flag, missing display, or sandbox issue), launchOpenClawChrome still tries to read proc.stderr and append it to the thrown startup error, but this stdio setting makes proc.stderr null. That regresses a previously actionable failure path into an opaque Failed to start Chrome CDP... message with no browser diagnostics, which will make real startup issues much harder to debug.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 384bbba21d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T00:25:53Z

src/infra/outbound/delivery-queue.ts

    if (now >= deadline) {
-      const deferred = pending.length - recovered - failed - skippedMaxRetries - deferredBackoff;
-      opts.log.warn(`Recovery time budget exceeded — ${deferred} entries deferred to next restart`);
+      opts.log.warn(`Recovery time budget exceeded — remaining entries deferred to next startup`);
      break;


Update deferred retries before breaking on recovery timeout

When maxRecoveryMs is exhausted, recoverPendingDeliveries() still only logs and breaks, so every queued item after the current cursor keeps its old retryCount. In the exact large-queue/slow-startup case this commit claims to fix, those same tail entries can be deferred on every restart and never reach MAX_RETRIES, because failDelivery() is the only path here that increments the counter and it is never called from the timeout branch.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe3f992390

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T00:52:09Z

src/infra/outbound/delivery-queue.ts

+      await Promise.allSettled(
+        pending
+          .slice(i)
+          .map((e) => failDelivery(e.id, "recovery time budget exceeded", opts.stateDir)),


Stop consuming retry budget for entries never retried

When recovery hits the startup time budget, this now runs failDelivery(...) on every untouched queue entry. failDelivery increments both retryCount and lastAttemptAt, so repeated restarts with a small maxRecoveryMs can exhaust MAX_RETRIES and move messages to failed/ without a single delivery attempt. I verified this by replaying recoverPendingDeliveries(..., maxRecoveryMs: 0) five times: the next normal startup archives the entry instead of calling deliver.

Useful? React with 👍 / 👎.

Prevents crash when totals is undefined in byModel/byProvider/byAgent sort comparators. Fixes 'Cannot read properties of undefined (reading totalTokens)' crash that causes context overflow in active sessions.

…me budget exceeded When delivery recovery ran out of the 60s time budget, remaining pending entries were silently deferred to the next restart with no retryCount increment. This caused them to loop forever across restarts, never hitting MAX_RETRIES and never moving to failed/. Fix: call failDelivery() on each remaining entry before breaking out of the recovery loop (both the deadline check and the backoff-exceeds-deadline check). This increments retryCount so that entries eventually exhaust MAX_RETRIES and are permanently skipped. Fixes openclaw#24353

… all remaining entries P1-C: After now >= deadline, the old code would iterate all remaining queue entries and call failDelivery() on each — O(n) work that nullified the maxRecoveryMs wall-clock cap on large queues. Fix: break out of the recovery loop immediately when the deadline is exceeded. Remaining entries are picked up on next startup unchanged (retryCount not incremented). The deadline means 'stop here', not 'fail everything remaining'.

…p' log message

… compat The stdio tuple overload resolves differently across @types/node versions (v20 vs v24/v25). Cast the spawn() result to ChildProcessWithoutNullStreams to ensure proc.stderr?.on/off type-checks regardless of installed @types/node.

…ssage

Codex P1: entries deferred by the recovery time budget kept retryCount=0 forever, so they could loop across restarts without ever reaching MAX_RETRIES. After breaking on deadline, call failDelivery() for all remaining entries so retryCount is incremented. Entries stay in queue until MAX_RETRIES is reached and they are pruned normally. Also updates the maxRecoveryMs test to assert retryCount=1 on deferred entries.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83e19dd045

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T01:39:31Z

src/infra/outbound/delivery-queue.ts

+      await Promise.allSettled(
+        pending
+          .slice(i)
+          .map((e) => failDelivery(e.id, "recovery time budget exceeded", opts.stateDir)),


Move timed-out entries to failed once they hit MAX_RETRIES

When maxRecoveryMs is already exhausted (for example 0 or a consistently overloaded startup), this branch only calls failDelivery(...) on the remaining entries and then breaks before the later entry.retryCount >= MAX_RETRIES path can run. That means the exact scenario this fix is trying to stop can still loop forever: after 7 consecutive recoverPendingDeliveries(..., maxRecoveryMs: 0) runs, I still had the entry in pending/ with retryCount: 7 and nothing in failed/. The timeout path needs to archive maxed-out entries instead of blindly incrementing them again.

Useful? React with 👍 / 👎.

steipete · 2026-03-23T01:39:37Z

Landed via temp rebase onto main.

Gate: pnpm check and OPENCLAW_TEST_PROFILE=low OPENCLAW_TEST_SERIAL_GATEWAY=1 pnpm test
Land commit: 30113a8
Merge commit: 2dd3425

Thanks @stephenschoettler!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30113a8156

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T01:43:24Z

src/infra/outbound/delivery-queue.ts

    if (now >= deadline) {
-      const deferred = pending.length - recovered - failed - skippedMaxRetries - deferredBackoff;
-      opts.log.warn(`Recovery time budget exceeded — ${deferred} entries deferred to next restart`);
+      opts.log.warn(`Recovery time budget exceeded — remaining entries deferred to next startup`);
+      // Increment retryCount for all remaining entries so that entries which
+      // are consistently deferred by the time budget eventually reach
+      // MAX_RETRIES and are pruned rather than looping forever.


Check MAX_RETRIES before deferring timed-out entries

If recovery starts already past the deadline (for example maxRecoveryMs: 0 or a consistently too-small budget), this timeout branch runs before the entry.retryCount >= MAX_RETRIES check below, so queued files are only rewritten with ever-higher retryCount values and are never moved to failed/. I reproduced this by calling recoverPendingDeliveries(..., maxRecoveryMs: 0) seven times: the entry reached retryCount=7 and still remained queued, so the advertised “eventually pruned” behavior never happens under persistent budget exhaustion.

Useful? React with 👍 / 👎.

openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 2, 2026

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

stephenschoettler force-pushed the fix/delivery-queue-retry-count branch from dba6aa8 to 098d276 Compare March 2, 2026 04:04

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

steipete mentioned this pull request Mar 3, 2026

fix(sessions-send): prevent 'message is too long' delivery failures and fix Telegram HTML chunking #31194

Closed

stephenschoettler mentioned this pull request Mar 4, 2026

fix(browser): prevent stdio buffer blocking in Docker environments #27899

Open

stephenschoettler force-pushed the fix/delivery-queue-retry-count branch 2 times, most recently from e2d0754 to b7a97f1 Compare March 22, 2026 22:01

chatgpt-codex-connector bot reviewed Mar 22, 2026

View reviewed changes

stephenschoettler force-pushed the fix/delivery-queue-retry-count branch from b7a97f1 to 384bbba Compare March 23, 2026 00:22

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

openclaw-barnacle bot added size: S and removed size: XS labels Mar 23, 2026

stephenschoettler force-pushed the fix/delivery-queue-retry-count branch from 93f868b to fe3f992 Compare March 23, 2026 00:47

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

openclaw-barnacle bot added channel: discord Channel integration: discord channel: nextcloud-talk Channel integration: nextcloud-talk channel: telegram Channel integration: telegram cli CLI command changes scripts Repository scripts extensions: tavily labels Mar 23, 2026

stephenschoettler added 6 commits March 23, 2026 01:34

fix: add null guards to usage sort comparators

16dcbf9

Prevents crash when totals is undefined in byModel/byProvider/byAgent sort comparators. Fixes 'Cannot read properties of undefined (reading totalTokens)' crash that causes context overflow in active sessions.

fix(browser): prevent stdio buffer blocking in Docker environments

0b961f8

fix(delivery-queue): align test assertion and JSDoc with 'next startu…

b2c74e5

…p' log message

stephenschoettler and others added 3 commits March 23, 2026 01:34

test(delivery-queue): align test assertion with 'next startup' log me…

c84ba12

…ssage

fix(ci): restore delivery queue branch checks

30113a8

steipete force-pushed the fix/delivery-queue-retry-count branch from 83e19dd to 30113a8 Compare March 23, 2026 01:39

steipete merged commit 2dd3425 into openclaw:main Mar 23, 2026
3 checks passed

openclaw-barnacle bot removed channel: discord Channel integration: discord channel: nextcloud-talk Channel integration: nextcloud-talk channel: telegram Channel integration: telegram cli CLI command changes extensions: tavily labels Mar 23, 2026

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

This was referenced Mar 24, 2026

[Bug]: Gateway OOM after batch agent sessions — 3 Maps grow unbounded #52725

Open

[Feature]: Control UI should show compact run progress for long or multi-agent tasks #52960

Open

Uh oh!

Conversation

stephenschoettler commented Mar 2, 2026

Problem

Fix

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

stephenschoettler commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

steipete commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants