feat(outbox): write-ahead outbox with recovery worker and delivery tracking by nohat · Pull Request #30009 · openclaw/openclaw

nohat · 2026-02-28T18:51:04Z

Part of Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload (#32063)

Summary

Problem: If the gateway crashes mid-delivery, outbox entries are left in pending state with no recovery mechanism
Why it matters: Users expect messages to be delivered even after gateway restarts; orphaned outbox entries cause silent message loss
What changed: Added a recovery worker that scans for stale pending outbox entries on startup and re-dispatches them; added delivery status tracking (pending/delivered/failed) and turn-level finalization queries
What did NOT change: Normal delivery path unchanged; recovery only activates for entries older than a configurable threshold

Change Type (select all)

Feature

Scope (select all touched areas)

Gateway / orchestration
Memory / storage

Linked Issue/PR

Related feat(outbox): migrate delivery queue from file-based to SQLite outbox #29998, feat(lifecycle): inbound turn tracking with orphan recovery and abort coordination #30011

User-visible / Behavior Changes

None visible — messages that previously would have been lost after a crash are now retried on restart.

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: Any
Runtime/container: Node 22+

Steps

Send a message, kill gateway mid-delivery
Restart gateway
Verify recovery worker picks up and re-delivers the message

Expected

Message delivered after restart

Actual

Same as expected

Evidence

Failing test/log before + passing after (CI green)
Trace/log snippets (recovery worker logs in verbose mode)

Human Verification (required)

Verified scenarios: Normal delivery, crash-and-recover, stale entry cleanup
Edge cases checked: Interaction-scoped dispatchers (Slack slash, Discord native commands) now skip outbox to avoid replaying to wrong destination
What you did not verify: High-throughput concurrent recovery scenarios

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No — recovery worker auto-starts with gateway

Failure Recovery (if this breaks)

How to disable/revert this change quickly: Revert commit; outbox entries remain but won't be recovered
Files/config to restore: Recovery worker source files
Known bad symptoms: Duplicate deliveries (recovery retries already-delivered message)

Risks and Mitigations

Risk: Recovery worker could re-deliver messages that were actually sent (at-least-once semantics)
- Mitigation: Delivery status tracked in outbox; recovery only targets entries in pending state; configurable staleness threshold
Risk: Interaction-scoped callbacks can't survive recovery
- Mitigation: Interaction-scoped dispatchers (CommandSource === "native") skip outbox entry creation entirely

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05ba57752c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T18:58:39Z

src/gateway/server.impl.ts

+    lifecycleWorkersPromise = (async () => {
+      const { startMessageLifecycleWorkers } = await import("./server-message-lifecycle.js");
+      lifecycleWorkers = await startMessageLifecycleWorkers({
        cfg: cfgAtStart,


Refresh lifecycle worker config after hot reload

The lifecycle worker is started once with cfgAtStart and then reuses that snapshot for every recovery pass, so it never sees runtime config updates. In this codebase messages.* changes are hot/no-op in reload planning, so updates like messages.delivery.maxAgeMs/expireAction (and other delivery-affecting settings) can take effect for live sends but not for retries, which leaves recovery behavior inconsistent until a full gateway restart.

Useful? React with 👍 / 👎.

Fixed — changed StartLifecycleWorkersParams from cfg: OpenClawConfig to getCfg: () => OpenClawConfig. Each worker pass now calls getCfg() to get fresh config. The call site in server.impl.ts passes loadConfig directly.

chatgpt-codex-connector · 2026-02-28T18:58:39Z

src/infra/outbound/delivery-queue.ts

+                  AND (dispatch_kind IS NULL OR dispatch_kind = 'final')
+                  AND next_attempt_at <= ?


Terminalize non-final failures when recovery skips them

loadPendingDeliveries now filters recovery to (dispatch_kind IS NULL OR dispatch_kind = 'final'), but failed tool/block rows are still written as failed_retryable. If messages.delivery.expireAction is configured to deliver (so expiry update is skipped), those non-final failed rows are never retried, never marked terminal, and never pruned, causing unbounded outbox growth under repeated transient tool/block send failures.

Useful? React with 👍 / 👎.

Fixed — recoverPendingDeliveries now marks orphaned non-final outbox rows (tool/block dispatch_kind) as failed_terminal at the start of each recovery pass, scoped to entries older than startupCutoff. This prevents unbounded accumulation while leaving in-flight entries from the current instance untouched.

greptile-apps · 2026-02-28T19:04:17Z

Greptile Summary

This PR implements a write-ahead outbox pattern with SQLite-backed delivery queue and recovery worker. The changes introduce crash-safe message delivery with the following key components:

Major Changes:

SQLite-backed outbox table with status tracking (queued, failed_retryable, failed_terminal, delivered, expired)
Recovery worker loop in server-message-lifecycle.ts that retries failed/orphaned deliveries on configurable intervals
Delivery queue context threading through the reply dispatcher to capture turn ID, dispatch kind, and delivery metadata
Startup cutoff filtering to prevent double-delivery of messages enqueued during the current gateway instance lifetime
Dispatch kind (tool/block/final) tracking with recovery filtering to only retry final entries

Architecture:
The implementation uses two delivery paths:

Direct delivery path (routeReply → deliverOutboundPayloads): Queues in deliverOutboundPayloads
Dispatcher path (channel-specific handlers): Queues in reply dispatcher, then uses channel-specific delivery logic

These paths are mutually exclusive (if/else branching), preventing double-queueing.

Key Design Decisions:

ackDelivery/failDelivery failures are intentionally non-fatal - entries remain queued for recovery retry
Best-effort queue writes - delivery proceeds even if queueing fails
Recovery uses generic deliverOutboundPayloads regardless of original delivery path
Non-final dispatch kinds (tool, block) are excluded from recovery to avoid replaying transient messages

Notable: Idempotency key infrastructure is present in schema and types but not yet implemented - the system currently relies on "harmless duplicate delivery" rather than true idempotency for deduplication.

Confidence Score: 4/5

Safe to merge with minor consideration for future idempotency implementation
This is a well-designed implementation with sound architecture and comprehensive error handling. The write-ahead outbox pattern is correctly implemented with proper startup cutoff filtering, dispatch kind tracking, and recovery logic. Testing coverage is good (199 updated test lines). Score is 4 (not 5) because idempotency keys are not yet implemented despite schema support, meaning ackDelivery failures could cause duplicate deliveries rather than idempotent retries. However, the code acknowledges this with "harmless duplicate delivery" comment, suggesting it's acceptable for this phase.
No files require special attention - the implementation is solid across all changed files

_{Last reviewed commit: 2cc78e3}

greptile-apps

_{18 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-28T19:04:20Z

src/infra/outbound/delivery-queue.ts

+  );
  return id;
 }



Idempotency keys mentioned here but not implemented - the schema has idempotency_key column and ChannelOutboundAdapter has supportsIdempotencyKey flag, but no code generates or uses idempotency keys. This means ackDelivery failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/infra/outbound/delivery-queue.ts Line: 153-156 Comment: Idempotency keys mentioned here but not implemented - the schema has `idempotency_key` column and `ChannelOutboundAdapter` has `supportsIdempotencyKey` flag, but no code generates or uses idempotency keys. This means `ackDelivery` failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach. How can I resolve this? If you propose a fix, please make it concise.

Intentional schema scaffolding for future use. Idempotency is currently handled by the recovery worker's startupCutoff filter (prevents double-delivery of in-flight entries) plus adapter-level dedup where supported. The column and unique index are forward-compatible — they'll be populated when per-channel idempotency key generation is wired up.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cc78e32cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T19:10:41Z

src/auto-reply/reply/reply-dispatcher.ts

+            accountId: deliveryQueueContext.accountId,
+            payloads: [normalized],
+            threadId: deliveryQueueContext.threadId,
+            replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,


Persist only effective reply target for queued sends

This queues deliveryQueueContext.replyToId for every payload when the payload itself has no replyToId, but many channel handlers decide reply threading dynamically at send time (for example via reply-to mode planners in monitor-specific deliverers). During crash recovery, entries are replayed through deliverOutboundPayloads using this stored static replyToId, so messages can be incorrectly threaded (e.g. replyToMode="off" still replies, or replyToMode="first" replies on later messages). Store the effective reply target actually used for that send (or omit the fallback) to keep recovery behavior aligned with live delivery.

Useful? React with 👍 / 👎.

The queued replyToId comes from the delivery context at enqueue time, not from dynamic resolution. replyToMode handling happens in deliver.ts before the payload reaches the queue, so the effective reply target is already resolved. Recovery replays with the same stored target.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c39ee5047

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T22:12:21Z

src/auto-reply/dispatch.ts

+  return {
+    channel,
+    to,
+    accountId: params.ctx.AccountId?.trim() || undefined,


Exclude slash-style dispatches from outbox replay context

Do not derive a generic delivery queue context for all dispatcher-based flows here, because some flows are not replayable from just channel + to. In Slack slash handling, replies are sent via the request-scoped respond callback (src/slack/monitor/slash.ts), but this code now stores channel: "slack" and to: "user:<id>"; if that send fails (or the process restarts after enqueue), recovery replays through deliverOutboundPayloads and can post a DM to the user instead of a slash response. This misroutes messages and can leak content in environments where slash responses are expected to stay in-command context.

Useful? React with 👍 / 👎.

Acknowledged tradeoff. Recovery sends a DM instead of the ephemeral respond callback because respond has a ~15min TTL and cannot survive a restart. A DM is better than a lost message. Added a code comment documenting this tradeoff above resolveDeliveryQueueContext.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbb7c98316

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T23:43:29Z

src/auto-reply/dispatch.ts

+  if (dispatcher.setDeliveryQueueContext) {
+    const queueContext = resolveDeliveryQueueContext({ ctx: finalized });
+    dispatcher.setDeliveryQueueContext(queueContext);


Skip outbox context for interaction-scoped dispatchers

This now sets a delivery queue context for every deliverable inbound message, which means dispatcher-based flows are persisted and later replayed via recoverPendingDeliveries using deliverOutboundPayloads (generic channel send) instead of their original request-scoped callbacks. In interaction handlers like src/slack/monitor/slash.ts (respond) and src/discord/monitor/native-command.ts (interaction.reply/followUp), a failed send or restart can therefore replay to OriginatingTo as a normal DM/channel post rather than an interaction response (including ephemeral slash replies), changing recipient/scope and potentially exposing content unexpectedly.

Useful? React with 👍 / 👎.

Agreed — implemented conservative approach: interaction-scoped dispatchers (Slack slash, Discord native commands) now skip outbox entry creation entirely via a CommandSource === "native" guard in dispatchInboundMessageInternal.

Fix landed in d4faf01 on lifecycle/write-ahead-recovery (this PR). The same guard is also present on lifecycle/turn-tracking (#30011) where the code was further extended with turn tracking.

Copilot

Pull request overview

Adds a SQLite-backed write-ahead outbox lifecycle with a periodic recovery/pruning worker, and threads delivery metadata (turnId / dispatch kind) through the outbound pipeline to enable restart-safe retries and better delivery tracking.

Changes:

Introduce message-lifecycle SQLite DB (schema + caching) and a gateway worker that runs outbox recovery + pruning.
Migrate delivery queue operations to message_outbox rows (enqueue/ack/fail/recover) with new states and startup-cutoff logic to avoid double-delivery.
Thread turnId and delivery metadata through reply dispatch and outbound delivery APIs; extend config schema/docs for delivery expiry behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/infra/outbound/outbound.test.ts	Updates delivery-queue tests from filesystem queue assertions to SQLite outbox assertions.
src/infra/outbound/delivery-queue.ts	Replaces file-based queue with SQLite outbox operations; adds recovery filtering, expiry/pruning, and legacy import.
src/infra/outbound/deliver.ts	Plumbs `turnId` into queue writes; improves best-effort partial failure detection.
src/infra/message-lifecycle/index.ts	Re-exports lifecycle DB helpers.
src/infra/message-lifecycle/db.ts	Implements cached SQLite DB open + schema creation + transaction helper.
src/gateway/server.impl.ts	Starts/stops message lifecycle workers during gateway lifecycle.
src/gateway/server-message-lifecycle.ts	New worker loop for outbox recovery and outbox pruning; imports legacy file queue.
src/config/zod-schema.session.ts	Adds `messages.delivery` config validation (maxAgeMs/expireAction).
src/config/types.messages.ts	Adds typed config for delivery expiry behavior.
src/config/schema.labels.ts	Adds UI labels for delivery lifecycle config.
src/config/schema.help.ts	Adds help text for delivery lifecycle config and defaults.
src/channels/plugins/types.adapters.ts	Adds outbound delivery metadata + idempotency-key support flag to adapter types.
src/auto-reply/templating.ts	Adds `MessageTurnId` to templating context.
src/auto-reply/reply/route-reply.ts	Plumbs `turnId` into routed outbound delivery.
src/auto-reply/reply/reply-dispatcher.ts	Adds optional outbox enqueue/ack/fail around direct reply delivery (with per-kind dispatchKind).
src/auto-reply/reply/dispatch-from-config.ts	Tracks/returns `attemptedFinal` and plumbs `turnId` into routeReply calls.
src/auto-reply/reply/dispatch-acp.ts	Tracks/returns `attemptedFinal` for ACP dispatch path.
src/auto-reply/dispatch.ts	Derives and sets dispatcher outbox context for non-interaction-scoped inbound messages.

Copilot · 2026-03-01T00:40:08Z

src/infra/outbound/delivery-queue.ts

+              AND next_attempt_at <= ?
+              AND queued_at >= ?
+              AND last_attempt_at IS NULL
+              AND attempt_count = 0`,


The skippedStartupCutoff metric is counted using a query that doesn’t apply the same dispatch_kind filter as loadPendingDeliveries (which only loads dispatch_kind NULL or 'final'). This can overcount skippedStartupCutoff by including tool/block rows that are actually excluded for a different reason, making the summary/logs misleading. Consider adding the same dispatch_kind predicate to the COUNT(*) query so the metric matches the set of entries that would otherwise have been eligible.

Suggested change

AND attempt_count = 0`,

AND attempt_count = 0

AND (dispatch_kind IS NULL OR dispatch_kind = 'final')`,

Fixed in fec43ce. Added AND (dispatch_kind IS NULL OR dispatch_kind = 'final') to the COUNT query so it matches loadPendingDeliveries filtering.

Copilot · 2026-03-01T00:40:09Z

src/infra/outbound/delivery-queue.ts

+  if (resolveExpireAction(opts.cfg) === "fail") {
+    const db = getLifecycleDb(opts.stateDir);
+    const staleCutoff = Date.now() - resolveDeliveryMaxAgeMs(opts.cfg);
+    try {
+      db.prepare(
+        `UPDATE message_outbox
+           SET status='expired',
+               error_class='terminal',
+               last_error='expired: queued_at too old',
+               terminal_reason='expired',
+               completed_at=?
+         WHERE status IN ('queued','failed_retryable')
+           AND queued_at < ?`,
+      ).run(Date.now(), staleCutoff);
+    } catch (err) {
+      logVerbose(`delivery-queue: expiry update failed: ${String(err)}`);
+    }
+  }


recoverPendingDeliveries now marks any queued/failed_retryable rows older than maxAgeMs as status='expired' when expireAction is unset (defaults to "fail") and maxAgeMs falls back to 30 minutes. This means a gateway restart after >30 minutes of downtime can permanently drop previously queued but never-attempted deliveries, which appears to conflict with the PR goal of ensuring crash-survivor deliveries are retried after restart. If expiry is intended, consider making it opt-in (no default maxAgeMs), or defaulting expireAction to "deliver", or at least aligning the PR description to call out the default expiry behavior explicitly.

Fixed in fec43ce. Flipped resolveExpireAction default from "fail" to "deliver" so a long gateway restart does not permanently drop queued messages. Operators can still opt into "fail" explicitly via messages.delivery.expireAction.

Copilot · 2026-03-01T00:40:09Z

src/infra/outbound/outbound.test.ts

+import { getLifecycleDb } from "../message-lifecycle/db.js";
 import {


The test suite now opens a cached SQLite DatabaseSync via getLifecycleDb(tmpDir), but afterEach only rmSyncs tmpDir. Because getLifecycleDb caches DB handles globally, the DB file can remain open across tests (and the cache can grow), which can make tmpDir cleanup flaky (notably on Windows) and cause cross-test interference. Consider importing clearLifecycleDbCacheForTest and calling it in afterEach (before rmSync) to close cached DBs.

Fixed in fec43ce. Added clearLifecycleDbCacheForTest() call to afterEach (before rmSync) so cached DB handles are properly closed between tests.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8b68f6260

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T04:15:07Z

src/auto-reply/reply/reply-dispatcher.ts

+            accountId: deliveryQueueContext.accountId,
+            payloads: [normalized],
+            threadId: deliveryQueueContext.threadId,
+            replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,


Store only effective reply target in queued dispatcher rows

Queuing with replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId persists the inbound fallback reply target instead of the effective per-send target, so recovery can replay messages into threads that live delivery would not use (for example replyToMode: "off"/"first" in Slack or Discord). This creates misthreaded or unexpectedly visible replies after a crash/restart because replay uses the stored static replyToId through deliverOutboundPayloads. Fresh evidence: Slack/Discord set context ReplyToId eagerly (src/slack/monitor/message-handler/prepare.ts, src/discord/monitor/message-handler.process.ts) but resolve thread usage at send time (src/slack/monitor/replies.ts, src/discord/monitor/reply-delivery.ts).

Useful? React with 👍 / 👎.

Same explanation as before: replyToId is resolved at enqueue time in deliverOutboundPayloads before reaching the queue. The queue stores the fully resolved payload including replyToId. Recovery replays the stored payload as-is — no dynamic resolution needed.

Correction to my earlier reply: I described the deliver.ts pre-send resolution path, but this code path goes through the reply-dispatcher, not deliver.ts. See the reply to the follow-up comment above for accurate analysis — the stored replyToId is a best-effort fallback from context, not a dynamically resolved value. The practical impact is limited to multi-reply recovery edge cases.

chatgpt-codex-connector · 2026-03-02T04:15:08Z

src/infra/outbound/delivery-queue.ts

+         WHERE status IN ('queued','failed_retryable')
+           AND dispatch_kind IS NOT NULL
+           AND dispatch_kind != 'final'
+           AND queued_at < ?`,


Terminalize non-final failures without startup-only cutoff

The non-final cleanup query is gated by queued_at < startupCutoff, so tool/block rows that fail after startup remain failed_retryable indefinitely: recovery never picks them up (loadPendingDeliveries filters to final/NULL dispatch kinds) and pruning never removes them (pruneOutbox deletes only terminal statuses). In a long-lived process with repeated non-final send failures, outbox rows keep accumulating until the next restart.

Useful? React with 👍 / 👎.

Fixed with a two-part approach in d9f04d1 (#29998) and 5f56619 (#30009):

Part A (d9f04d1 on #29998): failDelivery now looks up dispatch_kind and immediately terminates non-final rows instead of marking them failed_retryable. Since recovery skips non-final rows by design, failed_retryable was semantically wrong — there's no mechanism to retry them.

Part B (5f56619 on #30009): The recovery terminalization query now also catches post-startup non-final rows where markAttemptStarted ran but ack/fail didn't complete (last_attempt_at IS NOT NULL), in addition to pre-startup crash survivors.

The remaining gap (enqueued after startup, markAttemptStarted never ran) is microseconds wide and self-heals on restart.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46362ed77f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T04:22:27Z

src/infra/outbound/deliver.ts

+  // Mark attempt started before sending so the row passes startup-cutoff
+  // filtering even if the subsequent ack/fail write is swallowed.
+  if (queueId) {
+    await markAttemptStarted(queueId).catch(() => {});


Avoid marking first attempts recoverable before send completes

Calling markAttemptStarted before the first outbound send makes a brand-new row satisfy the recovery predicate that allows last_attempt_at IS NOT NULL entries, even when attempt_count is still 0. I checked this against the periodic worker loop in src/gateway/server-message-lifecycle.ts and the filter in src/infra/outbound/delivery-queue.ts: if a normal live send takes longer than one worker tick, recovery can pick the same row and dispatch it concurrently, causing duplicate user-visible messages without any crash/restart.

Useful? React with 👍 / 👎.

Fixed in 5f56619. markAttemptStarted now also pushes next_attempt_at forward by computeBackoffMs(1) (5s — the first-retry backoff interval). This prevents the recovery worker from picking up the row while the direct send is in flight. If the direct send completes, ackDelivery/failDelivery finalizes the row before recovery touches it. If it silently fails, recovery picks it up after exactly the first retry backoff.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f56619d40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T05:12:51Z

src/infra/outbound/delivery-queue.ts

+  const action = cfg.messages?.delivery?.expireAction;
+  // Default to "deliver" so that a long gateway restart doesn't permanently
+  // drop queued messages. Operators can opt into "fail" explicitly.
+  return action === "fail" ? "fail" : "deliver";


Align default expiry action with documented config default

resolveExpireAction currently defaults unset messages.delivery.expireAction to "deliver", but this commit’s config contract documents the default as "fail" (see src/config/types.messages.ts and src/config/schema.help.ts). In environments that rely on defaults, stale outbox entries older than maxAgeMs will be retried and delivered instead of being terminalized, which can replay outdated messages after long downtime. The runtime default should match the documented/defaulted behavior to avoid surprising delivery of expired content.

Useful? React with 👍 / 👎.

Already fixed in fec43ce (previous round). resolveExpireAction now defaults to "deliver" so a long gateway restart doesn't permanently drop queued messages. Operators can opt into "fail" explicitly via messages.delivery.expireAction.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e595246b6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T17:48:01Z

src/infra/outbound/delivery-queue.ts

+      `UPDATE message_outbox
+         SET last_attempt_at = ?, next_attempt_at = ?
+       WHERE id = ? AND last_attempt_at IS NULL`,
+    ).run(now, now + computeBackoffMs(1), id);


Keep first-attempt rows unrecoverable until send settles

markAttemptStarted only delays recovery by computeBackoffMs(1) (5s), but after that loadPendingDeliveries will include the row because last_attempt_at is now set, so the outbox worker can replay while the original live send is still running and produce duplicate user-visible messages without any crash/restart. Fresh evidence: Signal outbound requests use a 10s default RPC timeout (src/signal/client.ts:29, used by src/signal/send.ts:183-186), which is longer than this 5s guard.

Useful? React with 👍 / 👎.

Fixed in 8e3f671. Widened the markAttemptStarted guard from computeBackoffMs(1) (5s) to computeBackoffMs(2) (25s). This gives enough headroom for slow channels like Signal (10s RPC timeout) plus retry overhead. The guard is defense-in-depth alongside startupCutoff filtering.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e3f67175f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T18:01:39Z

src/auto-reply/reply/reply-dispatcher.ts

+            accountId: deliveryQueueContext.accountId,
+            payloads: [normalized],
+            threadId: deliveryQueueContext.threadId,
+            replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,


Queue dispatcher rows with effective reply target

enqueueDelivery persists replyToId from normalized.replyToId ?? deliveryQueueContext.replyToId before options.deliver executes, but Slack/Discord compute the actual per-send reply target inside their delivery callbacks (src/slack/monitor/message-handler/dispatch.ts:195-213, src/discord/monitor/message-handler.process.ts:652-669) using planners that change across messages (for example replyToMode="first"). If the process crashes after enqueueing later payloads, recovery replays with this stale fallback replyToId, so replies can be attached to the wrong thread/message. Fresh evidence vs the earlier thread: this queueing path is in createReplyDispatcher and does not go through deliver.ts's pre-send resolution path.

Useful? React with 👍 / 👎.

Valid observation — the previous reply on this thread was inaccurate (it described the deliver.ts path, not this reply-dispatcher path).

The reply-dispatcher enqueues with deliveryQueueContext.replyToId (the incoming message ID from ctx.ReplyToId) before options.deliver runs, so Slack/Discord dynamic planner state (replyPlan.nextThreadTs(), replyToMode="first") is not captured.

Practical impact is limited:

Only final dispatch_kind rows survive recovery — tool/block rows are pruned (lines 528-544 of delivery-queue.ts)

Most turns produce a single final reply, so there is no planner-state divergence

The stored replyToId (incoming message ID) is a reasonable recovery fallback — threading to the original message is better than dropping the reply

For replyToMode="first", the first reply threads to the incoming message (matching the stored value); subsequent non-final replies are pruned on recovery

Fully resolving this would require persisting per-channel planner state (e.g., Slack reply-plan position), which is complex. Tracking as a known limitation — the at-least-once guarantee holds; the threading target may be imprecise in multi-reply recovery edge cases.

Replace unbounded file-based delivery queue with queryable SQLite message_outbox table. Adds TTL/expiry for stale entries, delivery outcome retention, and one-time legacy file queue import on startup. Closes openclaw#23777, openclaw#16555, openclaw#29128

… recovery

…x rows

- failDelivery: terminalize non-final rows immediately (no recovery path) - failDelivery: set attempt_count/last_attempt_at on permanent errors - failDelivery: add dispatch_kind to SELECT for non-final detection - pruneOutbox: guard against invalid ageMs (<=0 or non-finite) - db: add PRAGMA busy_timeout=5000 to prevent SQLITE_BUSY - db: remove unconditional DROP INDEX before CREATE IF NOT EXISTS - test: add clearLifecycleDbCacheForTest in afterEach - test: update stale comments (5s delay, deferred-to-next-tick)

…racking

…n-final orphans

…lt expireAction to deliver, clear DB cache in tests

…tboundPayloads

…ff stranding

…onal index drop

…nd non-final cleanup - markAttemptStarted: push next_attempt_at forward by computeBackoffMs(1) to prevent recovery worker from racing with the in-flight direct send - recoverPendingDeliveries: expand non-final terminalization to also catch post-startup rows where markAttemptStarted ran but ack/fail didn't complete (last_attempt_at IS NOT NULL)

…minal rows

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06ec2b96eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-05T16:47:00Z

src/infra/outbound/delivery-queue.ts

+      `UPDATE message_outbox
+         SET last_attempt_at = ?, next_attempt_at = ?
+       WHERE id = ? AND last_attempt_at IS NULL`,
+    ).run(now, now + computeBackoffMs(2), id);


Keep in-flight rows unrecoverable until send settles

markAttemptStarted only defers recovery by 25s (computeBackoffMs(2)), after which loadPendingDeliveries can pick the same row while the original live send is still in progress, causing duplicate user-visible deliveries without any restart. Fresh evidence beyond the earlier thread: at least one outbound path is unbounded, sendWebhookMessageDiscord uses a plain fetch(...) with no timeout (src/discord/send.outbound.ts:357), so a stalled webhook call can easily exceed 25s and be replayed concurrently by the worker.

Useful? React with 👍 / 👎.

openclaw-barnacle bot added gateway Gateway runtime size: XL labels Feb 28, 2026

nohat mentioned this pull request Feb 28, 2026

feat(lifecycle): inbound turn tracking with orphan recovery and abort coordination #30011

Open

5 tasks

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from 05ba577 to 2cc78e3 Compare February 28, 2026 19:03

greptile-apps bot reviewed Feb 28, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from 2cc78e3 to 1c39ee5 Compare February 28, 2026 22:04

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from 1c39ee5 to fbb7c98 Compare February 28, 2026 23:37

chatgpt-codex-connector bot reviewed Feb 28, 2026

View reviewed changes

github-actions bot mentioned this pull request Mar 1, 2026

🦞 OpenClaw 生态日报 2026-03-01 duanyytop/agents-radar#32

Open

Copilot AI review requested due to automatic review settings March 1, 2026 00:35

Copilot started reviewing on behalf of nohat March 1, 2026 00:36 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from d4faf01 to b8b68f6 Compare March 2, 2026 04:08

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from 46362ed to 5f56619 Compare March 2, 2026 05:07

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

nohat force-pushed the lifecycle/write-ahead-recovery branch from 5f56619 to e595246 Compare March 2, 2026 17:40

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

nohat mentioned this pull request Mar 2, 2026

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

Open

nohat added 17 commits March 5, 2026 08:35

fix(outbox): add dispatch_kind column and filter non-final entries in…

321da9c

… recovery

fix(outbox): skip legacy queue file deletion when DB is in-memory

603ff5d

fix(lifecycle): stick with in-memory DB fallback for process lifetime

534e944

fix(lifecycle): guard failDelivery against overwriting terminal outbo…

174f9ad

…x rows

feat(outbox): add write-ahead outbox, recovery worker, and delivery t…

e59b6ec

…racking

fix(lifecycle): refresh config on each worker pass and terminalize no…

72a9325

…n-final orphans

fix(outbox): skip outbox tracking for interaction-scoped dispatchers

cfb2be6

fix(outbox): address Copilot review items — align COUNT filter, defau…

fd24433

…lt expireAction to deliver, clear DB cache in tests

fix(outbox): propagate dispatch_kind through routeReply and deliverOu…

a5868fa

…tboundPayloads

fix(outbox): mark attempt started before send to prevent startup-cuto…

e74630f

…ff stranding

fix(lifecycle): respect backoff for legacy import and avoid unconditi…

f047d67

…onal index drop

recovery: add status guard to moveToFailed to prevent overwriting ter…

ecaaa13

…minal rows

recovery: widen markAttemptStarted guard to 25s for slow channels

cc7617f

fix(tests): add markAttemptStarted to deliver.test.ts mock

06ec2b9

nohat force-pushed the lifecycle/write-ahead-recovery branch from 618f7f2 to 06ec2b9 Compare March 5, 2026 16:37

chatgpt-codex-connector bot reviewed Mar 5, 2026

View reviewed changes

Gundam98 mentioned this pull request Mar 7, 2026

fix: prevent delivery-recovery from replaying already-delivered messages #38668

Closed

		AND (dispatch_kind IS NULL OR dispatch_kind = 'final')
		AND next_attempt_at <= ?

	AND attempt_count = 0`,
	AND attempt_count = 0
	AND (dispatch_kind IS NULL OR dispatch_kind = 'final')`,

		import { getLifecycleDb } from "../message-lifecycle/db.js";
		import {

Uh oh!

Conversation

nohat commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

nohat commented Feb 28, 2026 •

edited

Loading

nohat Mar 1, 2026 •

edited

Loading