Skip to content

feat(outbox): write-ahead outbox with recovery worker and delivery tracking#30009

Open
nohat wants to merge 17 commits intoopenclaw:mainfrom
nohat:lifecycle/write-ahead-recovery
Open

feat(outbox): write-ahead outbox with recovery worker and delivery tracking#30009
nohat wants to merge 17 commits intoopenclaw:mainfrom
nohat:lifecycle/write-ahead-recovery

Conversation

@nohat
Copy link
Copy Markdown
Contributor

@nohat nohat commented Feb 28, 2026

Part of Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload (#32063)

Summary

  • Problem: If the gateway crashes mid-delivery, outbox entries are left in pending state with no recovery mechanism
  • Why it matters: Users expect messages to be delivered even after gateway restarts; orphaned outbox entries cause silent message loss
  • What changed: Added a recovery worker that scans for stale pending outbox entries on startup and re-dispatches them; added delivery status tracking (pending/delivered/failed) and turn-level finalization queries
  • What did NOT change: Normal delivery path unchanged; recovery only activates for entries older than a configurable threshold

Change Type (select all)

  • Feature

Scope (select all touched areas)

  • Gateway / orchestration
  • Memory / storage

Linked Issue/PR

User-visible / Behavior Changes

None visible — messages that previously would have been lost after a crash are now retried on restart.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Any
  • Runtime/container: Node 22+

Steps

  1. Send a message, kill gateway mid-delivery
  2. Restart gateway
  3. Verify recovery worker picks up and re-delivers the message

Expected

  • Message delivered after restart

Actual

  • Same as expected

Evidence

  • Failing test/log before + passing after (CI green)
  • Trace/log snippets (recovery worker logs in verbose mode)

Human Verification (required)

  • Verified scenarios: Normal delivery, crash-and-recover, stale entry cleanup
  • Edge cases checked: Interaction-scoped dispatchers (Slack slash, Discord native commands) now skip outbox to avoid replaying to wrong destination
  • What you did not verify: High-throughput concurrent recovery scenarios

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No — recovery worker auto-starts with gateway

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit; outbox entries remain but won't be recovered
  • Files/config to restore: Recovery worker source files
  • Known bad symptoms: Duplicate deliveries (recovery retries already-delivered message)

Risks and Mitigations

  • Risk: Recovery worker could re-deliver messages that were actually sent (at-least-once semantics)
    • Mitigation: Delivery status tracked in outbox; recovery only targets entries in pending state; configurable staleness threshold
  • Risk: Interaction-scoped callbacks can't survive recovery
    • Mitigation: Interaction-scoped dispatchers (CommandSource === "native") skip outbox entry creation entirely

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05ba57752c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lifecycleWorkersPromise = (async () => {
const { startMessageLifecycleWorkers } = await import("./server-message-lifecycle.js");
lifecycleWorkers = await startMessageLifecycleWorkers({
cfg: cfgAtStart,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh lifecycle worker config after hot reload

The lifecycle worker is started once with cfgAtStart and then reuses that snapshot for every recovery pass, so it never sees runtime config updates. In this codebase messages.* changes are hot/no-op in reload planning, so updates like messages.delivery.maxAgeMs/expireAction (and other delivery-affecting settings) can take effect for live sends but not for retries, which leaves recovery behavior inconsistent until a full gateway restart.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — changed StartLifecycleWorkersParams from cfg: OpenClawConfig to getCfg: () => OpenClawConfig. Each worker pass now calls getCfg() to get fresh config. The call site in server.impl.ts passes loadConfig directly.

Comment on lines +246 to +247
AND (dispatch_kind IS NULL OR dispatch_kind = 'final')
AND next_attempt_at <= ?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Terminalize non-final failures when recovery skips them

loadPendingDeliveries now filters recovery to (dispatch_kind IS NULL OR dispatch_kind = 'final'), but failed tool/block rows are still written as failed_retryable. If messages.delivery.expireAction is configured to deliver (so expiry update is skipped), those non-final failed rows are never retried, never marked terminal, and never pruned, causing unbounded outbox growth under repeated transient tool/block send failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — recoverPendingDeliveries now marks orphaned non-final outbox rows (tool/block dispatch_kind) as failed_terminal at the start of each recovery pass, scoped to entries older than startupCutoff. This prevents unbounded accumulation while leaving in-flight entries from the current instance untouched.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 05ba577 to 2cc78e3 Compare February 28, 2026 19:03
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR implements a write-ahead outbox pattern with SQLite-backed delivery queue and recovery worker. The changes introduce crash-safe message delivery with the following key components:

Major Changes:

  • SQLite-backed outbox table with status tracking (queued, failed_retryable, failed_terminal, delivered, expired)
  • Recovery worker loop in server-message-lifecycle.ts that retries failed/orphaned deliveries on configurable intervals
  • Delivery queue context threading through the reply dispatcher to capture turn ID, dispatch kind, and delivery metadata
  • Startup cutoff filtering to prevent double-delivery of messages enqueued during the current gateway instance lifetime
  • Dispatch kind (tool/block/final) tracking with recovery filtering to only retry final entries

Architecture:
The implementation uses two delivery paths:

  1. Direct delivery path (routeReplydeliverOutboundPayloads): Queues in deliverOutboundPayloads
  2. Dispatcher path (channel-specific handlers): Queues in reply dispatcher, then uses channel-specific delivery logic

These paths are mutually exclusive (if/else branching), preventing double-queueing.

Key Design Decisions:

  • ackDelivery/failDelivery failures are intentionally non-fatal - entries remain queued for recovery retry
  • Best-effort queue writes - delivery proceeds even if queueing fails
  • Recovery uses generic deliverOutboundPayloads regardless of original delivery path
  • Non-final dispatch kinds (tool, block) are excluded from recovery to avoid replaying transient messages

Notable: Idempotency key infrastructure is present in schema and types but not yet implemented - the system currently relies on "harmless duplicate delivery" rather than true idempotency for deduplication.

Confidence Score: 4/5

  • Safe to merge with minor consideration for future idempotency implementation
  • This is a well-designed implementation with sound architecture and comprehensive error handling. The write-ahead outbox pattern is correctly implemented with proper startup cutoff filtering, dispatch kind tracking, and recovery logic. Testing coverage is good (199 updated test lines). Score is 4 (not 5) because idempotency keys are not yet implemented despite schema support, meaning ackDelivery failures could cause duplicate deliveries rather than idempotent retries. However, the code acknowledges this with "harmless duplicate delivery" comment, suggesting it's acceptable for this phase.
  • No files require special attention - the implementation is solid across all changed files

Last reviewed commit: 2cc78e3

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +153 to 156
);
return id;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idempotency keys mentioned here but not implemented - the schema has idempotency_key column and ChannelOutboundAdapter has supportsIdempotencyKey flag, but no code generates or uses idempotency keys. This means ackDelivery failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/outbound/delivery-queue.ts
Line: 153-156

Comment:
Idempotency keys mentioned here but not implemented - the schema has `idempotency_key` column and `ChannelOutboundAdapter` has `supportsIdempotencyKey` flag, but no code generates or uses idempotency keys. This means `ackDelivery` failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional schema scaffolding for future use. Idempotency is currently handled by the recovery worker's startupCutoff filter (prevents double-delivery of in-flight entries) plus adapter-level dedup where supported. The column and unique index are forward-compatible — they'll be populated when per-channel idempotency key generation is wired up.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cc78e32cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

accountId: deliveryQueueContext.accountId,
payloads: [normalized],
threadId: deliveryQueueContext.threadId,
replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist only effective reply target for queued sends

This queues deliveryQueueContext.replyToId for every payload when the payload itself has no replyToId, but many channel handlers decide reply threading dynamically at send time (for example via reply-to mode planners in monitor-specific deliverers). During crash recovery, entries are replayed through deliverOutboundPayloads using this stored static replyToId, so messages can be incorrectly threaded (e.g. replyToMode="off" still replies, or replyToMode="first" replies on later messages). Store the effective reply target actually used for that send (or omit the fallback) to keep recovery behavior aligned with live delivery.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queued replyToId comes from the delivery context at enqueue time, not from dynamic resolution. replyToMode handling happens in deliver.ts before the payload reaches the queue, so the effective reply target is already resolved. Recovery replays with the same stored target.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 2cc78e3 to 1c39ee5 Compare February 28, 2026 22:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c39ee5047

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +57 to +60
return {
channel,
to,
accountId: params.ctx.AccountId?.trim() || undefined,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude slash-style dispatches from outbox replay context

Do not derive a generic delivery queue context for all dispatcher-based flows here, because some flows are not replayable from just channel + to. In Slack slash handling, replies are sent via the request-scoped respond callback (src/slack/monitor/slash.ts), but this code now stores channel: "slack" and to: "user:<id>"; if that send fails (or the process restarts after enqueue), recovery replays through deliverOutboundPayloads and can post a DM to the user instead of a slash response. This misroutes messages and can leak content in environments where slash responses are expected to stay in-command context.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged tradeoff. Recovery sends a DM instead of the ephemeral respond callback because respond has a ~15min TTL and cannot survive a restart. A DM is better than a lost message. Added a code comment documenting this tradeoff above resolveDeliveryQueueContext.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 1c39ee5 to fbb7c98 Compare February 28, 2026 23:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbb7c98316

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +76 to +78
if (dispatcher.setDeliveryQueueContext) {
const queueContext = resolveDeliveryQueueContext({ ctx: finalized });
dispatcher.setDeliveryQueueContext(queueContext);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip outbox context for interaction-scoped dispatchers

This now sets a delivery queue context for every deliverable inbound message, which means dispatcher-based flows are persisted and later replayed via recoverPendingDeliveries using deliverOutboundPayloads (generic channel send) instead of their original request-scoped callbacks. In interaction handlers like src/slack/monitor/slash.ts (respond) and src/discord/monitor/native-command.ts (interaction.reply/followUp), a failed send or restart can therefore replay to OriginatingTo as a normal DM/channel post rather than an interaction response (including ephemeral slash replies), changing recipient/scope and potentially exposing content unexpectedly.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

@nohat nohat Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — implemented conservative approach: interaction-scoped dispatchers (Slack slash, Discord native commands) now skip outbox entry creation entirely via a CommandSource === "native" guard in dispatchInboundMessageInternal.

Fix landed in d4faf01 on lifecycle/write-ahead-recovery (this PR). The same guard is also present on lifecycle/turn-tracking (#30011) where the code was further extended with turn tracking.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a SQLite-backed write-ahead outbox lifecycle with a periodic recovery/pruning worker, and threads delivery metadata (turnId / dispatch kind) through the outbound pipeline to enable restart-safe retries and better delivery tracking.

Changes:

  • Introduce message-lifecycle SQLite DB (schema + caching) and a gateway worker that runs outbox recovery + pruning.
  • Migrate delivery queue operations to message_outbox rows (enqueue/ack/fail/recover) with new states and startup-cutoff logic to avoid double-delivery.
  • Thread turnId and delivery metadata through reply dispatch and outbound delivery APIs; extend config schema/docs for delivery expiry behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/infra/outbound/outbound.test.ts Updates delivery-queue tests from filesystem queue assertions to SQLite outbox assertions.
src/infra/outbound/delivery-queue.ts Replaces file-based queue with SQLite outbox operations; adds recovery filtering, expiry/pruning, and legacy import.
src/infra/outbound/deliver.ts Plumbs turnId into queue writes; improves best-effort partial failure detection.
src/infra/message-lifecycle/index.ts Re-exports lifecycle DB helpers.
src/infra/message-lifecycle/db.ts Implements cached SQLite DB open + schema creation + transaction helper.
src/gateway/server.impl.ts Starts/stops message lifecycle workers during gateway lifecycle.
src/gateway/server-message-lifecycle.ts New worker loop for outbox recovery and outbox pruning; imports legacy file queue.
src/config/zod-schema.session.ts Adds messages.delivery config validation (maxAgeMs/expireAction).
src/config/types.messages.ts Adds typed config for delivery expiry behavior.
src/config/schema.labels.ts Adds UI labels for delivery lifecycle config.
src/config/schema.help.ts Adds help text for delivery lifecycle config and defaults.
src/channels/plugins/types.adapters.ts Adds outbound delivery metadata + idempotency-key support flag to adapter types.
src/auto-reply/templating.ts Adds MessageTurnId to templating context.
src/auto-reply/reply/route-reply.ts Plumbs turnId into routed outbound delivery.
src/auto-reply/reply/reply-dispatcher.ts Adds optional outbox enqueue/ack/fail around direct reply delivery (with per-kind dispatchKind).
src/auto-reply/reply/dispatch-from-config.ts Tracks/returns attemptedFinal and plumbs turnId into routeReply calls.
src/auto-reply/reply/dispatch-acp.ts Tracks/returns attemptedFinal for ACP dispatch path.
src/auto-reply/dispatch.ts Derives and sets dispatcher outbox context for non-interaction-scoped inbound messages.

AND next_attempt_at <= ?
AND queued_at >= ?
AND last_attempt_at IS NULL
AND attempt_count = 0`,
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skippedStartupCutoff metric is counted using a query that doesn’t apply the same dispatch_kind filter as loadPendingDeliveries (which only loads dispatch_kind NULL or 'final'). This can overcount skippedStartupCutoff by including tool/block rows that are actually excluded for a different reason, making the summary/logs misleading. Consider adding the same dispatch_kind predicate to the COUNT(*) query so the metric matches the set of entries that would otherwise have been eligible.

Suggested change
AND attempt_count = 0`,
AND attempt_count = 0
AND (dispatch_kind IS NULL OR dispatch_kind = 'final')`,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fec43ce. Added AND (dispatch_kind IS NULL OR dispatch_kind = 'final') to the COUNT query so it matches loadPendingDeliveries filtering.

Comment on lines +451 to +468
if (resolveExpireAction(opts.cfg) === "fail") {
const db = getLifecycleDb(opts.stateDir);
const staleCutoff = Date.now() - resolveDeliveryMaxAgeMs(opts.cfg);
try {
db.prepare(
`UPDATE message_outbox
SET status='expired',
error_class='terminal',
last_error='expired: queued_at too old',
terminal_reason='expired',
completed_at=?
WHERE status IN ('queued','failed_retryable')
AND queued_at < ?`,
).run(Date.now(), staleCutoff);
} catch (err) {
logVerbose(`delivery-queue: expiry update failed: ${String(err)}`);
}
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recoverPendingDeliveries now marks any queued/failed_retryable rows older than maxAgeMs as status='expired' when expireAction is unset (defaults to "fail") and maxAgeMs falls back to 30 minutes. This means a gateway restart after >30 minutes of downtime can permanently drop previously queued but never-attempted deliveries, which appears to conflict with the PR goal of ensuring crash-survivor deliveries are retried after restart. If expiry is intended, consider making it opt-in (no default maxAgeMs), or defaulting expireAction to "deliver", or at least aligning the PR description to call out the default expiry behavior explicitly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fec43ce. Flipped resolveExpireAction default from "fail" to "deliver" so a long gateway restart does not permanently drop queued messages. Operators can still opt into "fail" explicitly via messages.delivery.expireAction.

Comment on lines 8 to 9
import { getLifecycleDb } from "../message-lifecycle/db.js";
import {
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite now opens a cached SQLite DatabaseSync via getLifecycleDb(tmpDir), but afterEach only rmSyncs tmpDir. Because getLifecycleDb caches DB handles globally, the DB file can remain open across tests (and the cache can grow), which can make tmpDir cleanup flaky (notably on Windows) and cause cross-test interference. Consider importing clearLifecycleDbCacheForTest and calling it in afterEach (before rmSync) to close cached DBs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fec43ce. Added clearLifecycleDbCacheForTest() call to afterEach (before rmSync) so cached DB handles are properly closed between tests.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from d4faf01 to b8b68f6 Compare March 2, 2026 04:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8b68f6260

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

accountId: deliveryQueueContext.accountId,
payloads: [normalized],
threadId: deliveryQueueContext.threadId,
replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Store only effective reply target in queued dispatcher rows

Queuing with replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId persists the inbound fallback reply target instead of the effective per-send target, so recovery can replay messages into threads that live delivery would not use (for example replyToMode: "off"/"first" in Slack or Discord). This creates misthreaded or unexpectedly visible replies after a crash/restart because replay uses the stored static replyToId through deliverOutboundPayloads. Fresh evidence: Slack/Discord set context ReplyToId eagerly (src/slack/monitor/message-handler/prepare.ts, src/discord/monitor/message-handler.process.ts) but resolve thread usage at send time (src/slack/monitor/replies.ts, src/discord/monitor/reply-delivery.ts).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same explanation as before: replyToId is resolved at enqueue time in deliverOutboundPayloads before reaching the queue. The queue stores the fully resolved payload including replyToId. Recovery replays the stored payload as-is — no dynamic resolution needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my earlier reply: I described the deliver.ts pre-send resolution path, but this code path goes through the reply-dispatcher, not deliver.ts. See the reply to the follow-up comment above for accurate analysis — the stored replyToId is a best-effort fallback from context, not a dynamically resolved value. The practical impact is limited to multi-reply recovery edge cases.

WHERE status IN ('queued','failed_retryable')
AND dispatch_kind IS NOT NULL
AND dispatch_kind != 'final'
AND queued_at < ?`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Terminalize non-final failures without startup-only cutoff

The non-final cleanup query is gated by queued_at < startupCutoff, so tool/block rows that fail after startup remain failed_retryable indefinitely: recovery never picks them up (loadPendingDeliveries filters to final/NULL dispatch kinds) and pruning never removes them (pruneOutbox deletes only terminal statuses). In a long-lived process with repeated non-final send failures, outbox rows keep accumulating until the next restart.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with a two-part approach in d9f04d1 (#29998) and 5f56619 (#30009):

Part A (d9f04d1 on #29998): failDelivery now looks up dispatch_kind and immediately terminates non-final rows instead of marking them failed_retryable. Since recovery skips non-final rows by design, failed_retryable was semantically wrong — there's no mechanism to retry them.

Part B (5f56619 on #30009): The recovery terminalization query now also catches post-startup non-final rows where markAttemptStarted ran but ack/fail didn't complete (last_attempt_at IS NOT NULL), in addition to pre-startup crash survivors.

The remaining gap (enqueued after startup, markAttemptStarted never ran) is microseconds wide and self-heals on restart.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46362ed77f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Mark attempt started before sending so the row passes startup-cutoff
// filtering even if the subsequent ack/fail write is swallowed.
if (queueId) {
await markAttemptStarted(queueId).catch(() => {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid marking first attempts recoverable before send completes

Calling markAttemptStarted before the first outbound send makes a brand-new row satisfy the recovery predicate that allows last_attempt_at IS NOT NULL entries, even when attempt_count is still 0. I checked this against the periodic worker loop in src/gateway/server-message-lifecycle.ts and the filter in src/infra/outbound/delivery-queue.ts: if a normal live send takes longer than one worker tick, recovery can pick the same row and dispatch it concurrently, causing duplicate user-visible messages without any crash/restart.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5f56619. markAttemptStarted now also pushes next_attempt_at forward by computeBackoffMs(1) (5s — the first-retry backoff interval). This prevents the recovery worker from picking up the row while the direct send is in flight. If the direct send completes, ackDelivery/failDelivery finalizes the row before recovery touches it. If it silently fails, recovery picks it up after exactly the first retry backoff.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 46362ed to 5f56619 Compare March 2, 2026 05:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f56619d40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const action = cfg.messages?.delivery?.expireAction;
// Default to "deliver" so that a long gateway restart doesn't permanently
// drop queued messages. Operators can opt into "fail" explicitly.
return action === "fail" ? "fail" : "deliver";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align default expiry action with documented config default

resolveExpireAction currently defaults unset messages.delivery.expireAction to "deliver", but this commit’s config contract documents the default as "fail" (see src/config/types.messages.ts and src/config/schema.help.ts). In environments that rely on defaults, stale outbox entries older than maxAgeMs will be retried and delivered instead of being terminalized, which can replay outdated messages after long downtime. The runtime default should match the documented/defaulted behavior to avoid surprising delivery of expired content.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in fec43ce (previous round). resolveExpireAction now defaults to "deliver" so a long gateway restart doesn't permanently drop queued messages. Operators can opt into "fail" explicitly via messages.delivery.expireAction.

@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 5f56619 to e595246 Compare March 2, 2026 17:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e595246b6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +174 to +177
`UPDATE message_outbox
SET last_attempt_at = ?, next_attempt_at = ?
WHERE id = ? AND last_attempt_at IS NULL`,
).run(now, now + computeBackoffMs(1), id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep first-attempt rows unrecoverable until send settles

markAttemptStarted only delays recovery by computeBackoffMs(1) (5s), but after that loadPendingDeliveries will include the row because last_attempt_at is now set, so the outbox worker can replay while the original live send is still running and produce duplicate user-visible messages without any crash/restart. Fresh evidence: Signal outbound requests use a 10s default RPC timeout (src/signal/client.ts:29, used by src/signal/send.ts:183-186), which is longer than this 5s guard.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8e3f671. Widened the markAttemptStarted guard from computeBackoffMs(1) (5s) to computeBackoffMs(2) (25s). This gives enough headroom for slow channels like Signal (10s RPC timeout) plus retry overhead. The guard is defense-in-depth alongside startupCutoff filtering.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e3f67175f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

accountId: deliveryQueueContext.accountId,
payloads: [normalized],
threadId: deliveryQueueContext.threadId,
replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Queue dispatcher rows with effective reply target

enqueueDelivery persists replyToId from normalized.replyToId ?? deliveryQueueContext.replyToId before options.deliver executes, but Slack/Discord compute the actual per-send reply target inside their delivery callbacks (src/slack/monitor/message-handler/dispatch.ts:195-213, src/discord/monitor/message-handler.process.ts:652-669) using planners that change across messages (for example replyToMode="first"). If the process crashes after enqueueing later payloads, recovery replays with this stale fallback replyToId, so replies can be attached to the wrong thread/message. Fresh evidence vs the earlier thread: this queueing path is in createReplyDispatcher and does not go through deliver.ts's pre-send resolution path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — the previous reply on this thread was inaccurate (it described the deliver.ts path, not this reply-dispatcher path).

The reply-dispatcher enqueues with deliveryQueueContext.replyToId (the incoming message ID from ctx.ReplyToId) before options.deliver runs, so Slack/Discord dynamic planner state (replyPlan.nextThreadTs(), replyToMode="first") is not captured.

Practical impact is limited:

  1. Only final dispatch_kind rows survive recovery — tool/block rows are pruned (lines 528-544 of delivery-queue.ts)
  2. Most turns produce a single final reply, so there is no planner-state divergence
  3. The stored replyToId (incoming message ID) is a reasonable recovery fallback — threading to the original message is better than dropping the reply
  4. For replyToMode="first", the first reply threads to the incoming message (matching the stored value); subsequent non-final replies are pruned on recovery

Fully resolving this would require persisting per-channel planner state (e.g., Slack reply-plan position), which is complex. Tracking as a known limitation — the at-least-once guarantee holds; the threading target may be imprecise in multi-reply recovery edge cases.

nohat added 17 commits March 5, 2026 08:35
Replace unbounded file-based delivery queue with queryable SQLite
message_outbox table. Adds TTL/expiry for stale entries, delivery
outcome retention, and one-time legacy file queue import on startup.

Closes openclaw#23777, openclaw#16555, openclaw#29128
- failDelivery: terminalize non-final rows immediately (no recovery path)
- failDelivery: set attempt_count/last_attempt_at on permanent errors
- failDelivery: add dispatch_kind to SELECT for non-final detection
- pruneOutbox: guard against invalid ageMs (<=0 or non-finite)
- db: add PRAGMA busy_timeout=5000 to prevent SQLITE_BUSY
- db: remove unconditional DROP INDEX before CREATE IF NOT EXISTS
- test: add clearLifecycleDbCacheForTest in afterEach
- test: update stale comments (5s delay, deferred-to-next-tick)
…lt expireAction to deliver, clear DB cache in tests
…nd non-final cleanup

- markAttemptStarted: push next_attempt_at forward by computeBackoffMs(1) to
  prevent recovery worker from racing with the in-flight direct send
- recoverPendingDeliveries: expand non-final terminalization to also catch
  post-startup rows where markAttemptStarted ran but ack/fail didn't complete
  (last_attempt_at IS NOT NULL)
@nohat nohat force-pushed the lifecycle/write-ahead-recovery branch from 618f7f2 to 06ec2b9 Compare March 5, 2026 16:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06ec2b96eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +176 to +179
`UPDATE message_outbox
SET last_attempt_at = ?, next_attempt_at = ?
WHERE id = ? AND last_attempt_at IS NULL`,
).run(now, now + computeBackoffMs(2), id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep in-flight rows unrecoverable until send settles

markAttemptStarted only defers recovery by 25s (computeBackoffMs(2)), after which loadPendingDeliveries can pick the same row while the original live send is still in progress, causing duplicate user-visible deliveries without any restart. Fresh evidence beyond the earlier thread: at least one outbound path is unbounded, sendWebhookMessageDiscord uses a plain fetch(...) with no timeout (src/discord/send.outbound.ts:357), so a stalled webhook call can easily exceed 25s and be replayed concurrently by the worker.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants