feat(outbox): write-ahead outbox with recovery worker and delivery tracking#30009
feat(outbox): write-ahead outbox with recovery worker and delivery tracking#30009nohat wants to merge 17 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05ba57752c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/gateway/server.impl.ts
Outdated
| lifecycleWorkersPromise = (async () => { | ||
| const { startMessageLifecycleWorkers } = await import("./server-message-lifecycle.js"); | ||
| lifecycleWorkers = await startMessageLifecycleWorkers({ | ||
| cfg: cfgAtStart, |
There was a problem hiding this comment.
Refresh lifecycle worker config after hot reload
The lifecycle worker is started once with cfgAtStart and then reuses that snapshot for every recovery pass, so it never sees runtime config updates. In this codebase messages.* changes are hot/no-op in reload planning, so updates like messages.delivery.maxAgeMs/expireAction (and other delivery-affecting settings) can take effect for live sends but not for retries, which leaves recovery behavior inconsistent until a full gateway restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed — changed StartLifecycleWorkersParams from cfg: OpenClawConfig to getCfg: () => OpenClawConfig. Each worker pass now calls getCfg() to get fresh config. The call site in server.impl.ts passes loadConfig directly.
| AND (dispatch_kind IS NULL OR dispatch_kind = 'final') | ||
| AND next_attempt_at <= ? |
There was a problem hiding this comment.
Terminalize non-final failures when recovery skips them
loadPendingDeliveries now filters recovery to (dispatch_kind IS NULL OR dispatch_kind = 'final'), but failed tool/block rows are still written as failed_retryable. If messages.delivery.expireAction is configured to deliver (so expiry update is skipped), those non-final failed rows are never retried, never marked terminal, and never pruned, causing unbounded outbox growth under repeated transient tool/block send failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed — recoverPendingDeliveries now marks orphaned non-final outbox rows (tool/block dispatch_kind) as failed_terminal at the start of each recovery pass, scoped to entries older than startupCutoff. This prevents unbounded accumulation while leaving in-flight entries from the current instance untouched.
05ba577 to
2cc78e3
Compare
Greptile SummaryThis PR implements a write-ahead outbox pattern with SQLite-backed delivery queue and recovery worker. The changes introduce crash-safe message delivery with the following key components: Major Changes:
Architecture:
These paths are mutually exclusive (if/else branching), preventing double-queueing. Key Design Decisions:
Notable: Idempotency key infrastructure is present in schema and types but not yet implemented - the system currently relies on "harmless duplicate delivery" rather than true idempotency for deduplication. Confidence Score: 4/5
Last reviewed commit: 2cc78e3 |
| ); | ||
| return id; | ||
| } | ||
|
|
There was a problem hiding this comment.
Idempotency keys mentioned here but not implemented - the schema has idempotency_key column and ChannelOutboundAdapter has supportsIdempotencyKey flag, but no code generates or uses idempotency keys. This means ackDelivery failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/outbound/delivery-queue.ts
Line: 153-156
Comment:
Idempotency keys mentioned here but not implemented - the schema has `idempotency_key` column and `ChannelOutboundAdapter` has `supportsIdempotencyKey` flag, but no code generates or uses idempotency keys. This means `ackDelivery` failures will cause duplicate sends rather than idempotent retries. Consider implementing idempotency key generation or updating comment to reflect current "best-effort deduplication" approach.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Intentional schema scaffolding for future use. Idempotency is currently handled by the recovery worker's startupCutoff filter (prevents double-delivery of in-flight entries) plus adapter-level dedup where supported. The column and unique index are forward-compatible — they'll be populated when per-channel idempotency key generation is wired up.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2cc78e32cb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| accountId: deliveryQueueContext.accountId, | ||
| payloads: [normalized], | ||
| threadId: deliveryQueueContext.threadId, | ||
| replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId, |
There was a problem hiding this comment.
Persist only effective reply target for queued sends
This queues deliveryQueueContext.replyToId for every payload when the payload itself has no replyToId, but many channel handlers decide reply threading dynamically at send time (for example via reply-to mode planners in monitor-specific deliverers). During crash recovery, entries are replayed through deliverOutboundPayloads using this stored static replyToId, so messages can be incorrectly threaded (e.g. replyToMode="off" still replies, or replyToMode="first" replies on later messages). Store the effective reply target actually used for that send (or omit the fallback) to keep recovery behavior aligned with live delivery.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The queued replyToId comes from the delivery context at enqueue time, not from dynamic resolution. replyToMode handling happens in deliver.ts before the payload reaches the queue, so the effective reply target is already resolved. Recovery replays with the same stored target.
2cc78e3 to
1c39ee5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c39ee5047
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return { | ||
| channel, | ||
| to, | ||
| accountId: params.ctx.AccountId?.trim() || undefined, |
There was a problem hiding this comment.
Exclude slash-style dispatches from outbox replay context
Do not derive a generic delivery queue context for all dispatcher-based flows here, because some flows are not replayable from just channel + to. In Slack slash handling, replies are sent via the request-scoped respond callback (src/slack/monitor/slash.ts), but this code now stores channel: "slack" and to: "user:<id>"; if that send fails (or the process restarts after enqueue), recovery replays through deliverOutboundPayloads and can post a DM to the user instead of a slash response. This misroutes messages and can leak content in environments where slash responses are expected to stay in-command context.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Acknowledged tradeoff. Recovery sends a DM instead of the ephemeral respond callback because respond has a ~15min TTL and cannot survive a restart. A DM is better than a lost message. Added a code comment documenting this tradeoff above resolveDeliveryQueueContext.
1c39ee5 to
fbb7c98
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbb7c98316
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/auto-reply/dispatch.ts
Outdated
| if (dispatcher.setDeliveryQueueContext) { | ||
| const queueContext = resolveDeliveryQueueContext({ ctx: finalized }); | ||
| dispatcher.setDeliveryQueueContext(queueContext); |
There was a problem hiding this comment.
Skip outbox context for interaction-scoped dispatchers
This now sets a delivery queue context for every deliverable inbound message, which means dispatcher-based flows are persisted and later replayed via recoverPendingDeliveries using deliverOutboundPayloads (generic channel send) instead of their original request-scoped callbacks. In interaction handlers like src/slack/monitor/slash.ts (respond) and src/discord/monitor/native-command.ts (interaction.reply/followUp), a failed send or restart can therefore replay to OriginatingTo as a normal DM/channel post rather than an interaction response (including ephemeral slash replies), changing recipient/scope and potentially exposing content unexpectedly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed — implemented conservative approach: interaction-scoped dispatchers (Slack slash, Discord native commands) now skip outbox entry creation entirely via a CommandSource === "native" guard in dispatchInboundMessageInternal.
Fix landed in d4faf01 on lifecycle/write-ahead-recovery (this PR). The same guard is also present on lifecycle/turn-tracking (#30011) where the code was further extended with turn tracking.
There was a problem hiding this comment.
Pull request overview
Adds a SQLite-backed write-ahead outbox lifecycle with a periodic recovery/pruning worker, and threads delivery metadata (turnId / dispatch kind) through the outbound pipeline to enable restart-safe retries and better delivery tracking.
Changes:
- Introduce
message-lifecycleSQLite DB (schema + caching) and a gateway worker that runs outbox recovery + pruning. - Migrate delivery queue operations to
message_outboxrows (enqueue/ack/fail/recover) with new states and startup-cutoff logic to avoid double-delivery. - Thread
turnIdand delivery metadata through reply dispatch and outbound delivery APIs; extend config schema/docs for delivery expiry behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/infra/outbound/outbound.test.ts | Updates delivery-queue tests from filesystem queue assertions to SQLite outbox assertions. |
| src/infra/outbound/delivery-queue.ts | Replaces file-based queue with SQLite outbox operations; adds recovery filtering, expiry/pruning, and legacy import. |
| src/infra/outbound/deliver.ts | Plumbs turnId into queue writes; improves best-effort partial failure detection. |
| src/infra/message-lifecycle/index.ts | Re-exports lifecycle DB helpers. |
| src/infra/message-lifecycle/db.ts | Implements cached SQLite DB open + schema creation + transaction helper. |
| src/gateway/server.impl.ts | Starts/stops message lifecycle workers during gateway lifecycle. |
| src/gateway/server-message-lifecycle.ts | New worker loop for outbox recovery and outbox pruning; imports legacy file queue. |
| src/config/zod-schema.session.ts | Adds messages.delivery config validation (maxAgeMs/expireAction). |
| src/config/types.messages.ts | Adds typed config for delivery expiry behavior. |
| src/config/schema.labels.ts | Adds UI labels for delivery lifecycle config. |
| src/config/schema.help.ts | Adds help text for delivery lifecycle config and defaults. |
| src/channels/plugins/types.adapters.ts | Adds outbound delivery metadata + idempotency-key support flag to adapter types. |
| src/auto-reply/templating.ts | Adds MessageTurnId to templating context. |
| src/auto-reply/reply/route-reply.ts | Plumbs turnId into routed outbound delivery. |
| src/auto-reply/reply/reply-dispatcher.ts | Adds optional outbox enqueue/ack/fail around direct reply delivery (with per-kind dispatchKind). |
| src/auto-reply/reply/dispatch-from-config.ts | Tracks/returns attemptedFinal and plumbs turnId into routeReply calls. |
| src/auto-reply/reply/dispatch-acp.ts | Tracks/returns attemptedFinal for ACP dispatch path. |
| src/auto-reply/dispatch.ts | Derives and sets dispatcher outbox context for non-interaction-scoped inbound messages. |
| AND next_attempt_at <= ? | ||
| AND queued_at >= ? | ||
| AND last_attempt_at IS NULL | ||
| AND attempt_count = 0`, |
There was a problem hiding this comment.
The skippedStartupCutoff metric is counted using a query that doesn’t apply the same dispatch_kind filter as loadPendingDeliveries (which only loads dispatch_kind NULL or 'final'). This can overcount skippedStartupCutoff by including tool/block rows that are actually excluded for a different reason, making the summary/logs misleading. Consider adding the same dispatch_kind predicate to the COUNT(*) query so the metric matches the set of entries that would otherwise have been eligible.
| AND attempt_count = 0`, | |
| AND attempt_count = 0 | |
| AND (dispatch_kind IS NULL OR dispatch_kind = 'final')`, |
There was a problem hiding this comment.
Fixed in fec43ce. Added AND (dispatch_kind IS NULL OR dispatch_kind = 'final') to the COUNT query so it matches loadPendingDeliveries filtering.
| if (resolveExpireAction(opts.cfg) === "fail") { | ||
| const db = getLifecycleDb(opts.stateDir); | ||
| const staleCutoff = Date.now() - resolveDeliveryMaxAgeMs(opts.cfg); | ||
| try { | ||
| db.prepare( | ||
| `UPDATE message_outbox | ||
| SET status='expired', | ||
| error_class='terminal', | ||
| last_error='expired: queued_at too old', | ||
| terminal_reason='expired', | ||
| completed_at=? | ||
| WHERE status IN ('queued','failed_retryable') | ||
| AND queued_at < ?`, | ||
| ).run(Date.now(), staleCutoff); | ||
| } catch (err) { | ||
| logVerbose(`delivery-queue: expiry update failed: ${String(err)}`); | ||
| } | ||
| } |
There was a problem hiding this comment.
recoverPendingDeliveries now marks any queued/failed_retryable rows older than maxAgeMs as status='expired' when expireAction is unset (defaults to "fail") and maxAgeMs falls back to 30 minutes. This means a gateway restart after >30 minutes of downtime can permanently drop previously queued but never-attempted deliveries, which appears to conflict with the PR goal of ensuring crash-survivor deliveries are retried after restart. If expiry is intended, consider making it opt-in (no default maxAgeMs), or defaulting expireAction to "deliver", or at least aligning the PR description to call out the default expiry behavior explicitly.
There was a problem hiding this comment.
Fixed in fec43ce. Flipped resolveExpireAction default from "fail" to "deliver" so a long gateway restart does not permanently drop queued messages. Operators can still opt into "fail" explicitly via messages.delivery.expireAction.
src/infra/outbound/outbound.test.ts
Outdated
| import { getLifecycleDb } from "../message-lifecycle/db.js"; | ||
| import { |
There was a problem hiding this comment.
The test suite now opens a cached SQLite DatabaseSync via getLifecycleDb(tmpDir), but afterEach only rmSyncs tmpDir. Because getLifecycleDb caches DB handles globally, the DB file can remain open across tests (and the cache can grow), which can make tmpDir cleanup flaky (notably on Windows) and cause cross-test interference. Consider importing clearLifecycleDbCacheForTest and calling it in afterEach (before rmSync) to close cached DBs.
There was a problem hiding this comment.
Fixed in fec43ce. Added clearLifecycleDbCacheForTest() call to afterEach (before rmSync) so cached DB handles are properly closed between tests.
d4faf01 to
b8b68f6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8b68f6260
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| accountId: deliveryQueueContext.accountId, | ||
| payloads: [normalized], | ||
| threadId: deliveryQueueContext.threadId, | ||
| replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId, |
There was a problem hiding this comment.
Store only effective reply target in queued dispatcher rows
Queuing with replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId persists the inbound fallback reply target instead of the effective per-send target, so recovery can replay messages into threads that live delivery would not use (for example replyToMode: "off"/"first" in Slack or Discord). This creates misthreaded or unexpectedly visible replies after a crash/restart because replay uses the stored static replyToId through deliverOutboundPayloads. Fresh evidence: Slack/Discord set context ReplyToId eagerly (src/slack/monitor/message-handler/prepare.ts, src/discord/monitor/message-handler.process.ts) but resolve thread usage at send time (src/slack/monitor/replies.ts, src/discord/monitor/reply-delivery.ts).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same explanation as before: replyToId is resolved at enqueue time in deliverOutboundPayloads before reaching the queue. The queue stores the fully resolved payload including replyToId. Recovery replays the stored payload as-is — no dynamic resolution needed.
There was a problem hiding this comment.
Correction to my earlier reply: I described the deliver.ts pre-send resolution path, but this code path goes through the reply-dispatcher, not deliver.ts. See the reply to the follow-up comment above for accurate analysis — the stored replyToId is a best-effort fallback from context, not a dynamically resolved value. The practical impact is limited to multi-reply recovery edge cases.
src/infra/outbound/delivery-queue.ts
Outdated
| WHERE status IN ('queued','failed_retryable') | ||
| AND dispatch_kind IS NOT NULL | ||
| AND dispatch_kind != 'final' | ||
| AND queued_at < ?`, |
There was a problem hiding this comment.
Terminalize non-final failures without startup-only cutoff
The non-final cleanup query is gated by queued_at < startupCutoff, so tool/block rows that fail after startup remain failed_retryable indefinitely: recovery never picks them up (loadPendingDeliveries filters to final/NULL dispatch kinds) and pruning never removes them (pruneOutbox deletes only terminal statuses). In a long-lived process with repeated non-final send failures, outbox rows keep accumulating until the next restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed with a two-part approach in d9f04d1 (#29998) and 5f56619 (#30009):
Part A (d9f04d1 on #29998): failDelivery now looks up dispatch_kind and immediately terminates non-final rows instead of marking them failed_retryable. Since recovery skips non-final rows by design, failed_retryable was semantically wrong — there's no mechanism to retry them.
Part B (5f56619 on #30009): The recovery terminalization query now also catches post-startup non-final rows where markAttemptStarted ran but ack/fail didn't complete (last_attempt_at IS NOT NULL), in addition to pre-startup crash survivors.
The remaining gap (enqueued after startup, markAttemptStarted never ran) is microseconds wide and self-heals on restart.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46362ed77f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Mark attempt started before sending so the row passes startup-cutoff | ||
| // filtering even if the subsequent ack/fail write is swallowed. | ||
| if (queueId) { | ||
| await markAttemptStarted(queueId).catch(() => {}); |
There was a problem hiding this comment.
Avoid marking first attempts recoverable before send completes
Calling markAttemptStarted before the first outbound send makes a brand-new row satisfy the recovery predicate that allows last_attempt_at IS NOT NULL entries, even when attempt_count is still 0. I checked this against the periodic worker loop in src/gateway/server-message-lifecycle.ts and the filter in src/infra/outbound/delivery-queue.ts: if a normal live send takes longer than one worker tick, recovery can pick the same row and dispatch it concurrently, causing duplicate user-visible messages without any crash/restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 5f56619. markAttemptStarted now also pushes next_attempt_at forward by computeBackoffMs(1) (5s — the first-retry backoff interval). This prevents the recovery worker from picking up the row while the direct send is in flight. If the direct send completes, ackDelivery/failDelivery finalizes the row before recovery touches it. If it silently fails, recovery picks it up after exactly the first retry backoff.
46362ed to
5f56619
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f56619d40
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const action = cfg.messages?.delivery?.expireAction; | ||
| // Default to "deliver" so that a long gateway restart doesn't permanently | ||
| // drop queued messages. Operators can opt into "fail" explicitly. | ||
| return action === "fail" ? "fail" : "deliver"; |
There was a problem hiding this comment.
Align default expiry action with documented config default
resolveExpireAction currently defaults unset messages.delivery.expireAction to "deliver", but this commit’s config contract documents the default as "fail" (see src/config/types.messages.ts and src/config/schema.help.ts). In environments that rely on defaults, stale outbox entries older than maxAgeMs will be retried and delivered instead of being terminalized, which can replay outdated messages after long downtime. The runtime default should match the documented/defaulted behavior to avoid surprising delivery of expired content.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Already fixed in fec43ce (previous round). resolveExpireAction now defaults to "deliver" so a long gateway restart doesn't permanently drop queued messages. Operators can opt into "fail" explicitly via messages.delivery.expireAction.
5f56619 to
e595246
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e595246b6f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/infra/outbound/delivery-queue.ts
Outdated
| `UPDATE message_outbox | ||
| SET last_attempt_at = ?, next_attempt_at = ? | ||
| WHERE id = ? AND last_attempt_at IS NULL`, | ||
| ).run(now, now + computeBackoffMs(1), id); |
There was a problem hiding this comment.
Keep first-attempt rows unrecoverable until send settles
markAttemptStarted only delays recovery by computeBackoffMs(1) (5s), but after that loadPendingDeliveries will include the row because last_attempt_at is now set, so the outbox worker can replay while the original live send is still running and produce duplicate user-visible messages without any crash/restart. Fresh evidence: Signal outbound requests use a 10s default RPC timeout (src/signal/client.ts:29, used by src/signal/send.ts:183-186), which is longer than this 5s guard.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 8e3f671. Widened the markAttemptStarted guard from computeBackoffMs(1) (5s) to computeBackoffMs(2) (25s). This gives enough headroom for slow channels like Signal (10s RPC timeout) plus retry overhead. The guard is defense-in-depth alongside startupCutoff filtering.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e3f67175f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| accountId: deliveryQueueContext.accountId, | ||
| payloads: [normalized], | ||
| threadId: deliveryQueueContext.threadId, | ||
| replyToId: normalized.replyToId ?? deliveryQueueContext.replyToId, |
There was a problem hiding this comment.
Queue dispatcher rows with effective reply target
enqueueDelivery persists replyToId from normalized.replyToId ?? deliveryQueueContext.replyToId before options.deliver executes, but Slack/Discord compute the actual per-send reply target inside their delivery callbacks (src/slack/monitor/message-handler/dispatch.ts:195-213, src/discord/monitor/message-handler.process.ts:652-669) using planners that change across messages (for example replyToMode="first"). If the process crashes after enqueueing later payloads, recovery replays with this stale fallback replyToId, so replies can be attached to the wrong thread/message. Fresh evidence vs the earlier thread: this queueing path is in createReplyDispatcher and does not go through deliver.ts's pre-send resolution path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid observation — the previous reply on this thread was inaccurate (it described the deliver.ts path, not this reply-dispatcher path).
The reply-dispatcher enqueues with deliveryQueueContext.replyToId (the incoming message ID from ctx.ReplyToId) before options.deliver runs, so Slack/Discord dynamic planner state (replyPlan.nextThreadTs(), replyToMode="first") is not captured.
Practical impact is limited:
- Only
finaldispatch_kind rows survive recovery — tool/block rows are pruned (lines 528-544 ofdelivery-queue.ts) - Most turns produce a single final reply, so there is no planner-state divergence
- The stored
replyToId(incoming message ID) is a reasonable recovery fallback — threading to the original message is better than dropping the reply - For
replyToMode="first", the first reply threads to the incoming message (matching the stored value); subsequent non-final replies are pruned on recovery
Fully resolving this would require persisting per-channel planner state (e.g., Slack reply-plan position), which is complex. Tracking as a known limitation — the at-least-once guarantee holds; the threading target may be imprecise in multi-reply recovery edge cases.
Replace unbounded file-based delivery queue with queryable SQLite message_outbox table. Adds TTL/expiry for stale entries, delivery outcome retention, and one-time legacy file queue import on startup. Closes openclaw#23777, openclaw#16555, openclaw#29128
- failDelivery: terminalize non-final rows immediately (no recovery path) - failDelivery: set attempt_count/last_attempt_at on permanent errors - failDelivery: add dispatch_kind to SELECT for non-final detection - pruneOutbox: guard against invalid ageMs (<=0 or non-finite) - db: add PRAGMA busy_timeout=5000 to prevent SQLITE_BUSY - db: remove unconditional DROP INDEX before CREATE IF NOT EXISTS - test: add clearLifecycleDbCacheForTest in afterEach - test: update stale comments (5s delay, deferred-to-next-tick)
…lt expireAction to deliver, clear DB cache in tests
…nd non-final cleanup - markAttemptStarted: push next_attempt_at forward by computeBackoffMs(1) to prevent recovery worker from racing with the in-flight direct send - recoverPendingDeliveries: expand non-final terminalization to also catch post-startup rows where markAttemptStarted ran but ack/fail didn't complete (last_attempt_at IS NOT NULL)
618f7f2 to
06ec2b9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06ec2b96eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `UPDATE message_outbox | ||
| SET last_attempt_at = ?, next_attempt_at = ? | ||
| WHERE id = ? AND last_attempt_at IS NULL`, | ||
| ).run(now, now + computeBackoffMs(2), id); |
There was a problem hiding this comment.
Keep in-flight rows unrecoverable until send settles
markAttemptStarted only defers recovery by 25s (computeBackoffMs(2)), after which loadPendingDeliveries can pick the same row while the original live send is still in progress, causing duplicate user-visible deliveries without any restart. Fresh evidence beyond the earlier thread: at least one outbound path is unbounded, sendWebhookMessageDiscord uses a plain fetch(...) with no timeout (src/discord/send.outbound.ts:357), so a stalled webhook call can easily exceed 25s and be replayed concurrently by the worker.
Useful? React with 👍 / 👎.
Part of Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload (#32063)
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
None visible — messages that previously would have been lost after a crash are now retried on restart.
Security Impact (required)
Repro + Verification
Environment
Steps
Expected
Actual
Evidence
Human Verification (required)
Compatibility / Migration
Failure Recovery (if this breaks)
Risks and Mitigations