fix(gateway): message delivery reliability with SQLite journal (inbound dedup + orphan recovery) by nohat · Pull Request #27939 · openclaw/openclaw

nohat · 2026-02-26T21:33:52Z

Summary

Problem: OpenClaw has two reliability gaps that have surfaced repeatedly in bug reports and support threads:

Outbound: The delivery queue stores pending messages as delivery-queue/*.json files. Files are not atomic, not queryable, and accumulate silently — there is no status tracking, no per-entry retry state, and no pruning. A crash mid-write produces a corrupt entry; entries that will never succeed (blocked bot, deleted chat) are retried forever.

Inbound: Duplicate-delivery suppression is in-memory only. Channels like Telegram and WhatsApp redeliver unacknowledged messages; after a gateway restart the seen-set is cleared and the bot replies twice. There is also no record of turns that were accepted for dispatch but not yet completed when the gateway stopped — those are silently dropped, the user sees no reply.
Why it matters: Silent message loss and duplicate replies are the most-reported reliability issues.
What changed:

Both queues replaced with a single SQLite journal (message-journal.db, in the state dir). Uses node:sqlite — Node 22 built-in, no new dependency.

inbound_events — one row per accepted inbound turn. A UNIQUE index on dedupe_key (built from provider + account + peer/thread + message id) enforces dedup across restarts with correct per-chat/per-thread scope. When DB writes fail, an in-memory fallback with throttled warnings prevents duplicate spam. Status transitions: processing → delivered / aborted / failed. On startup, rows still in processing are recovered with bounded attempts (max 3), and stale processing rows are marked failed.

outbound_messages — replaces delivery-queue/*.json. Indexed status/retry/error tracking with permanent-error quarantine (entries that match known-unrecoverable patterns are moved to failed immediately rather than cycling through max retries). Retry backoff uses last_attempt_at so elapsed time carries across restarts — entries whose window hasn't elapsed are skipped rather than blocking recovery with an inline sleep. Abort errors are acked immediately without retry. Outbound rows are linked to inbound turns via inbound_id, and startup reconciliation uses outbound status (queued/delivered/failed) to prevent double replay while still recovering true orphans.

dispatchInboundMessageInternal now journals every real inbound turn (heartbeats and recovery replays excluded), releases dispatcher reservations on dedupe-skip paths, and marks the inbound turn delivered only after dispatcher drain completes. Journal writes are synchronous best-effort; errors are logged at verbose level and never block message processing. Atomic transactions protect multi-step operations (failDelivery read+update, recovery failure accounting).

Migration: On first startup after upgrade, any remaining delivery-queue/*.json files are imported into outbound_messages and deleted. Migrated entries preserve retry metadata including lastAttemptAt, so backoff timing remains correct post-upgrade. No manual steps required.

Pruning: Startup expires stale queued outbound rows (>30 min old) to failed, and marks stale inbound processing rows as failed (max recovery age 24h). The existing dedupeCleanup maintenance timer runs a prune every 6h, removing terminal rows older than 48h from both tables.
What did NOT change (scope boundary): Message routing, channel adapters, plugin hooks, agent logic, session management.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

None. This is an internal reliability improvement. No config changes required. Migration from old file queue happens automatically on first startup.

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS 15
Runtime/container: Node 22.5+ (local dev)
Model/provider: N/A (gateway-level change)
Integration/channel (if any): Telegram (for E2E verification)
Relevant config (redacted): Default gateway config

Steps

Start gateway, send a message to trigger a reply.
While reply is in-flight (before completeInboundTurn), kill -9 the gateway.
Restart the gateway.

Expected

Orphaned turn recovered and a reply sent on restart.
No duplicate reply if delivery completed before the kill.
Orphan recovery stops after 3 failed attempts (no infinite retry).

Actual

Log line: [orphan-recovery] Recovered orphan turn <id> (session=<key>)
DB row transitions processing → delivered.
Over-attempt orphans marked failed and skipped.

Evidence

Failing test / passing after — added/expanded coverage in inbound.test.ts, outbound.test.ts, and dispatch.test.ts, including dedupe scope, orphan replay gating, migration retry-timestamp preservation, malformed queued-row quarantine, and stale processing-row terminalization.
Trace/log snippets — E2E scripts: https://gist.github.com/nohat/f10a2687aa3807bf1b2dd864297eddb1
- run-journal-e2e.sh — 6 before/after scenarios: migration, inbound dedup, abort, orphan recovery, delivery recovery, pruning
- test-orphan-recovery-e2e.sh — automated kill-9 + restart + log assertion
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: First-startup migration from delivery-queue/ → journal; inbound dedup persists across restart (no duplicate reply); orphan turn recovered after kill -9 + restart; delivery retry after crash; journal pruning removes old rows; abort mid-turn recorded correctly.
Edge cases checked: No prior state dir; delivery-queue/ absent; same message id across different peers is not deduped incorrectly; permanent error patterns quarantine immediately; backoff window not elapsed on restart (entry deferred to next restart, not replayed inline); DB write failure uses in-memory dedupe fallback; abort errors skip retry queue; startup does not double-replay when outbound state already exists for an inbound turn; malformed queued payload rows are quarantined without blocking recovery of valid rows.
What I did not verify: Multi-account concurrent channel connections; Windows path behavior (covered by CI); queue depths >1000 entries.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No) — automatic on first startup via migrateFileQueueToJournal.
If yes, exact upgrade steps: N/A

Downgrade: if reverted, migrated delivery-queue/ files are gone but failure mode is identical to pre-patch (no queue).

Failure Recovery (if this breaks)

How to disable/revert quickly: delete ~/.openclaw/message-journal.db (gateway degrades to no-queue behavior) or revert this commit.
Files/config to restore: none — delivery-queue/ entries are migrated and deleted on first startup.
Known bad symptoms reviewers should watch for: getJournalDb() throwing on every message (SQLite file corrupt or missing write permission); startup log showing no journal operations where expected.

Risks and Mitigations

Risk: SQLite write contention if process forks or crashes mid-write.
- Mitigation: WAL + NORMAL synchronous; all writes go through a single DatabaseSync singleton per process. Journal writes are best-effort (errors logged, never thrown to callers). Atomic transactions for multi-step operations.
Risk: node:sqlite unavailable in environments running Node < 22.5.
- Mitigation: requireNodeSqlite() throws early at first use with a clear error; the memory subsystem already requires this. No silent degradation.
Risk: Orphan recovery loops forever on persistent failures.
- Mitigation: Hard limit of 3 recovery attempts; rows marked failed after limit reached.
Risk: Crash boundary can leave inbound/outbound state out of sync and cause replay ambiguity.
- Mitigation: Outbound rows are linked via inbound_id; startup reconciliation inspects outbound status for each orphan inbound row (queued skip replay, delivered mark inbound delivered, failed mark inbound failed, none replay).

AI-assisted. Reviewed and E2E-verified by @nohat.

AI-assisted
Testing: fully E2E-verified against a live test gateway (6 scenarios)
E2E scripts: https://gist.github.com/nohat/f10a2687aa3807bf1b2dd864297eddb1
Author (@nohat) understands the code and verified the behavior

The file-based delivery-queue/ stored pending outbound messages as JSON files: not atomic, not queryable, no status tracking, and no pruning. Inbound dedup was in-memory only, so channel redeliveries were not caught across restarts. In-flight turns at crash time were silently dropped. Replace both with a SQLite journal (message-journal.db in the state dir): - outbound_messages replaces delivery-queue/*.json with indexed status/retry tracking and permanent-error quarantine - inbound_events persists dedup via UNIQUE(channel, account_id, external_id) and tracks per-turn status for orphan recovery - dispatchInboundMessageInternal now journals every real inbound turn and marks it 'delivered' after the dispatcher fully drains - Startup: stale queued entries expire (30m TTL), orphaned 'processing' turns are re-dispatched, old delivery-queue files are migrated in place - Periodic prune every 6h keeps both tables bounded; terminal rows older than 48h are removed Journal errors are best-effort and never block message processing. Closes openclaw#22376, openclaw#9208, openclaw#14827, openclaw#16555

greptile-apps · 2026-02-27T02:42:06Z

Greptile Summary

Replaced file-based message queuing with SQLite journal to fix duplicate message delivery and orphaned turn recovery issues. The implementation adds persistent deduplication for inbound messages and reliable retry logic for outbound deliveries.

Key improvements:

Inbound deduplication now persists across restarts via UNIQUE index on (channel, account_id, external_id)
Orphaned turns (in-flight when gateway crashed) are recovered on restart with 3-attempt limit
Outbound retries use exponential backoff that persists across restarts
Permanent error patterns (blocked bot, deleted chat) skip retry queue immediately
Automatic migration from old delivery-queue/*.json files on first startup
Journal pruning every 6h keeps DB bounded (48h retention for terminal rows)

Test coverage:

850+ lines of new tests across inbound.test.ts, outbound.test.ts, dispatch.test.ts
E2E verification scripts provided in PR description

Error handling:

Journal write failures fall back to in-memory dedup with throttled warnings
All journal operations use try-catch to prevent blocking message flow
Atomic transactions for multi-step DB operations

Confidence Score: 4/5

Safe to merge with one minor style improvement recommended for SQL injection prevention
Score reflects thorough implementation with comprehensive test coverage (850+ new test lines), proper error handling with fallbacks, bounded retry attempts, and automatic migration. One non-critical style issue around SQL interpolation in a private function that only receives hardcoded values. The PR successfully addresses long-standing reliability gaps with persistent deduplication and orphan recovery.
No files require special attention - all critical paths have proper error handling and test coverage

_{Last reviewed commit: 785fd30}

greptile-apps

_{22 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T02:42:10Z

src/infra/message-journal/db.ts

+  const rows = db.prepare(`PRAGMA table_info(${tableName})`).all() as Array<{
+    name?: string | null;
+  }>;
+  const hasColumn = rows.some((row) => row.name === columnName);
+  if (hasColumn) {
+    return;
+  }
+  db.exec(`ALTER TABLE ${tableName} ADD COLUMN ${columnName} ${definition}`);


string interpolation in SQL with ${tableName}, ${columnName}, and ${definition} could allow injection if called with user input

currently safe since only called with hardcoded literals, but add parameter binding or validation to prevent future misuse

Suggested change

const rows = db.prepare(`PRAGMA table_info(${tableName})`).all() as Array<{

name?: string | null;

}>;

const hasColumn = rows.some((row) => row.name === columnName);

if (hasColumn) {

return;

}

db.exec(`ALTER TABLE ${tableName} ADD COLUMN ${columnName} ${definition}`);

const rows = db.prepare('PRAGMA table_info(?)').all(tableName) as Array<{

name?: string | null;

}>;

const hasColumn = rows.some((row) => row.name === columnName);

if (hasColumn) {

return;

}

// Note: SQLite doesn't support parameter binding for DDL; validate inputs instead

if (!/^[a-zA-Z_][a-zA-Z0-9_]*$/.test(tableName) || !/^[a-zA-Z_][a-zA-Z0-9_]*$/.test(columnName)) {

throw new Error('Invalid table or column name');

}

db.exec(`ALTER TABLE ${tableName} ADD COLUMN ${columnName} ${definition}`);

Prompt To Fix With AI

This is a comment left during a code review. Path: src/infra/message-journal/db.ts Line: 109-116 Comment: string interpolation in SQL with `${tableName}`, `${columnName}`, and `${definition}` could allow injection if called with user input currently safe since only called with hardcoded literals, but add parameter binding or validation to prevent future misuse ```suggestion const rows = db.prepare('PRAGMA table_info(?)').all(tableName) as Array<{ name?: string | null; }>; const hasColumn = rows.some((row) => row.name === columnName); if (hasColumn) { return; } // Note: SQLite doesn't support parameter binding for DDL; validate inputs instead if (!/^[a-zA-Z_][a-zA-Z0-9_]*$/.test(tableName) || !/^[a-zA-Z_][a-zA-Z0-9_]*$/.test(columnName)) { throw new Error('Invalid table or column name'); } db.exec(`ALTER TABLE ${tableName} ADD COLUMN ${columnName} ${definition}`); ``` How can I resolve this? If you propose a fix, please make it concise.

Addressed in 7e0351e. Added identifier validation via SQL_IDENTIFIER regex before interpolating tableName and columnName into DDL. Throws on invalid identifiers. The definition parameter remains allowlisted at call sites (INTEGER, TEXT, etc.).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 785fd30f05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T02:44:29Z

src/auto-reply/dispatch.ts

+        logVerbose(
+          `dispatch: deduped inbound turn — channel=${channel} external_id=${externalId} account=${finalized.AccountId ?? ""} turn=${pendingReplyId}`,
+        );
+        return { queuedFinal: false, counts: dispatcher.getQueuedCounts() };


Release dispatcher reservations when dedupe skips a turn

When acceptInboundOrSkip returns false, this early return bypasses withReplyDispatcher, so markComplete()/waitForIdle() are never called for that dispatcher instance. Since createReplyDispatcher starts with a reserved pending slot, duplicate inbound deliveries will leave permanently registered dispatchers and inflate getTotalPendingReplies(), which can continuously defer restart/reload logic and leak registry entries over time in channels that redeliver duplicates.

Useful? React with 👍 / 👎.

Fixed in 7e0351e. When acceptInboundOrSkip returns false, we now call dispatcher.markComplete() and await dispatcher.waitForIdle() before returning, so deduped turns release their reservation and unregister from the dispatcher registry. Added an assertion in the dispatch test to cover this path.

chatgpt-codex-connector · 2026-02-27T02:44:29Z

src/infra/message-journal/db.ts

+    CREATE UNIQUE INDEX IF NOT EXISTS idx_inbound_dedup
+      ON inbound_events(channel, account_id, external_id)


Scope inbound dedupe index by peer/thread, not message id alone

This unique index treats (channel, account_id, external_id) as globally unique, but some providers do not guarantee that scope (for example Telegram maps MessageSid to msg.message_id in src/telegram/bot-message-context.ts, where IDs are per chat). Two different chats in the same account can therefore share external_id and the second message will be incorrectly dropped as a duplicate; dedupe needs peer/thread context (like OriginatingTo/thread) or a stored composite key equivalent to buildInboundDedupeKey.

Useful? React with 👍 / 👎.

Fixed in 7e0351e. Replaced the unique index on (channel, account_id, external_id) with one on dedupe_key, which comes from buildInboundDedupeKey and includes peer/thread (OriginatingTo, MessageThreadId, etc.). This correctly scopes dedup for providers like Telegram where message IDs are per-chat. Added a test: "different peers with same message_id are not duplicates".

…validation

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e0351e154

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T02:57:35Z

src/gateway/server.impl.ts

+              await deliverOutboundPayloads({
+                cfg: cfgAtStart,
+                channel: row.channel,
+                to: ctx.To ?? "",


Route recovered replies to originating peer

Orphan recovery rebuilds context and then sends recovered output to ctx.To, but inbound contexts can legitimately have To set to the bot endpoint while the actual peer lives in OriginatingTo (for example WhatsApp inbound mapping in src/web/auto-reply/monitor/process-message.ts). In those cases, a recovered turn after crash/restart is delivered to the wrong target (or fails), so the user never receives the reply the recovery path is meant to restore.

Useful? React with 👍 / 👎.

Fixed in 573ab2d. Orphan recovery now uses ctx.OriginatingTo ?? ctx.To ?? "" for the delivery target, so channels where To is the bot endpoint (e.g. WhatsApp) deliver recovered replies to the correct peer.

chatgpt-codex-connector · 2026-02-27T02:57:35Z

src/gateway/server.impl.ts

+
+      // Find turns still marked 'processing' — these are orphans from a crash.
+      // Skip very recent rows (minAgeMs=5s) in case a turn is still actively running.
+      const orphans = findProcessingInbound({ minAgeMs: 5_000 });


Recover fresh orphan turns on immediate restarts

This startup path runs orphan recovery once and filters candidates with minAgeMs: 5_000; if the gateway is restarted quickly after a crash (common with process supervisors), in-flight rows younger than 5s are excluded and never retried in this process. Because those rows remain status='processing' and there is no subsequent orphan-recovery pass, the affected turn can be dropped indefinitely until another restart happens much later.

Useful? React with 👍 / 👎.

Fixed in 573ab2d. Reduced minAgeMs from 5_000 to 0. After a crash, the process is gone, so there’s no overlap. All processing rows are orphans; quick restarts (e.g. process supervisor) now recover in-flight turns instead of skipping them.

…rphans (minAgeMs=0)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68f9cd5246

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T05:19:54Z

src/gateway/server.impl.ts

+              });
+            },
+          });
+          await dispatchRecoveredPendingReply({ ctx, cfg: cfgAtStart, dispatcher });


Prevent double replay when outbound retry already exists

This startup path re-dispatches every processing inbound row without checking whether that turn already has a queued outbound retry. A crash between enqueueDelivery(...) and completeInboundTurn(...) leaves both artifacts (queued outbound + processing inbound), so restart will recover the queued outbound message and then replay the full turn again here, producing duplicate user replies. Please gate orphan replay on outbound journal state for the same turn (for example by linking outbound_messages.inbound_id and skipping re-dispatch when queued retries exist).

Useful? React with 👍 / 👎.

Fixed. We now:

Add inbound_id to outbound journal rows when enqueueing (via routeReply → deliverOutboundPayloads → enqueueDelivery).

Run delivery recovery before orphan recovery so queued outbound is delivered first.

Before re-dispatching each orphan, call hasQueuedOutboundForInbound(row.id). If true, skip re-dispatch and log that the queued outbound will deliver.

When delivery recovery acks a delivery, ackDelivery reads inbound_id and calls completeInboundTurn(inbound_id, "delivered").

inboundId is passed from dispatch-from-config and dispatch-acp into all routeReply calls that originate from an inbound turn.

chatgpt-codex-connector · 2026-02-27T05:19:54Z

src/infra/message-journal/outbound.ts

+             (id, channel, account_id, target, payload, queued_at, status, retry_count, last_error)
+           VALUES (?, ?, ?, ?, ?, ?, 'queued', ?, ?)`,


Preserve retry attempt timestamp during queue migration

The migration inserts retry_count but drops entry.lastAttemptAt, even though recovery backoff relies on last_attempt_at and falls back to enqueuedAt when it is missing. For migrated entries that already retried before upgrade, this makes them look older than their real last failure and can trigger immediate retries that bypass the intended backoff window. Include last_attempt_at in the migrated insert so post-upgrade recovery honors existing retry timing.

Useful? React with 👍 / 👎.

Fixed. The migration now includes last_attempt_at in the INSERT, using entry.lastAttemptAt ?? null. Recovery backoff now respects the last attempt time for migrated entries. The test "preserves lastAttemptAt so recovery honors backoff" covers this path.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b13a48cf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T05:58:20Z

src/gateway/server.impl.ts

+      for (const row of orphans) {
+        const outboundStatus = getOutboundStatusForInbound(row.id);
+        if (outboundStatus.queued > 0) {
+          logOrphan.info(
+            `Skipping orphan ${row.id} — queued outbound exists (will deliver without re-dispatch)`,


Avoid replaying orphans that have no outbound journal coverage

This recovery path treats outboundStatus == 0 as safe to replay, but some channel handlers still send replies without deliverOutboundPayloads/enqueueDelivery (for example Slack dispatch uses a direct deliver callback in src/slack/monitor/message-handler/dispatch.ts), so they never create outbound_messages rows. If the process crashes after such a send but before completeInboundTurn runs, restart will see a processing row with no outbound history and call dispatchRecoveredPendingReply again, causing a duplicate user-visible reply. Even after linking inbound_id for journaled sends, this branch still replays non-journaled channels; replay should be gated to turns with reliable outbound journaling (or another persisted post-send marker).

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07ea6367a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T06:14:15Z

src/gateway/server.impl.ts

+            },
+          });
+          await dispatchRecoveredPendingReply({ ctx, cfg: cfgAtStart, dispatcher });
+          completeInboundTurn(row.id, "delivered");


Only mark recovered turn delivered after a successful send

dispatchRecoveredPendingReply can resolve even when provider delivery fails because ReplyDispatcher swallows deliver(...) rejections in its internal .catch(...) path, so this unconditional completeInboundTurn(..., "delivered") can run during startup recovery after transient send errors (for example, API/network outages). In that case the orphan row is finalized as delivered instead of incrementing recovery attempts, and the turn is never retried, causing silent message loss.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07ea6367a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T06:14:16Z

src/gateway/server.impl.ts

+            },
+          });
+          await dispatchRecoveredPendingReply({ ctx, cfg: cfgAtStart, dispatcher });
+          completeInboundTurn(row.id, "delivered");


Only mark recovered turn delivered after a successful send

dispatchRecoveredPendingReply can resolve even when provider delivery fails because ReplyDispatcher swallows deliver(...) rejections in its internal .catch(...) path, so this unconditional completeInboundTurn(..., "delivered") can run during startup recovery after transient send errors (for example, API/network outages). In that case the orphan row is finalized as delivered instead of incrementing recovery attempts, and the turn is never retried, causing silent message loss.

Useful? React with 👍 / 👎.

nohat · 2026-03-01T00:20:13Z

Closing — superseded by the full lifecycle stack:

feat(outbox): migrate delivery queue from file-based to SQLite outbox #29998 (SQLite outbox migration)
feat(outbox): write-ahead outbox with recovery worker and delivery tracking #30009 (write-ahead outbox with recovery worker)
feat(lifecycle): inbound turn tracking with orphan recovery and abort coordination #30011 (inbound turn tracking with orphan recovery)
feat(lifecycle): persistent inbound dedup across gateway restarts #30012 (persistent inbound dedup)

The inbound dedup + orphan recovery goals from this PR are now covered by the turn-tracking and persistent-dedup layers.

openclaw-barnacle bot added gateway Gateway runtime size: XL labels Feb 26, 2026

nohat force-pushed the feat/message-journal branch 5 times, most recently from e142249 to 1b4da1c Compare February 27, 2026 01:21

nohat changed the title ~~infra: replace delivery-queue file store with SQLite message journal (inbound dedup + orphan recovery)~~ fix(gateway): message delivery reliability with SQLite journal (inbound dedup + orphan recovery) Feb 27, 2026

nohat force-pushed the feat/message-journal branch from 1b4da1c to 37aee7a Compare February 27, 2026 01:37

nohat force-pushed the feat/message-journal branch from 37aee7a to 785fd30 Compare February 27, 2026 02:12

nohat marked this pull request as ready for review February 27, 2026 02:38

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

fix(gateway): address PR review — dispatcher leak, dedupe scope, SQL …

7e0351e

…validation

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

nohat added 2 commits February 26, 2026 21:11

fix(gateway): route orphan recovery to OriginatingTo, recover fresh o…

573ab2d

…rphans (minAgeMs=0)

Merge branch 'main' into feat/message-journal

68f9cd5

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

fix(gateway): harden journal recovery invariants

1b13a48

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

fix(gateway): persist post-send marker for orphan replay

07ea636

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

nohat mentioned this pull request Feb 27, 2026

fix(gateway): message delivery reliability with SQLite journal #28400

Closed

22 tasks

nohat marked this pull request as draft February 27, 2026 06:57

github-actions bot mentioned this pull request Feb 27, 2026

🦞 OpenClaw Ecosystem Digest 2026-02-27 softwarecatscomgm/agents-radar#3

Closed

nohat mentioned this pull request Feb 27, 2026

feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer #28941

Closed

18 tasks

nohat closed this Mar 1, 2026

nohat mentioned this pull request Mar 2, 2026

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

Open

		CREATE UNIQUE INDEX IF NOT EXISTS idx_inbound_dedup
		ON inbound_events(channel, account_id, external_id)

		(id, channel, account_id, target, payload, queued_at, status, retry_count, last_error)
		VALUES (?, ?, ?, ?, ?, ?, 'queued', ?, ?)`,

Uh oh!

Conversation

nohat commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nohat Feb 27, 2026

nohat commented Feb 26, 2026 •

edited

Loading