Skip to content

[Bug]: Delivery Queue Retries Permanently-Failed Entries Indefinitely #23777

@chadmbrown

Description

@chadmbrown

Summary

The delivery queue retries failed entries on every gateway restart, even when the failure is permanent and non-recoverable. This causes two problems: (1) stale messages accumulate indefinitely, and (2) previously-delivered messages can be re-sent as duplicates after a restart.

Steps to reproduce

  1. Have a cron job or agent task attempt to send a message via Teams (msteams channel) without a stored conversation reference.
  2. The send fails immediately with: No conversation reference found for user:<id>
  3. Entry is written to the delivery queue.
  4. Gateway restarts (upgrade, crash, or manual restart).
  5. On startup: "Found N pending delivery entries — starting recovery."
  6. Recovery hits its time budget: "Recovery time budget exceeded — N entries deferred to next restart."
  7. Repeat indefinitely — same entries roll forward on every subsequent restart.

Expected behavior

  1. Permanent failures should be discarded, not retried. Errors such as No conversation reference found or chat not found are structural — retrying will never succeed. These should be moved to a failed/ archive or discarded after 1–2 attempts.
  2. Delivery should be idempotent. If an entry is being recovered after a restart, the system should confirm it wasn't already delivered before re-sending. A delivery ID or sent-at timestamp could prevent duplicate sends.
  3. Recovery time budget exceeded should discard, not defer. If the budget is consistently exceeded, deferring the same entries forever compounds the problem.

Actual behavior

  • 15 entries accumulated in ~/.openclaw/delivery-queue/ over 5 days.
  • Entries from Feb 17–21 were still present and being deferred on Feb 22.
  • Two messages that had already been successfully delivered via an alternate path were re-delivered as duplicates after a gateway restart (recipient received the same message twice).
  • Gateway log showed the same count ("Found 12 pending delivery entries") across 4+ separate restart cycles, never resolving.

OpenClaw version

2026.2.21-2

Operating system

Mac M4 macOS Taho 26.3

Install method

npm and Mac app

Logs, screenshots, and evidence

**Gateway startup (delivery-recovery subsystem):**
[delivery-recovery] Found 12 pending delivery entries — starting recovery
[delivery-recovery] Recovery time budget exceeded — 12 entries deferred to next restart
*(Observed at 2026-02-21T15:58Z, 15:59Z, 20:34Z, 20:38Z — same count each time)*

**Example stuck entry error (Teams):**
No conversation reference found for user:XXXX.
The bot can only send proactive messages to users who have previously messaged it.
**Example stuck entry error (Telegram):**
Telegram send failed: chat not found (chat_id=user:XXXX)

Impact and severity

  • Duplicate messages sent to end users/recipients after gateway restarts.
  • Delivery queue grows unbounded with unresolvable entries.
  • Each gateway restart carries performance overhead from attempting to recover a stale queue.

Additional information

Suggested Fix

  • Add a maxRetries threshold per delivery entry (e.g., 3). After N failures, move to delivery-queue/failed/ and stop retrying.
  • Before re-sending a recovered entry, check if a sentAt timestamp exists on the entry. If so, skip re-delivery.
  • Classify error types: permanent (discard after 1 attempt) vs. transient (retry with backoff).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions