-
-
Notifications
You must be signed in to change notification settings - Fork 69.6k
[Bug]: Delivery Queue Retries Permanently-Failed Entries Indefinitely #23777
Copy link
Copy link
Closed as not planned
Closed as not planned
Copy link
Labels
bugSomething isn't workingSomething isn't workingstaleMarked as stale due to inactivityMarked as stale due to inactivity
Description
Summary
The delivery queue retries failed entries on every gateway restart, even when the failure is permanent and non-recoverable. This causes two problems: (1) stale messages accumulate indefinitely, and (2) previously-delivered messages can be re-sent as duplicates after a restart.
Steps to reproduce
- Have a cron job or agent task attempt to send a message via Teams (msteams channel) without a stored conversation reference.
- The send fails immediately with:
No conversation reference found for user:<id> - Entry is written to the delivery queue.
- Gateway restarts (upgrade, crash, or manual restart).
- On startup: "Found N pending delivery entries — starting recovery."
- Recovery hits its time budget: "Recovery time budget exceeded — N entries deferred to next restart."
- Repeat indefinitely — same entries roll forward on every subsequent restart.
Expected behavior
- Permanent failures should be discarded, not retried. Errors such as
No conversation reference foundorchat not foundare structural — retrying will never succeed. These should be moved to afailed/archive or discarded after 1–2 attempts. - Delivery should be idempotent. If an entry is being recovered after a restart, the system should confirm it wasn't already delivered before re-sending. A delivery ID or sent-at timestamp could prevent duplicate sends.
- Recovery time budget exceeded should discard, not defer. If the budget is consistently exceeded, deferring the same entries forever compounds the problem.
Actual behavior
- 15 entries accumulated in
~/.openclaw/delivery-queue/over 5 days. - Entries from Feb 17–21 were still present and being deferred on Feb 22.
- Two messages that had already been successfully delivered via an alternate path were re-delivered as duplicates after a gateway restart (recipient received the same message twice).
- Gateway log showed the same count ("Found 12 pending delivery entries") across 4+ separate restart cycles, never resolving.
OpenClaw version
2026.2.21-2
Operating system
Mac M4 macOS Taho 26.3
Install method
npm and Mac app
Logs, screenshots, and evidence
**Gateway startup (delivery-recovery subsystem):**
[delivery-recovery] Found 12 pending delivery entries — starting recovery
[delivery-recovery] Recovery time budget exceeded — 12 entries deferred to next restart
*(Observed at 2026-02-21T15:58Z, 15:59Z, 20:34Z, 20:38Z — same count each time)*
**Example stuck entry error (Teams):**
No conversation reference found for user:XXXX.
The bot can only send proactive messages to users who have previously messaged it.
**Example stuck entry error (Telegram):**
Telegram send failed: chat not found (chat_id=user:XXXX)Impact and severity
- Duplicate messages sent to end users/recipients after gateway restarts.
- Delivery queue grows unbounded with unresolvable entries.
- Each gateway restart carries performance overhead from attempting to recover a stale queue.
Additional information
Suggested Fix
- Add a
maxRetriesthreshold per delivery entry (e.g., 3). After N failures, move todelivery-queue/failed/and stop retrying. - Before re-sending a recovered entry, check if a
sentAttimestamp exists on the entry. If so, skip re-delivery. - Classify error types: permanent (discard after 1 attempt) vs. transient (retry with backoff).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleMarked as stale due to inactivityMarked as stale due to inactivity
Type
Fields
Give feedbackNo fields configured for issues without a type.