Skip to content

Message runs interrupted by network errors are not retried, causing silent message loss #9208

@MuhsinunC

Description

@MuhsinunC

Summary

When a message processing run is interrupted by a network error (e.g., TLS connection failure), the run is silently dropped and never retried. The user's message goes unanswered with no notification or recovery.

Steps to Reproduce

  1. Send a message via Telegram (or any channel)
  2. While the agent is processing (making API calls to the model provider), a network error occurs
  3. The in-progress run is interrupted and lost
  4. The Telegram channel restarts, but the message is never retried

Observed Behavior

From gateway logs on 2026-02-05:

00:32:27 - embedded run start: runId=15f58535... sessionId=cdd2ca67... messageChannel=telegram
00:32:40 - embedded run start: runId=6db4c19d... sessionId=ba705afb... messageChannel=telegram
00:33:02 - [openclaw] Uncaught exception: TypeError: Cannot read properties of null (reading 'setSession')
           at TLSSocket.setSession (node:_tls_wrap:1132:16)
           at Object.connect (node:_tls_wrap:1826:13)
           at Client.connect (.../[email protected]/node_modules/undici/lib/core/connect.js:70:20)
00:33:09 - [default] starting provider  (Telegram channel restarted)

Key observations:

  • Two message runs were in progress when the TLS error occurred
  • Neither run has a corresponding run_completed log entry
  • The Telegram channel restarted automatically
  • But the two interrupted runs were never retried
  • User messages went unanswered for 10+ minutes until manual gateway restart

Expected Behavior

  1. When a run fails due to a transient network error, it should be retried (with exponential backoff)
  2. If retry fails after N attempts, the message should be moved to a dead-letter queue
  3. User should receive a notification that their message couldn't be processed
  4. At minimum, a warning should be logged when runs fail without completion

Proposed Solutions

Option 1: Automatic Retry

  • Track in-progress runs with their original message payload
  • On network error, re-enqueue the message for retry
  • Use exponential backoff (e.g., 1s, 2s, 4s, max 3 retries)

Option 2: Dead-Letter Queue

  • Failed messages are stored in a persistent queue
  • Agent can be configured to notify user of failed messages
  • Admin can manually retry or inspect failed messages

Option 3: Health-Check Recovery

  • Periodically check for runs that started but never completed
  • If a run is "stuck" for > N minutes, attempt recovery

Environment

  • OpenClaw version: 2026.2.2
  • Platform: macOS (Darwin 25.2.0)
  • Node.js: 22.22.0
  • Channel: Telegram (polling mode)
  • Model provider: Custom Anthropic proxy

Workaround

Currently, the only workaround is to manually restart the gateway when messages go unanswered:

launchctl kickstart -k gui/$UID/ai.openclaw.gateway

Impact

  • Severity: High - users lose messages with no indication of failure
  • Frequency: Rare (TLS errors are uncommon) but impactful when it happens
  • User experience: Very poor - messages silently disappear

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions