Message runs interrupted by network errors are not retried, causing silent message loss

## Summary

When a message processing run is interrupted by a network error (e.g., TLS connection failure), the run is silently dropped and never retried. The user's message goes unanswered with no notification or recovery.

## Steps to Reproduce

1. Send a message via Telegram (or any channel)
2. While the agent is processing (making API calls to the model provider), a network error occurs
3. The in-progress run is interrupted and lost
4. The Telegram channel restarts, but the message is never retried

## Observed Behavior

From gateway logs on 2026-02-05:

```
00:32:27 - embedded run start: runId=15f58535... sessionId=cdd2ca67... messageChannel=telegram
00:32:40 - embedded run start: runId=6db4c19d... sessionId=ba705afb... messageChannel=telegram
00:33:02 - [openclaw] Uncaught exception: TypeError: Cannot read properties of null (reading 'setSession')
           at TLSSocket.setSession (node:_tls_wrap:1132:16)
           at Object.connect (node:_tls_wrap:1826:13)
           at Client.connect (.../undici@7.20.0/node_modules/undici/lib/core/connect.js:70:20)
00:33:09 - [default] starting provider  (Telegram channel restarted)
```

**Key observations:**
- Two message runs were in progress when the TLS error occurred
- Neither run has a corresponding `run_completed` log entry
- The Telegram channel restarted automatically
- But the two interrupted runs were never retried
- User messages went unanswered for 10+ minutes until manual gateway restart

## Expected Behavior

1. When a run fails due to a transient network error, it should be retried (with exponential backoff)
2. If retry fails after N attempts, the message should be moved to a dead-letter queue
3. User should receive a notification that their message couldn't be processed
4. At minimum, a warning should be logged when runs fail without completion

## Proposed Solutions

### Option 1: Automatic Retry
- Track in-progress runs with their original message payload
- On network error, re-enqueue the message for retry
- Use exponential backoff (e.g., 1s, 2s, 4s, max 3 retries)

### Option 2: Dead-Letter Queue
- Failed messages are stored in a persistent queue
- Agent can be configured to notify user of failed messages
- Admin can manually retry or inspect failed messages

### Option 3: Health-Check Recovery
- Periodically check for runs that started but never completed
- If a run is "stuck" for > N minutes, attempt recovery

## Environment

- OpenClaw version: 2026.2.2
- Platform: macOS (Darwin 25.2.0)
- Node.js: 22.22.0
- Channel: Telegram (polling mode)
- Model provider: Custom Anthropic proxy

## Workaround

Currently, the only workaround is to manually restart the gateway when messages go unanswered:

```bash
launchctl kickstart -k gui/$UID/ai.openclaw.gateway
```

## Impact

- **Severity:** High - users lose messages with no indication of failure
- **Frequency:** Rare (TLS errors are uncommon) but impactful when it happens
- **User experience:** Very poor - messages silently disappear

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Message runs interrupted by network errors are not retried, causing silent message loss #9208

Summary

Steps to Reproduce

Observed Behavior

Expected Behavior

Proposed Solutions

Option 1: Automatic Retry

Option 2: Dead-Letter Queue

Option 3: Health-Check Recovery

Environment

Workaround

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Message runs interrupted by network errors are not retried, causing silent message loss #9208

Description

Summary

Steps to Reproduce

Observed Behavior

Expected Behavior

Proposed Solutions

Option 1: Automatic Retry

Option 2: Dead-Letter Queue

Option 3: Health-Check Recovery

Environment

Workaround

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions