Bug: Gateway enters infinite restart loop after agent timeout (250 retries in 42 minutes)

## Bug Description

OpenClaw Gateway enters an infinite restart loop after an agent run timeout, attempting to restart ~250 times over 42 minutes with a fixed 10-11 second interval, rendering the system completely unresponsive.

## Environment

- **OpenClaw version**: 2026.2.24
- **OS**: macOS 24.5.0 (arm64)
- **Node**: v22.22.0
- **Gateway mode**: local (ws://127.0.0.1:18789)
- **Launch method**: LaunchAgent (ai.openclaw.gateway.plist)

## Timeline (2026-02-26)

```
16:08:15 - First anomaly: embedded run timeout (runId=slug-gen-1772093280356)
16:08:47 - Restart loop begins
16:50:27 - Restart loop ends (42 minutes total)
```

## Error Log Pattern

```
Gateway failed to start: gateway already running (pid 40836); lock timeout after 5000ms
Port 18789 is already in use.
- pid 40836 tingjing: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
```

Source files: `gateway-cli-CHQJpgpN.js:1009` and `gateway-cli-CK1Ri8SQ.js`

## Root Cause Analysis

1. **Watchdog detection failure**: After an agent timeout at 16:08:15, the internal health check mechanism was triggered
2. **State detection bug**: The restart logic does not correctly detect that Gateway is already running
3. **No backoff**: Fixed 10-11 second interval between restart attempts, no exponential backoff
4. **No max retry limit**: Loop continued for 42 minutes (~250 attempts) with no upper bound

## Impact

- Gateway completely unresponsive for 42 minutes
- All Feishu/Lark messages unprocessable
- Log file bloat (gateway.err.log reached 156MB)
- System resource waste

## Suggested Fixes

### Short-term
1. Add running-state detection in `gateway start` — if already running, return success immediately
2. Add exponential backoff (1s, 2s, 4s, 8s, 16s, 30s, 60s...)
3. Add max retry limit (e.g., stop after 10 attempts)

### Long-term
1. Implement proper health check (HTTP endpoint or WebSocket ping/pong)
2. Use `openclaw gateway status` in watchdog logic
3. Consider leveraging systemd/LaunchAgent KeepAlive instead of custom restart logic

## Log File Locations

- Main log: `/tmp/openclaw/openclaw-YYYY-MM-DD.log`
- Error log: `~/.openclaw/logs/gateway.err.log`
- Gateway log: `~/.openclaw/logs/gateway.log`

## Workaround

I built an external watchdog script (`gateway-watchdog-v2.sh`) that:
- Health-checks every 10 seconds
- Detects restart loops and backs off
- Attempts auto-repair via Codex CLI for config issues
- Falls back to config rollback if repair fails


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Gateway enters infinite restart loop after agent timeout (250 retries in 42 minutes) #27590

Bug Description

Environment

Timeline (2026-02-26)

Error Log Pattern

Root Cause Analysis

Impact

Suggested Fixes

Short-term

Long-term

Log File Locations

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bug: Gateway enters infinite restart loop after agent timeout (250 retries in 42 minutes) #27590

Description

Bug Description

Environment

Timeline (2026-02-26)

Error Log Pattern

Root Cause Analysis

Impact

Suggested Fixes

Short-term

Long-term

Log File Locations

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions