-
-
Notifications
You must be signed in to change notification settings - Fork 69.1k
Bug: Gateway enters infinite restart loop after agent timeout (250 retries in 42 minutes) #27590
Copy link
Copy link
Closed
Description
Bug Description
OpenClaw Gateway enters an infinite restart loop after an agent run timeout, attempting to restart ~250 times over 42 minutes with a fixed 10-11 second interval, rendering the system completely unresponsive.
Environment
- OpenClaw version: 2026.2.24
- OS: macOS 24.5.0 (arm64)
- Node: v22.22.0
- Gateway mode: local (ws://127.0.0.1:18789)
- Launch method: LaunchAgent (ai.openclaw.gateway.plist)
Timeline (2026-02-26)
16:08:15 - First anomaly: embedded run timeout (runId=slug-gen-1772093280356)
16:08:47 - Restart loop begins
16:50:27 - Restart loop ends (42 minutes total)
Error Log Pattern
Gateway failed to start: gateway already running (pid 40836); lock timeout after 5000ms
Port 18789 is already in use.
- pid 40836 tingjing: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
Source files: gateway-cli-CHQJpgpN.js:1009 and gateway-cli-CK1Ri8SQ.js
Root Cause Analysis
- Watchdog detection failure: After an agent timeout at 16:08:15, the internal health check mechanism was triggered
- State detection bug: The restart logic does not correctly detect that Gateway is already running
- No backoff: Fixed 10-11 second interval between restart attempts, no exponential backoff
- No max retry limit: Loop continued for 42 minutes (~250 attempts) with no upper bound
Impact
- Gateway completely unresponsive for 42 minutes
- All Feishu/Lark messages unprocessable
- Log file bloat (gateway.err.log reached 156MB)
- System resource waste
Suggested Fixes
Short-term
- Add running-state detection in
gateway start— if already running, return success immediately - Add exponential backoff (1s, 2s, 4s, 8s, 16s, 30s, 60s...)
- Add max retry limit (e.g., stop after 10 attempts)
Long-term
- Implement proper health check (HTTP endpoint or WebSocket ping/pong)
- Use
openclaw gateway statusin watchdog logic - Consider leveraging systemd/LaunchAgent KeepAlive instead of custom restart logic
Log File Locations
- Main log:
/tmp/openclaw/openclaw-YYYY-MM-DD.log - Error log:
~/.openclaw/logs/gateway.err.log - Gateway log:
~/.openclaw/logs/gateway.log
Workaround
I built an external watchdog script (gateway-watchdog-v2.sh) that:
- Health-checks every 10 seconds
- Detects restart loops and backs off
- Attempts auto-repair via Codex CLI for config issues
- Falls back to config rollback if repair fails
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Fields
Give feedbackNo fields configured for issues without a type.