Skip to content

Bug: Gateway enters infinite restart loop after agent timeout (250 retries in 42 minutes) #27590

@flockmaster

Description

@flockmaster

Bug Description

OpenClaw Gateway enters an infinite restart loop after an agent run timeout, attempting to restart ~250 times over 42 minutes with a fixed 10-11 second interval, rendering the system completely unresponsive.

Environment

  • OpenClaw version: 2026.2.24
  • OS: macOS 24.5.0 (arm64)
  • Node: v22.22.0
  • Gateway mode: local (ws://127.0.0.1:18789)
  • Launch method: LaunchAgent (ai.openclaw.gateway.plist)

Timeline (2026-02-26)

16:08:15 - First anomaly: embedded run timeout (runId=slug-gen-1772093280356)
16:08:47 - Restart loop begins
16:50:27 - Restart loop ends (42 minutes total)

Error Log Pattern

Gateway failed to start: gateway already running (pid 40836); lock timeout after 5000ms
Port 18789 is already in use.
- pid 40836 tingjing: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

Source files: gateway-cli-CHQJpgpN.js:1009 and gateway-cli-CK1Ri8SQ.js

Root Cause Analysis

  1. Watchdog detection failure: After an agent timeout at 16:08:15, the internal health check mechanism was triggered
  2. State detection bug: The restart logic does not correctly detect that Gateway is already running
  3. No backoff: Fixed 10-11 second interval between restart attempts, no exponential backoff
  4. No max retry limit: Loop continued for 42 minutes (~250 attempts) with no upper bound

Impact

  • Gateway completely unresponsive for 42 minutes
  • All Feishu/Lark messages unprocessable
  • Log file bloat (gateway.err.log reached 156MB)
  • System resource waste

Suggested Fixes

Short-term

  1. Add running-state detection in gateway start — if already running, return success immediately
  2. Add exponential backoff (1s, 2s, 4s, 8s, 16s, 30s, 60s...)
  3. Add max retry limit (e.g., stop after 10 attempts)

Long-term

  1. Implement proper health check (HTTP endpoint or WebSocket ping/pong)
  2. Use openclaw gateway status in watchdog logic
  3. Consider leveraging systemd/LaunchAgent KeepAlive instead of custom restart logic

Log File Locations

  • Main log: /tmp/openclaw/openclaw-YYYY-MM-DD.log
  • Error log: ~/.openclaw/logs/gateway.err.log
  • Gateway log: ~/.openclaw/logs/gateway.log

Workaround

I built an external watchdog script (gateway-watchdog-v2.sh) that:

  • Health-checks every 10 seconds
  • Detects restart loops and backs off
  • Attempts auto-repair via Codex CLI for config issues
  • Falls back to config rollback if repair fails

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions