-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
[Bug]: openclaw gateway restart always times out due to MainPID vs child PID mismatch in health check #24279
Description
Summary
openclaw gateway restart unconditionally kills the healthy gateway process, retries the restart, and then times out — because the health check compares the wrong PID.
Steps to reproduce
- Install OpenClaw and start the gateway (
openclaw gateway start). - Confirm it is healthy (
openclaw gateway status). - Run
openclaw gateway restart.
Expected behavior
The gateway restarts cleanly and the command exits with a success message.
Actual behavior
The command kills the running gateway child process, issues a second restart, waits 60 s, and exits with an error — even though the gateway is running and healthy at that point:
Found stale gateway process(es): <pid>.
Stopping stale process(es) and retrying restart...
Restarted systemd service: openclaw-gateway.service
Timed out after 60s waiting for gateway port 18789 to become healthy.
Service runtime: status=running, state=active, pid=<supervisor-pid>, lastExit=0
Port 18789 is already in use.
- pid <gateway-pid> ubuntu: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
Gateway restart timed out after 60s waiting for health checks.
OpenClaw version
2026.2.22-2 (45febec)
Operating system
Ubuntu 24.04, Linux 6.17, systemd user service (--user)
Install method
npm global
Logs, screenshots, and evidence
Restarted systemd service: openclaw-gateway.service
Found stale gateway process(es): 301584.
Stopping stale process(es) and retrying restart...
Restarted systemd service: openclaw-gateway.service
Timed out after 60s waiting for gateway port 18789 to become healthy.
Service runtime: status=running, state=active, pid=302262, lastExit=0
Port 18789 is already in use.
- pid 302272 ubuntu: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
Gateway restart timed out after 60s waiting for health checks.Impact and severity
No response
Additional information
Root cause
The gateway runs as a two-process tree:
PID A openclaw ← node entry.js gateway (systemd MainPID)
PID B openclaw-gateway ← child, actually binds to port 18789
inspectGatewayRestart obtains the port listener PID via lsof (PID B) and the service PID via systemctl show --property MainPID (PID A). Because the port is bound by a child process rather than the main process tracked by systemd, these are structurally always different, causing two cascading failures:
1. healthy is always false:
const ownsPort = runtime.pid != null
? portUsage.listeners.some(listener => listener.pid === runtime.pid) // B === A → false
: ...
const healthy = running && ownsPort // true && false → false2. The child is always classified as stale:
const staleGatewayPids = gatewayListeners
.map(l => l.pid) // [B]
.filter(pid =>
runtime.pid == null ||
pid !== runtime.pid || // B !== A → true, so B is always "stale"
!running
)Because healthy is always false and staleGatewayPids is always [B], runDaemonRestart kills the healthy child (PID B), issues a second systemctl restart, then waits 60 s for healthy to become true — which it never does. The gateway recovers via systemd's Restart=always, but the CLI always reports a timeout.
Suggested fix
Walk the process tree from MainPID instead of comparing listener PID directly against it:
// current — breaks with parent-child process architecture
const ownsPort = runtime.pid != null
? portUsage.listeners.some(l => l.pid === runtime.pid)
: ...
// proposed — treat any listener whose ancestor is MainPID as owned
const ownsPort = runtime.pid != null
? portUsage.listeners.some(l => l.pid === runtime.pid || isDescendantOf(l.pid, runtime.pid))
: ...Apply the same fix to the staleGatewayPids filter. On Linux, ancestry can be resolved cheaply via /proc/<pid>/status (PPid: field) without shelling out.
Workaround
Use systemd directly — the service restarts correctly, only the CLI health check is broken:
systemctl --user restart openclaw-gateway.service