Skip to content

[Bug]: openclaw gateway restart always times out due to MainPID vs child PID mismatch in health check #24279

@Xyri1

Description

@Xyri1

Summary

openclaw gateway restart unconditionally kills the healthy gateway process, retries the restart, and then times out — because the health check compares the wrong PID.

Steps to reproduce

  1. Install OpenClaw and start the gateway (openclaw gateway start).
  2. Confirm it is healthy (openclaw gateway status).
  3. Run openclaw gateway restart.

Expected behavior

The gateway restarts cleanly and the command exits with a success message.

Actual behavior

The command kills the running gateway child process, issues a second restart, waits 60 s, and exits with an error — even though the gateway is running and healthy at that point:

Found stale gateway process(es): <pid>.
Stopping stale process(es) and retrying restart...
Restarted systemd service: openclaw-gateway.service
Timed out after 60s waiting for gateway port 18789 to become healthy.
Service runtime: status=running, state=active, pid=<supervisor-pid>, lastExit=0
Port 18789 is already in use.
- pid <gateway-pid> ubuntu: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
Gateway restart timed out after 60s waiting for health checks.

OpenClaw version

2026.2.22-2 (45febec)

Operating system

Ubuntu 24.04, Linux 6.17, systemd user service (--user)

Install method

npm global

Logs, screenshots, and evidence

Restarted systemd service: openclaw-gateway.service
Found stale gateway process(es): 301584.
Stopping stale process(es) and retrying restart...
Restarted systemd service: openclaw-gateway.service
Timed out after 60s waiting for gateway port 18789 to become healthy.
Service runtime: status=running, state=active, pid=302262, lastExit=0
Port 18789 is already in use.
- pid 302272 ubuntu: openclaw-gateway (127.0.0.1:18789)
- Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
Gateway restart timed out after 60s waiting for health checks.

Impact and severity

No response

Additional information

Root cause

The gateway runs as a two-process tree:

PID A  openclaw          ← node entry.js gateway  (systemd MainPID)
  PID B  openclaw-gateway  ← child, actually binds to port 18789

inspectGatewayRestart obtains the port listener PID via lsof (PID B) and the service PID via systemctl show --property MainPID (PID A). Because the port is bound by a child process rather than the main process tracked by systemd, these are structurally always different, causing two cascading failures:

1. healthy is always false:

const ownsPort = runtime.pid != null
  ? portUsage.listeners.some(listener => listener.pid === runtime.pid)  // B === A → false
  : ...
const healthy = running && ownsPort  // true && false → false

2. The child is always classified as stale:

const staleGatewayPids = gatewayListeners
  .map(l => l.pid)   // [B]
  .filter(pid =>
    runtime.pid == null ||
    pid !== runtime.pid ||  // B !== A → true, so B is always "stale"
    !running
  )

Because healthy is always false and staleGatewayPids is always [B], runDaemonRestart kills the healthy child (PID B), issues a second systemctl restart, then waits 60 s for healthy to become true — which it never does. The gateway recovers via systemd's Restart=always, but the CLI always reports a timeout.

Suggested fix

Walk the process tree from MainPID instead of comparing listener PID directly against it:

// current — breaks with parent-child process architecture
const ownsPort = runtime.pid != null
  ? portUsage.listeners.some(l => l.pid === runtime.pid)
  : ...

// proposed — treat any listener whose ancestor is MainPID as owned
const ownsPort = runtime.pid != null
  ? portUsage.listeners.some(l => l.pid === runtime.pid || isDescendantOf(l.pid, runtime.pid))
  : ...

Apply the same fix to the staleGatewayPids filter. On Linux, ancestry can be resolved cheaply via /proc/<pid>/status (PPid: field) without shelling out.


Workaround

Use systemd directly — the service restarts correctly, only the CLI health check is broken:

systemctl --user restart openclaw-gateway.service

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions