Gateway: fix restart-loop by terminating wss clients before close and triggering supervisor restart on Linux#33149
Conversation
… triggering supervisor restart on Linux
Greptile SummaryThis PR fixes a critical gateway restart-loop by making two targeted changes: (1) force-terminating all WebSocket connections in The core bug fix is correct and well-tested. However, there is one behavioral regression on unsupported platforms:
Confidence Score: 3/5
Last reviewed commit: f7b4a03 |
| const restart = triggerOpenClawRestart(); | ||
| if (!restart.ok) { | ||
| return { | ||
| mode: "failed", | ||
| detail: restart.detail ?? `${restart.method} restart failed`, | ||
| }; | ||
| } |
There was a problem hiding this comment.
triggerOpenClawRestart() returns ok: false on unsupported platforms
With the platform guard removed, triggerOpenClawRestart() is now called unconditionally for every supervised process. On platforms that are neither darwin nor linux (e.g. Windows), restart.ts explicitly returns { ok: false, method: "supervisor", detail: "unsupported platform restart" }. The new code treats this as a failure and returns { mode: "failed" }, which in run-loop.ts (line 73-75) triggers a warning and falls back to in-process restart.
Behavioral change:
- Before: Unsupported platforms →
{ mode: "supervised" }→ clean exit code 0 → supervisor restarts - After: Unsupported platforms →
{ mode: "failed" }→ warning → in-process restart
The OPENCLAW_SERVICE_MARKER in supervisor-markers.ts is platform-agnostic. If a user sets it on a non-Linux/macOS platform expecting supervisor-managed restart, they'll silently get in-process restart instead.
Suggested fix: Guard the triggerOpenClawRestart() call by platform, or treat "unsupported platform restart" as non-error so the function returns { mode: "supervised" } on unknown platforms:
| const restart = triggerOpenClawRestart(); | |
| if (!restart.ok) { | |
| return { | |
| mode: "failed", | |
| detail: restart.detail ?? `${restart.method} restart failed`, | |
| }; | |
| } | |
| const restart = triggerOpenClawRestart(); | |
| if (!restart.ok && restart.detail !== "unsupported platform restart") { | |
| return { | |
| mode: "failed", | |
| detail: restart.detail ?? `${restart.method} restart failed`, | |
| }; | |
| } | |
| return { mode: "supervised" }; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/process-respawn.ts
Line: 46-52
Comment:
## `triggerOpenClawRestart()` returns `ok: false` on unsupported platforms
With the platform guard removed, `triggerOpenClawRestart()` is now called unconditionally for every supervised process. On platforms that are neither `darwin` nor `linux` (e.g. Windows), `restart.ts` explicitly returns `{ ok: false, method: "supervisor", detail: "unsupported platform restart" }`. The new code treats this as a failure and returns `{ mode: "failed" }`, which in `run-loop.ts` (line 73-75) triggers a warning and falls back to in-process restart.
**Behavioral change:**
- **Before**: Unsupported platforms → `{ mode: "supervised" }` → clean exit code 0 → supervisor restarts
- **After**: Unsupported platforms → `{ mode: "failed" }` → warning → in-process restart
The `OPENCLAW_SERVICE_MARKER` in `supervisor-markers.ts` is platform-agnostic. If a user sets it on a non-Linux/macOS platform expecting supervisor-managed restart, they'll silently get in-process restart instead.
**Suggested fix:** Guard the `triggerOpenClawRestart()` call by platform, or treat `"unsupported platform restart"` as non-error so the function returns `{ mode: "supervised" }` on unknown platforms:
```suggestion
const restart = triggerOpenClawRestart();
if (!restart.ok && restart.detail !== "unsupported platform restart") {
return {
mode: "failed",
detail: restart.detail ?? `${restart.method} restart failed`,
};
}
return { mode: "supervised" };
```
How can I resolve this? If you propose a fix, please make it concise.…x/macOS exit cleanly (openclaw#33149)
Summary
wss.close()inserver-close.tswaits for all WebSocket clients to disconnect gracefully. If any client never sends a close-frame acknowledgment (e.g. a browser tab with the control UI, a stalled chat client, or a connection still in the WS upgrade handshake),wss.close()blocks indefinitely. BecausehttpServer.close()is only called afterwss.close()returns, the TCP listening socket on port 18789 is never released. Every new gateway process fails withEADDRINUSE, systemd restarts it again, and the cycle repeats — 30+ failed attempts — until the old process is manuallypkill -9'd.WebSocketServeris created with{ noServer: true }, so it has no reference to the HTTP server;wss.close()never touches the HTTP listening socket. The close sequence (wss.close()→httpServer.close()) means a hungwss.close()leaves the port bound indefinitely.server-close.ts(primary): Before callingwss.close(), forcefullyterminate()every connection still inwss.clients. This covers both connections that did not respond to the close frame and connections in the WS upgrade handshake that were never added toparams.clients. After terminationwss.close()returns immediately. Also upgradedcloseIdleConnections()tocloseAllConnections()(Node 18.2+, required Node 22+) so keep-alive HTTP connections are flushed without an extra timeout.process-respawn.ts(secondary): In Linux systemd supervised mode, now callstriggerOpenClawRestart()(i.e.systemctl restart <unit>) before exiting, mirroring the existing macOSlaunchctl kickstart -kpath. This (a) runscleanStaleGatewayProcessesSync()first to remove any leftover upstream process still on the port, and (b) ensures the service restarts even when configured withRestart=on-failure(which does not restart on a clean exit code 0).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
pkill -9to recover.Security Impact (required)
triggerOpenClawRestartwas already called from the!restartcommand path; the supervised respawn path now calls it too)Yes, explain risk + mitigation: N/ARepro + Verification
Environment
Restart=on-failureorRestart=always)prefix: "gateway", kind: "restart"inconfig-reload-plan.ts, e.g.gateway.controlUi.allowedOrigins)gateway.controlUi.allowedOriginschanged while a browser tab is open on the control UISteps to reproduce (before fix)
openclaw gateway rununder a systemd user service with at least one WebSocket client connected (e.g. open the web control UI).~/.openclaw/config.jsonto changegateway.controlUi.allowedOrigins.requestGatewayRestart()emits SIGUSR1, shutdown begins,wss.close()stalls, port 18789 remains bound.EADDRINUSE→ fails → repeat 30+ times.pkill -9 openclaw-gateway.Steps to verify (after fix)
pnpm build).pnpm test -- src/gateway/server-close.test.ts src/infra/process-respawn.test.ts src/cli/gateway-cli/run-loop.test.ts— all pass.Expected
wss.close()returns immediately afterterminate()is called on all clients.Actual
Evidence
src/gateway/server-close.test.tscover: terminate-before-close ordering, orphan connections (inwss.clientsbut notparams.clients), terminate-throws-safely,closeAllConnectionspreferred overcloseIdleConnections.src/infra/process-respawn.test.tscover Linux systemd supervised path callingtriggerOpenClawRestart()and propagating failure tomode: "failed".run-loop.test.ts.Human Verification (required)
terminate()throwing (caught and ignored);wss.clientsempty (loop is a no-op);closeAllConnectionsunavailable (falls back tocloseIdleConnections);triggerOpenClawRestart()failing on Linux (returnsmode: "failed", falls back to in-process restart).wss.terminatebehavior difference expected).Compatibility / Migration
Failure Recovery (if this breaks)
git revert f7b4a03b2src/gateway/server-close.ts,src/infra/process-respawn.tsterminate()is somehow breaking the close sequence — revert immediately); Linux supervised restarts now returningmode: "failed"unexpectedly ifsystemctlis unavailable (gateway falls back to in-process restart, which is safe).Risks and Mitigations
ws.terminate()is forceful (TCP RST). Clients receive no graceful close reason. Acceptable: we already sent a WS close frame (code 1012 "service restart") before terminate, so well-behaved clients have had their chance to close cleanly. The gateway is about to exit anyway.triggerOpenClawRestart()callssystemctl restartsynchronously (up to 2 s timeout). This is the same call already used by the!restartcommand; it blocks only for the systemctl acknowledgment, not for the full restart cycle. Low risk.Restart=on-failureservices now explicitly get asystemctl restartcall. This is intentional — previously a clean exit (code 0) would silently leave the service stopped. The new behavior matches user expectation of a live gateway after config change.