Skip to content

Gateway: fix restart-loop by terminating wss clients before close and triggering supervisor restart on Linux#33149

Open
chengzhichao-xydt wants to merge 2 commits intoopenclaw:mainfrom
chengzhichao-xydt:fix/33103-gateway-restart-loop-upstream
Open

Gateway: fix restart-loop by terminating wss clients before close and triggering supervisor restart on Linux#33149
chengzhichao-xydt wants to merge 2 commits intoopenclaw:mainfrom
chengzhichao-xydt:fix/33103-gateway-restart-loop-upstream

Conversation

@chengzhichao-xydt
Copy link
Copy Markdown
Contributor

Summary

  • Problem: When a config change triggers a gateway restart (SIGUSR1), wss.close() in server-close.ts waits for all WebSocket clients to disconnect gracefully. If any client never sends a close-frame acknowledgment (e.g. a browser tab with the control UI, a stalled chat client, or a connection still in the WS upgrade handshake), wss.close() blocks indefinitely. Because httpServer.close() is only called after wss.close() returns, the TCP listening socket on port 18789 is never released. Every new gateway process fails with EADDRINUSE, systemd restarts it again, and the cycle repeats — 30+ failed attempts — until the old process is manually pkill -9'd.
  • Why it matters: Any single stalled WebSocket connection makes a config-change restart permanently unrecoverable without manual intervention.
  • Root cause detail: The WebSocketServer is created with { noServer: true }, so it has no reference to the HTTP server; wss.close() never touches the HTTP listening socket. The close sequence (wss.close()httpServer.close()) means a hung wss.close() leaves the port bound indefinitely.
  • What changed (two fixes):
    1. server-close.ts (primary): Before calling wss.close(), forcefully terminate() every connection still in wss.clients. This covers both connections that did not respond to the close frame and connections in the WS upgrade handshake that were never added to params.clients. After termination wss.close() returns immediately. Also upgraded closeIdleConnections() to closeAllConnections() (Node 18.2+, required Node 22+) so keep-alive HTTP connections are flushed without an extra timeout.
    2. process-respawn.ts (secondary): In Linux systemd supervised mode, now calls triggerOpenClawRestart() (i.e. systemctl restart <unit>) before exiting, mirroring the existing macOS launchctl kickstart -k path. This (a) runs cleanStaleGatewayProcessesSync() first to remove any leftover upstream process still on the port, and (b) ensures the service restarts even when configured with Restart=on-failure (which does not restart on a clean exit code 0).
  • What did NOT change (scope boundary): Gateway behavior, WebSocket protocol, auth, routing, config schema, and all non-shutdown code paths are unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Gateway config-change restarts now complete cleanly even when a WebSocket client is stalled. Previously the gateway would hang with port 18789 occupied and require pkill -9 to recover.
  • No change to normal (non-stuck-client) restart behavior.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No (triggerOpenClawRestart was already called from the !restart command path; the supervised respawn path now calls it too)
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Ubuntu 24.04 LTS (primary repro); macOS also exercised
  • Runtime/container: Node 22+, systemd user service (Restart=on-failure or Restart=always)
  • Model/provider: N/A
  • Integration/channel (if any): Any (config change that maps to prefix: "gateway", kind: "restart" in config-reload-plan.ts, e.g. gateway.controlUi.allowedOrigins)
  • Relevant config (redacted): gateway.controlUi.allowedOrigins changed while a browser tab is open on the control UI

Steps to reproduce (before fix)

  1. Start openclaw gateway run under a systemd user service with at least one WebSocket client connected (e.g. open the web control UI).
  2. Edit ~/.openclaw/config.json to change gateway.controlUi.allowedOrigins.
  3. Observe: chokidar fires, requestGatewayRestart() emits SIGUSR1, shutdown begins, wss.close() stalls, port 18789 remains bound.
  4. Systemd restarts → new process → EADDRINUSE → fails → repeat 30+ times.
  5. Only exit: pkill -9 openclaw-gateway.

Steps to verify (after fix)

  1. Apply patch; rebuild (pnpm build).
  2. Repeat steps 1–2 above.
  3. Observe: shutdown completes within ~1 s; new gateway starts cleanly on port 18789.
  4. Run pnpm test -- src/gateway/server-close.test.ts src/infra/process-respawn.test.ts src/cli/gateway-cli/run-loop.test.ts — all pass.

Expected

  • Gateway restarts cleanly after config change even with connected WS clients.
  • wss.close() returns immediately after terminate() is called on all clients.
  • New process binds port 18789 on first attempt.

Actual

  • Matches expected.

Evidence

  • New tests in src/gateway/server-close.test.ts cover: terminate-before-close ordering, orphan connections (in wss.clients but not params.clients), terminate-throws-safely, closeAllConnections preferred over closeIdleConnections.
  • Updated tests in src/infra/process-respawn.test.ts cover Linux systemd supervised path calling triggerOpenClawRestart() and propagating failure to mode: "failed".
  • All 24 affected tests pass; no regressions in run-loop.test.ts.

Human Verification (required)

  • Verified scenarios: Config change restart with open browser control UI tab; config change restart with no connected clients; SIGTERM stop path (unchanged); in-process restart path (unchanged).
  • Edge cases checked: terminate() throwing (caught and ignored); wss.clients empty (loop is a no-op); closeAllConnections unavailable (falls back to closeIdleConnections); triggerOpenClawRestart() failing on Linux (returns mode: "failed", falls back to in-process restart).
  • What you did NOT verify: Long-running soak test with hundreds of connected clients; Windows (no wss.terminate behavior difference expected).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: git revert f7b4a03b2
  • Files/config to restore: src/gateway/server-close.ts, src/infra/process-respawn.ts
  • Known bad symptoms reviewers should watch for: gateway fails to start after restart (would indicate terminate() is somehow breaking the close sequence — revert immediately); Linux supervised restarts now returning mode: "failed" unexpectedly if systemctl is unavailable (gateway falls back to in-process restart, which is safe).

Risks and Mitigations

  • ws.terminate() is forceful (TCP RST). Clients receive no graceful close reason. Acceptable: we already sent a WS close frame (code 1012 "service restart") before terminate, so well-behaved clients have had their chance to close cleanly. The gateway is about to exit anyway.
  • Linux triggerOpenClawRestart() calls systemctl restart synchronously (up to 2 s timeout). This is the same call already used by the !restart command; it blocks only for the systemctl acknowledgment, not for the full restart cycle. Low risk.
  • Restart=on-failure services now explicitly get a systemctl restart call. This is intentional — previously a clean exit (code 0) would silently leave the service stopped. The new behavior matches user expectation of a live gateway after config change.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes a critical gateway restart-loop by making two targeted changes: (1) force-terminating all WebSocket connections in wss.clients before calling wss.close() in server-close.ts, preventing indefinite hangs when a client never sends a close-frame acknowledgment, and (2) extending the triggerOpenClawRestart() supervisor-kickstart call — previously limited to macOS launchd — to all supervised environments including Linux systemd, ensuring the service is explicitly restarted regardless of the Restart= policy.

The core bug fix is correct and well-tested. However, there is one behavioral regression on unsupported platforms:

  • server-close.ts: ws.terminate() is called on every entry in wss.clients (with per-client error swallowing) before wss.close(), resolving the indefinite hang. closeIdleConnections is also upgraded to closeAllConnections (Node 18.2+) for faster HTTP teardown.
  • process-respawn.ts: The darwin-only guard is removed so triggerOpenClawRestart() is called for any supervised process. On Linux this runs systemctl restart, matching the launchd path on macOS. However, on platforms that are neither darwin nor linux (e.g. Windows), triggerOpenClawRestart() returns { ok: false, detail: "unsupported platform restart" }. The new code now treats this as a failure and returns { mode: "failed" } instead of the previous { mode: "supervised" }, causing the process to log a warning and fall back to in-process restart. The generic OPENCLAW_SERVICE_MARKER in supervisor-markers.ts is not platform-scoped, so this change affects any supervised environment on an unsupported OS.
  • Tests are thorough and cover the happy path, but the unsupported-platform regression is not tested.

Confidence Score: 3/5

  • Safe to merge for Linux and macOS deployments; behavioral regression for unsupported platforms.
  • The core bug fix is correct and well-tested. The primary WebSocket termination approach in server-close.ts is straightforward and safe. The Linux systemd integration correctly mirrors the macOS launchd path. However, the removal of the platform guard causes a behavioral regression on non-Linux, non-macOS platforms: unsupported platforms will now return mode: "failed" instead of mode: "supervised", triggering in-process restart instead of supervisor-managed restart. While the in-process restart fallback is safe, this silently changes expected behavior for users on unsupported platforms who set the generic OPENCLAW_SERVICE_MARKER environment variable.
  • src/infra/process-respawn.ts — the unconditional triggerOpenClawRestart() call at line 46 causes behavioral regression on non-Linux, non-macOS platforms. Consider guarding by platform or treating "unsupported platform restart" as non-error.

Last reviewed commit: f7b4a03

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +46 to 52
const restart = triggerOpenClawRestart();
if (!restart.ok) {
return {
mode: "failed",
detail: restart.detail ?? `${restart.method} restart failed`,
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triggerOpenClawRestart() returns ok: false on unsupported platforms

With the platform guard removed, triggerOpenClawRestart() is now called unconditionally for every supervised process. On platforms that are neither darwin nor linux (e.g. Windows), restart.ts explicitly returns { ok: false, method: "supervisor", detail: "unsupported platform restart" }. The new code treats this as a failure and returns { mode: "failed" }, which in run-loop.ts (line 73-75) triggers a warning and falls back to in-process restart.

Behavioral change:

  • Before: Unsupported platforms → { mode: "supervised" } → clean exit code 0 → supervisor restarts
  • After: Unsupported platforms → { mode: "failed" } → warning → in-process restart

The OPENCLAW_SERVICE_MARKER in supervisor-markers.ts is platform-agnostic. If a user sets it on a non-Linux/macOS platform expecting supervisor-managed restart, they'll silently get in-process restart instead.

Suggested fix: Guard the triggerOpenClawRestart() call by platform, or treat "unsupported platform restart" as non-error so the function returns { mode: "supervised" } on unknown platforms:

Suggested change
const restart = triggerOpenClawRestart();
if (!restart.ok) {
return {
mode: "failed",
detail: restart.detail ?? `${restart.method} restart failed`,
};
}
const restart = triggerOpenClawRestart();
if (!restart.ok && restart.detail !== "unsupported platform restart") {
return {
mode: "failed",
detail: restart.detail ?? `${restart.method} restart failed`,
};
}
return { mode: "supervised" };
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/process-respawn.ts
Line: 46-52

Comment:
## `triggerOpenClawRestart()` returns `ok: false` on unsupported platforms

With the platform guard removed, `triggerOpenClawRestart()` is now called unconditionally for every supervised process. On platforms that are neither `darwin` nor `linux` (e.g. Windows), `restart.ts` explicitly returns `{ ok: false, method: "supervisor", detail: "unsupported platform restart" }`. The new code treats this as a failure and returns `{ mode: "failed" }`, which in `run-loop.ts` (line 73-75) triggers a warning and falls back to in-process restart.

**Behavioral change:**
- **Before**: Unsupported platforms → `{ mode: "supervised" }` → clean exit code 0 → supervisor restarts
- **After**: Unsupported platforms → `{ mode: "failed" }` → warning → in-process restart

The `OPENCLAW_SERVICE_MARKER` in `supervisor-markers.ts` is platform-agnostic. If a user sets it on a non-Linux/macOS platform expecting supervisor-managed restart, they'll silently get in-process restart instead.

**Suggested fix:** Guard the `triggerOpenClawRestart()` call by platform, or treat `"unsupported platform restart"` as non-error so the function returns `{ mode: "supervised" }` on unknown platforms:

```suggestion
const restart = triggerOpenClawRestart();
if (!restart.ok && restart.detail !== "unsupported platform restart") {
  return {
    mode: "failed",
    detail: restart.detail ?? `${restart.method} restart failed`,
  };
}
return { mode: "supervised" };
```

How can I resolve this? If you propose a fix, please make it concise.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: M labels Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway restart infinite loop with zombie process on config change

1 participant