fix(gateway): support openclaw gateway restart/stop in containers without systemd#36140
fix(gateway): support openclaw gateway restart/stop in containers without systemd#36140chalawbot wants to merge 2 commits intoopenclaw:mainfrom
openclaw gateway restart/stop in containers without systemd#36140Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 182691f826
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let port: number; | ||
| try { | ||
| port = resolveGatewayPort(undefined, process.env); | ||
| } catch { |
There was a problem hiding this comment.
Use configured port when resolving gateway PID
resolveGatewayPid calls resolveGatewayPort(undefined, process.env), so in manager-less environments it ignores gateway.port from config unless an env var is set. That makes gateway restart/stop miss a running gateway on non-default ports and can target whatever process is bound to the default port instead. This regresses the new container path for users who only set the port in config.
Useful? React with 👍 / 👎.
| const pidPattern = /pid=(\d+)/g; | ||
| let match: RegExpExecArray | null; | ||
| while ((match = pidPattern.exec(res.stdout)) !== null) { | ||
| const pid = Number.parseInt(match[1], 10); | ||
| if (isValidPid(pid) && pid !== process.pid) { |
There was a problem hiding this comment.
Restrict
ss PID parsing to OpenClaw processes
The ss fallback collects any pid=<n> on the port and returns it without validating the process identity. In minimal containers (where lsof is missing), gateway stop/restart can send SIGTERM/SIGUSR1 to an unrelated service that happens to listen on the configured port, which is an unsafe process-targeting regression.
Useful? React with 👍 / 👎.
| ctx.emit({ | ||
| ok: false, | ||
| error: `Failed to send restart signal to gateway (PID ${pid}): ${String(err)}`, | ||
| }); |
There was a problem hiding this comment.
Propagate signal-delivery failures as command errors
When direct signal delivery fails, this path emits an error payload and returns false instead of calling fail(...). Since runDaemonRestart’s caller does not inspect the boolean, the CLI exits 0 in non-JSON mode even though restart failed (same pattern exists in direct stop), which breaks scripting and health automation that rely on exit status.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR adds direct signal-based Two verified issues should be addressed before merging:
Confidence Score: 3/5
Last reviewed commit: 182691f |
| // Phase 1: wait for old process to go DOWN (ok: false). | ||
| while (Date.now() < deadline) { | ||
| const result = await probe(); | ||
| if (!result.ok) { | ||
| break; | ||
| } | ||
| await new Promise<void>((resolve) => setTimeout(resolve, intervalMs)); | ||
| } | ||
|
|
||
| // Phase 2: wait for new process to come UP (ok: true). | ||
| while (Date.now() < deadline) { | ||
| const result = await probe(); | ||
| if (result.ok) { | ||
| return true; | ||
| } | ||
| await new Promise<void>((resolve) => setTimeout(resolve, intervalMs)); | ||
| } |
There was a problem hiding this comment.
Phase 1 can starve Phase 2 for the entire timeout budget
If the gateway performs a true in-process SIGUSR1 restart (no process death, just connection reset) and restarts faster than the polling interval (500 ms), Phase 1 will never observe ok: false. The Phase 1 while loop then runs until the 45-second deadline expires.
When Phase 2 then starts, Date.now() >= deadline is already true, so the loop body never executes. The function returns false and the user sees "Restart signal sent but gateway did not confirm healthy within timeout" for a restart that actually succeeded.
A simple mitigation is to cap how long Phase 1 is allowed to run — for example, give it a fraction of the total budget (e.g. phase1DeadlineMs = timeoutMs / 3) and fall through to Phase 2 regardless:
const phase1DeadlineMs = params.timeoutMs / 3;
const phase1Deadline = Date.now() + phase1DeadlineMs;
// Phase 1: wait for old process DOWN — but don't starve Phase 2.
while (Date.now() < deadline && Date.now() < phase1Deadline) {
const result = await probe();
if (!result.ok) {
break;
}
await new Promise<void>((resolve) => setTimeout(resolve, intervalMs));
}
// Phase 2: wait for new process UP.
while (Date.now() < deadline) {
const result = await probe();
if (result.ok) {
return true;
}
await new Promise<void>((resolve) => setTimeout(resolve, intervalMs));
}
return false;This ensures Phase 2 always gets to run even when Phase 1 misses the down window.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/sigusr1-restart.ts
Line: 122-138
Comment:
Phase 1 can starve Phase 2 for the entire timeout budget
If the gateway performs a true in-process SIGUSR1 restart (no process death, just connection reset) and restarts faster than the polling interval (500 ms), Phase 1 will never observe `ok: false`. The Phase 1 `while` loop then runs until the 45-second deadline expires.
When Phase 2 then starts, `Date.now() >= deadline` is already true, so the loop body never executes. The function returns `false` and the user sees "Restart signal sent but gateway did not confirm healthy within timeout" for a restart that actually succeeded.
A simple mitigation is to cap how long Phase 1 is allowed to run — for example, give it a fraction of the total budget (e.g. `phase1DeadlineMs = timeoutMs / 3`) and fall through to Phase 2 regardless:
```typescript
const phase1DeadlineMs = params.timeoutMs / 3;
const phase1Deadline = Date.now() + phase1DeadlineMs;
// Phase 1: wait for old process DOWN — but don't starve Phase 2.
while (Date.now() < deadline && Date.now() < phase1Deadline) {
const result = await probe();
if (!result.ok) {
break;
}
await new Promise<void>((resolve) => setTimeout(resolve, intervalMs));
}
// Phase 2: wait for new process UP.
while (Date.now() < deadline) {
const result = await probe();
if (result.ok) {
return true;
}
await new Promise<void>((resolve) => setTimeout(resolve, intervalMs));
}
return false;
```
This ensures Phase 2 always gets to run even when Phase 1 misses the down window.
How can I resolve this? If you propose a fix, please make it concise.| // When service manager is not available (container/foreground), try SIGUSR1 directly. | ||
| // The gateway process handles SIGUSR1 natively for in-process restarts. | ||
| if (!loaded) { | ||
| await handleServiceNotLoaded({ | ||
| serviceNoun: params.serviceNoun, | ||
| service: params.service, | ||
| loaded, | ||
| renderStartHints: params.renderStartHints, | ||
| json, | ||
| emit, | ||
| }); | ||
| return false; | ||
| return runDirectSigusr1Restart({ params, json, emit }); | ||
| } |
There was a problem hiding this comment.
The --hard flag is silently ignored when no service manager is available
When loaded === false (container / foreground mode), the code immediately delegates to runDirectSigusr1Restart without inspecting opts.hard. Inside runDirectSigusr1Restart, the opts field is present in the context but hard is never read.
A user who explicitly runs openclaw gateway restart --hard in a container will silently receive a graceful SIGUSR1 restart instead of the hard restart they requested, with no indication that their flag was ignored.
At minimum, log a message when opts.hard is set but cannot be honoured:
if (!loaded) {
if (params.opts?.hard && !json) {
defaultRuntime.log(
"\n⚠️ No service manager detected — --hard is not available in this environment. Falling back to SIGUSR1.\n",
);
}
return runDirectSigusr1Restart({ params, json, emit });
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 259-263
Comment:
The `--hard` flag is silently ignored when no service manager is available
When `loaded === false` (container / foreground mode), the code immediately delegates to `runDirectSigusr1Restart` without inspecting `opts.hard`. Inside `runDirectSigusr1Restart`, the `opts` field is present in the context but `hard` is never read.
A user who explicitly runs `openclaw gateway restart --hard` in a container will silently receive a graceful SIGUSR1 restart instead of the hard restart they requested, with no indication that their flag was ignored.
At minimum, log a message when `opts.hard` is set but cannot be honoured:
```typescript
if (!loaded) {
if (params.opts?.hard && !json) {
defaultRuntime.log(
"\n⚠️ No service manager detected — --hard is not available in this environment. Falling back to SIGUSR1.\n",
);
}
return runDirectSigusr1Restart({ params, json, emit });
}
```
How can I resolve this? If you propose a fix, please make it concise.…ithout systemd When no service manager (systemd/launchd) is available — such as in Docker, Podman, or other container environments — `openclaw gateway restart` and `openclaw gateway stop` previously bailed with "Gateway service disabled" because `isLoaded()` returned false. This change adds direct signal-based restart and stop paths that bypass the service manager requirement: **PID discovery** (`sigusr1-restart.ts`): - Add `resolveGatewayPid()` with multi-strategy PID discovery: 1. Service manager runtime (systemd/launchd) — existing path 2. Port-based discovery via lsof — common on full Linux/macOS installs 3. Port-based discovery via ss — fallback for minimal containers without lsof - Load config to resolve correct gateway port (not just env/default) - Validate process identity via /proc/<pid>/cmdline before signaling (ss path) **Restart** (`lifecycle-core.ts`): - When the service is not "loaded", discover the gateway process on the configured port and send SIGUSR1 directly for graceful in-process restart - Add `--hard` flag to force service-manager restart (systemctl/launchctl) - Warn when `--hard` is used in container (no service manager available) - Add two-phase health polling (wait-for-down then wait-for-up) after SIGUSR1 - Default to graceful SIGUSR1 restart even when service manager IS available **Stop** (`lifecycle-core.ts`): - When the service is not "loaded", discover the gateway process and send SIGTERM directly instead of showing "service disabled" - Reports "not running" when no gateway process is found on the port The gateway process already handles SIGUSR1 natively for in-process restarts (run-loop.ts), so this connects the CLI to that existing mechanism. Co-Authored-By: Claude Opus 4.6 <[email protected]>
182691f to
8f6a5e0
Compare
|
Thanks for the automated reviews. Addressed the valid points in the latest push: Fixed:
Not changed (by design):
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f6a5e0a39
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| try { | ||
| await params.service.restart({ env: process.env, stdout }); | ||
| if (params.postRestartCheck) { | ||
| await params.postRestartCheck({ json, stdout, warnings, fail }); | ||
| process.kill(pid, "SIGUSR1"); |
There was a problem hiding this comment.
Force hard restart for upgrade-driven restarts
This now makes SIGUSR1 the default restart path on service-managed installs, which only performs an in-process loop restart instead of re-execing the gateway binary (src/cli/gateway-cli/run-loop.ts:186-195). As a result, upgrade flows can report a successful restart while still running old code, because openclaw update invokes runDaemonRestart() without hard (src/cli/update-cli/update-command.ts:542). For post-upgrade restarts, this should use the service-manager restart path by default (or force --hard) so the new build is actually loaded.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch on the upgrade flow — but this is out of scope for this PR.
This PR specifically fixes container environments where restart and stop were completely broken (no service manager → commands bail with "service not loaded"). The SIGUSR1 default for service-managed installs is intentional: it's the graceful path that avoids downtime (drain connections → restart loop), which is the correct default for openclaw gateway restart.
The openclaw update command (update-command.ts:542) is a separate concern — it should pass { hard: true } (or the equivalent) when it knows a binary upgrade happened. That's a one-line fix in update-command.ts, not something that should change the default restart semantics for all callers. I'll file a follow-up issue for that.
| ok: true, | ||
| result: healthy ? "restarted" : "restarted-unverified", | ||
| message: healthy | ||
| ? "Gateway restarted (graceful)." | ||
| : "Restart signal sent; health check timed out.", |
There was a problem hiding this comment.
Return failure when graceful restart health times out
When the health probe times out, this branch still emits ok: true with restarted-unverified and the function later returns success, so CLI callers get a zero exit status even though restart confirmation failed. That is a regression from the previous postRestartCheck behavior (src/cli/daemon-cli/lifecycle.ts:139-143), and it can silently mask broken restarts (for example crash loops or port bind failures) in scripts and health automation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is by design, not a regression.
When we reach the restarted-unverified state, the SIGUSR1 signal has already been sent and the process is already restarting. At that point:
- We cannot undo the restart (signal already delivered)
- The gateway is restarting — we just couldn't confirm it via health probe (network timeout, slow startup, etc.)
- Returning a failure exit code would be misleading — the restart did happen, we just couldn't verify it
The old postRestartCheck behavior is different: it ran after a service-manager restart where the manager itself confirms process lifecycle. Here we're doing direct signal delivery where success = signal sent.
The restarted-unverified status in the JSON output gives scripts the information they need to distinguish "confirmed healthy" from "signal sent but unverified" — they can act on the status field rather than exit code. A non-zero exit would cause unnecessary failures in automation when the gateway is actually fine but just slow to respond to health probes.
- Update lifecycle-core.test.ts to mock new dependencies (isRestartEnabled, buildGatewayConnectionDetails, resolveGatewayCredentialsFromConfig, resolveGatewayPid, pollUntilGatewayHealthy) - Fix "skips drift warning" test: loadConfig IS now called (hoisted for restart-enabled check) - Add tests for graceful SIGUSR1 restart path, --hard flag, container fallback, credential forwarding, health poll timeout - Add sigusr1-restart.test.ts with tests for resolveGatewayPid (service runtime, lsof fallback, config-based port) and pollUntilGatewayHealthy (two-phase polling, timeouts, credential forwarding) Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 829b7c4b19
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } catch { | ||
| // /proc not available (non-Linux) or permission denied — assume it's ours. | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Reject unreadable
/proc entries during ss PID filtering
The ss fallback still accepts unrelated listeners when /proc/<pid>/cmdline is unreadable, because this catch block returns true and effectively disables process-identity validation. In Linux environments with hidepid/permission restrictions (a common container hardening setting) and no lsof, gateway stop/restart can signal a non-OpenClaw process bound to the target port. This path should treat read failures as non-matches (or require a positive openclaw match) before signaling.
Useful? React with 👍 / 👎.
| defaultRuntime.log("\nGateway restart is disabled (commands.restart=false)."); | ||
| defaultRuntime.log(" Use --hard to force a service restart via the service manager.\n"); | ||
| } | ||
| return false; |
There was a problem hiding this comment.
Return non-zero when restart is disabled by policy
When commands.restart=false, this path logs an error and returns false instead of calling fail(...). Because the caller does not enforce the returned boolean for CLI exit behavior, openclaw gateway restart can exit successfully even though restart was explicitly rejected, which makes shell automation and health scripts mis-handle policy failures. This branch should terminate with a command error status.
Useful? React with 👍 / 👎.
Summary
openclaw gateway restartandopenclaw gateway stopfail with "Gateway service disabled" in containers without systemd/launchd, even when the gateway process is runningpkillstart,install,uninstallcommands are unchanged; service-manager paths are preserved as the primary path when availableChange Type (select all)
Scope (select all touched areas)
Linked Issue/PR
openclaw gateway restart/stopfail in containers without systemd #36137User-visible / Behavior Changes
openclaw gateway restartnow works in containers: discovers gateway PID via lsof/ss and sends SIGUSR1 for graceful in-process restartopenclaw gateway stopnow works in containers: discovers gateway PID and sends SIGTERM--hardflag onopenclaw gateway restartto force service-manager restart (systemctl/launchctl)Security Impact (required)
NoNoNo— health polling reuses existingprobeGatewayStatus(WS RPC)Yes— addsssas a fallback for PID discovery viaspawnSync("ss", [...]). Only reads socket info, no write operations.lsofwas already used.NoRepro + Verification
Environment
Steps
openclaw gateway --port 18789 --verboseopenclaw gateway restart→ should show "Restart signal sent to gateway (PID xxx)." then "Gateway restarted successfully."openclaw gateway stop→ should show "Gateway stopped (PID xxx)."openclaw gateway stopagain → should show "Gateway is not running."Expected
All commands work as described above.
Actual (before fix)
All commands fail with "Gateway service disabled."
Evidence
gateway stoppnpm build)Human Verification (required)
--hardflagCompatibility / Migration
Yes— service-manager paths are unchanged; new paths only activate whenisLoaded()returns falseNoNoFailure Recovery (if this breaks)
pkillremains as manual fallbackssoutput format changes across distros, PID parsing could fail — but this gracefully falls back to "not running" hintsRisks and Mitigations
ssoutput format varies across Linux distrospid=<number>which is the standardss -pformat across all major distros; returns null on parse failure🤖 AI-assisted (Claude Opus 4.6). Tested in Docker container. Prompts and code reviewed by human.