Skip to content

fix(gateway): signal-based stop/restart fallback for containers without service manager#36202

Closed
RealKai42 wants to merge 8 commits intoopenclaw:mainfrom
RealKai42:fix/gateway-stop-restart-container-36137
Closed

fix(gateway): signal-based stop/restart fallback for containers without service manager#36202
RealKai42 wants to merge 8 commits intoopenclaw:mainfrom
RealKai42:fix/gateway-stop-restart-container-36137

Conversation

@RealKai42
Copy link
Copy Markdown
Contributor

Summary

Closes #36137.

openclaw gateway stop and openclaw gateway restart both failed in Docker/Podman/LXC containers that lack a service manager (systemd/launchd). isLoaded() returns false in these environments, causing both commands to immediately print "Gateway service disabled" and exit without touching the running process.

Fix: Add an optional onNotLoaded callback to runServiceStop and runServiceRestart in lifecycle-core.ts. When the service manager does not recognize the service as loaded, the gateway-specific lifecycle functions in lifecycle.ts fall back to port-based PID discovery (via inspectPortUsage) and send OS signals directly:

  • gateway stop → sends SIGTERM to all gateway processes listening on the configured port
  • gateway restart → sends SIGUSR1 to trigger the gateway's in-process restart loop (the same signal used for hot restarts; handled by run-loop.ts's onSigusr1 handler)

This matches the behavior already described in the issue: the gateway already handles both signals natively — the CLI just couldn't reach them.

Changes

  • src/cli/daemon-cli/lifecycle-core.ts: add optional onNotLoaded callback to runServiceStop and runServiceRestart
  • src/cli/daemon-cli/lifecycle.ts: implement signal-based fallback using findGatewayPidsOnPort (port inspection via inspectPortUsage)

Test plan

  • pnpm tsgo — no TypeScript errors
  • pnpm check — passes lint and format
  • Live: in a Docker container without systemd, run openclaw gateway stop / openclaw gateway restart with a foreground gateway process and verify signals are sent

@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord channel: matrix Channel integration: matrix channel: telegram Channel integration: telegram gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling size: M labels Mar 5, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes openclaw gateway stop and openclaw gateway restart in containers without a service manager (systemd/launchd) by adding an onNotLoaded callback to lifecycle-core.ts that lets the gateway-specific lifecycle code fall back to port-based PID discovery and OS signals. The PR also bundles several unrelated fixes: Matrix rate-limit retry, provider config override improvements, config unknown-key stripping, SQLite WAL mode, Discord audio abort racing, and a channel health restart-reason fix.

Issues in the core signal-based fallback implementation:

  1. Restart health check is skipped silently: The onNotLoaded callback sends SIGUSR1 and immediately returns true, causing runServiceRestart to exit before postRestartCheck is invoked. The health-polling logic, stale-PID cleanup, and timeout/failure handling are entirely bypassed. The log message "Gateway restart signal sent. Waiting for health..." is misleading—no health check actually runs.

  2. No JSON response on signal-based stop: When onNotLoaded handles the stop and returns true, the function exits without calling emit(). In --json mode, this produces no output, breaking JSON consumers (CI scripts, wrappers).

  3. No JSON response on signal-based restart: When onNotLoaded handles the restart and returns true, the function exits without calling emit(). In --json mode, this produces no output.

Confidence Score: 2/5

  • The PR has three concrete logic bugs in the primary feature: restart health checks are skipped, and JSON output is missing for both stop and restart in the signal-based fallback paths.
  • The PR contains three concrete, reproducible logic bugs in the signal-based stop/restart fallback—the feature being shipped to fix containers without service managers. These are not edge cases: (1) the restart health check is mandatory for production use, and (2) missing JSON output breaks CLI automation in the exact container environments this PR targets. While the bundled changes (Matrix retry, config handling, Discord audio, etc.) appear correct, these three bugs must be fixed before merging.
  • src/cli/daemon-cli/lifecycle-core.ts and src/cli/daemon-cli/lifecycle.ts need fixes for the missing health-check invocation and missing JSON emit calls on the fallback paths.

Last reviewed commit: b16ae1b

Comment on lines +154 to +157
}
if (!json) {
defaultRuntime.log(`Gateway restart signal sent. Waiting for health...`);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message "Gateway restart signal sent. Waiting for health..." is misleading — no health check actually runs in this code path.

When onNotLoaded returns true (line 157), the runServiceRestart function in lifecycle-core.ts exits immediately with return true; before the postRestartCheck callback is ever invoked. The postRestartCheck is only called inside the try block of the normal restart path (after service restart), so the entire health-polling loop, stale-PID cleanup, and timeout/failure logic is completely bypassed.

To fix this:

  1. Either move the health-check logic into the onNotLoaded callback itself (after sending SIGUSR1, before returning), or
  2. Modify the lifecycle-core.ts contract so that postRestartCheck is still invoked even when onNotLoaded handles the restart.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 154-157

Comment:
The log message **"Gateway restart signal sent. Waiting for health..."** is misleading — no health check actually runs in this code path.

When `onNotLoaded` returns `true` (line 157), the `runServiceRestart` function in `lifecycle-core.ts` exits immediately with `return true;` before the `postRestartCheck` callback is ever invoked. The `postRestartCheck` is only called inside the `try` block of the normal restart path (after service restart), so the entire health-polling loop, stale-PID cleanup, and timeout/failure logic is completely bypassed.

To fix this:
1. Either move the health-check logic into the `onNotLoaded` callback itself (after sending SIGUSR1, before returning), or
2. Modify the `lifecycle-core.ts` contract so that `postRestartCheck` is still invoked even when `onNotLoaded` handles the restart.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +223 to +227
const handled = await params.onNotLoaded().catch(() => false);
if (handled) {
return;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing JSON response on signal-based stop. When onNotLoaded returns true and handles the stop, the function returns at line 225 without calling emit(...). In --json mode, this produces no output at all, breaking any JSON consumer (CI scripts, wrappers, etc.).

The normal stop path always emits { ok: true, result: "not-loaded" | "stopped", ... } before returning. The fallback path should do the same.

Add an emit call before the return; statement, similar to the normal "not-loaded" path on line 229:

if (params.onNotLoaded) {
  const handled = await params.onNotLoaded().catch(() => false);
  if (handled) {
    emit({
      ok: true,
      result: "stopped",
      message: "Gateway stopped via SIGTERM.",
      service: buildDaemonServiceSnapshot(params.service, false),
    });
    return;
  }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 223-227

Comment:
**Missing JSON response on signal-based stop.** When `onNotLoaded` returns `true` and handles the stop, the function returns at line 225 without calling `emit(...)`. In `--json` mode, this produces no output at all, breaking any JSON consumer (CI scripts, wrappers, etc.).

The normal stop path always emits `{ ok: true, result: "not-loaded" | "stopped", ... }` before returning. The fallback path should do the same.

Add an `emit` call before the `return;` statement, similar to the normal "not-loaded" path on line 229:

```typescript
if (params.onNotLoaded) {
  const handled = await params.onNotLoaded().catch(() => false);
  if (handled) {
    emit({
      ok: true,
      result: "stopped",
      message: "Gateway stopped via SIGTERM.",
      service: buildDaemonServiceSnapshot(params.service, false),
    });
    return;
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +286 to +290
if (params.onNotLoaded) {
const handled = await params.onNotLoaded().catch(() => false);
if (handled) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing JSON response on signal-based restart. When onNotLoaded returns true and handles the restart, the function returns true at line 289 without calling emit(...). In --json mode, this produces no output at all.

The normal restart path always emits a structured JSON response before returning. The fallback path should do the same.

Add an emit call before the return true; statement:

if (params.onNotLoaded) {
  const handled = await params.onNotLoaded().catch(() => false);
  if (handled) {
    emit({
      ok: true,
      result: "restarted",
      message: "Gateway restarted via SIGUSR1.",
      service: buildDaemonServiceSnapshot(params.service, true),
      warnings: warnings.length ? warnings : undefined,
    });
    return true;
  }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 286-290

Comment:
**Missing JSON response on signal-based restart.** When `onNotLoaded` returns `true` and handles the restart, the function returns `true` at line 289 without calling `emit(...)`. In `--json` mode, this produces no output at all.

The normal restart path always emits a structured JSON response before returning. The fallback path should do the same.

Add an `emit` call before the `return true;` statement:

```typescript
if (params.onNotLoaded) {
  const handled = await params.onNotLoaded().catch(() => false);
  if (handled) {
    emit({
      ok: true,
      result: "restarted",
      message: "Gateway restarted via SIGUSR1.",
      service: buildDaemonServiceSnapshot(params.service, true),
      warnings: warnings.length ? warnings : undefined,
    });
    return true;
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b16ae1bb1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +80 to +82
const port = await resolveGatewayRestartPort().catch(() =>
resolveGatewayPort(loadConfig(), process.env),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer port resolution until stop fallback is needed

runDaemonStop now resolves port before calling runServiceStop, which eagerly executes loadConfig() via resolveGatewayRestartPort(). That makes openclaw gateway stop fail on malformed/legacy config even when the service manager is loaded and could stop the process without reading config, which is a regression in exactly the recovery scenario where stop is needed most. Resolve the port inside onNotLoaded so config parsing only happens for the container fallback path.

Useful? React with 👍 / 👎.

Comment on lines +288 to +289
if (handled) {
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue restart verification after fallback signal

In the not-loaded path, a handled onNotLoaded restart returns immediately here, which skips the normal postRestartCheck health verification. In containers without systemd/launchd, this allows openclaw gateway restart to report success even if SIGUSR1 was not delivered/handled and the gateway never becomes healthy (despite lifecycle.ts logging “Waiting for health…”). The fallback path should still run restart health checks before returning success.

Useful? React with 👍 / 👎.

Comment on lines +224 to +225
if (handled) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit JSON output when stop fallback handles the request

When onNotLoaded returns true, this branch exits before any emit(...) call. As a result, in service-manager-less environments where the signal fallback succeeds, openclaw gateway stop --json produces no JSON payload, breaking automation that expects one structured response per command. Emit a success response for this handled-fallback path (and mirror the same behavior for restart).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: discord Channel integration: discord channel: matrix Channel integration: matrix channel: telegram Channel integration: telegram cli CLI command changes gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: openclaw gateway restart/stop fail in containers without systemd

2 participants