Skip to content

Gateway: refresh service env after update.run#32575

Closed
zwright8 wants to merge 1 commit intoopenclaw:mainfrom
zwright8:codex/pr26712-gateway-refresh-env-after-update
Closed

Gateway: refresh service env after update.run#32575
zwright8 wants to merge 1 commit intoopenclaw:mainfrom
zwright8:codex/pr26712-gateway-refresh-env-after-update

Conversation

@zwright8
Copy link
Copy Markdown

@zwright8 zwright8 commented Mar 3, 2026

Summary

Validation

  • pnpm exec vitest run src/gateway/server-methods/update.test.ts

Context

This PR is one focused slice extracted from:

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR adds a best-effort service-env refresh step to update.run: after a successful runGatewayUpdate, the handler now invokes node <openclaw.mjs> gateway install --force --json against the updated install so the daemon service picks up the new environment values before the gateway restarts. The refresh is intentionally non-blocking — failures are logged as warnings and the SIGUSR1 restart is always scheduled regardless.

Key points:

  • resolveNodeRunner() correctly prefers process.execPath when it is Node and falls back to "node" (PATH lookup) for bundled-binary scenarios.
  • resolveGatewayInstallEntrypointCandidates() builds a deduplicated candidate list from the resolved update root, cwd, and argv[1]-relative locations.
  • The first candidate that pathExists confirms is used; if its command exits non-zero the loop stops immediately (no fallback to the next candidate), which is a deliberate design choice but worth noting.
  • Test coverage is solid for the new paths: success, update failure, service not installed, and command failure scenarios.
  • Two minor observability gaps: the warning label reads "service env refresh skipped" even when the command was actually attempted and failed (should be "failed"), and res.stderr from a failed command is not included in the reason string forwarded to the log.

Confidence Score: 4/5

  • PR is safe to merge; the refresh is best-effort and failures cannot break the update or restart flow.
  • The logic is sound and well-tested. The two flagged issues are purely observability concerns (misleading log label and missing stderr in the reason string) that do not affect correctness or runtime behavior.
  • No files require special attention beyond the two style comments on src/gateway/server-methods/update.ts.

Last reviewed commit: d9188dc

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +143 to +147
if (!refreshResult.ok) {
context?.logGateway?.warn(
`update.run service env refresh skipped ${formatControlPlaneActor(actor)} reason=${refreshResult.reason ?? "unknown"}`,
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading log label for attempted-but-failed refresh

The warning always says "service env refresh skipped", but when attempted: true and ok: false (a candidate file was found, the command ran, and returned a non-zero exit code) the refresh was not skipped — it was attempted and failed. The word "skipped" only accurately describes the cases where the service is not installed (ok: true) or the command was never invoked (service check threw, or no candidate file existed).

Consider branching on refreshResult.attempted to distinguish the two situations:

if (!refreshResult.ok) {
  const verb = refreshResult.attempted ? "failed" : "skipped";
  context?.logGateway?.warn(
    `update.run service env refresh ${verb} ${formatControlPlaneActor(actor)} reason=${refreshResult.reason ?? "unknown"}`,
  );
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-methods/update.ts
Line: 143-147

Comment:
**Misleading log label for attempted-but-failed refresh**

The warning always says `"service env refresh skipped"`, but when `attempted: true` and `ok: false` (a candidate file was found, the command ran, and returned a non-zero exit code) the refresh was not skipped — it was attempted and failed. The word "skipped" only accurately describes the cases where the service is not installed (`ok: true`) or the command was never invoked (service check threw, or no candidate file existed).

Consider branching on `refreshResult.attempted` to distinguish the two situations:

```typescript
if (!refreshResult.ok) {
  const verb = refreshResult.attempted ? "failed" : "skipped";
  context?.logGateway?.warn(
    `update.run service env refresh ${verb} ${formatControlPlaneActor(actor)} reason=${refreshResult.reason ?? "unknown"}`,
  );
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +83 to +87
return {
attempted: true,
ok: false,
reason: `refresh failed (${candidate}): exit=${res.code}`,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stderr omitted from failure reason

When runCommandWithTimeout returns a non-zero exit code, res.stderr is available but not captured in the reason string. Since this reason flows directly into the warning log, any error message printed by openclaw.mjs gateway install will be silently discarded, making production failures harder to diagnose.

    return {
      attempted: true,
      ok: false,
      reason: `refresh failed (${candidate}): exit=${res.code}${res.stderr ? ` stderr=${res.stderr.trim()}` : ""}`,
    };
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-methods/update.ts
Line: 83-87

Comment:
**`stderr` omitted from failure reason**

When `runCommandWithTimeout` returns a non-zero exit code, `res.stderr` is available but not captured in the `reason` string. Since this `reason` flows directly into the warning log, any error message printed by `openclaw.mjs gateway install` will be silently discarded, making production failures harder to diagnose.

```typescript
    return {
      attempted: true,
      ok: false,
      reason: `refresh failed (${candidate}): exit=${res.code}${res.stderr ? ` stderr=${res.stderr.trim()}` : ""}`,
    };
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9188dc99e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import { assertValidParams } from "./validation.js";

const SERVICE_REFRESH_TIMEOUT_MS = 60_000;
const SERVICE_REFRESH_ARGS = ["gateway", "install", "--force", "--json"] as const;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid self-restarting gateway while handling update.run

The refresh command now runs gateway install --force (src/gateway/server-methods/update.ts), but that install path actively restarts the service (src/daemon/systemd.ts:262 calls systemctl --user restart, and src/daemon/launchd.ts:411 bootouts/reloads the LaunchAgent). Because this refresh runs before sentinel writing and before the explicit SIGUSR1 scheduling, calling update.run from the daemonized gateway process can terminate the in-flight handler, which can drop the response/sentinel and bypass the controlled restart flow.

Useful? React with 👍 / 👎.

ok: boolean;
reason?: string;
}> {
const service = resolveGatewayService();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Catch unsupported service resolution in env refresh path

refreshGatewayServiceEnvFromUpdatedInstall calls resolveGatewayService() outside any try/catch, but resolveGatewayService throws on unsupported platforms (src/daemon/service.ts). That means a successful runGatewayUpdate can now be turned into an unhandled error in update.run on those platforms, which is a regression from the previous behavior where the handler always continued to build and return the update payload.

Useful? React with 👍 / 👎.

@zwright8 zwright8 force-pushed the codex/pr26712-gateway-refresh-env-after-update branch from d9188dc to 8f8d356 Compare March 3, 2026 05:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f8d356fe3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +77 to +79
const res = await runCommandWithTimeout([nodeRunner, candidate, ...SERVICE_REFRESH_ARGS], {
timeoutMs: SERVICE_REFRESH_TIMEOUT_MS,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Catch service-refresh spawn failures in update.run

refreshGatewayServiceEnvFromUpdatedInstall treats non-zero exits as recoverable, but the runCommandWithTimeout(...) call can also reject on child-process startup errors (see src/process/exec.ts where child.on("error") rejects). In environments where the refresh command cannot be spawned (for example missing node when the gateway runtime is Bun, or ENOENT/EACCES for the resolved entrypoint), this throws out of update.run after a successful update, so the handler can return an UNAVAILABLE error instead of writing the restart sentinel and scheduling the controlled restart.

Useful? React with 👍 / 👎.

@openclaw-barnacle
Copy link
Copy Markdown

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

@openclaw-barnacle openclaw-barnacle bot closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants