Skip to content

Commit 103cdd9

Browse files
NikolaFCgaliniliev
andauthored
fix(gateway): add safe restart coordinator (#76923)
Add a safe restart coordinator that preflights active Gateway work before restart. - expose gateway.restart.preflight and gateway.restart.request RPC methods - add explicit openclaw gateway restart --safe / openclaw daemon restart --safe path - narrow restart blockers to running non-ended tasks so queued records no longer block indefinitely - keep existing restart behavior unchanged; --force remains the immediate override Co-authored-by: NikolaFC <[email protected]> Co-authored-by: galiniliev <[email protected]>
1 parent 0e702f1 commit 103cdd9

22 files changed

Lines changed: 519 additions & 16 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,7 @@ Docs: https://docs.openclaw.ai
457457
- Status/update: resolve beta update-channel checks from the installed version when config still says `stable`, and let `status --deep` reuse live gateway channel credential state instead of warning on command-path-only token misses.
458458
- Doctor/plugins: preserve unmanaged third-party plugin `node_modules` during `doctor --fix`, while still pruning OpenClaw-managed runtime dependency caches.
459459
- Gateway/restart: add `openclaw gateway restart --force` and `--wait <duration>`, log active task run IDs before restart deferral timers, and report timeout restarts as explicit forced restarts.
460+
- Gateway/restart: align `gateway.restart.safe` preflight with scheduled restart deferral by counting only active restart blockers (running non-ended tasks), so queued task records no longer keep "safe" restarts deferred indefinitely.
460461
- Discord: persist slash-command deploy hashes across process restarts so unchanged command sets skip redeploy and avoid restart-loop 429s.
461462
- Providers/LM Studio: normalize binary `off`/`on` reasoning metadata from Gemma 4 and other local models to LM Studio's accepted OpenAI-compatible `reasoning_effort` values.
462463
- Plugins/externalization: keep official external install docs, update examples, and live Codex npm checks on default npm tags instead of `@beta`. Thanks @vincentkoc.

docs/cli/daemon.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ openclaw daemon uninstall
3636

3737
- `status`: `--url`, `--token`, `--password`, `--timeout`, `--no-probe`, `--require-rpc`, `--deep`, `--json`
3838
- `install`: `--port`, `--runtime <node|bun>`, `--token`, `--force`, `--json`
39-
- `restart`: `--force`, `--wait <duration>`, `--json`
39+
- `restart`: `--safe`, `--force`, `--wait <duration>`, `--json`
4040
- lifecycle (`uninstall|start|stop`): `--json`
4141

4242
Notes:
@@ -53,6 +53,7 @@ Notes:
5353
- If both `gateway.auth.token` and `gateway.auth.password` are configured and `gateway.auth.mode` is unset, install is blocked until mode is set explicitly.
5454
- On macOS, `install` keeps LaunchAgent plists owner-only and loads managed service environment values through an owner-only file and wrapper instead of serializing API keys or auth-profile env refs into `EnvironmentVariables`.
5555
- If you intentionally run multiple gateways on one host, isolate ports, config/state, and workspaces; see [/gateway#multiple-gateways-same-host](/gateway#multiple-gateways-same-host).
56+
- `restart --safe` asks the running Gateway to preflight active work and schedule one coalesced restart after active work drains. Plain `restart` keeps the existing service-manager behavior; `--force` remains the immediate override path.
5657

5758
## Prefer
5859

docs/cli/gateway.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,16 @@ openclaw gateway run
105105
Raw stream jsonl path.
106106
</ParamField>
107107

108+
## Restart the Gateway
109+
110+
```bash
111+
openclaw gateway restart
112+
openclaw gateway restart --safe
113+
openclaw gateway restart --force
114+
```
115+
116+
`openclaw gateway restart --safe` asks the running Gateway to preflight active OpenClaw work before restarting. If queued operations, reply delivery, embedded runs, or task runs are active, the Gateway reports the blockers, coalesces duplicate safe restart requests, and restarts once the active work drains. Plain `restart` keeps the existing service-manager behavior for compatibility. Use `--force` only when you explicitly want the immediate override path.
117+
108118
<Warning>
109119
Inline `--password` can be exposed in local process listings. Prefer `--password-file`, env, or a SecretRef-backed `gateway.auth.password`.
110120
</Warning>

src/cli/daemon-cli/lifecycle.test.ts

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ const probeGateway = vi.fn<
4949
configSnapshot: unknown;
5050
}>
5151
>();
52+
const callGatewayCli = vi.fn();
5253
const isRestartEnabled = vi.fn<(config?: { commands?: unknown }) => boolean>(() => true);
5354
const loadConfig = vi.hoisted(() => vi.fn(() => ({})));
5455
const recoverInstalledLaunchAgent = vi.hoisted(() => vi.fn());
@@ -77,6 +78,10 @@ vi.mock("../../gateway/probe.js", () => ({
7778
}) => probeGateway(opts),
7879
}));
7980

81+
vi.mock("../../gateway/call.js", () => ({
82+
callGatewayCli: (opts: unknown) => callGatewayCli(opts),
83+
}));
84+
8085
vi.mock("../../config/commands.js", () => ({
8186
isRestartEnabled: (config?: { commands?: unknown }) => isRestartEnabled(config),
8287
}));
@@ -113,7 +118,11 @@ vi.mock("./lifecycle-core.js", () => ({
113118

114119
describe("runDaemonRestart health checks", () => {
115120
let runDaemonStart: (opts?: { json?: boolean }) => Promise<void>;
116-
let runDaemonRestart: (opts?: { json?: boolean }) => Promise<boolean>;
121+
let runDaemonRestart: (opts?: {
122+
json?: boolean;
123+
safe?: boolean;
124+
force?: boolean;
125+
}) => Promise<boolean>;
117126
let runDaemonStop: (opts?: { json?: boolean }) => Promise<void>;
118127
let envSnapshot: ReturnType<typeof captureEnv>;
119128

@@ -162,6 +171,7 @@ describe("runDaemonRestart health checks", () => {
162171
signalVerifiedGatewayPidSync.mockReset();
163172
formatGatewayPidList.mockReset();
164173
probeGateway.mockReset();
174+
callGatewayCli.mockReset();
165175
isRestartEnabled.mockReset();
166176
loadConfig.mockReset();
167177
recoverInstalledLaunchAgent.mockReset();
@@ -204,6 +214,31 @@ describe("runDaemonRestart health checks", () => {
204214
ok: true,
205215
configSnapshot: { commands: { restart: true } },
206216
});
217+
callGatewayCli.mockResolvedValue({
218+
ok: true,
219+
status: "deferred",
220+
preflight: {
221+
safe: false,
222+
counts: {
223+
queueSize: 1,
224+
pendingReplies: 0,
225+
embeddedRuns: 0,
226+
activeTasks: 0,
227+
totalActive: 1,
228+
},
229+
blockers: [{ kind: "queue", count: 1, message: "1 queued or active operation(s)" }],
230+
summary: "restart deferred: 1 queued or active operation(s)",
231+
},
232+
restart: {
233+
ok: true,
234+
pid: 123,
235+
signal: "SIGUSR1",
236+
delayMs: 0,
237+
mode: "emit",
238+
coalesced: false,
239+
cooldownMsApplied: 0,
240+
},
241+
});
207242
isRestartEnabled.mockReturnValue(true);
208243
signalVerifiedGatewayPidSync.mockImplementation(() => {});
209244
formatGatewayPidList.mockImplementation((pids) => pids.join(", "));
@@ -230,6 +265,24 @@ describe("runDaemonRestart health checks", () => {
230265
expect(recoverInstalledLaunchAgent).toHaveBeenCalledWith({ result: "started" });
231266
});
232267

268+
it("requests a safe gateway restart over RPC without touching the service manager", async () => {
269+
await runDaemonRestart({ json: true, safe: true });
270+
271+
expect(callGatewayCli).toHaveBeenCalledWith({
272+
method: "gateway.restart.request",
273+
params: { reason: "gateway.restart.safe" },
274+
timeoutMs: 10_000,
275+
});
276+
expect(runServiceRestart).not.toHaveBeenCalled();
277+
});
278+
279+
it("keeps force restart on the existing non-safe path", async () => {
280+
await runDaemonRestart({ json: true, force: true });
281+
282+
expect(callGatewayCli).not.toHaveBeenCalled();
283+
expect(runServiceRestart).toHaveBeenCalled();
284+
});
285+
233286
it("repairs stale loaded service definitions from gateway start", async () => {
234287
repairLoadedGatewayServiceForStart.mockResolvedValue({
235288
result: "started",

src/cli/daemon-cli/lifecycle.ts

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
import { isRestartEnabled } from "../../config/commands.flags.js";
22
import { readBestEffortConfig, resolveGatewayPort } from "../../config/config.js";
33
import { resolveGatewayService } from "../../daemon/service.js";
4+
import { callGatewayCli } from "../../gateway/call.js";
45
import { probeGateway } from "../../gateway/probe.js";
56
import {
67
findVerifiedGatewayListenerPidsOnPortSync,
78
formatGatewayPidList,
89
signalVerifiedGatewayPidSync,
910
} from "../../infra/gateway-processes.js";
11+
import type { SafeGatewayRestartRequestResult } from "../../infra/restart-coordinator.js";
1012
import { type GatewayRestartIntent, writeGatewayRestartIntentSync } from "../../infra/restart.js";
1113
import { defaultRuntime } from "../../runtime.js";
1214
import { normalizeOptionalString } from "../../shared/string-coerce.js";
@@ -139,6 +141,50 @@ function resolveGatewayRestartIntentOptions(
139141
return undefined;
140142
}
141143

144+
function formatSafeRestartWarnings(result: SafeGatewayRestartRequestResult): string[] | undefined {
145+
if (result.preflight.blockers.length === 0) {
146+
return undefined;
147+
}
148+
return [result.preflight.summary];
149+
}
150+
151+
async function requestSafeGatewayRestart(opts: DaemonLifecycleOptions): Promise<boolean> {
152+
if (opts.force) {
153+
throw new Error("--safe cannot be combined with --force; omit --safe to force restart now");
154+
}
155+
if (opts.wait !== undefined) {
156+
throw new Error("--safe cannot be combined with --wait; safe restart uses gateway deferral");
157+
}
158+
const result = await callGatewayCli<SafeGatewayRestartRequestResult>({
159+
method: "gateway.restart.request",
160+
params: { reason: "gateway.restart.safe" },
161+
timeoutMs: 10_000,
162+
});
163+
const message =
164+
result.status === "coalesced"
165+
? "safe restart request joined an existing pending gateway restart"
166+
: result.status === "deferred"
167+
? "safe restart requested; gateway will restart after active work drains"
168+
: "safe restart requested; gateway will restart momentarily";
169+
const payload = {
170+
ok: true,
171+
result: result.status,
172+
message,
173+
preflight: result.preflight,
174+
restart: result.restart,
175+
warnings: formatSafeRestartWarnings(result),
176+
};
177+
if (opts.json) {
178+
defaultRuntime.log(JSON.stringify(payload, null, 2));
179+
} else {
180+
defaultRuntime.log(message);
181+
if (result.preflight.blockers.length > 0) {
182+
defaultRuntime.log(theme.warn(result.preflight.summary));
183+
}
184+
}
185+
return true;
186+
}
187+
142188
async function restartGatewayWithoutServiceManager(
143189
port: number,
144190
restartIntent?: GatewayRestartIntent,
@@ -218,6 +264,9 @@ export async function runDaemonStop(opts: DaemonLifecycleOptions = {}) {
218264
* Throws/exits on check or restart failures.
219265
*/
220266
export async function runDaemonRestart(opts: DaemonLifecycleOptions = {}): Promise<boolean> {
267+
if (opts.safe) {
268+
return await requestSafeGatewayRestart(opts);
269+
}
221270
const json = Boolean(opts.json);
222271
const service = resolveGatewayService();
223272
let restartedWithoutServiceManager = false;

src/cli/daemon-cli/register-service-commands.test.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,17 @@ describe("addGatewayServiceCommands", () => {
7070
);
7171
},
7272
},
73+
{
74+
name: "forwards restart safe control",
75+
argv: ["restart", "--safe"],
76+
assert: () => {
77+
expect(runDaemonRestart).toHaveBeenCalledWith(
78+
expect.objectContaining({
79+
safe: true,
80+
}),
81+
);
82+
},
83+
},
7384
{
7485
name: "forwards restart force control",
7586
argv: ["restart", "--force"],

src/cli/daemon-cli/register-service-commands.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ function resolveRestartOptions(cmdOpts: DaemonLifecycleOptions, command?: Comman
4949
return {
5050
...cmdOpts,
5151
force: Boolean(cmdOpts.force || parentForce),
52+
safe: Boolean(cmdOpts.safe),
5253
};
5354
}
5455

@@ -122,6 +123,7 @@ export function addGatewayServiceCommands(parent: Command, opts?: { statusDescri
122123
.command("restart")
123124
.description("Restart the Gateway service (launchd/systemd/schtasks)")
124125
.option("--force", "Restart immediately without waiting for active gateway work", false)
126+
.option("--safe", "Request an OpenClaw-aware restart after active work drains", false)
125127
.option(
126128
"--wait <duration>",
127129
"Wait duration before forcing restart (ms, 10s, 5m; 0 waits indefinitely)",

src/cli/daemon-cli/types.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,5 +27,6 @@ export type DaemonInstallOptions = {
2727
export type DaemonLifecycleOptions = {
2828
json?: boolean;
2929
force?: boolean;
30+
safe?: boolean;
3031
wait?: string;
3132
};

src/cli/gateway-cli/run-loop.test.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -390,7 +390,9 @@ describe("runGatewayLoop", () => {
390390
expect(waitForActiveEmbeddedRuns).not.toHaveBeenCalled();
391391
expect(abortEmbeddedPiRun).toHaveBeenCalledWith(undefined, { mode: "all" });
392392
expect(gatewayLog.warn).toHaveBeenCalledWith(
393-
expect.stringContaining("restart blocked by active task run(s): taskId=task-force"),
393+
expect.stringContaining(
394+
"restart blocked by active background task run(s): taskId=task-force",
395+
),
394396
);
395397
expect(gatewayLog.warn).toHaveBeenCalledWith(
396398
"forced restart requested; skipping active work drain",

src/cli/gateway-cli/run-loop.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -392,7 +392,7 @@ export async function runGatewayLoop(params: {
392392
`draining ${activeTasks} active task(s) and ${activeRuns} active embedded run(s) before restart ${formatRestartDrainBudget()}`,
393393
);
394394
if (taskBlockers) {
395-
gatewayLog.warn(`restart blocked by active task run(s): ${taskBlockers}`);
395+
gatewayLog.warn(`restart blocked by active background task run(s): ${taskBlockers}`);
396396
}
397397
if (restartIntent?.force) {
398398
gatewayLog.warn("forced restart requested; skipping active work drain");

0 commit comments

Comments
 (0)