Skip to content

fix(windows): use detached restart script for scheduled task restart#46869

Open
EvaZenith wants to merge 5 commits intoopenclaw:mainfrom
EvaZenith:fix/windows-scheduled-task-restart
Open

fix(windows): use detached restart script for scheduled task restart#46869
EvaZenith wants to merge 5 commits intoopenclaw:mainfrom
EvaZenith:fix/windows-scheduled-task-restart

Conversation

@EvaZenith
Copy link
Copy Markdown

Summary

Fixes #5065, #27802, #41804

On Windows, openclaw gateway restart (and the in-chat /restart command) stops the gateway but never relaunches it. The gateway remains offline until manually restarted.

Root Cause

restartScheduledTask() in src/daemon/schtasks.ts calls schtasks /End and schtasks /Run inline in the same process. On Windows, the agent process is owned by the Scheduled Task — so schtasks /End terminates the current process immediately, before schtasks /Run ever executes.

Fix

Call relaunchGatewayScheduledTask() (already present in src/infra/windows-task-restart.ts) before calling schtasks /End. This function:

  1. Writes a temporary .cmd script to the OpenClaw tmp directory
  2. Spawns it in a detached process (stdio: ignore, child.unref())
  3. The script waits 1 second, then calls schtasks /Run with up to 12 retries

The gateway process then shuts down cleanly, and the detached script handles the relaunch independently.

Changes

  • src/daemon/schtasks.ts: import relaunchGatewayScheduledTask and call it before schtasks /End; remove the now-unnecessary inline schtasks /Run call

Testing

Verified locally on Windows 10 (x64) with OpenClaw v2026.3.8 — gateway successfully restarts after openclaw gateway restart.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 15, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR fixes a real Windows-specific bug where openclaw gateway restart left the gateway permanently offline because schtasks /End killed the agent process before the inline schtasks /Run call could execute. The approach — spawning a detached .cmd retry script before ending the task — is a sound solution to the process-ownership problem.

Two issues are worth addressing before merging:

  • Port-release regression: The second waitForGatewayPortRelease(restartPort, 2_000) guard (which verified the port was actually free after force-killing busy listeners) was removed. The detached script's 12-retry loop only repeats on schtasks /Run command failures; it does not retry when schtasks /Run exits 0 but the gateway process crashes immediately because the port is still occupied. This can lead to a silent failure that looks like a successful restart.
  • Potentially dead cleanup code: terminateScheduledTaskGatewayListeners, terminateInstalledStartupRuntime, and the port-wait block all run after execSchtasks(["/End", …]). If the process is indeed killed synchronously (as the PR description states), these lines never execute, leaving port listeners and the startup runtime potentially still alive when the detached script fires schtasks /Run.

Confidence Score: 3/5

  • The PR solves the core restart bug, but removes a port-release safety net that can cause silent gateway startup failures in edge cases.
  • The fundamental approach is correct and locally verified. However, removing the second waitForGatewayPortRelease check introduces a regression where a still-occupied port produces a silent failure (the detached script reports success but the gateway never comes up), and the post-/End cleanup block is likely dead code that may leave resources in a partially-stopped state.
  • src/daemon/schtasks.ts — specifically the port-release logic and the ordering of cleanup steps relative to schtasks /End.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/daemon/schtasks.ts
Line: 727-730

Comment:
**Missing port-release verification after force-kill**

The original code had a second `waitForGatewayPortRelease(restartPort, 2_000)` guard after calling `terminateBusyPortListeners`. That 2-second check was the only mechanism that confirmed the port was actually free *before* the task was scheduled to start again. It has been removed here without a replacement.

The detached restart script's retry loop only re-runs `schtasks /Run` if the command itself fails (e.g. the task is still in a "stopping" state). It has no awareness of port availability. If `schtasks /Run` succeeds but the gateway process can't bind the port because a listener was not yet fully released, the gateway will exit immediately and silently — and the retry loop won't fire again because the initial `schtasks /Run` call returned exit code 0.

Consider restoring the port-release guard (or an equivalent check) before the function returns, for example:

```typescript
if (restartPort) {
  const released = await waitForGatewayPortRelease(restartPort);
  if (!released) {
    await terminateBusyPortListeners(restartPort);
    const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
    if (!releasedAfterForce) {
      // Port still busy; the detached script will likely start a gateway that can't bind.
      // Log a warning, or throw so the caller surfaces the issue.
    }
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/daemon/schtasks.ts
Line: 723-731

Comment:
**Potentially unreachable cleanup code**

The PR description states: *"`schtasks /End` terminates the current process immediately"* — yet all four cleanup steps (`terminateScheduledTaskGatewayListeners`, `terminateInstalledStartupRuntime`, `waitForGatewayPortRelease`, `terminateBusyPortListeners`) sit after the `/End` call and would never run if the process is killed synchronously.

If the process is indeed killed immediately on Windows, these lines are dead code in the common path, and any cleanup they were meant to do (releasing port listeners, stopping the startup runtime) is skipped entirely. The detached restart script then races against those still-running resources.

Consider either:
- Moving critical cleanup steps **before** `execSchtasks(["/End", …])` (mirrors what `relaunchGatewayScheduledTask` itself does by spawning before ending), or
- Adding a brief comment explaining these are intentionally best-effort and may not execute.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 266ff0f

Comment on lines 727 to 730
const released = await waitForGatewayPortRelease(restartPort);
if (!released) {
await terminateBusyPortListeners(restartPort);
const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
if (!releasedAfterForce) {
throw new Error(`gateway port ${restartPort} is still busy before restart`);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing port-release verification after force-kill

The original code had a second waitForGatewayPortRelease(restartPort, 2_000) guard after calling terminateBusyPortListeners. That 2-second check was the only mechanism that confirmed the port was actually free before the task was scheduled to start again. It has been removed here without a replacement.

The detached restart script's retry loop only re-runs schtasks /Run if the command itself fails (e.g. the task is still in a "stopping" state). It has no awareness of port availability. If schtasks /Run succeeds but the gateway process can't bind the port because a listener was not yet fully released, the gateway will exit immediately and silently — and the retry loop won't fire again because the initial schtasks /Run call returned exit code 0.

Consider restoring the port-release guard (or an equivalent check) before the function returns, for example:

if (restartPort) {
  const released = await waitForGatewayPortRelease(restartPort);
  if (!released) {
    await terminateBusyPortListeners(restartPort);
    const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
    if (!releasedAfterForce) {
      // Port still busy; the detached script will likely start a gateway that can't bind.
      // Log a warning, or throw so the caller surfaces the issue.
    }
  }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/schtasks.ts
Line: 727-730

Comment:
**Missing port-release verification after force-kill**

The original code had a second `waitForGatewayPortRelease(restartPort, 2_000)` guard after calling `terminateBusyPortListeners`. That 2-second check was the only mechanism that confirmed the port was actually free *before* the task was scheduled to start again. It has been removed here without a replacement.

The detached restart script's retry loop only re-runs `schtasks /Run` if the command itself fails (e.g. the task is still in a "stopping" state). It has no awareness of port availability. If `schtasks /Run` succeeds but the gateway process can't bind the port because a listener was not yet fully released, the gateway will exit immediately and silently — and the retry loop won't fire again because the initial `schtasks /Run` call returned exit code 0.

Consider restoring the port-release guard (or an equivalent check) before the function returns, for example:

```typescript
if (restartPort) {
  const released = await waitForGatewayPortRelease(restartPort);
  if (!released) {
    await terminateBusyPortListeners(restartPort);
    const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
    if (!releasedAfterForce) {
      // Port still busy; the detached script will likely start a gateway that can't bind.
      // Log a warning, or throw so the caller surfaces the issue.
    }
  }
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 723 to 731
await execSchtasks(["/End", "/TN", taskName]);
await terminateScheduledTaskGatewayListeners(effectiveEnv);
await terminateInstalledStartupRuntime(effectiveEnv);
if (restartPort) {
const released = await waitForGatewayPortRelease(restartPort);
if (!released) {
await terminateBusyPortListeners(restartPort);
const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
if (!releasedAfterForce) {
throw new Error(`gateway port ${restartPort} is still busy before restart`);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially unreachable cleanup code

The PR description states: "schtasks /End terminates the current process immediately" — yet all four cleanup steps (terminateScheduledTaskGatewayListeners, terminateInstalledStartupRuntime, waitForGatewayPortRelease, terminateBusyPortListeners) sit after the /End call and would never run if the process is killed synchronously.

If the process is indeed killed immediately on Windows, these lines are dead code in the common path, and any cleanup they were meant to do (releasing port listeners, stopping the startup runtime) is skipped entirely. The detached restart script then races against those still-running resources.

Consider either:

  • Moving critical cleanup steps before execSchtasks(["/End", …]) (mirrors what relaunchGatewayScheduledTask itself does by spawning before ending), or
  • Adding a brief comment explaining these are intentionally best-effort and may not execute.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/schtasks.ts
Line: 723-731

Comment:
**Potentially unreachable cleanup code**

The PR description states: *"`schtasks /End` terminates the current process immediately"* — yet all four cleanup steps (`terminateScheduledTaskGatewayListeners`, `terminateInstalledStartupRuntime`, `waitForGatewayPortRelease`, `terminateBusyPortListeners`) sit after the `/End` call and would never run if the process is killed synchronously.

If the process is indeed killed immediately on Windows, these lines are dead code in the common path, and any cleanup they were meant to do (releasing port listeners, stopping the startup runtime) is skipped entirely. The detached restart script then races against those still-running resources.

Consider either:
- Moving critical cleanup steps **before** `execSchtasks(["/End", …])` (mirrors what `relaunchGatewayScheduledTask` itself does by spawning before ending), or
- Adding a brief comment explaining these are intentionally best-effort and may not execute.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 266ff0f353

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 727 to 730
const released = await waitForGatewayPortRelease(restartPort);
if (!released) {
await terminateBusyPortListeners(restartPort);
const releasedAfterForce = await waitForGatewayPortRelease(restartPort, 2_000);
if (!releasedAfterForce) {
throw new Error(`gateway port ${restartPort} is still busy before restart`);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check port release before marking restart complete

This restart path now stops after terminateBusyPortListeners(restartPort) without verifying that the gateway port was actually freed, then reports success. In cases where a listener cannot be killed (for example, permission-restricted PID or a non-gateway process holding the port), the detached helper still calls schtasks /Run; that command can return success even if the gateway immediately dies on bind, so users get a false "Restarted" result and the service stays down. The previous logic explicitly re-waited and threw when the port remained busy.

Useful? React with 👍 / 👎.

@EvaZenith
Copy link
Copy Markdown
Author

Thanks for the thorough review! Both points are valid — addressed in the latest commit.

Re: missing port-release verification
Restored the second waitForGatewayPortRelease(restartPort, 2_000) guard after erminateBusyPortListeners. If the port is still busy after force-kill, the function now throws before the restart script fires, rather than silently handing off to a script that would start a gateway that can't bind.

Re: potentially unreachable cleanup code
Moved all cleanup steps ( erminateScheduledTaskGatewayListeners, erminateInstalledStartupRuntime, port-release checks) to before
elaunchGatewayScheduledTask() and schtasks /End. The new order is:

  1. Cleanup + port-release (fully awaited, guaranteed to run)
  2. Spawn detached restart script
  3. schtasks /End (kills this process — nothing meaningful after this line)

This eliminates the race between the detached script and any still-running listeners.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 235a36a5bd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +745 to 746
await execSchtasks(["/End", "/TN", taskName]);
stdout.write(`${formatLine("Restarted Scheduled Task", taskName)}\n`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate task stop before reporting restart success

restartScheduledTask now unconditionally reports success right after execSchtasks(["/End", ...]), but it no longer checks any synchronous /Run result in this code path. When task control actually fails (for example, ACL-denied task control, or the task being removed/disabled between the earlier existence check and restart), the detached helper can fail silently and this function still returns { outcome: "completed" } with "Restarted Scheduled Task", which is a regression from the prior behavior that surfaced restart failures via the inline /Run status check.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c719874ea7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// task. On Windows, `schtasks /End` terminates the current process
// immediately (the agent runs inside the scheduled task), so any code after
// that call is unreliable dead code in the common path.
await terminateScheduledTaskGatewayListeners(effectiveEnv);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Spawn restart handoff before pre-end listener termination

restartScheduledTask now terminates gateway listeners before calling relaunchGatewayScheduledTask, which can kill the current process in the exact Windows fallback case where verified PID discovery misses. resolveScheduledTaskGatewayListenerPids falls back to inspectPortUsage and does not filter out process.pid, so when restart is triggered from the running scheduled-task gateway, terminateScheduledTaskGatewayListeners can end the caller before the detached /Run script is created, leaving the task stopped with no relaunch attempt.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bda85f4a43

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +741 to +742
if (!attempt.ok) {
throw new Error(`failed to prepare restart script: ${attempt.detail ?? "unknown error"}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Provide inline /Run fallback when handoff setup fails

In restartScheduledTask, all stop/cleanup work runs before this check, so when relaunchGatewayScheduledTask returns ok: false (for example, temp script write/spawn failures handled in src/infra/windows-task-restart.ts), this throw aborts without any /Run attempt and can leave the gateway down after a restart request. This regression is specific to environments where detached-script setup fails, and an inline /Run fallback (when still possible) would avoid turning a transient handoff error into an outage.

Useful? React with 👍 / 👎.

@EvaZenith
Copy link
Copy Markdown
Author

The failing CI check (checks-windows (node, test, 3, 6, pnpm test)) is unrelated to this PR. The 6 failing tests are all in src/tts/tts.test.ts (summary model tests), which are not touched by this change. The other 5 Windows shards all pass. This appears to be a pre-existing flaky test or an upstream issue.

@EvaZenith
Copy link
Copy Markdown
Author

Hi maintainers,

The CI failure on checks-windows (node, test, 3, 6) is unrelated to this PR. The failing tests are in src/tts/tts.test.ts with this error:

[vitest] No resolveModelAsync export is defined on the ../agents/pi-embedded-runner/model.js mock

This is a pre-existing mock issue in the TTS module, not caused by my changes to src/daemon/schtasks.ts.

My PR only modifies the Windows gateway restart logic and has been verified locally on Windows 10. Could you please review and merge manually? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Windows Gateway fails to relaunch after restart

1 participant