fix(gateway): validate config before restart + run loop crash guard (#35862)#38699
Conversation
Greptile SummaryThis PR adds two complementary safety layers for the macOS TCC (Full Disk Access) permission issue caused by gateway process crashes: a CLI pre-flight config validation guard in Changes:
Minor issues found:
Confidence Score: 4/5
Last reviewed commit: a22a891 |
| if (configError) { | ||
| fail( | ||
| `${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`, | ||
| ); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
return false is dead code and inconsistent with the rest of runServiceStart
fail() calls defaultRuntime.exit(1) which terminates the process in production, so return false on line 207 is never reached at runtime. In tests, exit throws, so again return false is unreachable. All other early-exit paths in runServiceStart use a bare return (returning undefined), and the function has no declared return type. The return false looks like an accidental copy-paste from runServiceRestart, which has an explicit Promise<boolean> return type.
| if (configError) { | |
| fail( | |
| `${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`, | |
| ); | |
| return false; | |
| } | |
| fail( | |
| `${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`, | |
| ); | |
| return; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 203-208
Comment:
**`return false` is dead code and inconsistent with the rest of `runServiceStart`**
`fail()` calls `defaultRuntime.exit(1)` which terminates the process in production, so `return false` on line 207 is never reached at runtime. In tests, `exit` throws, so again `return false` is unreachable. All other early-exit paths in `runServiceStart` use a bare `return` (returning `undefined`), and the function has no declared return type. The `return false` looks like an accidental copy-paste from `runServiceRestart`, which has an explicit `Promise<boolean>` return type.
```suggestion
fail(
`${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
return;
```
How can I resolve this? If you propose a fix, please make it concise.| } catch (err) { | ||
| // If startup fails (e.g., invalid config after a config-triggered | ||
| // restart), keep the process alive and wait for the next SIGUSR1 | ||
| // instead of crashing. A crash here would respawn a new process that | ||
| // loses macOS Full Disk Access (TCC permissions are PID-bound). (#35862) | ||
| server = null; | ||
| gatewayLog.error( | ||
| `gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` + | ||
| "Process will stay alive; fix the issue and restart.", | ||
| ); | ||
| } |
There was a problem hiding this comment.
Stack trace dropped from crash-guard error log
Only err.message is captured in the logged string. If startup fails due to an unexpected internal error (not just a config issue), the stack trace is silently discarded, making it harder to diagnose the root cause. Consider including the full stack when available:
| } catch (err) { | |
| // If startup fails (e.g., invalid config after a config-triggered | |
| // restart), keep the process alive and wait for the next SIGUSR1 | |
| // instead of crashing. A crash here would respawn a new process that | |
| // loses macOS Full Disk Access (TCC permissions are PID-bound). (#35862) | |
| server = null; | |
| gatewayLog.error( | |
| `gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` + | |
| "Process will stay alive; fix the issue and restart.", | |
| ); | |
| } | |
| server = null; | |
| gatewayLog.error( | |
| `gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` + | |
| "Process will stay alive; fix the issue and restart.", | |
| err instanceof Error ? err : undefined, | |
| ); |
(only if gatewayLog.error accepts a second cause argument — otherwise logging err.stack inline is an equivalent alternative)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run-loop.ts
Line: 193-203
Comment:
**Stack trace dropped from crash-guard error log**
Only `err.message` is captured in the logged string. If startup fails due to an unexpected internal error (not just a config issue), the stack trace is silently discarded, making it harder to diagnose the root cause. Consider including the full stack when available:
```suggestion
server = null;
gatewayLog.error(
`gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` +
"Process will stay alive; fix the issue and restart.",
err instanceof Error ? err : undefined,
);
```
_(only if `gatewayLog.error` accepts a second cause argument — otherwise logging `err.stack` inline is an equivalent alternative)_
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a22a8910c2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| */ | ||
| async function getConfigValidationError(): Promise<string | null> { | ||
| try { | ||
| const snapshot = await readConfigFileSnapshot(); |
There was a problem hiding this comment.
Validate config using service env before aborting start/restart
This pre-flight check reads config with readConfigFileSnapshot() against the current CLI environment, so configs that intentionally use ${ENV_VAR} values provided only by launchd/systemd are treated as invalid and gateway/node daemon start|restart are aborted. readConfigFileSnapshot marks missing env substitutions as invalid (see src/config/io.ts around env-substitution failure handling), so operators can no longer restart a healthy service unless they also export all runtime secrets in their shell. That is a regression from previous behavior and should use the service runtime env (or ignore substitution-only failures) before blocking.
Useful? React with 👍 / 👎.
src/cli/gateway-cli/run-loop.ts
Outdated
| try { | ||
| server = await params.start(); | ||
| } catch (err) { | ||
| // If startup fails (e.g., invalid config after a config-triggered | ||
| // restart), keep the process alive and wait for the next SIGUSR1 |
There was a problem hiding this comment.
Re-throw initial gateway startup failures
Catching all params.start() exceptions here prevents startup failures from ever reaching runGatewayCommand's outer try/catch (src/cli/gateway-cli/run.ts), which is what reports "Gateway failed to start" and exits non-zero. After this change, an initial bind/auth/config failure leaves the process alive but not serving (and still holding the gateway lock) until a manual restart signal, so supervised deployments can get stuck in a permanently "running but dead" state instead of failing fast and recovering.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6620f34046
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| */ | ||
| async function getConfigValidationError(): Promise<string | null> { | ||
| try { | ||
| const snapshot = await readConfigFileSnapshot(); |
There was a problem hiding this comment.
Use daemon env for pre-flight config snapshot
getConfigValidationError() validates via readConfigFileSnapshot() using the current CLI environment, but gateway start/restart acts on the installed service, which may run with a different OPENCLAW_STATE_DIR/OPENCLAW_CONFIG_PATH and env substitutions from its persisted service environment. In that mismatch case, this guard can validate the wrong file (allowing a restart that still crashes) or block a healthy restart because only the shell config is invalid, so the regression this patch is trying to prevent remains in multi-profile/service-env setups.
Useful? React with 👍 / 👎.
|
Re: Codex P1 (use daemon env for config snapshot): Acknowledged. This is intentionally best-effort — The current behavior catches the most common case (user edits This could be a follow-up improvement, but shouldn't block the core crash prevention fix. |
6620f34 to
ab67b62
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab67b62801
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/cli/daemon-cli/lifecycle-core.ts
Outdated
| // Pre-flight config validation (#35862) | ||
| { | ||
| const configError = await getConfigValidationError(); | ||
| if (configError) { |
There was a problem hiding this comment.
Validate config before signaling unmanaged restart
runServiceRestart performs the new pre-flight config check only after onNotLoaded has already run, but the gateway caller’s onNotLoaded path (runDaemonRestart → restartGatewayWithoutServiceManager) sends SIGUSR1 immediately to the unmanaged process; with an invalid config, this means the restart is triggered before we fail with "aborted: config is invalid," so the guard does not actually prevent the risky restart in that flow and can report a misleading abort after side effects have already happened.
Useful? React with 👍 / 👎.
ab67b62 to
c02f635
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c02f63535d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| */ | ||
| async function getConfigValidationError(): Promise<string | null> { | ||
| try { | ||
| const snapshot = await readConfigFileSnapshot(); |
There was a problem hiding this comment.
Validate restart config with service environment
getConfigValidationError() reads readConfigFileSnapshot() from the current CLI environment, but runServiceStart/runServiceRestart operate on a service that may have different persisted env (notably OPENCLAW_STATE_DIR, OPENCLAW_CONFIG_PATH, and service-only substitutions), so this check can both block healthy restarts and miss invalid service-side configs that still crash after restart. Fresh evidence in this commit is the new function comment explicitly stating validation is done in the CLI environment.
Useful? React with 👍 / 👎.
| if (isFirstStart) { | ||
| throw err; | ||
| } | ||
| server = null; |
There was a problem hiding this comment.
Avoid trapping restart failures in non-listening loop state
After the first successful boot, this branch swallows params.start() failures and keeps the process alive with server = null; in unmanaged/in-process restart mode (for example OPENCLAW_NO_RESPAWN=1 or respawn fallback), daemon restart/stop can only signal PIDs discovered from listeners on the gateway port (restartGatewayWithoutServiceManager), so a failed restart before bind leaves no signalable PID while the process still holds the gateway lock, forcing manual process cleanup.
Useful? React with 👍 / 👎.
… permission loss (openclaw#35862) When 'openclaw gateway restart' is run with an invalid config, the new process crashes on startup due to config validation failure. On macOS, this causes Full Disk Access (TCC) permissions to be lost because the respawned process has a different PID. Add getConfigValidationError() helper and pre-flight config validation in both runServiceRestart() and runServiceStart(). If config is invalid, abort with a clear error message instead of crashing. The config watcher's hot-reload path already had this guard (handleInvalidSnapshot), but the CLI restart/start commands did not. AI-assisted (OpenClaw agent, fully tested)
…it (openclaw#35862) When an in-process restart (SIGUSR1) triggers a config-triggered restart and the new config is invalid, params.start() throws and the while loop exits, killing the process. On macOS this loses TCC permissions. Wrap params.start() in try/catch: on failure, set server=null, log the error, and wait for the next SIGUSR1 instead of crashing.
Address Greptile review: add test coverage for runServiceStart path. The error message copy-paste issue was already fixed in the DRY refactor (uses params.serviceNoun instead of hardcoded 'restart').
- Remove dead 'return false' in runServiceStart (Greptile) - Include stack trace in run-loop crash guard error log (Greptile) - Only catch startup errors on subsequent restarts, not initial start (Codex P1) - Add JSDoc note about env var false positive edge case (Codex P1)
…Codex P2) The config check was positioned after onNotLoaded, which could send SIGUSR1 to an unmanaged process before config was validated.
- Release gateway lock when in-process restart fails, so daemon restart/stop can still manage the process (Codex P2) - P1 (env mismatch) already addressed: best-effort by design, documented in JSDoc
c02f635 to
ca0cc13
Compare
|
Addressed latest Codex reviews: P1 (env mismatch): Already addressed — this is best-effort by design. P2 (lock held with no listener): Fixed — now calls |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca0cc13a32
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // discovers PIDs via the gateway port) can still manage the process. | ||
| // Without this, the process holds the lock but is not listening, | ||
| // forcing manual cleanup. (#35862) | ||
| await releaseLockIfHeld(); |
There was a problem hiding this comment.
Reacquire lock before retrying a failed in-process restart
Calling releaseLockIfHeld() in the restart-failure catch path drops the gateway lock, but later restart attempts only reacquire when handleRestartAfterServerClose() sees hadLock === true. In OPENCLAW_NO_RESPAWN=1 (or respawn failed) mode, a second SIGUSR1 after this failure will restart the loop without reacquiring the lock, so the gateway can come back up lockless and allow duplicate/coordinated-management breakage across processes.
Useful? React with 👍 / 👎.
|
Landed via temp rebase onto main.
Thanks @lml2468! |
What
Add pre-flight config validation before
gateway restartandgateway start, plus crash guard in the gateway run loop.Why
Fixes #35862 — When
openclaw gateway restartis run with invalid config, the new process crashes on startup. On macOS, this causes Full Disk Access (TCC) permissions to be lost because launchd respawns with a different PID.How
Two layers of defense:
lifecycle-core.ts):getConfigValidationError()validates config beforerunServiceRestart()andrunServiceStart(). If invalid, aborts with clear error message.run-loop.ts): Wrapsparams.start()in try/catch. On failure, setsserver=null, logs error, and waits for next SIGUSR1 instead of crashing.The config watcher's hot-reload path already had this guard (
handleInvalidSnapshot), but CLI commands and the run loop did not.Testing
pnpm build+pnpm checkpassReplaces #35924 which was auto-closed by barnacle bot (PR slot limit).