Skip to content

fix(gateway): validate config before restart + run loop crash guard (#35862)#38699

Merged
steipete merged 6 commits intoopenclaw:mainfrom
lml2468:fix/issue-35862-restart-config-validation
Mar 9, 2026
Merged

fix(gateway): validate config before restart + run loop crash guard (#35862)#38699
steipete merged 6 commits intoopenclaw:mainfrom
lml2468:fix/issue-35862-restart-config-validation

Conversation

@lml2468
Copy link
Copy Markdown
Contributor

@lml2468 lml2468 commented Mar 7, 2026

What

Add pre-flight config validation before gateway restart and gateway start, plus crash guard in the gateway run loop.

Why

Fixes #35862 — When openclaw gateway restart is run with invalid config, the new process crashes on startup. On macOS, this causes Full Disk Access (TCC) permissions to be lost because launchd respawns with a different PID.

How

Two layers of defense:

  1. CLI pre-check (lifecycle-core.ts): getConfigValidationError() validates config before runServiceRestart() and runServiceStart(). If invalid, aborts with clear error message.
  2. Run loop crash guard (run-loop.ts): Wraps params.start() in try/catch. On failure, sets server=null, logs error, and waits for next SIGUSR1 instead of crashing.

The config watcher's hot-reload path already had this guard (handleInvalidSnapshot), but CLI commands and the run loop did not.

Testing

  • pnpm build + pnpm check pass
  • 6 new tests (restart + start, invalid/valid/missing/error cases)
  • 3 existing lifecycle-core tests pass
  • AI-assisted (OpenClaw agent, fully tested)

Replaces #35924 which was auto-closed by barnacle bot (PR slot limit).

@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: M labels Mar 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR adds two complementary safety layers for the macOS TCC (Full Disk Access) permission issue caused by gateway process crashes: a CLI pre-flight config validation guard in lifecycle-core.ts and a run-loop crash guard in run-loop.ts. The approach is well-reasoned and the implementation fits cleanly into the existing pattern used by the config watcher's hot-reload path.

Changes:

  • getConfigValidationError() performs a best-effort readConfigFileSnapshot() call before runServiceStart and runServiceRestart, aborting with a clear error message if the config is invalid (missing/unreadable configs are intentionally allowed through).
  • The while(true) run loop in run-loop.ts wraps params.start() in a try/catch; on failure it sets server = null, logs the error, and blocks on the restart-resolver promise instead of propagating the exception and crashing.
  • 6 new tests cover the abort (invalid config) and proceed (valid/missing/error) paths for both runServiceRestart and runServiceStart.

Minor issues found:

  • runServiceStart's config-error branch uses return false — dead code since fail() always calls process.exit(1) (or throws in tests) before return is reached, and inconsistent with the bare return used in all other early-exit paths in the same function.
  • The crash-guard in run-loop.ts logs only err.message, discarding the stack trace, which can make it harder to diagnose unexpected startup failures.

Confidence Score: 4/5

  • This PR is safe to merge — the two changes are well-scoped, correctly implemented, and guarded by new tests.
  • The core logic is sound: the pre-flight check correctly gates only on exists && !valid, gracefully handling missing files and read errors, and the run-loop catch block correctly interacts with the existing signal handler / restart-resolver machinery. The only issues are minor style inconsistencies (a dead return false and a missing stack trace in the error log) that don't affect correctness.
  • No files require special attention beyond the minor style notes on lifecycle-core.ts (dead return false) and run-loop.ts (stack trace logging).

Last reviewed commit: a22a891

Comment on lines +203 to +208
if (configError) {
fail(
`${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false is dead code and inconsistent with the rest of runServiceStart

fail() calls defaultRuntime.exit(1) which terminates the process in production, so return false on line 207 is never reached at runtime. In tests, exit throws, so again return false is unreachable. All other early-exit paths in runServiceStart use a bare return (returning undefined), and the function has no declared return type. The return false looks like an accidental copy-paste from runServiceRestart, which has an explicit Promise<boolean> return type.

Suggested change
if (configError) {
fail(
`${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
return false;
}
fail(
`${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
return;
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 203-208

Comment:
**`return false` is dead code and inconsistent with the rest of `runServiceStart`**

`fail()` calls `defaultRuntime.exit(1)` which terminates the process in production, so `return false` on line 207 is never reached at runtime. In tests, `exit` throws, so again `return false` is unreachable. All other early-exit paths in `runServiceStart` use a bare `return` (returning `undefined`), and the function has no declared return type. The `return false` looks like an accidental copy-paste from `runServiceRestart`, which has an explicit `Promise<boolean>` return type.

```suggestion
      fail(
        `${params.serviceNoun} aborted: config is invalid.\n${configError}\nFix the config and retry, or run "openclaw doctor" to repair.`,
      );
      return;
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +193 to +203
} catch (err) {
// If startup fails (e.g., invalid config after a config-triggered
// restart), keep the process alive and wait for the next SIGUSR1
// instead of crashing. A crash here would respawn a new process that
// loses macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
server = null;
gatewayLog.error(
`gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` +
"Process will stay alive; fix the issue and restart.",
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack trace dropped from crash-guard error log

Only err.message is captured in the logged string. If startup fails due to an unexpected internal error (not just a config issue), the stack trace is silently discarded, making it harder to diagnose the root cause. Consider including the full stack when available:

Suggested change
} catch (err) {
// If startup fails (e.g., invalid config after a config-triggered
// restart), keep the process alive and wait for the next SIGUSR1
// instead of crashing. A crash here would respawn a new process that
// loses macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
server = null;
gatewayLog.error(
`gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` +
"Process will stay alive; fix the issue and restart.",
);
}
server = null;
gatewayLog.error(
`gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` +
"Process will stay alive; fix the issue and restart.",
err instanceof Error ? err : undefined,
);

(only if gatewayLog.error accepts a second cause argument — otherwise logging err.stack inline is an equivalent alternative)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run-loop.ts
Line: 193-203

Comment:
**Stack trace dropped from crash-guard error log**

Only `err.message` is captured in the logged string. If startup fails due to an unexpected internal error (not just a config issue), the stack trace is silently discarded, making it harder to diagnose the root cause. Consider including the full stack when available:

```suggestion
        server = null;
        gatewayLog.error(
          `gateway startup failed: ${err instanceof Error ? err.message : String(err)}. ` +
            "Process will stay alive; fix the issue and restart.",
          err instanceof Error ? err : undefined,
        );
```

_(only if `gatewayLog.error` accepts a second cause argument — otherwise logging `err.stack` inline is an equivalent alternative)_

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a22a8910c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

*/
async function getConfigValidationError(): Promise<string | null> {
try {
const snapshot = await readConfigFileSnapshot();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate config using service env before aborting start/restart

This pre-flight check reads config with readConfigFileSnapshot() against the current CLI environment, so configs that intentionally use ${ENV_VAR} values provided only by launchd/systemd are treated as invalid and gateway/node daemon start|restart are aborted. readConfigFileSnapshot marks missing env substitutions as invalid (see src/config/io.ts around env-substitution failure handling), so operators can no longer restart a healthy service unless they also export all runtime secrets in their shell. That is a regression from previous behavior and should use the service runtime env (or ignore substitution-only failures) before blocking.

Useful? React with 👍 / 👎.

Comment on lines +191 to +195
try {
server = await params.start();
} catch (err) {
// If startup fails (e.g., invalid config after a config-triggered
// restart), keep the process alive and wait for the next SIGUSR1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-throw initial gateway startup failures

Catching all params.start() exceptions here prevents startup failures from ever reaching runGatewayCommand's outer try/catch (src/cli/gateway-cli/run.ts), which is what reports "Gateway failed to start" and exits non-zero. After this change, an initial bind/auth/config failure leaves the process alive but not serving (and still holding the gateway lock) until a manual restart signal, so supervised deployments can get stuck in a permanently "running but dead" state instead of failing fast and recovering.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6620f34046

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

*/
async function getConfigValidationError(): Promise<string | null> {
try {
const snapshot = await readConfigFileSnapshot();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use daemon env for pre-flight config snapshot

getConfigValidationError() validates via readConfigFileSnapshot() using the current CLI environment, but gateway start/restart acts on the installed service, which may run with a different OPENCLAW_STATE_DIR/OPENCLAW_CONFIG_PATH and env substitutions from its persisted service environment. In that mismatch case, this guard can validate the wrong file (allowing a restart that still crashes) or block a healthy restart because only the shell config is invalid, so the regression this patch is trying to prevent remains in multi-profile/service-env setups.

Useful? React with 👍 / 👎.

@lml2468
Copy link
Copy Markdown
Contributor Author

lml2468 commented Mar 8, 2026

Re: Codex P1 (use daemon env for config snapshot):

Acknowledged. This is intentionally best-effort — readConfigFileSnapshot() doesn't accept a custom env parameter, and threading the service environment through would require changes to the config I/O layer.

The current behavior catches the most common case (user edits openclaw.json with an invalid key, then runs gateway restart). The env-mismatch edge case (service uses different OPENCLAW_STATE_DIR / env substitutions) is documented in the JSDoc comment. If a false positive blocks a valid restart, the error message directs users to openclaw doctor.

This could be a follow-up improvement, but shouldn't block the core crash prevention fix.

@lml2468 lml2468 force-pushed the fix/issue-35862-restart-config-validation branch from 6620f34 to ab67b62 Compare March 8, 2026 11:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab67b62801

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +391 to +394
// Pre-flight config validation (#35862)
{
const configError = await getConfigValidationError();
if (configError) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate config before signaling unmanaged restart

runServiceRestart performs the new pre-flight config check only after onNotLoaded has already run, but the gateway caller’s onNotLoaded path (runDaemonRestartrestartGatewayWithoutServiceManager) sends SIGUSR1 immediately to the unmanaged process; with an invalid config, this means the restart is triggered before we fail with "aborted: config is invalid," so the guard does not actually prevent the risky restart in that flow and can report a misleading abort after side effects have already happened.

Useful? React with 👍 / 👎.

@lml2468 lml2468 force-pushed the fix/issue-35862-restart-config-validation branch from ab67b62 to c02f635 Compare March 9, 2026 00:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c02f63535d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

*/
async function getConfigValidationError(): Promise<string | null> {
try {
const snapshot = await readConfigFileSnapshot();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate restart config with service environment

getConfigValidationError() reads readConfigFileSnapshot() from the current CLI environment, but runServiceStart/runServiceRestart operate on a service that may have different persisted env (notably OPENCLAW_STATE_DIR, OPENCLAW_CONFIG_PATH, and service-only substitutions), so this check can both block healthy restarts and miss invalid service-side configs that still crash after restart. Fresh evidence in this commit is the new function comment explicitly stating validation is done in the CLI environment.

Useful? React with 👍 / 👎.

Comment on lines +202 to +205
if (isFirstStart) {
throw err;
}
server = null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid trapping restart failures in non-listening loop state

After the first successful boot, this branch swallows params.start() failures and keeps the process alive with server = null; in unmanaged/in-process restart mode (for example OPENCLAW_NO_RESPAWN=1 or respawn fallback), daemon restart/stop can only signal PIDs discovered from listeners on the gateway port (restartGatewayWithoutServiceManager), so a failed restart before bind leaves no signalable PID while the process still holds the gateway lock, forcing manual process cleanup.

Useful? React with 👍 / 👎.

lml2468 added 6 commits March 9, 2026 12:35
… permission loss (openclaw#35862)

When 'openclaw gateway restart' is run with an invalid config, the new
process crashes on startup due to config validation failure. On macOS,
this causes Full Disk Access (TCC) permissions to be lost because the
respawned process has a different PID.

Add getConfigValidationError() helper and pre-flight config validation
in both runServiceRestart() and runServiceStart(). If config is invalid,
abort with a clear error message instead of crashing.

The config watcher's hot-reload path already had this guard
(handleInvalidSnapshot), but the CLI restart/start commands did not.

AI-assisted (OpenClaw agent, fully tested)
…it (openclaw#35862)

When an in-process restart (SIGUSR1) triggers a config-triggered restart
and the new config is invalid, params.start() throws and the while loop
exits, killing the process. On macOS this loses TCC permissions.

Wrap params.start() in try/catch: on failure, set server=null, log the
error, and wait for the next SIGUSR1 instead of crashing.
Address Greptile review: add test coverage for runServiceStart path.
The error message copy-paste issue was already fixed in the DRY refactor
(uses params.serviceNoun instead of hardcoded 'restart').
- Remove dead 'return false' in runServiceStart (Greptile)
- Include stack trace in run-loop crash guard error log (Greptile)
- Only catch startup errors on subsequent restarts, not initial start (Codex P1)
- Add JSDoc note about env var false positive edge case (Codex P1)
…Codex P2)

The config check was positioned after onNotLoaded, which could send
SIGUSR1 to an unmanaged process before config was validated.
- Release gateway lock when in-process restart fails, so daemon
  restart/stop can still manage the process (Codex P2)
- P1 (env mismatch) already addressed: best-effort by design, documented
  in JSDoc
@lml2468 lml2468 force-pushed the fix/issue-35862-restart-config-validation branch from c02f635 to ca0cc13 Compare March 9, 2026 04:36
@lml2468
Copy link
Copy Markdown
Contributor Author

lml2468 commented Mar 9, 2026

Addressed latest Codex reviews:

P1 (env mismatch): Already addressed — this is best-effort by design. readConfigFileSnapshot() validates in CLI env; the JSDoc documents this limitation. A false positive is safer than a crash. Follow-up could thread service env through, but requires config I/O changes.

P2 (lock held with no listener): Fixed — now calls releaseLockIfHeld() after restart failure so daemon restart/stop can still discover and manage the process. Commit ca0cc13.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca0cc13a32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// discovers PIDs via the gateway port) can still manage the process.
// Without this, the process holds the lock but is not listening,
// forcing manual cleanup. (#35862)
await releaseLockIfHeld();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reacquire lock before retrying a failed in-process restart

Calling releaseLockIfHeld() in the restart-failure catch path drops the gateway lock, but later restart attempts only reacquire when handleRestartAfterServerClose() sees hadLock === true. In OPENCLAW_NO_RESPAWN=1 (or respawn failed) mode, a second SIGUSR1 after this failure will restart the loop without reacquiring the lock, so the gateway can come back up lockless and allow duplicate/coordinated-management breakage across processes.

Useful? React with 👍 / 👎.

@steipete steipete merged commit f84adcb into openclaw:main Mar 9, 2026
25 of 28 checks passed
@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 9, 2026

Landed via temp rebase onto main.

  • Gate: pnpm check && pnpm build; pnpm test OOMed locally, then OPENCLAW_TEST_PROFILE=serial OPENCLAW_TEST_VM_FORKS=0 OPENCLAW_TEST_SHARDS=4 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=8192 pnpm test passed shards 1-2 and then hung idle in shard 3 with no failing test output.
  • Land commit: 67e22fa
  • Merge commit: f84adcb

Thanks @lml2468!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: gateway crash on invalid config causes macOS Full Disk Access loss

2 participants