Skip to content

fix(gateway): validate config before restart to prevent crash + macOS permission loss (#35862)#35924

Closed
lml2468 wants to merge 3 commits intoopenclaw:mainfrom
lml2468:fix/issue-35862-restart-config-validation
Closed

fix(gateway): validate config before restart to prevent crash + macOS permission loss (#35862)#35924
lml2468 wants to merge 3 commits intoopenclaw:mainfrom
lml2468:fix/issue-35862-restart-config-validation

Conversation

@lml2468
Copy link
Copy Markdown
Contributor

@lml2468 lml2468 commented Mar 5, 2026

What

Add pre-flight config validation before gateway restart and gateway start to prevent crashes from invalid config.

Why

Fixes #35862 — When openclaw gateway restart is run with an invalid config (e.g., unrecognized key like agents.defaults.pdfModel), the new process crashes on startup due to config validation (.strict() rejection). On macOS, this causes Full Disk Access (TCC) permissions to be lost because launchd respawns a new process with a different PID, which doesn't inherit TCC grants.

How

  • runServiceRestart() and runServiceStart() in lifecycle-core.ts now call readConfigFileSnapshot() before proceeding
  • If config exists and is invalid: abort with a clear error message (including the specific validation issues) and suggest openclaw doctor to repair
  • If snapshot read fails: proceed anyway (best-effort, don't block on infra errors)
  • The config watcher's hot-reload path already had this guard (handleInvalidSnapshot in config-reload.ts), but the CLI commands did not

Testing

  • pnpm build passes
  • pnpm check passes
  • 4 new tests:
    • Aborts restart when config is invalid ✅
    • Proceeds when config is valid ✅
    • Proceeds when config file doesn't exist ✅
    • Proceeds when snapshot read throws (best-effort) ✅
  • 3 existing lifecycle-core tests still pass
  • AI-assisted (OpenClaw agent, fully tested)

Files changed

File Change
src/cli/daemon-cli/lifecycle-core.ts Pre-flight config validation in runServiceRestart + runServiceStart
src/cli/daemon-cli/lifecycle-core.config-guard.test.ts 4 new test cases

@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: M labels Mar 5, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR adds pre-flight config validation to runServiceRestart and runServiceStart to prevent the gateway from crashing on startup with invalid config. This is important on macOS where crashes cause TCC/Full Disk Access permissions to be lost.

Key findings:

  1. User-facing error message bug in runServiceStart (line 198): The validation error message says "Gateway restart aborted" when it should say "Gateway start aborted". Users running openclaw gateway start will see a confusing "restart" verb that doesn't match their command.

  2. Stale comments in runServiceStart (lines 178–180): The inline comments copy-pasted from runServiceRestart still say "refuse to restart" and "after restart" when they should describe the start path.

  3. Test coverage gap: The test file only covers runServiceRestart, not runServiceStart. This gap is what allowed the error message bug above to slip through.

The core fix for runServiceRestart is correct and well-tested. Once the two runServiceStart issues are addressed, the PR will be safe to merge.

Confidence Score: 3/5

  • Safe to merge after correcting the error message and comments in runServiceStart, and ideally adding test coverage for that path.
  • The core fix for runServiceRestart is correct and well-tested. However, the same validation block was copy-pasted into runServiceStart with two issues: (1) user-facing error message still says "restart aborted" instead of "start aborted", and (2) inline comments still describe the restart path. These are copy-paste oversights rather than logic errors, but they need correction. The missing test cases for runServiceStart prevented these issues from being caught.
  • src/cli/daemon-cli/lifecycle-core.ts (fix error message and comments in runServiceStart)

Last reviewed commit: 8516b73

Comment on lines +197 to +199
fail(
`${params.serviceNoun} restart aborted: config is invalid.\n${configIssues}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "restart aborted" but this code is in runServiceStart(). A user running openclaw gateway start would see a confusing "Gateway restart aborted: config is invalid." message, suggesting they ran a restart command instead.

Suggested change
fail(
`${params.serviceNoun} restart aborted: config is invalid.\n${configIssues}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
fail(
`${params.serviceNoun} start aborted: config is invalid.\n${configIssues}\nFix the config and retry, or run "openclaw doctor" to repair.`,
);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 197-199

Comment:
The error message says `"restart aborted"` but this code is in `runServiceStart()`. A user running `openclaw gateway start` would see a confusing `"Gateway restart aborted: config is invalid."` message, suggesting they ran a restart command instead.

```suggestion
      fail(
        `${params.serviceNoun} start aborted: config is invalid.\n${configIssues}\nFix the config and retry, or run "openclaw doctor" to repair.`,
      );
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +178 to +180
// Pre-flight config validation: refuse to restart if config is invalid.
// A crash on startup after restart would spawn a new process that loses
// macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comments say "refuse to restart" and "after restart", but this is runServiceStart(). They should describe the start path:

Suggested change
// Pre-flight config validation: refuse to restart if config is invalid.
// A crash on startup after restart would spawn a new process that loses
// macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
// Pre-flight config validation: refuse to start if config is invalid.
// A crash on startup would spawn a new process that loses
// macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.ts
Line: 178-180

Comment:
The inline comments say "refuse to restart" and "after restart", but this is `runServiceStart()`. They should describe the start path:

```suggestion
  // Pre-flight config validation: refuse to start if config is invalid.
  // A crash on startup would spawn a new process that loses
  // macOS Full Disk Access (TCC permissions are PID-bound). (#35862)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +45 to +145
describe("runServiceRestart config pre-flight (#35862)", () => {
let runServiceRestart: typeof import("./lifecycle-core.js").runServiceRestart;

beforeAll(async () => {
({ runServiceRestart } = await import("./lifecycle-core.js"));
});

beforeEach(() => {
runtimeLogs.length = 0;
readConfigFileSnapshotMock.mockReset();
readConfigFileSnapshotMock.mockResolvedValue({
exists: true,
valid: true,
config: {},
issues: [],
});
loadConfig.mockReset();
loadConfig.mockReturnValue({});
service.isLoaded.mockClear();
service.readCommand.mockClear();
service.restart.mockClear();
service.isLoaded.mockResolvedValue(true);
service.readCommand.mockResolvedValue({ environment: {} });
service.restart.mockResolvedValue(undefined);
vi.unstubAllEnvs();
vi.stubEnv("OPENCLAW_GATEWAY_TOKEN", "");
vi.stubEnv("CLAWDBOT_GATEWAY_TOKEN", "");
});

it("aborts restart when config is invalid", async () => {
readConfigFileSnapshotMock.mockResolvedValue({
exists: true,
valid: false,
config: {},
issues: [{ path: "agents.defaults.pdfModel", message: "Unrecognized key" }],
});

await expect(
runServiceRestart({
serviceNoun: "Gateway",
service,
renderStartHints: () => [],
opts: { json: true },
}),
).rejects.toThrow("__exit__:1");

expect(service.restart).not.toHaveBeenCalled();
});

it("proceeds with restart when config is valid", async () => {
readConfigFileSnapshotMock.mockResolvedValue({
exists: true,
valid: true,
config: {},
issues: [],
});

const result = await runServiceRestart({
serviceNoun: "Gateway",
service,
renderStartHints: () => [],
opts: { json: true },
});

expect(result).toBe(true);
expect(service.restart).toHaveBeenCalledTimes(1);
});

it("proceeds with restart when config file does not exist", async () => {
readConfigFileSnapshotMock.mockResolvedValue({
exists: false,
valid: true,
config: {},
issues: [],
});

const result = await runServiceRestart({
serviceNoun: "Gateway",
service,
renderStartHints: () => [],
opts: { json: true },
});

expect(result).toBe(true);
expect(service.restart).toHaveBeenCalledTimes(1);
});

it("proceeds with restart when snapshot read throws", async () => {
readConfigFileSnapshotMock.mockRejectedValue(new Error("read failed"));

const result = await runServiceRestart({
serviceNoun: "Gateway",
service,
renderStartHints: () => [],
opts: { json: true },
});

expect(result).toBe(true);
expect(service.restart).toHaveBeenCalledTimes(1);
});
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new validation block was added to both runServiceStart and runServiceRestart, but test coverage only exists for runServiceRestart. The missing test cases for runServiceStart allowed the copy-paste error in lifecycle-core.ts line 198 (incorrect "restart aborted" message in the start path) to go undetected.

Consider adding a parallel describe("runServiceStart config pre-flight...") block with the same four test cases (invalid config aborts, valid config proceeds, file not found proceeds, snapshot read throws proceeds) to ensure consistency between both functions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle-core.config-guard.test.ts
Line: 45-145

Comment:
The new validation block was added to both `runServiceStart` and `runServiceRestart`, but test coverage only exists for `runServiceRestart`. The missing test cases for `runServiceStart` allowed the copy-paste error in lifecycle-core.ts line 198 (incorrect "restart aborted" message in the start path) to go undetected.

Consider adding a parallel `describe("runServiceStart config pre-flight...")` block with the same four test cases (invalid config aborts, valid config proceeds, file not found proceeds, snapshot read throws proceeds) to ensure consistency between both functions.

How can I resolve this? If you propose a fix, please make it concise.

… permission loss (openclaw#35862)

When 'openclaw gateway restart' is run with an invalid config, the new
process crashes on startup due to config validation failure. On macOS,
this causes Full Disk Access (TCC) permissions to be lost because the
respawned process has a different PID.

Add getConfigValidationError() helper and pre-flight config validation
in both runServiceRestart() and runServiceStart(). If config is invalid,
abort with a clear error message instead of crashing.

The config watcher's hot-reload path already had this guard
(handleInvalidSnapshot), but the CLI restart/start commands did not.

AI-assisted (OpenClaw agent, fully tested)
@lml2468 lml2468 force-pushed the fix/issue-35862-restart-config-validation branch from 8516b73 to 251bd68 Compare March 5, 2026 10:24
…it (openclaw#35862)

When an in-process restart (SIGUSR1) triggers a config-triggered restart
and the new config is invalid, params.start() throws and the while loop
exits, killing the process. On macOS this loses TCC permissions.

Wrap params.start() in try/catch: on failure, set server=null, log the
error, and wait for the next SIGUSR1 instead of crashing.
Address Greptile review: add test coverage for runServiceStart path.
The error message copy-paste issue was already fixed in the DRY refactor
(uses params.serviceNoun instead of hardcoded 'restart').
@openclaw-barnacle
Copy link
Copy Markdown

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: gateway crash on invalid config causes macOS Full Disk Access loss

2 participants