Skip to content

fix(browser): harden chrome shutdown across SIGUSR1 restarts#32520

Closed
bmendonca3 wants to merge 2 commits intoopenclaw:mainfrom
bmendonca3:bm/browser-soft-restart-kill-stale-proc-32458
Closed

fix(browser): harden chrome shutdown across SIGUSR1 restarts#32520
bmendonca3 wants to merge 2 commits intoopenclaw:mainfrom
bmendonca3:bm/browser-soft-restart-kill-stale-proc-32458

Conversation

@bmendonca3
Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: soft restart paths can leave a previously-launched browser process alive, so a later restart can inherit a stale proc.killed=true state and skip real shutdown work.
  • Why it matters: stale browser processes can conflict with fresh restarts and lead to unreliable CDP behavior after restart.
  • What changed: hardened stopOpenClawChrome() to key shutdown on exitCode (actual process exit) instead of proc.killed (signal attempt flag), and to continue TERM/KILL escalation when killed===true but the process is still running.
  • What did NOT change (scope boundary): no changes to browser launch flags, profile selection, or gateway restart orchestration logic.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Browser teardown during restart is more reliable when prior stop attempts already marked proc.killed=true but the process has not actually exited.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS/Linux (unit-level validation)
  • Runtime/container: Node + Vitest
  • Model/provider: N/A
  • Integration/channel (if any): Browser control / CDP
  • Relevant config (redacted): default browser control settings

Steps

  1. Start browser control and launch a local browser profile.
  2. Trigger stop/restart sequences where process signaling may set proc.killed=true before actual process exit.
  3. Observe teardown behavior for subsequent stop attempts.

Expected

  • Stop logic should only no-op when process already exited (exitCode !== null).
  • Stop should continue TERM/KILL escalation when process is still running even if proc.killed === true.

Actual

  • Previous logic returned early when proc.killed === true, which can skip shutdown work for still-running processes.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: pnpm vitest run src/browser/chrome.test.ts passes with updated stop semantics and new regression case.
  • Edge cases checked: already-exited process no-op, CDP-down fast return, CDP-still-reachable SIGKILL escalation, and killed=true + exitCode=null retry path.
  • What you did not verify: full live SIGUSR1 restart on an external host with real Chrome process table inspection.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit Browser: harden SIGUSR1 chrome shutdown.
  • Files/config to restore: src/browser/chrome.ts, src/browser/chrome.test.ts.
  • Known bad symptoms reviewers should watch for: browser process not receiving SIGTERM/SIGKILL in tests or unexpected stop-time hangs.

Risks and Mitigations

  • Risk: stricter reliance on exitCode could behave differently on unusual process wrappers.
    • Mitigation: retained existing CDP reachability checks and SIGKILL fallback; added regression coverage for the key stale-killed state.

@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 3, 2026

🤖 We're reviewing this PR with Aisle

We're running a security check on the changes in this PR now. This usually takes a few minutes. ⌛
We'll post the results here as soon as they're ready.

Progress:

  • Analysis
  • Triage
  • Finalization

Latest run failed. Keeping previous successful results. Trace ID: 019cb1ebd5f5a8af767840e92593a488.

Last updated on: 2026-03-03T04:19:20Z

Latest run failed. Keeping previous successful results. Trace ID: 019cb20500b64aaf1e7f3c341f626f49.

Last updated on: 2026-03-03T04:46:48Z

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR hardens stopOpenClawChrome() in src/browser/chrome.ts to gate all shutdown logic on proc.exitCode !== null (actual process exit) rather than proc.killed (signal-attempt flag). The fix correctly handles the stale-killed scenario — a process that had a prior SIGTERM attempt (killed=true) but never actually exited would previously be skipped by the shutdown path, leaving a zombie browser alive across SIGUSR1-triggered restarts.

Key changes:

  • Early-exit guard changed from proc.killed to proc.exitCode !== null, so a stale killed=true no longer prevents re-escalation.
  • Mid-loop break condition replaced with proc.exitCode !== null (also fixes a pre-existing subtle bug: the old !proc.exitCode would treat exitCode=0 as falsy, causing an unnecessary SIGKILL on an already-cleanly-exited process).
  • An explicit exitCode !== null guard added before SIGKILL to short-circuit after the timeout loop in the common case where the process exited during the wait.
  • The corresponding test is updated: the no-op test now supplies exitCode: 0 to match the new guard semantics, and a new regression test validates that killed=true, exitCode=null correctly reaches both SIGTERM and SIGKILL escalation.

Confidence Score: 5/5

  • This PR is safe to merge — the changes are minimal, well-scoped, and backed by a targeted regression test.
  • The fix is a straightforward guard condition swap in a single function. It resolves a real bug (stale killed flag bypassing shutdown), also fixes a latent exitCode=0-is-falsy issue in the old loop condition, and introduces no new external dependencies or surface area changes. The test updates directly cover the regression path and all assertions are correct.
  • No files require special attention.

Last reviewed commit: f64b2ab

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f64b2ab1f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Soft restart (SIGUSR1) does not kill previously-launched browser processes, causing port conflicts

2 participants