Skip to content

fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock#42027

Open
1998xiangcai-cmd wants to merge 3 commits intoopenclaw:mainfrom
1998xiangcai-cmd:xiangcai/exec-browser-cron-fixes-41549-41553-41558
Open

fix: resolve exec PATH fallback, layered browser diagnostics, and cron force-run deadlock#42027
1998xiangcai-cmd wants to merge 3 commits intoopenclaw:mainfrom
1998xiangcai-cmd:xiangcai/exec-browser-cron-fixes-41549-41553-41558

Conversation

@1998xiangcai-cmd
Copy link
Copy Markdown

Summary

Describe the problem and fix in 2�C4 bullets:

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Exec tool can now find login-shell-managed binaries when the default sandbox host falls back to direct local execution.
  • openclaw browser status now surfaces layered diagnostics for Control UI auth/origin configuration, browser profile attach mode, and remote/local CDP reachability.
  • Browser HTTP errors now summarize proxy HTML error pages instead of dumping raw markup.
  • Detached cron.run --force no longer self-deadlocks when isolated cron work reuses the cron execution lane.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) Yes
  • Data access scope changed? (Yes/No) Yes
  • If any Yes, explain risk + mitigation:
    • Exec PATH recovery now also runs for local fallback from the default sandbox host, but only when no sandbox runtime exists and no explicit PATH override was provided.
    • Browser CLI status now reads the existing redacted config.get snapshot through the authenticated gateway path to derive operator diagnostics; it does not expose unredacted secrets.

Repro + Verification

Environment

  • OS: Windows workstation, repo validation run locally
  • Runtime/container: Node 22 / pnpm
  • Model/provider: N/A for repro validation
  • Integration/channel (if any): Browser control + cron
  • Relevant config (redacted): default exec host, browser remote CDP profile, recurring isolated cron jobs

Steps

  1. Leave tools.exec.host unset, disable sandbox runtime, and run an exec command that depends on login-shell PATH.
  2. Run openclaw browser status against a remote CDP profile in a WSL2 + Windows style setup.
  3. Force-run a recurring isolated cron job whose nested work reuses the cron lane.

Expected

  • Exec finds login-shell-managed binaries.
  • Browser status points to the failing layer instead of a generic connection failure.
  • Force-run completes instead of timing out behind its own lane.

Actual

  • Covered by the targeted tests below; all now pass.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: pnpm check; pnpm test -- --run src/agents/bash-tools.exec.path.test.ts src/browser/status-diagnostics.test.ts src/browser/server.status-diagnostics.test.ts src/cli/browser-cli-manage.status-diagnostics.test.ts src/browser/client.test.ts src/cron/service.issue-regressions.test.ts; pnpm build.
  • Edge cases checked: phantom sandbox PATH fallback, remote/local CDP diagnostics ordering, HTML proxy error summarization, detached cron force-run lane reuse.
  • What you did not verify: live remote CDP against a real Windows browser host, or a live LLM-backed recurring cron on production credentials.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert the three commits in this PR.
  • Files/config to restore: src/agents/bash-tools.exec.ts, src/browser/*status*, src/cli/browser-cli-*, src/cron/service/ops.ts, src/process/lanes.ts, gateway lane config wiring.
  • Known bad symptoms reviewers should watch for: missing PATH recovery on macOS launch agents, noisy browser diagnostics ordering, cron force-runs stuck until timeout.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: Browser status now includes more diagnostics and merges CLI-derived gateway config checks with server-side CDP checks.
    • Mitigation: diagnostics are additive, optional in the payload, redacted, and covered by route + CLI tests.
  • Risk: The new detached cron manual lane could drift from cron concurrency settings.
    • Mitigation: gateway startup and hot-reload now set CommandLane.Cron and CommandLane.CronManual together, with regression coverage for the deadlock case.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling size: L labels Mar 10, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR addresses three targeted bug fixes: (1) login-shell PATH recovery now triggers when host === "sandbox" falls back to direct local execution (not just host === "gateway"), (2) openclaw browser status gains layered diagnostics covering Control UI auth/origin config, profile attach-mode, and CDP reachability, with HTML proxy error pages now summarized instead of dumped raw, and (3) cron.run --force is moved to a new CommandLane.CronManual lane so isolated cron jobs can still enqueue on CommandLane.Cron without self-deadlocking.

Key changes:

  • bash-tools.exec.ts: extends host === "gateway" PATH probe guard to also cover host === "sandbox" when no sandbox runtime exists
  • status-diagnostics.ts (new): pure-logic module for deriveBrowserStatusDiagnostics, deriveGatewayControlUiDiagnostics, and combineBrowserStatusDiagnostics; no network calls, well-typed, well-tested
  • client-fetch.ts: summarizeBrowserServiceHttpError correctly identifies and summarizes nginx/proxy HTML error pages; the /<[^>]+>/g tag-stripping regex can be tricked by unencoded > in HTML attribute values and the fallback branch (no <title> or <h1>) has no length cap — minor robustness concerns worth addressing
  • cron/service/ops.ts + process/lanes.ts + gateway wiring: CronManual lane cleanly separates manual force-runs from scheduled cron execution; both startup and hot-reload set CronManual concurrency in lockstep with Cron, preventing config drift
  • Test coverage is thorough: unit tests for all three diagnostic paths, a regression test that directly exercises the deadlock scenario, and a server integration test for CDP reachability diagnostics

Confidence Score: 4/5

  • Safe to merge with one minor robustness improvement suggested for HTML error message handling.
  • All three fixes are narrowly scoped, backward-compatible, and covered by regression tests. The exec PATH change is guarded by existing conditions. Browser diagnostics are additive and optional. The cron lane split resolves a deadlock without altering scheduling semantics. One style/robustness concern identified in stripHtmlErrorText: the tag-stripping regex can be confused by unencoded > in attribute values, and the fallback path has no length cap, but this is a minor improvement opportunity rather than a blocking issue for merge.
  • src/browser/client-fetch.ts — consider applying the suggested robustness improvement to the stripHtmlErrorText function (cap output length, improve regex handling of attributes with > characters)

Last reviewed commit: 3e8ab2b

Comment on lines +157 to +163
function stripHtmlErrorText(input: string): string {
return input
.replace(/<script[\s\S]*?<\/script>/gi, " ")
.replace(/<style[\s\S]*?<\/style>/gi, " ")
.replace(/<[^>]+>/g, " ")
.replace(/\s+/g, " ")
.trim();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> in attribute values can defeat the tag-stripping regex.

/<[^>]+>/g stops at the first > it encounters, so an attribute value that contains an unencoded > (e.g., aria-label="a > b", which some proxy error pages include) will leave the tail of the attribute and the closing > in the output. This is the fallback path that fires when neither <title> nor <h1> is found, so the resulting "summary" could contain raw attribute fragments and could be unboundedly long if the page body has significant text content.

Consider using a tighter pattern and capping the output length in the fallback:

Suggested change
function stripHtmlErrorText(input: string): string {
return input
.replace(/<script[\s\S]*?<\/script>/gi, " ")
.replace(/<style[\s\S]*?<\/style>/gi, " ")
.replace(/<[^>]+>/g, " ")
.replace(/\s+/g, " ")
.trim();
function stripHtmlErrorText(input: string): string {
return input
.replace(/<script[\s\S]*?<\/script>/gi, " ")
.replace(/<style[\s\S]*?<\/style>/gi, " ")
.replace(/<[^>]*>/g, " ")
.replace(/&[a-z]+;/gi, " ")
.replace(/\s+/g, " ")
.trim()
.slice(0, 200);
}

Note: [^>]* (zero-or-more instead of one-or-more) is a minor alignment fix, but the more actionable change is adding .slice(0, 200) so that an HTML page without a title or heading can't push a multi-kilobyte text dump into the thrown error message.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/browser/client-fetch.ts
Line: 157-163

Comment:
`>` in attribute values can defeat the tag-stripping regex.

`/<[^>]+>/g` stops at the first `>` it encounters, so an attribute value that contains an unencoded `>` (e.g., `aria-label="a > b"`, which some proxy error pages include) will leave the tail of the attribute and the closing `>` in the output. This is the fallback path that fires when neither `<title>` nor `<h1>` is found, so the resulting "summary" could contain raw attribute fragments and could be unboundedly long if the page body has significant text content.

Consider using a tighter pattern and capping the output length in the fallback:

```suggestion
function stripHtmlErrorText(input: string): string {
  return input
    .replace(/<script[\s\S]*?<\/script>/gi, " ")
    .replace(/<style[\s\S]*?<\/style>/gi, " ")
    .replace(/<[^>]*>/g, " ")
    .replace(/&[a-z]+;/gi, " ")
    .replace(/\s+/g, " ")
    .trim()
    .slice(0, 200);
}
```

Note: `[^>]*` (zero-or-more instead of one-or-more) is a minor alignment fix, but the more actionable change is adding `.slice(0, 200)` so that an HTML page without a title or heading can't push a multi-kilobyte text dump into the thrown error message.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e8ab2b689

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

};
};
}> {
const payload = await callGatewayFromCli("config.get", opts, {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse short status timeout for config snapshot call

The new browser status path now always does config.get via callGatewayConfigSnapshot, but this helper does not pass a bounded timeout and therefore inherits the CLI default timeout from callGatewayFromCli (30s), unlike fetchBrowserStatus which explicitly uses 1500ms; when config.get is slow or partially unavailable, openclaw browser status can hang for tens of seconds before falling back to existing diagnostics, which is a responsiveness regression introduced by this change.

Useful? React with 👍 / 👎.

@Owlock
Copy link
Copy Markdown

Owlock commented Mar 24, 2026

Additional real-world validation from today (Mar 24): this PR’s layered diagnostics would have helped a lot in a WSL2 + Windows CDP outage triggered by virtualization mode toggling.

Repro chain:

  • bcdedit /set hypervisorlaunchtype off (game mode) -> reboot
  • bcdedit /set hypervisorlaunchtype auto -> reboot
  • Chrome CDP still healthy on Windows localhost (127.0.0.1:9222), but unreachable from WSL host endpoint (172.30.144.1:9222) until bridge recovery

Recovery that restored remote tabs:

  • ensure iphlpsvc running
  • recreate portproxy (0.0.0.0:9222 -> 127.0.0.1:9222)
  • allow inbound TCP 9222 in firewall
  • verify both /json/version checks (localhost and WSL-reachable host address)

I posted full field notes in #41553 as follow-up evidence. +1 on this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes gateway Gateway runtime size: L

Projects

None yet

2 participants