Skip to content

fix: improve gateway upgrade diagnostics and recovery#28990

Open
ogenev wants to merge 13 commits intoopenclaw:mainfrom
ogenev:fix/issue-26252
Open

fix: improve gateway upgrade diagnostics and recovery#28990
ogenev wants to merge 13 commits intoopenclaw:mainfrom
ogenev:fix/issue-26252

Conversation

@ogenev
Copy link
Copy Markdown

@ogenev ogenev commented Feb 27, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: upgrades in PM2/NVM and proxy-forwarded setups could fail in multiple ways (legacy openclaw-gateway launch mismatch, opaque plugin dependency failures, and unclear nonce-handshake recovery guidance).
  • Why it matters: operators could be blocked on startup/connectivity and receive recovery commands that did not match their actual deployment port/path.
  • What changed:
    • Added a dedicated openclaw-gateway wrapper entrypoint and hardened legacy argv normalization in run-main.
    • Synced normalized gateway argv into process.argv so preAction/config-guard/lazy command registration all see the same command path.
    • Added explicit passthroughs for root update/version/help aliases (update, --update, -v, -V, --version, -h, --help) and made rewrite logic root-flag aware (--no-color, --log-level, --profile, --dev).
    • Added plugin missing-dependency hints (gateway startup + openclaw plugins doctor) with actionable install guidance.
    • Added nonce/proxy handshake diagnostics plus smarter tunnel port selection:
      • explicit listener port (env/config) wins for non-loopback URLs,
      • otherwise non-loopback probe URL ports are used when available,
      • loopback detection now covers broader local forms (including 127/8 and IPv4-mapped IPv6 loopback).
    • Added/updated Linux troubleshooting docs for PM2/NVM wrapper behavior and recovery.
  • What did NOT change (scope boundary): no auth/pairing trust model was loosened; this PR focuses on compatibility, diagnostics, and operator guidance.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • openclaw-gateway legacy launch paths are now compatible across shim/wrapper variants.
  • openclaw-gateway root update/version/help invocations now remain valid even with leading root flags.
  • Plugin startup failures now include dependency-specific install hints.
  • openclaw status nonce recovery guidance now provides more accurate tunnel commands across loopback, custom-port, and proxy-facing scenarios.
  • Linux troubleshooting docs include PM2/NVM wrapper recovery guidance.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    • CLI dispatch behavior for legacy gateway entrypoints changed. Mitigation: regression tests cover rewrite/passthrough cases and command-path consistency.

Repro + Verification

Environment

  • OS: macOS 14.x (local verification), Linux PM2/NVM behavior covered by docs + command-path tests
  • Runtime/container: Node 22+
  • Model/provider: N/A
  • Integration/channel (if any): Feishu dependency hint path
  • Relevant config (redacted): gateway local mode, proxy/nonce failure scenarios

Steps

  1. Verify legacy gateway argv normalization (openclaw-gateway ...) and root passthrough behavior.
  2. Verify nonce recovery messaging for loopback/non-loopback/proxy-facing URL variants.
  3. Verify plugin load-error hints and doctor output guidance.
  4. Run tests + checks.

Expected

  • Legacy gateway launch/update/version paths remain valid.
  • Nonce recovery guidance uses correct tunnel target/port context.
  • Plugin missing dependencies produce actionable hints.
  • Checks pass.

Actual

  • Matches expected after this PR.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm exec vitest run src/cli/run-main.test.ts
    • pnpm exec vitest run src/commands/status.test.ts
    • pnpm exec vitest run src/infra/infra-parsing.test.ts
    • pnpm check
  • Edge cases checked:
    • root-flag-ordered gateway update passthrough (--no-color --update, --log-level debug --update)
    • root version alias passthrough for gateway wrapper (-v)
    • loopback URL variants (127.0.0.1, 127.0.0.2, [::1], [::ffff:127.0.0.1])
    • non-loopback nonce tunnel guidance with and without explicit listener port
  • What you did not verify:
    • Live remote PM2/NVM deployment and live proxy appliance behavior in this PR run.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps:
    • N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this PR commits and republish.
  • Files/config to restore:
    • src/cli/run-main.ts
    • src/commands/status.command.ts
    • src/plugins/load-error-hints.ts
    • src/gateway/server/ws-connection/message-handler.ts
    • src/gateway/server-plugins.ts
  • Known bad symptoms reviewers should watch for:
    • openclaw-gateway commands mapping to invalid gateway update paths
    • wrong nonce tunnel port hints on loopback/custom-port URLs
    • missing plugin dependency remediation hints on startup

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: argv rewrite could alter root command semantics for gateway wrapper paths.
    • Mitigation: root passthrough set + root-flag-aware command detection + targeted rewrite tests.
  • Risk: nonce guidance could prefer incorrect port in mixed proxy/direct deployments.
    • Mitigation: explicit listener-port precedence + URL-port fallback + loopback variant tests.
  • Risk: plugin hints could become stale if dependency error signatures change.
    • Mitigation: hint logic constrained to known missing-module signatures and covered by unit tests.

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation gateway Gateway runtime cli CLI command changes commands Command implementations size: M labels Feb 27, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR improves gateway upgrade diagnostics and backward compatibility for operators upgrading from older versions. The implementation adds three key capabilities:

  • Legacy binary compatibility: Preserves openclaw-gateway command compatibility by adding a package.json alias and argv rewriting logic in src/cli/run-main.ts. The rewrite correctly handles root flags (--update, --version) and avoids double-prefixing when gateway is already present.

  • Plugin dependency diagnostics: Introduces actionable hints when plugins fail to load due to missing dependencies. The src/plugins/load-error-hints.ts module parses error messages to extract missing package names and suggests openclaw plugins update <plugin-id> commands. These hints appear in both gateway startup logs and the openclaw plugins doctor output.

  • Nonce recovery guidance: When device nonce handshakes fail (commonly due to proxies stripping challenge data), the openclaw status command now provides specific recovery instructions including SSH tunnel commands with the correct port. The port detection logic intelligently prioritizes explicit configuration, then probe URL ports, then defaults - handling both loopback and non-loopback scenarios.

The changes include comprehensive test coverage (107 new test cases in run-main.test.ts, 180 additions in status.test.ts, etc.) and documentation updates for PM2+NVM troubleshooting scenarios. A minor typing fix in pi-embedded-runner-extraparams.test.ts ensures CI passes.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-architected with defensive coding practices throughout. All new functionality has comprehensive test coverage including edge cases (IPv4/IPv6 loopback detection, argv rewriting scenarios, port detection logic). No security vulnerabilities introduced - the argv rewriting is purely syntactic and doesn't execute user-controlled input. The changes are backward compatible and provide multiple escape hatches (explicit configuration) for edge cases. The code follows repository conventions and includes clear documentation updates.
  • No files require special attention

Last reviewed commit: 4bc4ac3

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f775acf84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa4d972293

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7c7d97989

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5a4f8cfc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ogenev ogenev marked this pull request as draft February 27, 2026 18:18
@ogenev ogenev marked this pull request as ready for review February 27, 2026 20:10
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bc4ac3a65

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d0f2d3f527

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
runtime.log(
theme.muted(
`Tunnel: ssh -L ${nonceRecovery.tunnelPort}:127.0.0.1:${nonceRecovery.tunnelPort} <user>@<host>`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a non-privileged local port in SSH recovery command

When nonce recovery derives tunnelPort from a wss:///ws:// URL without an explicit port, it can become 443 or 80, and this line uses that value for both sides of ssh -L (local:remote). On typical Linux/macOS setups, non-root users cannot bind local privileged ports, so the recommended recovery command fails immediately even though the remote target may be correct. Keep the remote port derived from the URL, but choose a non-privileged local port (for example the configured gateway/default port) for the left-hand side.

Useful? React with 👍 / 👎.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 5, 2026
@Takhoffman Takhoffman requested a review from a team as a code owner March 24, 2026 20:16
@openclaw-barnacle openclaw-barnacle bot removed the stale Marked as stale due to inactivity label Mar 28, 2026
@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes commands Command implementations docs Improvements or additions to documentation gateway Gateway runtime size: L stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: Cascading deployment failures on Linux + NVM + PM2 environment

1 participant