-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
Bug Report: Cascading deployment failures on Linux + NVM + PM2 environment #26252
Description
Bug Report: Cascading deployment failures on Linux + NVM + PM2 environment
Environment
- OS: OpenCloudOS (Linux) / Tencent Cloud
- Node: v22.22.0 (managed via NVM)
- Process Manager: PM2
- Gateway Version: Upgraded from 2026.2.21-2 to 2026.2.23
- Primary Integration: Feishu (Lark) Webhook
Summary
During a routine gateway upgrade, we encountered four cascading failures that took significant time to diagnose. Each fix exposed the next issue. Documenting here to help improve the deployment experience for Linux + NVM + PM2 users.
Failure 1: Corrupted global NPM package + outdated start command
Symptom: openclaw-gateway process fails to start, command not found.
Root Cause: npm list -g --depth=0 showed openclaw@(no version) — the package was corrupted with no bin links generated. Additionally, the old startup script used the hyphenated openclaw-gateway command, but the new version changed to the subcommand openclaw gateway.
Fix: npm uninstall -g openclaw && npm install -g openclaw, then corrected the start command.
Failure 2: Reinstalling core package drops plugin SDK dependencies
Symptom: Gateway starts successfully, but Feishu stops responding. PM2 logs show: Error: Cannot find module '@larksuiteoapi/node-sdk'.
Root Cause: Reinstalling openclaw globally did not restore the separately installed Feishu SDK dependency, causing the plugin to crash at runtime.
Fix: npm install -g @larksuiteoapi/node-sdk and restart.
Failure 3: Cloud proxy strips WebSocket security headers from Dashboard
Symptom: Accessing the Dashboard via Tencent Cloud OrcaTerm's public web proxy (wss://forward-...), the UI loads but immediately errors with 1008: device nonce required or 1008: pairing required. No pairing request appears in server logs.
Root Cause: The cloud provider's security gateway strips/blocks custom security headers (Nonce) during WSS forwarding. OpenClaw rejects the connection because the security handshake data is incomplete.
Fix: Abandoned the cloud web proxy. Used a clean SSH tunnel instead: ssh -L 18789:127.0.0.1:18789 root@<ip>, then accessed via http://127.0.0.1:18789.
Failure 4: PM2 "ghost crash" after upgrade — NVM environment variables lost
Symptom: After upgrading to 2026.2.23, pm2 restart openclaw-gateway causes the gateway to crash instantly (memory drops to 0-8MB) with zero error log output. Running openclaw gateway directly in the foreground works perfectly.
Root Cause: On servers using NVM, PM2 forks processes without sourcing ~/.bashrc, losing critical environment variables (PATH, etc.). The process dies so fast it can't even write an error log.
Fix: Created a shell wrapper script that sources the environment before launching:
#!/bin/bash
source ~/.bashrc
export PATH="$HOME/.nvm/versions/node/v22.22.0/bin:$PATH"
openclaw gatewayThen configured PM2 to manage this script instead of the bare command.
Suggestions for OpenClaw
-
Enhanced dependency self-check (
openclaw doctor):
When plugins like Feishu are enabled, auto-detect whether their underlying SDKs (e.g.,@larksuiteoapi/node-sdk) are installed. Show a friendlynpm installhint on missing dependencies instead of crashing at runtime. -
Dashboard proxy detection hint:
When the Web UI encounters1008: device nonce requiredand appears to be behind a reverse proxy, display a UI message: "WebSocket handshake failed. Your proxy may be stripping security headers. Try accessing via SSH local port forwarding instead." -
PM2 + NVM deployment best practice:
For Linux + NVM users, recommend the shell wrapper script approach as the default PM2 setup in documentation. This prevents 99% of "ghost crash" issues caused by missing environment variables. Example in docs would save many users hours of debugging.