Skip to content

Bug Report: Cascading deployment failures on Linux + NVM + PM2 environment #26252

@Henryyu119

Description

@Henryyu119

Bug Report: Cascading deployment failures on Linux + NVM + PM2 environment

Environment

  • OS: OpenCloudOS (Linux) / Tencent Cloud
  • Node: v22.22.0 (managed via NVM)
  • Process Manager: PM2
  • Gateway Version: Upgraded from 2026.2.21-2 to 2026.2.23
  • Primary Integration: Feishu (Lark) Webhook

Summary

During a routine gateway upgrade, we encountered four cascading failures that took significant time to diagnose. Each fix exposed the next issue. Documenting here to help improve the deployment experience for Linux + NVM + PM2 users.


Failure 1: Corrupted global NPM package + outdated start command

Symptom: openclaw-gateway process fails to start, command not found.

Root Cause: npm list -g --depth=0 showed openclaw@(no version) — the package was corrupted with no bin links generated. Additionally, the old startup script used the hyphenated openclaw-gateway command, but the new version changed to the subcommand openclaw gateway.

Fix: npm uninstall -g openclaw && npm install -g openclaw, then corrected the start command.


Failure 2: Reinstalling core package drops plugin SDK dependencies

Symptom: Gateway starts successfully, but Feishu stops responding. PM2 logs show: Error: Cannot find module '@larksuiteoapi/node-sdk'.

Root Cause: Reinstalling openclaw globally did not restore the separately installed Feishu SDK dependency, causing the plugin to crash at runtime.

Fix: npm install -g @larksuiteoapi/node-sdk and restart.


Failure 3: Cloud proxy strips WebSocket security headers from Dashboard

Symptom: Accessing the Dashboard via Tencent Cloud OrcaTerm's public web proxy (wss://forward-...), the UI loads but immediately errors with 1008: device nonce required or 1008: pairing required. No pairing request appears in server logs.

Root Cause: The cloud provider's security gateway strips/blocks custom security headers (Nonce) during WSS forwarding. OpenClaw rejects the connection because the security handshake data is incomplete.

Fix: Abandoned the cloud web proxy. Used a clean SSH tunnel instead: ssh -L 18789:127.0.0.1:18789 root@<ip>, then accessed via http://127.0.0.1:18789.


Failure 4: PM2 "ghost crash" after upgrade — NVM environment variables lost

Symptom: After upgrading to 2026.2.23, pm2 restart openclaw-gateway causes the gateway to crash instantly (memory drops to 0-8MB) with zero error log output. Running openclaw gateway directly in the foreground works perfectly.

Root Cause: On servers using NVM, PM2 forks processes without sourcing ~/.bashrc, losing critical environment variables (PATH, etc.). The process dies so fast it can't even write an error log.

Fix: Created a shell wrapper script that sources the environment before launching:

#!/bin/bash
source ~/.bashrc
export PATH="$HOME/.nvm/versions/node/v22.22.0/bin:$PATH"
openclaw gateway

Then configured PM2 to manage this script instead of the bare command.


Suggestions for OpenClaw

  1. Enhanced dependency self-check (openclaw doctor):
    When plugins like Feishu are enabled, auto-detect whether their underlying SDKs (e.g., @larksuiteoapi/node-sdk) are installed. Show a friendly npm install hint on missing dependencies instead of crashing at runtime.

  2. Dashboard proxy detection hint:
    When the Web UI encounters 1008: device nonce required and appears to be behind a reverse proxy, display a UI message: "WebSocket handshake failed. Your proxy may be stripping security headers. Try accessing via SSH local port forwarding instead."

  3. PM2 + NVM deployment best practice:
    For Linux + NVM users, recommend the shell wrapper script approach as the default PM2 setup in documentation. This prevents 99% of "ghost crash" issues caused by missing environment variables. Example in docs would save many users hours of debugging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions