Skip to content

[Bug]: health-monitor stale-socket breaks webhook mode with self-signed certs #39303

@wanggold

Description

@wanggold

Bug type

Regression (worked before, now fails)

Summary

Version: 2026.3.2 (stable), Node v22.22.1, arm64

Setup: 3 Telegram bots using webhook mode with self-signed cert (direct IP, nginx reverse proxy on port 8443)

Problem: health-monitor fires stale-socket every ~35 min even in webhook mode (no long-polling socket exists). When it restarts channels, it calls setWebhook WITHOUT re-uploading the self-signed certificate. Telegram then can't verify SSL → stops delivering webhooks → all bots go dead.

Logs (49 events in 7 hours):

[health-monitor] [telegram:default] restarting (reason: stale-socket)
[health-monitor] [telegram:dev] restarting (reason: stale-socket)
[health-monitor] [telegram:finance] restarting (reason: stale-socket)
[health-monitor] [whatsapp:default] restarting (reason: stale-socket)
(repeats every ~35 min)
After each restart: getWebhookInfo shows has_custom_certificate: false

Workaround: Cron job every 1 min checks has_custom_certificate and re-registers webhook with cert if needed.

Suggested fix (any of these):

  1. Skip stale-socket detection in webhook mode (no socket to go stale)
  2. Persist + re-upload self-signed cert on channel restart
  3. Add config option webhookCertPath so cert is auto-included in every setWebhook call

Steps to reproduce

  1. Configure 3 Telegram bot accounts with webhook mode using self-signed certificate (direct IP, no domain)
  2. Set webhookUrl, webhookPort, webhookPath per account in openclaw.json
  3. Manually register webhooks via Telegram Bot API setWebhook with certificate parameter (self-signed PEM)
  4. Wait ~35 minutes

Expected behavior

In webhook mode, health-monitor should not perform stale-socket detection (there is no long-polling socket). If it must restart channels, it should re-upload the self-signed certificate when calling setWebhook.

Actual behavior

health-monitor detects "stale-socket" every ~35 min even in webhook mode. When it restarts channels, it calls setWebhook WITHOUT re-uploading the self-signed certificate. Telegram then reports has_custom_certificate: false, SSL verification fails, and all webhook deliveries stop. All bots become unresponsive.

49 stale-socket events observed in 7 hours:

[health-monitor] [telegram:default] restarting (reason: stale-socket)
[health-monitor] [telegram:dev] restarting (reason: stale-socket)
[health-monitor] [telegram:finance] restarting (reason: stale-socket)
[health-monitor] [whatsapp:default] restarting (reason: stale-socket)
(repeats every ~35 min)

OpenClaw version

2026.3.2 (build 85377a2)

Operating system

Ubuntu 24.04 arm64 (AWS EC2 m7g.2xlarge)

Install method

npm global

Logs, screenshots, and evidence

# health-monitor fires stale-socket every ~35 min in webhook mode (49 events in 7 hours):

Mar 07 17:30:52 [health-monitor] [telegram:default] restarting (reason: stale-socket)
Mar 07 17:30:52 [health-monitor] [telegram:dev] restarting (reason: stale-socket)
Mar 07 17:30:52 [health-monitor] [telegram:finance] restarting (reason: stale-socket)
Mar 07 17:30:52 [health-monitor] [whatsapp:default] restarting (reason: stale-socket)
Mar 07 18:05:52 [health-monitor] [telegram:default] restarting (reason: stale-socket)
Mar 07 18:40:52 [health-monitor] [telegram:default] restarting (reason: stale-socket)
Mar 07 19:15:52 [health-monitor] [telegram:default] restarting (reason: stale-socket)
... (continues every ~35 min)

# After each restart, Telegram webhook cert is lost:
# getWebhookInfo shows has_custom_certificate: false
# Telegram stops delivering webhooks → all bots unresponsive

# Workaround: cron job every 1 min re-registers webhook with cert

Impact and severity

Affected: All Telegram webhook users with self-signed certificates (direct IP, no domain)
Severity: High (blocks all message delivery)
Frequency: 100% repro — triggers every ~35 minutes automatically
Consequence: All Telegram bots become completely unresponsive after each health-monitor restart. Users receive no replies. Messages sent during the dead window are lost. In our case, 49 outages in 7 hours across 3 bot accounts.

Additional information

First known bad version: 2026.3.2 (only version tested). The issue has two root causes: (1) stale-socket detection runs in webhook mode where no long-polling socket exists, and (2) channel restart calls setWebhook without re-uploading the self-signed certificate. Temporary workaround: a cron job running every 1 minute that checks getWebhookInfo for has_custom_certificate: false and re-registers the webhook with the cert file if needed. Suggested fix: add a webhookCertPath config option and include the cert in every setWebhook call, or skip stale-socket detection when webhook mode is active.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions