-
-
Notifications
You must be signed in to change notification settings - Fork 40k
Description
Clawdbot Gateway Crash Bug Report
Date: 2026-01-29
Reporter: Parker (@parkerati)
Clawdbot Version: 2026.1.24-3
Platform: macOS (Darwin 24.6.0)
Node Version: v22.22.0
Summary
The Clawdbot gateway crashes repeatedly due to unhandled promise rejections from network failures. Any failed HTTP request (Telegram API, web_fetch, etc.) causes the entire gateway process to terminate with no graceful recovery.
Severity: CRITICAL - Gateway requires manual restarts multiple times per session
Crash Timeline (2026-01-29)
Crash #1: ~00:16 EST (05:16 UTC)
- Trigger: Telegram
setMyCommandsAPI failures - Pattern: Repeated network fetch failures starting at 05:11 UTC
- Result: Silent crash, no error logged for actual exit
Crash #2: ~00:48 EST (05:48 UTC)
- Trigger: Unknown (silent crash during normal operation)
- Last Log: 05:48:33 UTC - exec tool call, then process died
- Result: No error message, no exception logged
Crash #3: 01:27 EST (06:27 UTC)
- Trigger: web_fetch 403 error from Investing.com
- Log Entry:
06:15:41 [tools] web_fetch failed: Web fetch failed (403): Just a moment...
06:27:03 [clawdbot] Unhandled promise rejection: TypeError: fetch failed
at node:internal/deps/undici/undici:14902:13
at processTicksAndRejections (node:internal/process/task_queues:105:5)
Crash #4: 01:31 EST (06:31 UTC)
- Trigger: Unknown network fetch failure during normal operation
- Log Entry:
06:28:52 [hooks] loaded 3 internal hook handlers
06:28:53 [telegram] [default] starting provider (@lisaparkerbot)
06:29:07 [agent/embedded] Removed orphaned user message
06:31:25 [clawdbot] Unhandled promise rejection: TypeError: fetch failed
at node:internal/deps/undici/undici:14902:13
at processTicksAndRejections (node:internal/process/task_queues:105:5)
- Note: Crash occurred during normal conversation, not during tool use
Crash #5+: 01:36-01:38 EST
- Trigger: Local file exceptions / file operations
- Pattern: Gateway also crashes when local file operations fail or throw exceptions
- Note: Not just network failures - ANY unhandled exception crashes the gateway
Root Cause
Network operations (Telegram API, web_fetch, etc.) AND local file operations throw unhandled promise rejections when they fail. Node.js terminates the process on unhandled rejections by default.
Crash triggers include:
- Network fetch failures (Telegram API, web_fetch tool)
- Local file exceptions (reading non-existent files, permission errors)
- Any unhandled promise rejection from any operation
Example Log Pattern (Telegram crashes):
{
"subsystem": "gateway/channels/telegram",
"message": "telegram setMyCommands failed: HttpError: Network request for 'setMyCommands' failed!",
"logLevelName": "ERROR",
"time": "2026-01-29T05:11:13.656Z"
}
{
"message": "Unhandled promise rejection: TypeError: fetch failed\n at node:internal/deps/undici/undici:14902:13\n at processTicksAndRejections (node:internal/process/task_queues:105:5)",
"logLevelName": "ERROR",
"time": "2026-01-29T05:11:13.656Z"
}This pattern repeated 10+ times between 05:11 and 05:22 UTC, with gateway crash-looping until Telegram channel was disabled.
Reproduction Steps
- Start gateway with Telegram enabled
- Trigger network failure (disconnect internet, block Telegram API, etc.)
- Gateway attempts Telegram API call on startup
- API call fails with network error
- Unhandled promise rejection crashes entire gateway process
Alternative: Use web_fetch tool on a URL that returns 403/403/timeout → same crash pattern
Impact
User Experience
- Gateway requires manual restart after each crash
- Web UI disconnects and cannot reconnect until manual restart
- Telegram channel becomes unusable
- No automatic recovery despite LaunchAgent supervision (stale locks prevent restart)
Current Workarounds
- Disable Telegram channel temporarily
- Avoid web_fetch tools on unreliable endpoints
- Manual restarts via
clawdbot gateway stop && clawdbot gateway start
Expected Behavior
Network failures should:
- Be caught and logged - not crash the process
- Retry with backoff - especially for startup operations like Telegram init
- Gracefully degrade - disable failing channel/tool instead of killing gateway
- Clean up locks - allow supervisor to restart if crash occurs
Suggested Fixes
1. Global Unhandled Rejection Handler
Add process-level handler to catch and log unhandled rejections:
process.on('unhandledRejection', (reason, promise) => {
logger.error('Unhandled Promise Rejection:', reason);
// Don't exit - log and continue
});2. Wrap Network Operations
All fetch/HTTP operations should use try-catch or .catch():
// Telegram init
try {
await telegram.setMyCommands(commands);
} catch (error) {
logger.error('Telegram init failed:', error);
// Disable channel or retry, don't throw
}
// web_fetch tool
async function webFetch(url) {
try {
return await fetch(url);
} catch (error) {
logger.error(`web_fetch failed for ${url}:`, error);
return { status: 'error', error: error.message };
}
}3. Startup Resilience
Channel initialization should not block gateway startup:
- Try to init channels asynchronously
- If channel fails to init, mark as disabled and log error
- Continue gateway startup with remaining channels
4. Lock File Cleanup
On crash, stale lock files prevent LaunchAgent auto-restart. Either:
- Use process monitoring instead of file locks
- Clean stale locks on startup (check if PID is actually running)
- Implement lock timeout/expiration
Additional Context
LaunchAgent Configuration
Gateway is supervised by macOS LaunchAgent with KeepAlive = true, but auto-restart fails due to stale lock conflicts.
System Resources
Not a resource issue - crashes happen with plenty of available memory/CPU. Purely error handling problem.
Frequency
Tonight (2026-01-29): 5+ crashes in ~1.5 hours of active use. Gateway completely unstable - requires manual restart every 10-15 minutes on average. System is unusable for production.
Log Files
Full logs available at:
/tmp/clawdbot/clawdbot-2026-01-29.log/Users/parker/.clawdbot/logs/gateway.log/Users/parker/.clawdbot/logs/gateway.err.log
Relevant excerpts included above.
Priority
CRITICAL - Gateway is unusable in production without constant manual intervention. This affects:
- All channel integrations (Telegram, etc.)
- Tool reliability (web_fetch, web_search)
- User confidence in system stability