[Bug]: Clawdbot Gateway Crashes Repeatedly

# Clawdbot Gateway Crash Bug Report

**Date:** 2026-01-29  
**Reporter:** Parker (@parkerati)  
**Clawdbot Version:** 2026.1.24-3  
**Platform:** macOS (Darwin 24.6.0)  
**Node Version:** v22.22.0

---

## Summary

The Clawdbot gateway crashes repeatedly due to unhandled promise rejections from network failures. Any failed HTTP request (Telegram API, web_fetch, etc.) causes the entire gateway process to terminate with no graceful recovery.

**Severity:** CRITICAL - Gateway requires manual restarts multiple times per session

---

## Crash Timeline (2026-01-29)

### Crash #1: ~00:16 EST (05:16 UTC)
- **Trigger:** Telegram `setMyCommands` API failures
- **Pattern:** Repeated network fetch failures starting at 05:11 UTC
- **Result:** Silent crash, no error logged for actual exit

### Crash #2: ~00:48 EST (05:48 UTC)
- **Trigger:** Unknown (silent crash during normal operation)
- **Last Log:** 05:48:33 UTC - exec tool call, then process died
- **Result:** No error message, no exception logged

### Crash #3: 01:27 EST (06:27 UTC)
- **Trigger:** web_fetch 403 error from Investing.com
- **Log Entry:**
```
06:15:41 [tools] web_fetch failed: Web fetch failed (403): Just a moment...
06:27:03 [clawdbot] Unhandled promise rejection: TypeError: fetch failed
    at node:internal/deps/undici/undici:14902:13
    at processTicksAndRejections (node:internal/process/task_queues:105:5)
```

### Crash #4: 01:31 EST (06:31 UTC)
- **Trigger:** Unknown network fetch failure during normal operation
- **Log Entry:**
```
06:28:52 [hooks] loaded 3 internal hook handlers
06:28:53 [telegram] [default] starting provider (@lisaparkerbot)
06:29:07 [agent/embedded] Removed orphaned user message
06:31:25 [clawdbot] Unhandled promise rejection: TypeError: fetch failed
    at node:internal/deps/undici/undici:14902:13
    at processTicksAndRejections (node:internal/process/task_queues:105:5)
```
- **Note:** Crash occurred during normal conversation, not during tool use

### Crash #5+: 01:36-01:38 EST
- **Trigger:** Local file exceptions / file operations
- **Pattern:** Gateway also crashes when local file operations fail or throw exceptions
- **Note:** Not just network failures - ANY unhandled exception crashes the gateway

---

## Root Cause

Network operations (Telegram API, web_fetch, etc.) AND local file operations throw unhandled promise rejections when they fail. Node.js terminates the process on unhandled rejections by default.

**Crash triggers include:**
- Network fetch failures (Telegram API, web_fetch tool)
- Local file exceptions (reading non-existent files, permission errors)
- Any unhandled promise rejection from any operation

### Example Log Pattern (Telegram crashes):

```json
{
  "subsystem": "gateway/channels/telegram",
  "message": "telegram setMyCommands failed: HttpError: Network request for 'setMyCommands' failed!",
  "logLevelName": "ERROR",
  "time": "2026-01-29T05:11:13.656Z"
}

{
  "message": "Unhandled promise rejection: TypeError: fetch failed\n    at node:internal/deps/undici/undici:14902:13\n    at processTicksAndRejections (node:internal/process/task_queues:105:5)",
  "logLevelName": "ERROR",
  "time": "2026-01-29T05:11:13.656Z"
}
```

This pattern repeated 10+ times between 05:11 and 05:22 UTC, with gateway crash-looping until Telegram channel was disabled.

---

## Reproduction Steps

1. Start gateway with Telegram enabled
2. Trigger network failure (disconnect internet, block Telegram API, etc.)
3. Gateway attempts Telegram API call on startup
4. API call fails with network error
5. Unhandled promise rejection crashes entire gateway process

**Alternative:** Use web_fetch tool on a URL that returns 403/403/timeout → same crash pattern

---

## Impact

### User Experience
- Gateway requires manual restart after each crash
- Web UI disconnects and cannot reconnect until manual restart
- Telegram channel becomes unusable
- No automatic recovery despite LaunchAgent supervision (stale locks prevent restart)

### Current Workarounds
1. Disable Telegram channel temporarily
2. Avoid web_fetch tools on unreliable endpoints
3. Manual restarts via `clawdbot gateway stop && clawdbot gateway start`

---

## Expected Behavior

Network failures should:
1. **Be caught and logged** - not crash the process
2. **Retry with backoff** - especially for startup operations like Telegram init
3. **Gracefully degrade** - disable failing channel/tool instead of killing gateway
4. **Clean up locks** - allow supervisor to restart if crash occurs

---

## Suggested Fixes

### 1. Global Unhandled Rejection Handler
Add process-level handler to catch and log unhandled rejections:

```javascript
process.on('unhandledRejection', (reason, promise) => {
  logger.error('Unhandled Promise Rejection:', reason);
  // Don't exit - log and continue
});
```

### 2. Wrap Network Operations
All fetch/HTTP operations should use try-catch or .catch():

```javascript
// Telegram init
try {
  await telegram.setMyCommands(commands);
} catch (error) {
  logger.error('Telegram init failed:', error);
  // Disable channel or retry, don't throw
}

// web_fetch tool
async function webFetch(url) {
  try {
    return await fetch(url);
  } catch (error) {
    logger.error(`web_fetch failed for ${url}:`, error);
    return { status: 'error', error: error.message };
  }
}
```

### 3. Startup Resilience
Channel initialization should not block gateway startup:
- Try to init channels asynchronously
- If channel fails to init, mark as disabled and log error
- Continue gateway startup with remaining channels

### 4. Lock File Cleanup
On crash, stale lock files prevent LaunchAgent auto-restart. Either:
- Use process monitoring instead of file locks
- Clean stale locks on startup (check if PID is actually running)
- Implement lock timeout/expiration

---

## Additional Context

### LaunchAgent Configuration
Gateway is supervised by macOS LaunchAgent with `KeepAlive = true`, but auto-restart fails due to stale lock conflicts.

### System Resources
Not a resource issue - crashes happen with plenty of available memory/CPU. Purely error handling problem.

### Frequency
Tonight (2026-01-29): **5+ crashes in ~1.5 hours of active use**. Gateway completely unstable - requires manual restart every 10-15 minutes on average. System is unusable for production.

---

## Log Files

Full logs available at:
- `/tmp/clawdbot/clawdbot-2026-01-29.log`
- `/Users/parker/.clawdbot/logs/gateway.log`
- `/Users/parker/.clawdbot/logs/gateway.err.log`

Relevant excerpts included above.

---

## Priority

**CRITICAL** - Gateway is unusable in production without constant manual intervention. This affects:
- All channel integrations (Telegram, etc.)
- Tool reliability (web_fetch, web_search)
- User confidence in system stability

---



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Clawdbot Gateway Crashes Repeatedly #3815

Clawdbot Gateway Crash Bug Report

Summary

Crash Timeline (2026-01-29)

Crash #1: ~00:16 EST (05:16 UTC)

Crash #2: ~00:48 EST (05:48 UTC)

Crash #3: 01:27 EST (06:27 UTC)

Crash #4: 01:31 EST (06:31 UTC)

Crash #5+: 01:36-01:38 EST

Root Cause

Example Log Pattern (Telegram crashes):

Reproduction Steps

Impact

User Experience

Current Workarounds

Expected Behavior

Suggested Fixes

1. Global Unhandled Rejection Handler

2. Wrap Network Operations

3. Startup Resilience

4. Lock File Cleanup

Additional Context

LaunchAgent Configuration

System Resources

Frequency

Log Files

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Clawdbot Gateway Crashes Repeatedly #3815

Description

Clawdbot Gateway Crash Bug Report

Summary

Crash Timeline (2026-01-29)

Crash #1: ~00:16 EST (05:16 UTC)

Crash #2: ~00:48 EST (05:48 UTC)

Crash #3: 01:27 EST (06:27 UTC)

Crash #4: 01:31 EST (06:31 UTC)

Crash #5+: 01:36-01:38 EST

Root Cause

Example Log Pattern (Telegram crashes):

Reproduction Steps

Impact

User Experience

Current Workarounds

Expected Behavior

Suggested Fixes

1. Global Unhandled Rejection Handler

2. Wrap Network Operations

3. Startup Resilience

4. Lock File Cleanup

Additional Context

LaunchAgent Configuration

System Resources

Frequency

Log Files

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions