-
-
Notifications
You must be signed in to change notification settings - Fork 39.7k
Description
Problem
After a SIGUSR1 in-process restart, the gateway can enter a permanent zombie state where:
- Heartbeat scheduler stops firing —
[heartbeat] startedlogs but no heartbeats actually execute - Incoming messages are received but never processed — Signal messages arrive via signal-cli but are permanently queued
- Cron scheduler keeps polling — the event loop stays alive, masking the problem
- Only a SIGTERM (full process restart) recovers — the zombie state persists indefinitely
This happens when SIGUSR1 fires while work is in-flight (heartbeat running, messages being processed).
Root Cause
Two pieces of module-level state in the dist bundle survive the in-process restart but don't get reset:
1. running flag in heartbeat-wake.ts
The running flag is set to true when a heartbeat handler is executing. If SIGUSR1 fires mid-execution:
- The in-flight handler may be abandoned before its
finallyblock runs runningstaystruein the new lifecycle- Every
schedule()call seesrunning === true→ setsscheduled = true→ re-callsschedule()→ but the timer callback just loops on therunningcheck forever - Heartbeats never fire again
Note: The upstream drain mechanism (waitForActiveTasks) does not cover heartbeat runs — heartbeats execute directly via the wake handler, not through command queue lanes.
2. Lane active counters in command-queue.ts
Lane states track active (number of in-flight tasks) and use maxConcurrent to limit parallelism. If SIGUSR1 fires while tasks are executing:
waitForActiveTaskshas a 30-second timeout — if tasks don't finish, restart proceeds anyway- Interrupted/abandoned tasks may never decrement
active - New lifecycle inherits elevated
activecounts drainLane()seesactive >= maxConcurrent→ never dequeues new work → messages permanently stuck
The 2026.2.12 drain fix (#13931) helps but doesn't fully solve this because:
- The drain has a finite timeout (30s) — tasks that don't complete in time leave stale counters
- Race window between drain completion and new lifecycle startup
- Heartbeat runs bypass the drain entirely
Fix
heartbeat-wake.ts — Reset state on new handler registration
In setHeartbeatWakeHandler(), when registering a new (non-null) handler:
- Clear stale timer metadata (
timer,timerDueAt,timerKind) so old retry cooldowns do not delay new lifecycle work - Reset module-level
runningandscheduledflags - Call
resetAllLanes()to recover command queues
This is safe because:
- A new non-null handler is only registered during lifecycle startup
- The old handler has been disposed (generation-based disposer from fix: prevent heartbeat scheduler silent death from wake handler race #15108)
- Any old in-flight work is either drained or abandoned
export function setHeartbeatWakeHandler(next: HeartbeatWakeHandler | null): () => void {
handlerGeneration += 1;
const generation = handlerGeneration;
handler = next;
if (next) {
if (timer) {
clearTimeout(timer);
}
timer = null;
timerDueAt = null;
timerKind = null;
running = false;
scheduled = false;
resetAllLanes();
}
if (handler && pendingWake) {
schedule(DEFAULT_COALESCE_MS, "normal");
}
// ... disposer unchanged
}command-queue.ts — Add resetAllLanes() function
Add and export a function that resets lane runtime counters and immediately re-drains queued work:
export function resetAllLanes(): void {
for (const [lane, state] of lanes) {
state.active = 0;
state.activeTaskIds.clear();
state.draining = false;
drainLane(lane);
}
}This intentionally does not clear lane queues; pending user work should still run after restart.
Orphaned process recovery
Add a utility script that scans for orphaned coding agent processes (Claude Code, Codex CLI) after gateway restarts. These background processes can outlive the gateway session that spawned them.
Reproduction
- Start a heartbeat-enabled gateway with
commands.restart: true - Wait for a heartbeat to start executing (or send a message to create in-flight work)
- Send SIGUSR1:
kill -USR1 $(pgrep -f openclaw-gateway) - Observe:
[heartbeat] startedappears in logs, but no heartbeat polls fire - Send a Signal message → never processed, no response
- Only
systemctl restart openclaw(SIGTERM) recovers
Related
- Heartbeat scheduler dies silently when runOnce() throws during session compaction #14892 — Original heartbeat death report
- fix: prevent heartbeat scheduler death when runOnce throws #14901 — Heartbeat stall fix (error handling + retry)
- fix: prevent heartbeat scheduler silent death from wake handler race #15108 — Wake handler race fix (generation-based disposer)
- fix(gateway): drain active turns before restart to prevent message loss #13931 — Drain active turns before restart
This is the remaining piece: resetting in-flight state that survives the in-process restart boundary.