Skip to content

SIGUSR1 in-process restart leaves gateway in zombie state #15177

@joeykrug

Description

@joeykrug

Problem

After a SIGUSR1 in-process restart, the gateway can enter a permanent zombie state where:

  1. Heartbeat scheduler stops firing[heartbeat] started logs but no heartbeats actually execute
  2. Incoming messages are received but never processed — Signal messages arrive via signal-cli but are permanently queued
  3. Cron scheduler keeps polling — the event loop stays alive, masking the problem
  4. Only a SIGTERM (full process restart) recovers — the zombie state persists indefinitely

This happens when SIGUSR1 fires while work is in-flight (heartbeat running, messages being processed).

Root Cause

Two pieces of module-level state in the dist bundle survive the in-process restart but don't get reset:

1. running flag in heartbeat-wake.ts

The running flag is set to true when a heartbeat handler is executing. If SIGUSR1 fires mid-execution:

  • The in-flight handler may be abandoned before its finally block runs
  • running stays true in the new lifecycle
  • Every schedule() call sees running === true → sets scheduled = true → re-calls schedule() → but the timer callback just loops on the running check forever
  • Heartbeats never fire again

Note: The upstream drain mechanism (waitForActiveTasks) does not cover heartbeat runs — heartbeats execute directly via the wake handler, not through command queue lanes.

2. Lane active counters in command-queue.ts

Lane states track active (number of in-flight tasks) and use maxConcurrent to limit parallelism. If SIGUSR1 fires while tasks are executing:

  • waitForActiveTasks has a 30-second timeout — if tasks don't finish, restart proceeds anyway
  • Interrupted/abandoned tasks may never decrement active
  • New lifecycle inherits elevated active counts
  • drainLane() sees active >= maxConcurrent → never dequeues new work → messages permanently stuck

The 2026.2.12 drain fix (#13931) helps but doesn't fully solve this because:

  • The drain has a finite timeout (30s) — tasks that don't complete in time leave stale counters
  • Race window between drain completion and new lifecycle startup
  • Heartbeat runs bypass the drain entirely

Fix

heartbeat-wake.ts — Reset state on new handler registration

In setHeartbeatWakeHandler(), when registering a new (non-null) handler:

  • Clear stale timer metadata (timer, timerDueAt, timerKind) so old retry cooldowns do not delay new lifecycle work
  • Reset module-level running and scheduled flags
  • Call resetAllLanes() to recover command queues

This is safe because:

export function setHeartbeatWakeHandler(next: HeartbeatWakeHandler | null): () => void {
  handlerGeneration += 1;
  const generation = handlerGeneration;
  handler = next;
  if (next) {
    if (timer) {
      clearTimeout(timer);
    }
    timer = null;
    timerDueAt = null;
    timerKind = null;
    running = false;
    scheduled = false;
    resetAllLanes();
  }
  if (handler && pendingWake) {
    schedule(DEFAULT_COALESCE_MS, "normal");
  }
  // ... disposer unchanged
}

command-queue.ts — Add resetAllLanes() function

Add and export a function that resets lane runtime counters and immediately re-drains queued work:

export function resetAllLanes(): void {
  for (const [lane, state] of lanes) {
    state.active = 0;
    state.activeTaskIds.clear();
    state.draining = false;
    drainLane(lane);
  }
}

This intentionally does not clear lane queues; pending user work should still run after restart.

Orphaned process recovery

Add a utility script that scans for orphaned coding agent processes (Claude Code, Codex CLI) after gateway restarts. These background processes can outlive the gateway session that spawned them.

Reproduction

  1. Start a heartbeat-enabled gateway with commands.restart: true
  2. Wait for a heartbeat to start executing (or send a message to create in-flight work)
  3. Send SIGUSR1: kill -USR1 $(pgrep -f openclaw-gateway)
  4. Observe: [heartbeat] started appears in logs, but no heartbeat polls fire
  5. Send a Signal message → never processed, no response
  6. Only systemctl restart openclaw (SIGTERM) recovers

Related

This is the remaining piece: resetting in-flight state that survives the in-process restart boundary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions