SIGUSR1 in-process restart leaves gateway in zombie state

## Problem

After a SIGUSR1 in-process restart, the gateway can enter a permanent zombie state where:
1. **Heartbeat scheduler stops firing** — `[heartbeat] started` logs but no heartbeats actually execute
2. **Incoming messages are received but never processed** — Signal messages arrive via signal-cli but are permanently queued
3. **Cron scheduler keeps polling** — the event loop stays alive, masking the problem
4. **Only a SIGTERM (full process restart) recovers** — the zombie state persists indefinitely

This happens when SIGUSR1 fires while work is in-flight (heartbeat running, messages being processed).

## Root Cause

Two pieces of module-level state in the dist bundle survive the in-process restart but don't get reset:

### 1. `running` flag in `heartbeat-wake.ts`

The `running` flag is set to `true` when a heartbeat handler is executing. If SIGUSR1 fires mid-execution:
- The in-flight handler may be abandoned before its `finally` block runs
- `running` stays `true` in the new lifecycle
- Every `schedule()` call sees `running === true` → sets `scheduled = true` → re-calls `schedule()` → but the timer callback just loops on the `running` check forever
- Heartbeats never fire again

**Note:** The upstream drain mechanism (`waitForActiveTasks`) does not cover heartbeat runs — heartbeats execute directly via the wake handler, not through command queue lanes.

### 2. Lane `active` counters in `command-queue.ts`

Lane states track `active` (number of in-flight tasks) and use `maxConcurrent` to limit parallelism. If SIGUSR1 fires while tasks are executing:
- `waitForActiveTasks` has a 30-second timeout — if tasks don't finish, restart proceeds anyway
- Interrupted/abandoned tasks may never decrement `active`
- New lifecycle inherits elevated `active` counts
- `drainLane()` sees `active >= maxConcurrent` → never dequeues new work → messages permanently stuck

The 2026.2.12 drain fix (#13931) helps but doesn't fully solve this because:
- The drain has a finite timeout (30s) — tasks that don't complete in time leave stale counters
- Race window between drain completion and new lifecycle startup
- Heartbeat runs bypass the drain entirely

## Fix

### `heartbeat-wake.ts` — Reset state on new handler registration

In `setHeartbeatWakeHandler()`, when registering a new (non-null) handler:
- Clear stale timer metadata (`timer`, `timerDueAt`, `timerKind`) so old retry cooldowns do not delay new lifecycle work
- Reset module-level `running` and `scheduled` flags
- Call `resetAllLanes()` to recover command queues

This is safe because:
- A new non-null handler is only registered during lifecycle startup
- The old handler has been disposed (generation-based disposer from #15108)
- Any old in-flight work is either drained or abandoned

```typescript
export function setHeartbeatWakeHandler(next: HeartbeatWakeHandler | null): () => void {
  handlerGeneration += 1;
  const generation = handlerGeneration;
  handler = next;
  if (next) {
    if (timer) {
      clearTimeout(timer);
    }
    timer = null;
    timerDueAt = null;
    timerKind = null;
    running = false;
    scheduled = false;
    resetAllLanes();
  }
  if (handler && pendingWake) {
    schedule(DEFAULT_COALESCE_MS, "normal");
  }
  // ... disposer unchanged
}
```

### `command-queue.ts` — Add `resetAllLanes()` function

Add and export a function that resets lane runtime counters and immediately re-drains queued work:

```typescript
export function resetAllLanes(): void {
  for (const [lane, state] of lanes) {
    state.active = 0;
    state.activeTaskIds.clear();
    state.draining = false;
    drainLane(lane);
  }
}
```

This intentionally does **not** clear lane queues; pending user work should still run after restart.

### Orphaned process recovery

Add a utility script that scans for orphaned coding agent processes (Claude Code, Codex CLI) after gateway restarts. These background processes can outlive the gateway session that spawned them.

## Reproduction

1. Start a heartbeat-enabled gateway with `commands.restart: true`
2. Wait for a heartbeat to start executing (or send a message to create in-flight work)
3. Send SIGUSR1: `kill -USR1 $(pgrep -f openclaw-gateway)`
4. Observe: `[heartbeat] started` appears in logs, but no heartbeat polls fire
5. Send a Signal message → never processed, no response
6. Only `systemctl restart openclaw` (SIGTERM) recovers

## Related

- #14892 — Original heartbeat death report
- #14901 — Heartbeat stall fix (error handling + retry)
- #15108 — Wake handler race fix (generation-based disposer)
- #13931 — Drain active turns before restart

This is the remaining piece: resetting in-flight state that survives the in-process restart boundary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIGUSR1 in-process restart leaves gateway in zombie state #15177

Problem

Root Cause

1. `running` flag in `heartbeat-wake.ts`

2. Lane `active` counters in `command-queue.ts`

Fix

`heartbeat-wake.ts` — Reset state on new handler registration

`command-queue.ts` — Add `resetAllLanes()` function

Orphaned process recovery

Reproduction

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SIGUSR1 in-process restart leaves gateway in zombie state #15177

Description

Problem

Root Cause

1. running flag in heartbeat-wake.ts

2. Lane active counters in command-queue.ts

Fix

heartbeat-wake.ts — Reset state on new handler registration

command-queue.ts — Add resetAllLanes() function

Orphaned process recovery

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `running` flag in `heartbeat-wake.ts`

2. Lane `active` counters in `command-queue.ts`

`heartbeat-wake.ts` — Reset state on new handler registration

`command-queue.ts` — Add `resetAllLanes()` function