Skip to content

fix: reset stale execution state after SIGUSR1 in-process restart#15195

Merged
gumadeiras merged 17 commits intoopenclaw:mainfrom
joeykrug:fix/sigusr1-zombie-state
Feb 13, 2026
Merged

fix: reset stale execution state after SIGUSR1 in-process restart#15195
gumadeiras merged 17 commits intoopenclaw:mainfrom
joeykrug:fix/sigusr1-zombie-state

Conversation

@joeykrug
Copy link
Contributor

@joeykrug joeykrug commented Feb 13, 2026

Summary

Fixes #15177. After a SIGUSR1 in-process restart, the gateway can enter a zombie state where:

  1. Heartbeat scheduler blocks permanently — the module-level running flag in heartbeat-wake.ts stays true when an in-flight heartbeat is interrupted, blocking all future schedule() attempts.
  2. Message lanes block permanentlyactive counters in command-queue.ts stay elevated from interrupted tasks, blocking all message dequeuing.

Approach

Generation-based lane reset (addresses Greptile review)

Previous version used resetAllLanes() which zeroed counters and immediately called drainLane(). This violated concurrency invariants: if old tasks were still executing, new tasks could start alongside them, breaking main-lane serialization.

New approach: Each lane has a generation counter. When resetAllLanes() fires:

  • Generation increments per lane
  • Active counters reset to 0
  • No immediate drain — new work resumes naturally on next enqueueCommandInLane()

When old tasks' finally blocks run, completeTask() checks generation — stale tasks are no-ops. This preserves concurrency invariants at all times.

Heartbeat wake handler improvements

  • setHeartbeatWakeHandler() now returns a generation-scoped disposer — stale disposers from previous registrations can't clear a newer handler (fixes the ownership race)
  • Clears stale timer metadata (retry cooldowns) when registering a new handler
  • Resets running/scheduled flags on new registration
  • Priority-based wake reason queueing (action > default > interval > retry)
  • Timer preemption: sooner requests preempt later ones, but retry cooldowns are preserved

Orphan recovery script

scripts/recover-orphaned-processes.sh scans for coding agent processes (Claude Code, Codex) that outlived their session. Output: JSON object with orphaned array and ts timestamp.

Test plan

  • 21 tests (10 command-queue, 11 heartbeat-wake), all passing
  • Tests cover: generation mismatch, stale disposer no-op, timer preemption, retry cooldown preservation, active task counting, drain-after-reset flow

Files changed

  • src/process/command-queue.ts — generation-based lane reset, completeTask(), resetAllLanes(), getActiveTaskCount(), waitForActiveTasks()
  • src/infra/heartbeat-wake.ts — generation-scoped disposer, priority wake reasons, timer preemption
  • src/process/command-queue.test.ts — 6 new tests
  • src/infra/heartbeat-wake.test.ts — 11 new tests (new file)
  • scripts/recover-orphaned-processes.sh — orphan scanner (new file)

Greptile Overview

Greptile Summary

This change hardens SIGUSR1 in-process restarts by resetting stale runtime state that can otherwise block progress after an interrupted lifecycle.

  • src/process/command-queue.ts replaces the separate active counter with a generation + activeTaskIds.size model, and adds resetAllLanes() to bump generations, clear stale in-flight bookkeeping, and immediately drain preserved queued work.
  • Restart coordinators (src/cli/gateway-cli/run-loop.ts, src/macos/gateway-daemon.ts) now call resetAllLanes() only on restart iterations (not initial boot).
  • src/infra/heartbeat-wake.ts makes handler disposal generation-scoped, resets stale timer/execution flags when registering a new handler, and refines wake coalescing/priorities/timer preemption.
  • Adds tests for the above behaviors and introduces scripts/recover-orphaned-processes.sh to report likely orphaned agent processes as JSON.

No new merge-blocking issues were found in the current diff based on the reviewed code paths.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • Changes are narrowly scoped to restart recovery and wake scheduling, with state reset gated to restart iterations, generation-based stale-task handling in the command queue, and extensive new unit coverage for the new semantics.
  • No files require special attention

Last reviewed commit: d62a97a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Feb 13, 2026
@joeykrug joeykrug force-pushed the fix/sigusr1-zombie-state branch from 8bf59b2 to d14be67 Compare February 13, 2026 05:10
@openclaw-barnacle openclaw-barnacle bot removed the cli CLI command changes label Feb 13, 2026
@joeykrug
Copy link
Contributor Author

joeykrug commented Feb 13, 2026

/review @greptile, issues in the comments and rating are fixed

@joeykrug joeykrug marked this pull request as draft February 13, 2026 05:26
@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 05:27
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@joeykrug joeykrug marked this pull request as draft February 13, 2026 05:32
@joeykrug joeykrug force-pushed the fix/sigusr1-zombie-state branch from d14be67 to 8bf59b2 Compare February 13, 2026 05:33
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Feb 13, 2026
@joeykrug
Copy link
Contributor Author

joeykrug commented Feb 13, 2026

/review @greptile

@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 05:34
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@joeykrug joeykrug marked this pull request as draft February 13, 2026 05:37
@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 05:47
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@joeykrug joeykrug marked this pull request as draft February 13, 2026 05:56
@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 05:56
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@joeykrug joeykrug marked this pull request as draft February 13, 2026 06:16
@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 06:16
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 13, 2026

Additional Comments (1)

src/infra/heartbeat-wake.ts
Registration resets in-flight handler

setHeartbeatWakeHandler(next) resets running/scheduled and clears any pending timer state whenever next is non-null (src/infra/heartbeat-wake.ts:149-165). This allows a second handler to start while the previous handler is still executing (the PR’s own test simulates a hung handler), which can cause overlapping heartbeat runs if a handler is replaced for reasons other than a full lifecycle restart. If handler replacement is possible during normal operation, consider gating this reset behind an explicit “new lifecycle” signal (or only doing it when next changes from null→non-null) so a mid-flight handler can’t be bypassed accidentally.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-wake.ts
Line: 145:165

Comment:
**Registration resets in-flight handler**

`setHeartbeatWakeHandler(next)` resets `running`/`scheduled` and clears any pending timer state whenever `next` is non-null (src/infra/heartbeat-wake.ts:149-165). This allows a second handler to start while the previous handler is still executing (the PR’s own test simulates a hung handler), which can cause overlapping heartbeat runs if a handler is replaced for reasons other than a full lifecycle restart. If handler replacement is possible during normal operation, consider gating this reset behind an explicit “new lifecycle” signal (or only doing it when `next` changes from null→non-null) so a mid-flight handler can’t be bypassed accidentally.

How can I resolve this? If you propose a fix, please make it concise.

@joeykrug joeykrug marked this pull request as draft February 13, 2026 06:22
@joeykrug joeykrug marked this pull request as ready for review February 13, 2026 06:28
@joeykrug
Copy link
Contributor Author

Took some iteration to get this one working smoothly but this should be the final batch of fixes for the various zombie states where the bot gets stuck and isn’t responding. Tagging @gumadeiras

@gumadeiras gumadeiras force-pushed the fix/sigusr1-zombie-state branch from d62a97a to d5c114b Compare February 13, 2026 20:16
@gumadeiras gumadeiras force-pushed the fix/sigusr1-zombie-state branch from d5c114b to eb5fa3b Compare February 13, 2026 20:17
joeykrug and others added 17 commits February 13, 2026 15:29
…ators

Addresses review concern that setHeartbeatWakeHandler() had a surprising
cross-cutting side effect by calling resetAllLanes(), coupling heartbeat
handler registration to command-queue global state.

The lane reset now lives in the restart loop (run-loop.ts and
gateway-daemon.ts), which is the correct abstraction level — only
in-process restart coordinators need to know about stale lane state.

setHeartbeatWakeHandler() still resets its own module-level state
(running, scheduled, timer) which is properly scoped.
Address two additional review concerns:

1. Remove separate 'active' counter from LaneState; derive it from
   activeTaskIds.size instead. This makes negative-underflow impossible
   — the Set is the single source of truth for active task count.
   Previously, a double-reset scenario could drive 'active' negative,
   violating the concurrency check in pump().

2. Replace unbounded 'ps -axo pid=,command=' with targeted pgrep
   pre-filter in orphan scanner. Only fetches full command info for
   candidate PIDs matching 'codex|claude', avoiding O(all-processes)
   overhead on large hosts.
resetAllLanes() now calls drainLane() for lanes with pending queue
entries after resetting generation/activeTaskIds. This prevents queued
work from stalling indefinitely when no subsequent enqueueCommandInLane()
call arrives after a SIGUSR1 restart.

Safe because the drain happens after all lanes are fully reset (generation
bumped, activeTaskIds cleared, draining=false), so concurrency invariants
are preserved.
Use command -v guard and fallback to "unknown" timestamp when
neither node nor date is available. Suppresses stderr and ensures
valid JSON output in all minimal environments.
When pgrep returns empty (no matches, exit code 1), the script now
correctly reports zero orphans instead of falling through to a full
ps scan. The ps fallback only triggers when pgrep is genuinely
unavailable (ENOENT).

Also: heartbeat handler state reset was already gated on replacement
(prev !== null) in a prior commit — no additional change needed.
@gumadeiras gumadeiras force-pushed the fix/sigusr1-zombie-state branch from 01dd2f1 to 676f9ec Compare February 13, 2026 20:29
@gumadeiras gumadeiras merged commit 4e9f933 into openclaw:main Feb 13, 2026
6 checks passed
@gumadeiras
Copy link
Member

Merged via squash.

Thanks @joeykrug!

zhangyang-crazy-one pushed a commit to zhangyang-crazy-one/openclaw that referenced this pull request Feb 13, 2026
…enclaw#15195)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 676f9ec
Co-authored-by: joeykrug <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
steipete pushed a commit to azade-c/openclaw that referenced this pull request Feb 14, 2026
…enclaw#15195)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 676f9ec
Co-authored-by: joeykrug <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
GwonHyeok pushed a commit to learners-superpumped/openclaw that referenced this pull request Feb 15, 2026
…enclaw#15195)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 676f9ec
Co-authored-by: joeykrug <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
cloud-neutral pushed a commit to cloud-neutral-toolkit/openclawbot.svc.plus that referenced this pull request Feb 15, 2026
…enclaw#15195)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 676f9ec
Co-authored-by: joeykrug <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes scripts Repository scripts size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGUSR1 in-process restart leaves gateway in zombie state

2 participants

Comments