fix: reset stale execution state after SIGUSR1 in-process restart#15195
fix: reset stale execution state after SIGUSR1 in-process restart#15195gumadeiras merged 17 commits intoopenclaw:mainfrom
Conversation
8bf59b2 to
d14be67
Compare
|
/review @greptile, issues in the comments and rating are fixed |
d14be67 to
8bf59b2
Compare
|
/review @greptile |
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/infra/heartbeat-wake.ts
Line: 145:165
Comment:
**Registration resets in-flight handler**
`setHeartbeatWakeHandler(next)` resets `running`/`scheduled` and clears any pending timer state whenever `next` is non-null (src/infra/heartbeat-wake.ts:149-165). This allows a second handler to start while the previous handler is still executing (the PR’s own test simulates a hung handler), which can cause overlapping heartbeat runs if a handler is replaced for reasons other than a full lifecycle restart. If handler replacement is possible during normal operation, consider gating this reset behind an explicit “new lifecycle” signal (or only doing it when `next` changes from null→non-null) so a mid-flight handler can’t be bypassed accidentally.
How can I resolve this? If you propose a fix, please make it concise. |
|
Took some iteration to get this one working smoothly but this should be the final batch of fixes for the various zombie states where the bot gets stuck and isn’t responding. Tagging @gumadeiras |
d62a97a to
d5c114b
Compare
d5c114b to
eb5fa3b
Compare
…ators Addresses review concern that setHeartbeatWakeHandler() had a surprising cross-cutting side effect by calling resetAllLanes(), coupling heartbeat handler registration to command-queue global state. The lane reset now lives in the restart loop (run-loop.ts and gateway-daemon.ts), which is the correct abstraction level — only in-process restart coordinators need to know about stale lane state. setHeartbeatWakeHandler() still resets its own module-level state (running, scheduled, timer) which is properly scoped.
Address two additional review concerns: 1. Remove separate 'active' counter from LaneState; derive it from activeTaskIds.size instead. This makes negative-underflow impossible — the Set is the single source of truth for active task count. Previously, a double-reset scenario could drive 'active' negative, violating the concurrency check in pump(). 2. Replace unbounded 'ps -axo pid=,command=' with targeted pgrep pre-filter in orphan scanner. Only fetches full command info for candidate PIDs matching 'codex|claude', avoiding O(all-processes) overhead on large hosts.
resetAllLanes() now calls drainLane() for lanes with pending queue entries after resetting generation/activeTaskIds. This prevents queued work from stalling indefinitely when no subsequent enqueueCommandInLane() call arrives after a SIGUSR1 restart. Safe because the drain happens after all lanes are fully reset (generation bumped, activeTaskIds cleared, draining=false), so concurrency invariants are preserved.
Use command -v guard and fallback to "unknown" timestamp when neither node nor date is available. Suppresses stderr and ensures valid JSON output in all minimal environments.
When pgrep returns empty (no matches, exit code 1), the script now correctly reports zero orphans instead of falling through to a full ps scan. The ps fallback only triggers when pgrep is genuinely unavailable (ENOENT). Also: heartbeat handler state reset was already gated on replacement (prev !== null) in a prior commit — no additional change needed.
01dd2f1 to
676f9ec
Compare
…enclaw#15195) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 676f9ec Co-authored-by: joeykrug <[email protected]> Co-authored-by: gumadeiras <[email protected]> Reviewed-by: @gumadeiras
…enclaw#15195) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 676f9ec Co-authored-by: joeykrug <[email protected]> Co-authored-by: gumadeiras <[email protected]> Reviewed-by: @gumadeiras
…enclaw#15195) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 676f9ec Co-authored-by: joeykrug <[email protected]> Co-authored-by: gumadeiras <[email protected]> Reviewed-by: @gumadeiras
…enclaw#15195) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 676f9ec Co-authored-by: joeykrug <[email protected]> Co-authored-by: gumadeiras <[email protected]> Reviewed-by: @gumadeiras
Summary
Fixes #15177. After a SIGUSR1 in-process restart, the gateway can enter a zombie state where:
runningflag inheartbeat-wake.tsstaystruewhen an in-flight heartbeat is interrupted, blocking all futureschedule()attempts.activecounters incommand-queue.tsstay elevated from interrupted tasks, blocking all message dequeuing.Approach
Generation-based lane reset (addresses Greptile review)
Previous version used
resetAllLanes()which zeroed counters and immediately calleddrainLane(). This violated concurrency invariants: if old tasks were still executing, new tasks could start alongside them, breaking main-lane serialization.New approach: Each lane has a
generationcounter. WhenresetAllLanes()fires:enqueueCommandInLane()When old tasks'
finallyblocks run,completeTask()checks generation — stale tasks are no-ops. This preserves concurrency invariants at all times.Heartbeat wake handler improvements
setHeartbeatWakeHandler()now returns a generation-scoped disposer — stale disposers from previous registrations can't clear a newer handler (fixes the ownership race)running/scheduledflags on new registrationOrphan recovery script
scripts/recover-orphaned-processes.shscans for coding agent processes (Claude Code, Codex) that outlived their session. Output: JSON object withorphanedarray andtstimestamp.Test plan
Files changed
src/process/command-queue.ts— generation-based lane reset,completeTask(),resetAllLanes(),getActiveTaskCount(),waitForActiveTasks()src/infra/heartbeat-wake.ts— generation-scoped disposer, priority wake reasons, timer preemptionsrc/process/command-queue.test.ts— 6 new testssrc/infra/heartbeat-wake.test.ts— 11 new tests (new file)scripts/recover-orphaned-processes.sh— orphan scanner (new file)Greptile Overview
Greptile Summary
This change hardens SIGUSR1 in-process restarts by resetting stale runtime state that can otherwise block progress after an interrupted lifecycle.
src/process/command-queue.tsreplaces the separateactivecounter with a generation +activeTaskIds.sizemodel, and addsresetAllLanes()to bump generations, clear stale in-flight bookkeeping, and immediately drain preserved queued work.src/cli/gateway-cli/run-loop.ts,src/macos/gateway-daemon.ts) now callresetAllLanes()only on restart iterations (not initial boot).src/infra/heartbeat-wake.tsmakes handler disposal generation-scoped, resets stale timer/execution flags when registering a new handler, and refines wake coalescing/priorities/timer preemption.scripts/recover-orphaned-processes.shto report likely orphaned agent processes as JSON.No new merge-blocking issues were found in the current diff based on the reviewed code paths.
Confidence Score: 5/5
Last reviewed commit: d62a97a