-
-
Notifications
You must be signed in to change notification settings - Fork 69.5k
[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry #47711
Description
Bug Description
After a SIGUSR1 gateway reload, in-flight subagent LLM calls are aborted and the orphaned sessions are never resumed correctly. They can sit idle until the archive sweeper cleans them up, silently losing in-progress work, or remain stuck as "running" forever under the old run ID.
Root Cause (Updated)
Three SIGUSR1 events occurred in sequence, with three underlying bugs:
Timeline
- 18:59:25 —
openclaw gateway restartCLI command sent SIGUSR1 immediately (bypassing deferral) - 18:59:31 — manual
kill -USR1from same deploy subagent (ignored during shutdown) - 19:00:18 — config watcher deferral timeout (90s) expired and fired SIGUSR1
Bug 1: openclaw gateway restart CLI bypasses deferral
The CLI command in src/cli/daemon-cli/lifecycle.ts called signalVerifiedGatewayPidSync(pid, "SIGUSR1") directly, bypassing the deferral logic that the config watcher uses. The config watcher checks for active embedded runs and waits up to the deferral timeout before sending SIGUSR1 — but the CLI restart command sent it immediately.
Bug 2: Post-restart orphan recovery fails silently / tracking stays on the dead run
After restart, restored subagent runs are re-armed under the original run ID, but aborted sessions need a brand-new resumed run. Without explicitly remapping registry tracking to the new resumed run ID, lifecycle handling keeps watching the dead/original run, so resumed completion events are ignored and the old tracked run can later time out or appear stuck as running forever.
Bug 3: No auto-resume for aborted turns
Sessions whose last LLM turn has stopReason: "aborted" (set by the restart abort) are never sent a resume message. They sit idle until archive sweeper cleans them up.
Impact
- In-flight subagent work (potentially hours of LLM computation) is silently lost
- No retry mechanism exists — the session is abandoned
- Completion / outcome reporting can attach to the wrong run ID after restart
- Users are not notified that their background tasks were killed
Fix
Part A: CLI restart uses deferral path (#47719)
- New
gateway.restartRPC method that routes throughscheduleGatewaySigusr1Restart()(same deferral logic as config watcher) - CLI
restartGatewayWithoutServiceManager()calls the RPC instead of sending SIGUSR1 directly - Falls back to direct SIGUSR1 only for backward compatibility with older gateways
Part B: Post-reload orphan recovery + run tracking repair (#47719)
- New module
src/agents/subagent-orphan-recovery.ts - After restore, scans for sessions with
abortedLastRun: true - Sends synthetic resume message to trigger a new LLM turn
- Captures the resumed run ID and remaps registry tracking from old run ID to new run ID
- Retries recovery with exponential backoff if gateway is not yet ready
- Flag only cleared after confirmed successful resume
Part C: Configurable deferral timeout (#47719)
DEFAULT_DEFERRAL_MAX_WAIT_MSincreased from 90s to 300s (5 minutes)- New config key:
gateway.reload.deferralTimeoutMs
Key Source Files
src/cli/daemon-cli/lifecycle.ts—restartGatewayWithoutServiceManager()src/infra/restart.ts—deferGatewayRestartUntilIdle(),DEFAULT_DEFERRAL_MAX_WAIT_MSsrc/agents/subagent-orphan-recovery.ts— orphan recovery + resumed run remapsrc/gateway/server-methods/system.ts— newgateway.restartRPC methodsrc/agents/subagent-registry.ts— hooks orphan recovery into restore path and remaps run tracking