[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry

## Bug Description

After a SIGUSR1 gateway reload, in-flight subagent LLM calls are aborted and the orphaned sessions are never resumed correctly. They can sit idle until the archive sweeper cleans them up, silently losing in-progress work, or remain stuck as "running" forever under the old run ID.

## Root Cause (Updated)

Three SIGUSR1 events occurred in sequence, with **three underlying bugs**:

### Timeline
1. **18:59:25** — `openclaw gateway restart` CLI command sent SIGUSR1 immediately (bypassing deferral)
2. **18:59:31** — manual `kill -USR1` from same deploy subagent (ignored during shutdown)
3. **19:00:18** — config watcher deferral timeout (90s) expired and fired SIGUSR1

### Bug 1: `openclaw gateway restart` CLI bypasses deferral
The CLI command in `src/cli/daemon-cli/lifecycle.ts` called `signalVerifiedGatewayPidSync(pid, "SIGUSR1")` directly, bypassing the deferral logic that the config watcher uses. The config watcher checks for active embedded runs and waits up to the deferral timeout before sending SIGUSR1 — but the CLI restart command sent it immediately.

### Bug 2: Post-restart orphan recovery fails silently / tracking stays on the dead run
After restart, restored subagent runs are re-armed under the original run ID, but aborted sessions need a brand-new resumed run. Without explicitly remapping registry tracking to the new resumed run ID, lifecycle handling keeps watching the dead/original run, so resumed completion events are ignored and the old tracked run can later time out or appear stuck as running forever.

### Bug 3: No auto-resume for aborted turns
Sessions whose last LLM turn has `stopReason: "aborted"` (set by the restart abort) are never sent a resume message. They sit idle until archive sweeper cleans them up.

## Impact

- In-flight subagent work (potentially hours of LLM computation) is silently lost
- No retry mechanism exists — the session is abandoned
- Completion / outcome reporting can attach to the wrong run ID after restart
- Users are not notified that their background tasks were killed

## Fix

### Part A: CLI restart uses deferral path (#47719)
- New `gateway.restart` RPC method that routes through `scheduleGatewaySigusr1Restart()` (same deferral logic as config watcher)
- CLI `restartGatewayWithoutServiceManager()` calls the RPC instead of sending SIGUSR1 directly
- Falls back to direct SIGUSR1 only for backward compatibility with older gateways

### Part B: Post-reload orphan recovery + run tracking repair (#47719)
- New module `src/agents/subagent-orphan-recovery.ts`
- After restore, scans for sessions with `abortedLastRun: true`
- Sends synthetic resume message to trigger a new LLM turn
- Captures the resumed run ID and remaps registry tracking from old run ID to new run ID
- Retries recovery with exponential backoff if gateway is not yet ready
- Flag only cleared after confirmed successful resume

### Part C: Configurable deferral timeout (#47719)
- `DEFAULT_DEFERRAL_MAX_WAIT_MS` increased from 90s to 300s (5 minutes)
- New config key: `gateway.reload.deferralTimeoutMs`

## Key Source Files
- `src/cli/daemon-cli/lifecycle.ts` — `restartGatewayWithoutServiceManager()`
- `src/infra/restart.ts` — `deferGatewayRestartUntilIdle()`, `DEFAULT_DEFERRAL_MAX_WAIT_MS`
- `src/agents/subagent-orphan-recovery.ts` — orphan recovery + resumed run remap
- `src/gateway/server-methods/system.ts` — new `gateway.restart` RPC method
- `src/agents/subagent-registry.ts` — hooks orphan recovery into restore path and remaps run tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry #47711

Bug Description

Root Cause (Updated)

Timeline

Bug 1: `openclaw gateway restart` CLI bypasses deferral

Bug 2: Post-restart orphan recovery fails silently / tracking stays on the dead run

Bug 3: No auto-resume for aborted turns

Impact

Fix

Part A: CLI restart uses deferral path (#47719)

Part B: Post-reload orphan recovery + run tracking repair (#47719)

Part C: Configurable deferral timeout (#47719)

Key Source Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry #47711

Description

Bug Description

Root Cause (Updated)

Timeline

Bug 1: openclaw gateway restart CLI bypasses deferral

Bug 2: Post-restart orphan recovery fails silently / tracking stays on the dead run

Bug 3: No auto-resume for aborted turns

Impact

Fix

Part A: CLI restart uses deferral path (#47719)

Part B: Post-reload orphan recovery + run tracking repair (#47719)

Part C: Configurable deferral timeout (#47719)

Key Source Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: `openclaw gateway restart` CLI bypasses deferral