Skip to content

[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry #47711

@joeykrug

Description

@joeykrug

Bug Description

After a SIGUSR1 gateway reload, in-flight subagent LLM calls are aborted and the orphaned sessions are never resumed correctly. They can sit idle until the archive sweeper cleans them up, silently losing in-progress work, or remain stuck as "running" forever under the old run ID.

Root Cause (Updated)

Three SIGUSR1 events occurred in sequence, with three underlying bugs:

Timeline

  1. 18:59:25openclaw gateway restart CLI command sent SIGUSR1 immediately (bypassing deferral)
  2. 18:59:31 — manual kill -USR1 from same deploy subagent (ignored during shutdown)
  3. 19:00:18 — config watcher deferral timeout (90s) expired and fired SIGUSR1

Bug 1: openclaw gateway restart CLI bypasses deferral

The CLI command in src/cli/daemon-cli/lifecycle.ts called signalVerifiedGatewayPidSync(pid, "SIGUSR1") directly, bypassing the deferral logic that the config watcher uses. The config watcher checks for active embedded runs and waits up to the deferral timeout before sending SIGUSR1 — but the CLI restart command sent it immediately.

Bug 2: Post-restart orphan recovery fails silently / tracking stays on the dead run

After restart, restored subagent runs are re-armed under the original run ID, but aborted sessions need a brand-new resumed run. Without explicitly remapping registry tracking to the new resumed run ID, lifecycle handling keeps watching the dead/original run, so resumed completion events are ignored and the old tracked run can later time out or appear stuck as running forever.

Bug 3: No auto-resume for aborted turns

Sessions whose last LLM turn has stopReason: "aborted" (set by the restart abort) are never sent a resume message. They sit idle until archive sweeper cleans them up.

Impact

  • In-flight subagent work (potentially hours of LLM computation) is silently lost
  • No retry mechanism exists — the session is abandoned
  • Completion / outcome reporting can attach to the wrong run ID after restart
  • Users are not notified that their background tasks were killed

Fix

Part A: CLI restart uses deferral path (#47719)

  • New gateway.restart RPC method that routes through scheduleGatewaySigusr1Restart() (same deferral logic as config watcher)
  • CLI restartGatewayWithoutServiceManager() calls the RPC instead of sending SIGUSR1 directly
  • Falls back to direct SIGUSR1 only for backward compatibility with older gateways

Part B: Post-reload orphan recovery + run tracking repair (#47719)

  • New module src/agents/subagent-orphan-recovery.ts
  • After restore, scans for sessions with abortedLastRun: true
  • Sends synthetic resume message to trigger a new LLM turn
  • Captures the resumed run ID and remaps registry tracking from old run ID to new run ID
  • Retries recovery with exponential backoff if gateway is not yet ready
  • Flag only cleared after confirmed successful resume

Part C: Configurable deferral timeout (#47719)

  • DEFAULT_DEFERRAL_MAX_WAIT_MS increased from 90s to 300s (5 minutes)
  • New config key: gateway.reload.deferralTimeoutMs

Key Source Files

  • src/cli/daemon-cli/lifecycle.tsrestartGatewayWithoutServiceManager()
  • src/infra/restart.tsdeferGatewayRestartUntilIdle(), DEFAULT_DEFERRAL_MAX_WAIT_MS
  • src/agents/subagent-orphan-recovery.ts — orphan recovery + resumed run remap
  • src/gateway/server-methods/system.ts — new gateway.restart RPC method
  • src/agents/subagent-registry.ts — hooks orphan recovery into restore path and remaps run tracking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions