fix: resume orphaned subagent sessions after SIGUSR1 reload#47719
fix: resume orphaned subagent sessions after SIGUSR1 reload#47719steipete merged 6 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR implements post-restart orphan recovery for subagent sessions (Part B) and a configurable deferral timeout (Part C) to address three root causes of silent in-flight work loss during gateway SIGUSR1 reloads. The new
Confidence Score: 4/5
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d725ec153
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Clear the aborted flag before resuming | ||
| await updateSessionStore(storePath, (currentStore) => { | ||
| const current = currentStore[childSessionKey]; | ||
| if (current) { | ||
| current.abortedLastRun = false; |
There was a problem hiding this comment.
Clear aborted markers only after a successful resume
This clears abortedLastRun before attempting the synthetic resume turn. If the subsequent gateway call fails (e.g., gateway still booting or transient RPC error), the session is counted as failed but no longer has abortedLastRun=true, so later scans skip it (if (!entry.abortedLastRun) continue) and the orphaned run is effectively unrecoverable. Keep the flag set until resume succeeds (or revert it on failure) so retries remain possible.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 52e8e6d — flag clearing now happens only after successful resume.
There was a problem hiding this comment.
Already addressed: abortedLastRun was moved to clear only after successful resume in c3eae1d. The flag is preserved on failure so the next restart can retry. Test coverage verifies both paths.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52e8e6d19e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13680006cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a792291486
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
a792291 to
38bb216
Compare
38bb216 to
1eb6ce7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1eb6ce7d2d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02c0c994bc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d39ffaa to
c3eae1d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3eae1d019
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ume context and config idempotency guard
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53d038a218
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const cfg = await readBestEffortConfig().catch(() => undefined); | ||
| const tlsEnabled = !!(cfg as { gateway?: { tls?: { enabled?: unknown } } } | undefined)?.gateway | ||
| ?.tls?.enabled; | ||
| const scheme = tlsEnabled ? "wss" : "ws"; |
There was a problem hiding this comment.
Raw type assertion bypasses typed config
readBestEffortConfig() already returns a strongly-typed config object. Using a manual structural cast here creates a silent mismatch risk — if the TLS config shape ever changes (e.g. tls.enabled is renamed or moved), this cast will silently evaluate to undefined instead of surfacing an error.
The existing typed config should already expose gateway.tls.enabled via the imported types — using the typed accessor would be both safer and cleaner:
| const cfg = await readBestEffortConfig().catch(() => undefined); | |
| const tlsEnabled = !!(cfg as { gateway?: { tls?: { enabled?: unknown } } } | undefined)?.gateway | |
| ?.tls?.enabled; | |
| const scheme = tlsEnabled ? "wss" : "ws"; | |
| const cfg = await readBestEffortConfig().catch(() => undefined); | |
| const tlsEnabled = !!(cfg?.gateway?.tls?.enabled); | |
| const scheme = tlsEnabled ? "wss" : "ws"; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 53-56
Comment:
**Raw type assertion bypasses typed config**
`readBestEffortConfig()` already returns a strongly-typed config object. Using a manual structural cast here creates a silent mismatch risk — if the TLS config shape ever changes (e.g. `tls.enabled` is renamed or moved), this cast will silently evaluate to `undefined` instead of surfacing an error.
The existing typed config should already expose `gateway.tls.enabled` via the imported types — using the typed accessor would be both safer and cleaner:
```suggestion
const cfg = await readBestEffortConfig().catch(() => undefined);
const tlsEnabled = !!(cfg?.gateway?.tls?.enabled);
const scheme = tlsEnabled ? "wss" : "ws";
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c26966def8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| getActiveRuns: () => Map<string, SubagentRunRecord>; | ||
| }): Promise<{ recovered: number; failed: number; skipped: number }> { | ||
| const result = { recovered: 0, failed: 0, skipped: 0 }; | ||
| const resumedSessionKeys = new Set<string>(); |
There was a problem hiding this comment.
Persist resumed-session dedupe across recovery retries
Fresh evidence after the earlier duplicate-resume fix: resumedSessionKeys is recreated on every recoverOrphanedSubagentSessions invocation, but scheduleOrphanRecovery can invoke this function multiple times when any session in the batch fails. If one session resumes successfully but updateSessionStore fails to clear abortedLastRun, and another session causes result.failed > 0, the next retry pass will re-resume the already-running session and create a second run for the same orphaned work.
Useful? React with 👍 / 👎.
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ext and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as #4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 745b01d775
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const tlsEnabled = !!cfg?.gateway?.tls?.enabled; | ||
| const scheme = tlsEnabled ? "wss" : "ws"; | ||
| const probe = await probeGateway({ | ||
| url: `ws://127.0.0.1:${port}`, | ||
| url: `${scheme}://127.0.0.1:${port}`, |
There was a problem hiding this comment.
Probe unmanaged gateway with both ws and wss schemes
assertUnmanagedGatewayRestartEnabled now picks a single probe scheme from local config, so when the running unmanaged gateway still uses the previous scheme during a pending TLS transition, the probe fails and the commands.restart guard is skipped. In that state, restartGatewayWithoutServiceManager still sends SIGUSR1, but the run loop ignores unauthorized external restarts when commands.restart=false, so openclaw gateway restart can report success without an actual restart. Please try both schemes (or otherwise verify against the live listener) before deciding the precheck is unavailable.
Useful? React with 👍 / 👎.
…ume context and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ume context and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ume context and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* refactor: make setup the primary wizard surface * test: move setup surface coverage * docs: prefer setup wizard command * fix: follow up shared interactive regressions (openclaw#47715) * fix(plugins): resolve lazy runtime from package root * fix(daemon): accept 'Last Result' schtasks key variant on Windows (openclaw#47726) Some Windows locales/versions emit 'Last Result' instead of 'Last Run Result' in schtasks output, causing gateway status to falsely report 'Runtime: unknown'. Fall back to the shorter key when the canonical key is absent. * fix: accept schtasks Last Result key on Windows (openclaw#47844) (thanks @MoerAI) * refactor(plugins): move auth profile hooks into providers * fix: resume orphaned subagent sessions after SIGUSR1 reload Closes openclaw#47711 After a SIGUSR1 gateway reload aborts in-flight subagent LLM calls, the gateway now scans for orphaned sessions and sends a synthetic resume message to restart their work. Also makes the deferral timeout configurable via gateway.reload.deferralTimeoutMs (default: 5 minutes, up from 90s). * fix: address Greptile review feedback - Remove unrelated pnpm-lock.yaml changes - Move abortedLastRun flag clearing to AFTER successful resume (prevents permanent session loss on transient gateway failures) - Use dynamic import for orphan recovery module to avoid startup memory overhead - Add test assertion that flag is preserved on resume failure * fix: add retry with exponential backoff for orphan recovery Addresses Codex review feedback — if recovery fails (e.g. gateway still booting), retries up to 3 times with exponential backoff (5s → 10s → 20s) before giving up. * fix: address all review comments on PR openclaw#47719 + implement resume context and config idempotency guard * fix: address 6 review comments on PR openclaw#47719 1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: land SIGUSR1 orphan recovery regressions (openclaw#47719) (thanks @joeykrug) * refactor: move channel messaging hooks into plugins * fix: stabilize windows parallels smoke harness * refactor(plugins): move provider onboarding auth into plugins * docs: restore onboard as canonical setup command * docs: restore onboard docs references * refactor: move channel capability diagnostics into plugins * refactor: split slack block action handling * refactor: split plugin interactive dispatch adapters * refactor: unify reply content checks * refactor: extract discord shared interactive mapper * refactor: unify telegram interactive button resolution * build: add land gate parity script * test: fix setup wizard smoke mocks * docs: sync config baseline * Status: split heartbeat summary helpers * Security: trim audit policy import surfaces * Security: lazy-load deep skill audit helpers * Security: lazy-load audit config snapshot IO * Config: keep native command defaults off heavy channel registry * Status: split lightweight gateway agent list * refactor(plugins): simplify provider auth choice metadata * test: add openshell sandbox e2e smoke * feishu: add structured card actions and interactive approval flows (openclaw#47873) * feishu: add structured card actions and interactive approval flows * feishu: address review fixes and test-gate regressions * feishu: hold inflight card dedup until completion * feishu: restore fire-and-forget bot menu handling * feishu: format card interaction helpers * Feishu: add changelog entry for card interactions * Feishu: add changelog entry for ACP session binding * build: remove land gate script * Status: lazy-load tailscale and memory scan deps * Tests: fix Feishu full registration mock * Tests: cover plugin capability matrix * Gateway: import normalizeAgentId in hooks * fix: recover bonjour advertiser from ciao announce loops * fix: preserve loopback gateway scopes for local auth * Status: lazy-load summary session helpers * Status: lazy-load security audit commands * refactor: move channel delivery and ACP seams into plugins * Security: split audit runtime surfaces * Tests: add channel actions contract helper * Tests: add channel plugin contract helper * Tests: add Slack channel contract suite * Tests: add Mattermost channel contract suite * Tests: add Telegram channel contract suite * Tests: add Discord channel contract suite * fix(session): preserve external channel route when webchat views session (openclaw#47745) When a Telegram/WhatsApp/iMessage session was viewed or messaged from the dashboard/webchat, resolveLastChannelRaw() unconditionally returned 'webchat' for any isDirectSessionKey() or isMainSessionKey() match, overwriting the persisted external delivery route. This caused subagent completion events to be delivered to the webchat/dashboard instead of the original channel (Telegram, WhatsApp, etc.), silently dropping messages for the channel user. Fix: only allow webchat to own routing when no external delivery route has been established (no persisted external lastChannel, no external channel hint in the session key). If an external route exists, webchat is treated as admin/monitoring access and must not mutate the delivery route. Updated/added tests to document the correct behaviour. Fixes openclaw#47745 * fix: address bot nit on session route preservation (openclaw#47797) (thanks @brokemac79) * Tests: add plugin contract suites * Tests: add plugin contract registry * Tests: add global plugin contract suite * Tests: add global actions contract suite * Tests: add global setup contract suite * Tests: add global status contract suite * Tests: replace local channel contracts * refactor(plugins): move onboarding auth metadata to manifests * refactor: move remaining channel seams into plugins * refactor: add plugin-owned outbound adapters * fix: scope localStorage settings key by basePath to prevent cross-deployment conflicts - Add settingsKeyForGateway() function similar to tokenSessionKeyForGateway() - Use scoped key format: openclaw.control.settings.v1:https://example.com/gateway-a - Add migration from legacy static key on load - Fixes openclaw#47481 * Tests: add provider contract suites * Tests: add provider contract registry * Tests: add global provider contract suite * Tests: add global web search contract suite * fix: stabilize ci gate * Tests: add provider registry contract suite * Tests: relax provider auth hint contract * !refactor(browser): remove Chrome extension path and add MCP doctor migration (openclaw#47893) * Browser: replace extension path with Chrome MCP * Browser: clarify relay stub and doctor checks * Docs: mark browser MCP migration as breaking * Browser: reject unsupported profile drivers * Browser: accept clawd alias on profile create * Doctor: narrow legacy browser driver migration * feishu: harden media support and align capability docs (openclaw#47968) * feishu: harden media support and action surface * feishu: format media action changes * feishu: fix review follow-ups * fix: scope Feishu target aliases to Feishu (openclaw#47968) (thanks @Takhoffman) * fix: make docs i18n use gpt-5.4 overrides * docs: regenerate zh-CN onboarding references * Tests: tighten provider wizard contracts * Tests: add plugin loader contract suite * refactor: remove dock shim and move session routing into plugins * Plugins: add provider runtime contracts * GitHub Copilot: move runtime tests to provider contracts * Z.ai: move runtime tests to provider contracts * Anthropic: move runtime tests to provider contracts * Google: move runtime tests to provider contracts * OpenAI: move runtime tests to provider contracts * Qwen Portal: move runtime tests to provider contracts * fix(core): restore outbound fallbacks and gate checks * style(core): normalize rebase fallout * fix: accept sandbox plugin id hints * Plugins: capture tool registrations in test registry * Plugins: cover Firecrawl tool ownership * Firecrawl: drop local registration contract test * Plugins: add provider catalog contracts * Plugins: narrow provider runtime contracts * refactor(plugin-sdk): use scoped core imports for bundled channels * fix: unblock ci gates * Plugins: add provider wizard contracts * fix: unblock docs and registry checks * refactor: finish plugin-owned channel runtime seams * refactor(plugin-sdk): clean shared core imports * Plugins: add provider auth contracts * Plugins: dedupe routing imports in channel adapters * Plugins: add provider discovery contracts * fix: stop bonjour before re-advertising * Plugins: extend provider discovery contracts * docs: codify macOS parallels discord smoke * refactor: move session lifecycle and outbound fallbacks into plugins * refactor(plugins): derive compat provider ids from manifests * Plugins: cover catalog discovery providers * Tests: type auth contract prompt mocks * fix: mount CLI auth dirs in docker live tests * refactor: route shared channel sdk imports through plugin seams * Plugins: add auth choice contracts * Plugins: restore routing seams and discovery fixtures * fix: harden bonjour retry recovery * Runtime: lazy-load channel runtime singletons * refactor: tighten plugin sdk channel seams * refactor: route remaining channel imports through plugin sdk * refactor(plugins): finish provider auth boundary cleanup * fix(infra): wire gaxios-fetch-compat shim to prevent node-fetch crash on Node.js 25 * fix(infra): also wire gaxios-fetch-compat shim into src/index.ts (gateway entry) * fix: keep gaxios compat off the package root (openclaw#47914) (thanks @pdd-cli) * refactor: shrink public channel plugin sdk surfaces * refactor: add private channel sdk bridges * fix: restore effective setup wizard lazy import * docs: add frontmatter to parallels discord skill * fix: retry runtime postbuild skill copy races * Gateway: lazily resolve channel runtime * Gateway: cover lazy channel runtime resolution * Plugins: add Claude marketplace registry installs (openclaw#48058) * Changelog: note Claude marketplace plugin support * Plugins: add Claude marketplace installs * E2E: cover marketplace plugin installs in Docker * UI: keep thinking helpers browser-safe * Infra: restore check after gaxios compat * Docs: refresh generated config baseline * docs: reorder unreleased changelog entries * fix: split browser-safe thinking helpers * fix(macos): restore debug build helpers (openclaw#48046) * Docs: add Claude marketplace plugin install guidance * Channels: expand contract suites * Channels: add contract surface coverage * Channels: centralize outbound payload contracts * Channels: centralize group policy contracts * Channels: centralize inbound context contracts * Tests: add contract runner * Tests: harden WhatsApp inbound contract cleanup * Tests: add extension test runner * Runtime: lazy-load Discord channel ops * Docs: use placeholders for marketplace plugin examples * Release: trim generated docs from npm pack * Runtime: lazy-load Telegram and Slack channel ops * Tests: detect changed extensions * Tests: cover changed extension detection * Docs: add extension test workflow * CI: add changed extension test lane * BlueBubbles: lazy-load channel runtime paths * Plugin SDK: restore scoped imports for bundled channels * Plugin SDK: consolidate shared channel exports * Channels: fix surface contract plugin lookup * Status: stabilize startup memory probes * Media: avoid slow auth misses in auto-detect * Tests: scope Codex bundle loader fixture * Tests: isolate bundle surface fixtures * feat(telegram): add configurable silent error replies (openclaw#19776) Port and complete openclaw#19776 on top of the current Telegram extension layout. Adds a default-off `channels.telegram.silentErrorReplies` setting. When enabled, Telegram bot replies marked as errors are delivered silently across the regular bot reply flow, native/slash command replies, and fallback sends. Thanks @auspic7 Co-authored-by: Myeongwon Choi <[email protected]> Co-authored-by: ImLukeF <[email protected]> * fix(ui): auto load Usage tab data on navigation * fix(telegram): keep silent error fallback replies quiet * test(gateway): restore agent request route mock * Cron: isolate active-model delivery tests * Tests: align media auth fixture with selection checks * Plugins: preserve lazy runtime provider resolution * Bootstrap: report nested entry import misses * fix(channels): parse bundled targets without plugin registry * test(telegram): cover shared parsing without registry * Plugin SDK: split setup and sandbox subpaths * Providers: centralize setup defaults and helper boundaries * Plugins: decouple bundled web search discovery * Plugin SDK: update entrypoint metadata * Secrets: honor caller env during runtime validation * Tests: align Docker cache checks with non-root images * Plugin SDK: keep root alias reflection lazy * Providers: scope compat resolution to owning plugins * Plugin SDK: add narrow setup subpaths * Plugin SDK: update entrypoint metadata * fix(slack): harden bolt import interop (openclaw#45953) * fix(slack): harden bolt import interop * fix(slack): simplify bolt interop resolver * fix(slack): harden startup bolt interop * fix(slack): place changelog entry at section end --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Altay <[email protected]> * Tests: fix green check typing regressions * Plugins: avoid booting bundled providers for catalog hooks * fix: bypass telegram runtime proxy during health checks * fix: align telegram probe test mock * test: remove stale synology zod mock * fix(android): reduce chat recomposition churn * fix(android): preserve chat message identity on refresh * fix(android): shrink chat image attachments * Browser: support non-Chrome existing-session profiles via userDataDir (openclaw#48170) Merged via squash. Prepared head SHA: e490035 Co-authored-by: velvet-shark <[email protected]> Co-authored-by: velvet-shark <[email protected]> Reviewed-by: @velvet-shark * fix(local-storage): improve VITEST environment check for localStorage access * fix: normalize discord commands allowFrom auth * test: update discord subagent hook mocks * test: mock telegram native command reply pipeline * fix(android): lazy-init node runtime after onboarding * docs(config): refresh generated baseline * Plugins: share channel plugin id resolution * Gateway: defer full channel plugins until after listen * Gateway: gate deferred channel startup behind opt-in * Docs: document deferred channel startup opt-in * feat(skills): preserve all skills in prompt via compact fallback before dropping (openclaw#47553) * feat(skills): add compact format fallback for skill catalog truncation When the full-format skill catalog exceeds the character budget, applySkillsPromptLimits now tries a compact format (name + location only, no description) before binary-searching for the largest fitting prefix. This preserves full model awareness of registered skills in the common overflow case. Three-tier strategy: 1. Full format fits → use as-is 2. Compact format fits → switch to compact, keep all skills 3. Compact still too large → binary search largest compact prefix Other changes: - escapeXml() utility for safe XML attribute values - formatSkillsCompact() emits same XML structure minus <description> - Compact char-budget check reserves 150 chars for the warning line the caller prepends, preventing prompt overflow at the boundary - 13 tests covering all tiers, edge cases, and budget reservation - docs/.generated/config-baseline.json: fix pre-existing oxfmt issue * docs: document compact skill prompt fallback --------- Co-authored-by: Frank Yang <[email protected]> * Gateway: simplify startup and stabilize mock responses tests * test: fix stale web search and boot-md contracts * Gateway tests: centralize mock responses provider setup * Gateway tests: share ordered client teardown helper * Infra: ignore ciao probing cancellations * Docs: repair unreleased changelog attribution * Docs: normalize unreleased changelog refs * Channels: ignore enabled-only disabled plugin config * perf: reduce status json startup memory * perf: lazy-load status route startup helpers * Plugins: stage local bundled runtime tree * Build: share root dist chunks across tsdown entries * fix(logging): make logger import browser-safe * fix(changelog): add entry for Control UI logger import fix (openclaw#48469) * fix(changelog): note Control UI logger import fix * fix(changelog): attribute Control UI logger fix entry * fix(changelog): credit original Control UI fix author * Plugins: remove public extension-api surface (openclaw#48462) * Plugins: remove public extension-api surface * Plugins: fix loader setup routing follow-ups * CI: ignore non-extension helper dirs in extension-fast * Docs: note extension-api removal as breaking * fix(ui): language dropdown selection not persisting after refresh (openclaw#48019) Merged via squash. Prepared head SHA: 06c8258 Co-authored-by: git-jxj <[email protected]> Co-authored-by: altaywtf <[email protected]> Reviewed-by: @altaywtf * fix(plugins): late-binding subagent runtime for non-gateway load paths (openclaw#46648) Merged via squash. Prepared head SHA: 4474265 Co-authored-by: jalehman <[email protected]> Co-authored-by: jalehman <[email protected]> Reviewed-by: @jalehman * fix: enable auto-scroll during assistant response streaming Fix auto-scroll behavior when AI assistant streams responses in the web UI. Previously, the viewport would remain at the sent message position and users had to manually click a badge to see streaming responses. Fixes openclaw#14959 Changes: - Reset chat scroll state before sending message to ensure viewport readiness - Force scroll to bottom after message send to position viewport correctly - Detect streaming start (chatStream: null -> string) and trigger auto-scroll - Ensure smooth scroll-following during entire streaming response Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(ui): align chatStream lifecycle type with nullable state * fix(whatsapp): restore implicit reply mentions for LID identities (openclaw#48494) Threads selfLid from the Baileys socket through the inbound WhatsApp pipeline and adds LID-format matching to the implicit mention check in group gating, so reply-to-bot detection works when WhatsApp sends the quoted sender in @lid format. Also fixes the device-suffix stripping regex (was a silent no-op). Closes openclaw#23029 Co-authored-by: sparkyrider <[email protected]> Reviewed-by: @ademczuk * fix(compaction): stabilize toolResult trim/prune flow in safeguard (openclaw#44133) Merged via squash. Prepared head SHA: ec789c6 Co-authored-by: SayrWolfridge <[email protected]> Co-authored-by: jalehman <[email protected]> Reviewed-by: @jalehman * Fix launcher startup regressions (openclaw#48501) * Fix launcher startup regressions * Fix CI follow-up regressions * Fix review follow-ups * Fix workflow audit shell inputs * Handle require resolve gaxios misses * fix: remove orphaned tool_result blocks during compaction (openclaw#15691) (openclaw#16095) Merged via squash. Prepared head SHA: b772432 Co-authored-by: claw-sylphx <[email protected]> Co-authored-by: jalehman <[email protected]> Reviewed-by: @jalehman * docs: rename onboarding user-facing wizard copy Co-authored-by: Tak <[email protected]> * fix(plugins): keep built plugin loading on one module graph (openclaw#48595) * Plugins: stabilize global catalog contracts * Channels: add global threading and directory contracts * Tests: improve extension runner discovery * CI: run global contract lane * Plugins: speed up auth-choice contracts * Plugins: fix catalog contract mocks * refactor: move provider catalogs into extensions * refactor: route bundled channel setup helpers through private sdk bridges * test: fix check contract type drift * Tests: lock plugin slash commands to one runtime graph * Tests: cover Discord provider plugin registry * Tests: pin loader command activation semantics * Tests: cover Telegram plugin auth on real registry * refactor(slack): share setup helpers * refactor(whatsapp): reuse shared normalize helpers * Tlon: lazy-load channel runtime paths * Tests: document Discord plugin auth gating * feat(plugins): add speech provider registration * docs(plugins): document capability ownership model * fix: detect Ollama "prompt too long" as context overflow error (openclaw#34019) Merged via squash. Prepared head SHA: 825a402 Co-authored-by: lishuaigit <[email protected]> Co-authored-by: jalehman <[email protected]> Reviewed-by: @jalehman * agent: preemptive context overflow detection during tool loops (openclaw#29371) Merged via squash. Prepared head SHA: 19661b8 Co-authored-by: keshav55 <[email protected]> Co-authored-by: jalehman <[email protected]> Reviewed-by: @jalehman --------- Co-authored-by: Peter Steinberger <[email protected]> Co-authored-by: MoerAI <[email protected]> Co-authored-by: Joey Krug <[email protected]> Co-authored-by: bot_apk <[email protected]> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Vincent Koc <[email protected]> Co-authored-by: Tak Hoffman <[email protected]> Co-authored-by: brokemac79 <[email protected]> Co-authored-by: ObitaBot <[email protected]> Co-authored-by: Prompt Driven <[email protected]> Co-authored-by: Nimrod Gutman <[email protected]> Co-authored-by: Gustavo Madeira Santana <[email protected]> Co-authored-by: Myeongwon Choi <[email protected]> Co-authored-by: Myeongwon Choi <[email protected]> Co-authored-by: ImLukeF <[email protected]> Co-authored-by: 郑耀宏 <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]> Co-authored-by: huntharo <[email protected]> Co-authored-by: Yauheni Shauchenka <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Altay <[email protected]> Co-authored-by: Radek Sienkiewicz <[email protected]> Co-authored-by: velvet-shark <[email protected]> Co-authored-by: Val Alexander <[email protected]> Co-authored-by: Hung-Che Lo <[email protected]> Co-authored-by: Frank Yang <[email protected]> Co-authored-by: git-jxj <[email protected]> Co-authored-by: git-jxj <[email protected]> Co-authored-by: altaywtf <[email protected]> Co-authored-by: Josh Lehman <[email protected]> Co-authored-by: jalehman <[email protected]> Co-authored-by: Jaewon Hwang <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: sparkyrider <[email protected]> Co-authored-by: sparkyrider <[email protected]> Co-authored-by: Sayr Wolfridge <[email protected]> Co-authored-by: SayrWolfridge <[email protected]> Co-authored-by: Clayton Shaw <[email protected]> Co-authored-by: claw-sylphx <[email protected]> Co-authored-by: Tak <[email protected]> Co-authored-by: lishuaigit <[email protected]> Co-authored-by: lishuaigit <[email protected]> Co-authored-by: Keshav Rao <[email protected]> Co-authored-by: keshav55 <[email protected]>
…@joeykrug) (cherry picked from commit 680eff6)
…ndler Pattern from PR openclaw#47719 The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
…ndler Pattern from PR openclaw#47719 The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
…ndler Pattern from PR openclaw#47719 The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
…ume context and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as #4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> (cherry picked from commit 98f6ec5)
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as #4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> (cherry picked from commit 98f6ec5)
…ume context and config idempotency guard
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer returns false, do NOT clear abortedLastRun, increment failed count. 2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0 in the outer catch block so scheduleOrphanRecovery retry logic triggers. 3. [P2] Persist resumed-session dedupe across recovery retries — accept resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to its own scope and passes it through retries. 4. [Greptile] Use typed config accessors instead of raw structural cast for TLS check in lifecycle.ts. 5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored. 6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Closes #47711
Problem
Three bugs cause in-flight subagent work to be silently lost during gateway restarts:
openclaw gateway restartCLI bypasses deferral — sends SIGUSR1 immediately without checking for active embedded runsChanges
Part B: Post-reload orphan recovery + run tracking repair (Bug 2)
New module
src/agents/subagent-orphan-recovery.ts:abortedLastRun: truecallGateway({ method: "agent" })to trigger a new LLM turn for the aborted sessionrestoreSubagentRunsOnce()in the subagent registryPart C: Configurable deferral timeout (Bug 3)
DEFAULT_DEFERRAL_MAX_WAIT_MSincreased from 90s to 300s (5 minutes)gateway.reload.deferralTimeoutMsTesting
24 tests passing across targeted coverage:
subagent-orphan-recovery.test.ts(10 tests): orphan detection, resume message injection, multi-session recovery, error handling with flag preservation, missing resumed runId, task truncation, recovered-run callbackrestart.deferral-timeout.test.ts(5 tests): default 5-minute timeout, custom timeout via config, drain-before-timeout, immediate restart on zero pending, error handlinglifecycle.test.ts(9 tests): unmanaged graceful restart over RPC, fallback SIGUSR1 path, existing restart/stop coverageScope
This PR now covers all three root-cause bugs from #47711: