Skip to content

fix: resume orphaned subagent sessions after SIGUSR1 reload#47719

Merged
steipete merged 6 commits intoopenclaw:mainfrom
joeykrug:fix/sigusr1-orphan-recovery
Mar 16, 2026
Merged

fix: resume orphaned subagent sessions after SIGUSR1 reload#47719
steipete merged 6 commits intoopenclaw:mainfrom
joeykrug:fix/sigusr1-orphan-recovery

Conversation

@joeykrug
Copy link
Copy Markdown
Contributor

@joeykrug joeykrug commented Mar 16, 2026

Closes #47711

Problem

Three bugs cause in-flight subagent work to be silently lost during gateway restarts:

  1. openclaw gateway restart CLI bypasses deferral — sends SIGUSR1 immediately without checking for active embedded runs
  2. Post-restart orphan recovery fails silently — restored runs can get stuck because the original wait path is re-armed before the gateway is ready, and aborted turns are never actively resumed
  3. Default deferral timeout too short — 90s default can expire while long-running subagent turns are still in progress

Changes

Part B: Post-reload orphan recovery + run tracking repair (Bug 2)

New module src/agents/subagent-orphan-recovery.ts:

  • After subagent runs are restored on restart, schedules an orphan recovery scan (5s delay for gateway bootstrap)
  • Scans active subagent run records for sessions with abortedLastRun: true
  • Sends a synthetic resume message via callGateway({ method: "agent" }) to trigger a new LLM turn for the aborted session
  • Captures the new resumed runId and hands it back to the registry, which remaps tracking from the old runId to the new runId and re-arms completion waiting on the correct run
  • This prevents resumed lifecycle events from being ignored and avoids incorrect timeout-based completion for recovered runs
  • Flag only cleared after confirmed successful resume (prevents permanent session loss on transient failures)
  • Retry with exponential backoff (3 attempts: 5s → 10s → 20s) if gateway is not yet ready
  • Dynamic import to avoid startup memory overhead
  • Integrated into restoreSubagentRunsOnce() in the subagent registry

Part C: Configurable deferral timeout (Bug 3)

  • DEFAULT_DEFERRAL_MAX_WAIT_MS increased from 90s to 300s (5 minutes)
  • New config key: gateway.reload.deferralTimeoutMs
  • Wired through config schema, types, help text, and caller site

Testing

24 tests passing across targeted coverage:

  • subagent-orphan-recovery.test.ts (10 tests): orphan detection, resume message injection, multi-session recovery, error handling with flag preservation, missing resumed runId, task truncation, recovered-run callback
  • restart.deferral-timeout.test.ts (5 tests): default 5-minute timeout, custom timeout via config, drain-before-timeout, immediate restart on zero pending, error handling
  • lifecycle.test.ts (9 tests): unmanaged graceful restart over RPC, fallback SIGUSR1 path, existing restart/stop coverage

Scope

This PR now covers all three root-cause bugs from #47711:

  • CLI restart bypass
  • post-restart orphan recovery / stuck tracking
  • missing auto-resume for aborted turns

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime agents Agent runtime and tooling size: L labels Mar 16, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR implements post-restart orphan recovery for subagent sessions (Part B) and a configurable deferral timeout (Part C) to address three root causes of silent in-flight work loss during gateway SIGUSR1 reloads. The new subagent-orphan-recovery.ts module is well-structured: it correctly gates flag clearing on confirmed resume success, uses exponential backoff for retries, and deduplicates within a single scan pass. Config schema, types, and help text are consistently updated for the new gateway.reload.deferralTimeoutMs key.

  • The orphan recovery logic is sound and addresses the key correctness concern (previously flagged) of not clearing abortedLastRun before a confirmed successful resume.
  • The configurable deferralTimeoutMs is only wired through the requestGatewayRestart / server-reload-handlers.ts path. The parallel scheduleGatewaySigusr1Restart path in restart.ts calls deferGatewayRestartUntilIdle without forwarding maxWaitMs, so the user-configured value is silently ignored for restarts routed through that function.
  • The TLS probe fix in lifecycle.ts uses a raw structural type cast (cfg as { gateway?: { tls?: ... } }) rather than the already-typed config, which is a minor maintenance risk.

Confidence Score: 4/5

  • Safe to merge with awareness that deferralTimeoutMs config has no effect on the scheduleGatewaySigusr1Restart code path.
  • The orphan recovery implementation is correct and well-tested (24 tests). The critical flag-preservation fix addressed in prior review threads is properly implemented. The main gap is an incomplete application of the new deferralTimeoutMs config — it is wired in server-reload-handlers.ts but silently ignored in scheduleGatewaySigusr1Restart, which could surprise operators who configure the timeout expecting uniform behaviour across all restart paths. No data-loss or runtime-crash risks identified.
  • src/infra/restart.ts — the scheduleGatewaySigusr1Restart path at line 479 does not forward maxWaitMs to deferGatewayRestartUntilIdle, causing the new deferralTimeoutMs config to have no effect for restarts triggered through that function.

Comments Outside Diff (1)

  1. src/infra/restart.ts, line 479 (link)

    deferralTimeoutMs ignored on scheduleGatewaySigusr1Restart path

    deferGatewayRestartUntilIdle is called here without maxWaitMs, so the user-configured gateway.reload.deferralTimeoutMs is silently ignored for any restart that flows through scheduleGatewaySigusr1Restart. Only the requestGatewayRestart path in server-reload-handlers.ts correctly forwards the custom timeout.

    If a user sets deferralTimeoutMs to, say, 60000 to reduce wait time on forced restarts, that value will have no effect here — the call always falls back to DEFAULT_DEFERRAL_MAX_WAIT_MS (300s). The config key and the default now disagree on which code paths honour it.

    To wire deferralTimeoutMs here consistently, scheduleGatewaySigusr1Restart would need to accept (or read) the config value and forward it:

    deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck });
    

    deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck, maxWaitMs: opts?.maxWaitMs });
    

    with scheduleGatewaySigusr1Restart accepting an optional maxWaitMs from callers that have config access.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/infra/restart.ts
    Line: 479
    
    Comment:
    **`deferralTimeoutMs` ignored on `scheduleGatewaySigusr1Restart` path**
    
    `deferGatewayRestartUntilIdle` is called here without `maxWaitMs`, so the user-configured `gateway.reload.deferralTimeoutMs` is silently ignored for any restart that flows through `scheduleGatewaySigusr1Restart`. Only the `requestGatewayRestart` path in `server-reload-handlers.ts` correctly forwards the custom timeout.
    
    If a user sets `deferralTimeoutMs` to, say, `60000` to reduce wait time on forced restarts, that value will have no effect here — the call always falls back to `DEFAULT_DEFERRAL_MAX_WAIT_MS` (300s). The config key and the default now disagree on which code paths honour it.
    
    To wire `deferralTimeoutMs` here consistently, `scheduleGatewaySigusr1Restart` would need to accept (or read) the config value and forward it:
    
    ```
    deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck });
    ``````
    deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck, maxWaitMs: opts?.maxWaitMs });
    ```
    
    with `scheduleGatewaySigusr1Restart` accepting an optional `maxWaitMs` from callers that have config access.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/restart.ts
Line: 479

Comment:
**`deferralTimeoutMs` ignored on `scheduleGatewaySigusr1Restart` path**

`deferGatewayRestartUntilIdle` is called here without `maxWaitMs`, so the user-configured `gateway.reload.deferralTimeoutMs` is silently ignored for any restart that flows through `scheduleGatewaySigusr1Restart`. Only the `requestGatewayRestart` path in `server-reload-handlers.ts` correctly forwards the custom timeout.

If a user sets `deferralTimeoutMs` to, say, `60000` to reduce wait time on forced restarts, that value will have no effect here — the call always falls back to `DEFAULT_DEFERRAL_MAX_WAIT_MS` (300s). The config key and the default now disagree on which code paths honour it.

To wire `deferralTimeoutMs` here consistently, `scheduleGatewaySigusr1Restart` would need to accept (or read) the config value and forward it:

```
deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck });
``````
deferGatewayRestartUntilIdle({ getPendingCount: pendingCheck, maxWaitMs: opts?.maxWaitMs });
```

with `scheduleGatewaySigusr1Restart` accepting an optional `maxWaitMs` from callers that have config access.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 53-56

Comment:
**Raw type assertion bypasses typed config**

`readBestEffortConfig()` already returns a strongly-typed config object. Using a manual structural cast here creates a silent mismatch risk — if the TLS config shape ever changes (e.g. `tls.enabled` is renamed or moved), this cast will silently evaluate to `undefined` instead of surfacing an error.

The existing typed config should already expose `gateway.tls.enabled` via the imported types — using the typed accessor would be both safer and cleaner:

```suggestion
  const cfg = await readBestEffortConfig().catch(() => undefined);
  const tlsEnabled = !!(cfg?.gateway?.tls?.enabled);
  const scheme = tlsEnabled ? "wss" : "ws";
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: c26966d

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d725ec153

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +132 to +136
// Clear the aborted flag before resuming
await updateSessionStore(storePath, (currentStore) => {
const current = currentStore[childSessionKey];
if (current) {
current.abortedLastRun = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear aborted markers only after a successful resume

This clears abortedLastRun before attempting the synthetic resume turn. If the subsequent gateway call fails (e.g., gateway still booting or transient RPC error), the session is counted as failed but no longer has abortedLastRun=true, so later scans skip it (if (!entry.abortedLastRun) continue) and the orphaned run is effectively unrecoverable. Keep the flag set until resume succeeds (or revert it on failure) so retries remain possible.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 52e8e6d — flag clearing now happens only after successful resume.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed: abortedLastRun was moved to clear only after successful resume in c3eae1d. The flag is preserved on failure so the next restart can retry. Test coverage verifies both paths.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52e8e6d19e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13680006cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Mar 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a792291486

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@joeykrug joeykrug force-pushed the fix/sigusr1-orphan-recovery branch from a792291 to 38bb216 Compare March 16, 2026 01:41
@openclaw-barnacle openclaw-barnacle bot removed the cli CLI command changes label Mar 16, 2026
@joeykrug joeykrug force-pushed the fix/sigusr1-orphan-recovery branch from 38bb216 to 1eb6ce7 Compare March 16, 2026 01:43
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Mar 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eb6ce7d2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02c0c994bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3eae1d019

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

joeykrug added a commit to joeykrug/openclaw that referenced this pull request Mar 16, 2026
…ume context and config idempotency guard
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Mar 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53d038a218

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@joeykrug joeykrug marked this pull request as draft March 16, 2026 03:15
@joeykrug joeykrug marked this pull request as ready for review March 16, 2026 03:15
Comment on lines +53 to +56
const cfg = await readBestEffortConfig().catch(() => undefined);
const tlsEnabled = !!(cfg as { gateway?: { tls?: { enabled?: unknown } } } | undefined)?.gateway
?.tls?.enabled;
const scheme = tlsEnabled ? "wss" : "ws";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw type assertion bypasses typed config

readBestEffortConfig() already returns a strongly-typed config object. Using a manual structural cast here creates a silent mismatch risk — if the TLS config shape ever changes (e.g. tls.enabled is renamed or moved), this cast will silently evaluate to undefined instead of surfacing an error.

The existing typed config should already expose gateway.tls.enabled via the imported types — using the typed accessor would be both safer and cleaner:

Suggested change
const cfg = await readBestEffortConfig().catch(() => undefined);
const tlsEnabled = !!(cfg as { gateway?: { tls?: { enabled?: unknown } } } | undefined)?.gateway
?.tls?.enabled;
const scheme = tlsEnabled ? "wss" : "ws";
const cfg = await readBestEffortConfig().catch(() => undefined);
const tlsEnabled = !!(cfg?.gateway?.tls?.enabled);
const scheme = tlsEnabled ? "wss" : "ws";
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 53-56

Comment:
**Raw type assertion bypasses typed config**

`readBestEffortConfig()` already returns a strongly-typed config object. Using a manual structural cast here creates a silent mismatch risk — if the TLS config shape ever changes (e.g. `tls.enabled` is renamed or moved), this cast will silently evaluate to `undefined` instead of surfacing an error.

The existing typed config should already expose `gateway.tls.enabled` via the imported types — using the typed accessor would be both safer and cleaner:

```suggestion
  const cfg = await readBestEffortConfig().catch(() => undefined);
  const tlsEnabled = !!(cfg?.gateway?.tls?.enabled);
  const scheme = tlsEnabled ? "wss" : "ws";
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c26966def8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

getActiveRuns: () => Map<string, SubagentRunRecord>;
}): Promise<{ recovered: number; failed: number; skipped: number }> {
const result = { recovered: 0, failed: 0, skipped: 0 };
const resumedSessionKeys = new Set<string>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist resumed-session dedupe across recovery retries

Fresh evidence after the earlier duplicate-resume fix: resumedSessionKeys is recreated on every recoverOrphanedSubagentSessions invocation, but scheduleOrphanRecovery can invoke this function multiple times when any session in the batch fails. If one session resumes successfully but updateSessionStore fails to clear abortedLastRun, and another session causes result.failed > 0, the next retry pass will re-resume the already-running session and create a second run for the same orphaned work.

Useful? React with 👍 / 👎.

joeykrug pushed a commit to joeykrug/openclaw that referenced this pull request Mar 16, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
steipete pushed a commit that referenced this pull request Mar 16, 2026
steipete pushed a commit that referenced this pull request Mar 16, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as #4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@steipete
Copy link
Copy Markdown
Contributor

Landed via temp rebase onto main.

  • Gate: pnpm check && pnpm build && pnpm test
  • Land commit: 745b01d
  • Merge commit: 680eff6

Thanks @joeykrug!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 745b01d775

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +54 to +57
const tlsEnabled = !!cfg?.gateway?.tls?.enabled;
const scheme = tlsEnabled ? "wss" : "ws";
const probe = await probeGateway({
url: `ws://127.0.0.1:${port}`,
url: `${scheme}://127.0.0.1:${port}`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Probe unmanaged gateway with both ws and wss schemes

assertUnmanagedGatewayRestartEnabled now picks a single probe scheme from local config, so when the running unmanaged gateway still uses the previous scheme during a pending TLS transition, the probe fails and the commands.restart guard is skipped. In that state, restartGatewayWithoutServiceManager still sends SIGUSR1, but the run loop ignores unauthorized external restarts when commands.restart=false, so openclaw gateway restart can report success without an actual restart. Please try both schemes (or otherwise verify against the live listener) before deciding the precheck is unavailable.

Useful? React with 👍 / 👎.

pengwork pushed a commit to pengwork/openclaw that referenced this pull request Mar 16, 2026
pengwork pushed a commit to pengwork/openclaw that referenced this pull request Mar 16, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
pengwork pushed a commit to pengwork/openclaw that referenced this pull request Mar 16, 2026
vincentkoc pushed a commit to vincentkoc/openclaw that referenced this pull request Mar 16, 2026
vincentkoc pushed a commit to vincentkoc/openclaw that referenced this pull request Mar 16, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
vincentkoc pushed a commit to vincentkoc/openclaw that referenced this pull request Mar 16, 2026
romeroej2 pushed a commit to romeroej2/openclaw that referenced this pull request Mar 16, 2026
romeroej2 pushed a commit to romeroej2/openclaw that referenced this pull request Mar 16, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
romeroej2 pushed a commit to romeroej2/openclaw that referenced this pull request Mar 16, 2026
jonBoone added a commit to jonBoone/openclaw that referenced this pull request Mar 17, 2026
* refactor: make setup the primary wizard surface

* test: move setup surface coverage

* docs: prefer setup wizard command

* fix: follow up shared interactive regressions (openclaw#47715)

* fix(plugins): resolve lazy runtime from package root

* fix(daemon): accept 'Last Result' schtasks key variant on Windows (openclaw#47726)

Some Windows locales/versions emit 'Last Result' instead of 'Last Run Result' in schtasks output, causing gateway status to falsely report 'Runtime: unknown'. Fall back to the shorter key when the canonical key is absent.

* fix: accept schtasks Last Result key on Windows (openclaw#47844) (thanks @MoerAI)

* refactor(plugins): move auth profile hooks into providers

* fix: resume orphaned subagent sessions after SIGUSR1 reload

Closes openclaw#47711

After a SIGUSR1 gateway reload aborts in-flight subagent LLM calls, the gateway now scans for orphaned sessions and sends a synthetic resume message to restart their work. Also makes the deferral timeout configurable via gateway.reload.deferralTimeoutMs (default: 5 minutes, up from 90s).

* fix: address Greptile review feedback

- Remove unrelated pnpm-lock.yaml changes
- Move abortedLastRun flag clearing to AFTER successful resume
  (prevents permanent session loss on transient gateway failures)
- Use dynamic import for orphan recovery module to avoid startup
  memory overhead
- Add test assertion that flag is preserved on resume failure

* fix: add retry with exponential backoff for orphan recovery

Addresses Codex review feedback — if recovery fails (e.g. gateway
still booting), retries up to 3 times with exponential backoff
(5s → 10s → 20s) before giving up.

* fix: address all review comments on PR openclaw#47719 + implement resume context and config idempotency guard

* fix: address 6 review comments on PR openclaw#47719

1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: land SIGUSR1 orphan recovery regressions (openclaw#47719) (thanks @joeykrug)

* refactor: move channel messaging hooks into plugins

* fix: stabilize windows parallels smoke harness

* refactor(plugins): move provider onboarding auth into plugins

* docs: restore onboard as canonical setup command

* docs: restore onboard docs references

* refactor: move channel capability diagnostics into plugins

* refactor: split slack block action handling

* refactor: split plugin interactive dispatch adapters

* refactor: unify reply content checks

* refactor: extract discord shared interactive mapper

* refactor: unify telegram interactive button resolution

* build: add land gate parity script

* test: fix setup wizard smoke mocks

* docs: sync config baseline

* Status: split heartbeat summary helpers

* Security: trim audit policy import surfaces

* Security: lazy-load deep skill audit helpers

* Security: lazy-load audit config snapshot IO

* Config: keep native command defaults off heavy channel registry

* Status: split lightweight gateway agent list

* refactor(plugins): simplify provider auth choice metadata

* test: add openshell sandbox e2e smoke

* feishu: add structured card actions and interactive approval flows (openclaw#47873)

* feishu: add structured card actions and interactive approval flows

* feishu: address review fixes and test-gate regressions

* feishu: hold inflight card dedup until completion

* feishu: restore fire-and-forget bot menu handling

* feishu: format card interaction helpers

* Feishu: add changelog entry for card interactions

* Feishu: add changelog entry for ACP session binding

* build: remove land gate script

* Status: lazy-load tailscale and memory scan deps

* Tests: fix Feishu full registration mock

* Tests: cover plugin capability matrix

* Gateway: import normalizeAgentId in hooks

* fix: recover bonjour advertiser from ciao announce loops

* fix: preserve loopback gateway scopes for local auth

* Status: lazy-load summary session helpers

* Status: lazy-load security audit commands

* refactor: move channel delivery and ACP seams into plugins

* Security: split audit runtime surfaces

* Tests: add channel actions contract helper

* Tests: add channel plugin contract helper

* Tests: add Slack channel contract suite

* Tests: add Mattermost channel contract suite

* Tests: add Telegram channel contract suite

* Tests: add Discord channel contract suite

* fix(session): preserve external channel route when webchat views session (openclaw#47745)

When a Telegram/WhatsApp/iMessage session was viewed or messaged from the
dashboard/webchat, resolveLastChannelRaw() unconditionally returned 'webchat'
for any isDirectSessionKey() or isMainSessionKey() match, overwriting the
persisted external delivery route.

This caused subagent completion events to be delivered to the webchat/dashboard
instead of the original channel (Telegram, WhatsApp, etc.), silently dropping
messages for the channel user.

Fix: only allow webchat to own routing when no external delivery route has been
established (no persisted external lastChannel, no external channel hint in the
session key). If an external route exists, webchat is treated as admin/monitoring
access and must not mutate the delivery route.

Updated/added tests to document the correct behaviour.

Fixes openclaw#47745

* fix: address bot nit on session route preservation (openclaw#47797) (thanks @brokemac79)

* Tests: add plugin contract suites

* Tests: add plugin contract registry

* Tests: add global plugin contract suite

* Tests: add global actions contract suite

* Tests: add global setup contract suite

* Tests: add global status contract suite

* Tests: replace local channel contracts

* refactor(plugins): move onboarding auth metadata to manifests

* refactor: move remaining channel seams into plugins

* refactor: add plugin-owned outbound adapters

* fix: scope localStorage settings key by basePath to prevent cross-deployment conflicts

- Add settingsKeyForGateway() function similar to tokenSessionKeyForGateway()
- Use scoped key format: openclaw.control.settings.v1:https://example.com/gateway-a
- Add migration from legacy static key on load
- Fixes openclaw#47481

* Tests: add provider contract suites

* Tests: add provider contract registry

* Tests: add global provider contract suite

* Tests: add global web search contract suite

* fix: stabilize ci gate

* Tests: add provider registry contract suite

* Tests: relax provider auth hint contract

* !refactor(browser): remove Chrome extension path and add MCP doctor migration (openclaw#47893)

* Browser: replace extension path with Chrome MCP

* Browser: clarify relay stub and doctor checks

* Docs: mark browser MCP migration as breaking

* Browser: reject unsupported profile drivers

* Browser: accept clawd alias on profile create

* Doctor: narrow legacy browser driver migration

* feishu: harden media support and align capability docs (openclaw#47968)

* feishu: harden media support and action surface

* feishu: format media action changes

* feishu: fix review follow-ups

* fix: scope Feishu target aliases to Feishu (openclaw#47968) (thanks @Takhoffman)

* fix: make docs i18n use gpt-5.4 overrides

* docs: regenerate zh-CN onboarding references

* Tests: tighten provider wizard contracts

* Tests: add plugin loader contract suite

* refactor: remove dock shim and move session routing into plugins

* Plugins: add provider runtime contracts

* GitHub Copilot: move runtime tests to provider contracts

* Z.ai: move runtime tests to provider contracts

* Anthropic: move runtime tests to provider contracts

* Google: move runtime tests to provider contracts

* OpenAI: move runtime tests to provider contracts

* Qwen Portal: move runtime tests to provider contracts

* fix(core): restore outbound fallbacks and gate checks

* style(core): normalize rebase fallout

* fix: accept sandbox plugin id hints

* Plugins: capture tool registrations in test registry

* Plugins: cover Firecrawl tool ownership

* Firecrawl: drop local registration contract test

* Plugins: add provider catalog contracts

* Plugins: narrow provider runtime contracts

* refactor(plugin-sdk): use scoped core imports for bundled channels

* fix: unblock ci gates

* Plugins: add provider wizard contracts

* fix: unblock docs and registry checks

* refactor: finish plugin-owned channel runtime seams

* refactor(plugin-sdk): clean shared core imports

* Plugins: add provider auth contracts

* Plugins: dedupe routing imports in channel adapters

* Plugins: add provider discovery contracts

* fix: stop bonjour before re-advertising

* Plugins: extend provider discovery contracts

* docs: codify macOS parallels discord smoke

* refactor: move session lifecycle and outbound fallbacks into plugins

* refactor(plugins): derive compat provider ids from manifests

* Plugins: cover catalog discovery providers

* Tests: type auth contract prompt mocks

* fix: mount CLI auth dirs in docker live tests

* refactor: route shared channel sdk imports through plugin seams

* Plugins: add auth choice contracts

* Plugins: restore routing seams and discovery fixtures

* fix: harden bonjour retry recovery

* Runtime: lazy-load channel runtime singletons

* refactor: tighten plugin sdk channel seams

* refactor: route remaining channel imports through plugin sdk

* refactor(plugins): finish provider auth boundary cleanup

* fix(infra): wire gaxios-fetch-compat shim to prevent node-fetch crash on Node.js 25

* fix(infra): also wire gaxios-fetch-compat shim into src/index.ts (gateway entry)

* fix: keep gaxios compat off the package root (openclaw#47914) (thanks @pdd-cli)

* refactor: shrink public channel plugin sdk surfaces

* refactor: add private channel sdk bridges

* fix: restore effective setup wizard lazy import

* docs: add frontmatter to parallels discord skill

* fix: retry runtime postbuild skill copy races

* Gateway: lazily resolve channel runtime

* Gateway: cover lazy channel runtime resolution

* Plugins: add Claude marketplace registry installs (openclaw#48058)

* Changelog: note Claude marketplace plugin support

* Plugins: add Claude marketplace installs

* E2E: cover marketplace plugin installs in Docker

* UI: keep thinking helpers browser-safe

* Infra: restore check after gaxios compat

* Docs: refresh generated config baseline

* docs: reorder unreleased changelog entries

* fix: split browser-safe thinking helpers

* fix(macos): restore debug build helpers (openclaw#48046)

* Docs: add Claude marketplace plugin install guidance

* Channels: expand contract suites

* Channels: add contract surface coverage

* Channels: centralize outbound payload contracts

* Channels: centralize group policy contracts

* Channels: centralize inbound context contracts

* Tests: add contract runner

* Tests: harden WhatsApp inbound contract cleanup

* Tests: add extension test runner

* Runtime: lazy-load Discord channel ops

* Docs: use placeholders for marketplace plugin examples

* Release: trim generated docs from npm pack

* Runtime: lazy-load Telegram and Slack channel ops

* Tests: detect changed extensions

* Tests: cover changed extension detection

* Docs: add extension test workflow

* CI: add changed extension test lane

* BlueBubbles: lazy-load channel runtime paths

* Plugin SDK: restore scoped imports for bundled channels

* Plugin SDK: consolidate shared channel exports

* Channels: fix surface contract plugin lookup

* Status: stabilize startup memory probes

* Media: avoid slow auth misses in auto-detect

* Tests: scope Codex bundle loader fixture

* Tests: isolate bundle surface fixtures

* feat(telegram): add configurable silent error replies (openclaw#19776)

Port and complete openclaw#19776 on top of the current Telegram extension layout.

Adds a default-off `channels.telegram.silentErrorReplies` setting. When enabled, Telegram bot replies marked as errors are delivered silently across the regular bot reply flow, native/slash command replies, and fallback sends.

Thanks @auspic7 

Co-authored-by: Myeongwon Choi <[email protected]>
Co-authored-by: ImLukeF <[email protected]>

* fix(ui): auto load Usage tab data on navigation

* fix(telegram): keep silent error fallback replies quiet

* test(gateway): restore agent request route mock

* Cron: isolate active-model delivery tests

* Tests: align media auth fixture with selection checks

* Plugins: preserve lazy runtime provider resolution

* Bootstrap: report nested entry import misses

* fix(channels): parse bundled targets without plugin registry

* test(telegram): cover shared parsing without registry

* Plugin SDK: split setup and sandbox subpaths

* Providers: centralize setup defaults and helper boundaries

* Plugins: decouple bundled web search discovery

* Plugin SDK: update entrypoint metadata

* Secrets: honor caller env during runtime validation

* Tests: align Docker cache checks with non-root images

* Plugin SDK: keep root alias reflection lazy

* Providers: scope compat resolution to owning plugins

* Plugin SDK: add narrow setup subpaths

* Plugin SDK: update entrypoint metadata

* fix(slack): harden bolt import interop (openclaw#45953)

* fix(slack): harden bolt import interop

* fix(slack): simplify bolt interop resolver

* fix(slack): harden startup bolt interop

* fix(slack): place changelog entry at section end

---------

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Altay <[email protected]>

* Tests: fix green check typing regressions

* Plugins: avoid booting bundled providers for catalog hooks

* fix: bypass telegram runtime proxy during health checks

* fix: align telegram probe test mock

* test: remove stale synology zod mock

* fix(android): reduce chat recomposition churn

* fix(android): preserve chat message identity on refresh

* fix(android): shrink chat image attachments

* Browser: support non-Chrome existing-session profiles via userDataDir (openclaw#48170)

Merged via squash.

Prepared head SHA: e490035
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Reviewed-by: @velvet-shark

* fix(local-storage): improve VITEST environment check for localStorage access

* fix: normalize discord commands allowFrom auth

* test: update discord subagent hook mocks

* test: mock telegram native command reply pipeline

* fix(android): lazy-init node runtime after onboarding

* docs(config): refresh generated baseline

* Plugins: share channel plugin id resolution

* Gateway: defer full channel plugins until after listen

* Gateway: gate deferred channel startup behind opt-in

* Docs: document deferred channel startup opt-in

* feat(skills): preserve all skills in prompt via compact fallback before dropping (openclaw#47553)

* feat(skills): add compact format fallback for skill catalog truncation

When the full-format skill catalog exceeds the character budget,
applySkillsPromptLimits now tries a compact format (name + location
only, no description) before binary-searching for the largest fitting
prefix. This preserves full model awareness of registered skills in
the common overflow case.

Three-tier strategy:
1. Full format fits → use as-is
2. Compact format fits → switch to compact, keep all skills
3. Compact still too large → binary search largest compact prefix

Other changes:
- escapeXml() utility for safe XML attribute values
- formatSkillsCompact() emits same XML structure minus <description>
- Compact char-budget check reserves 150 chars for the warning line
  the caller prepends, preventing prompt overflow at the boundary
- 13 tests covering all tiers, edge cases, and budget reservation
- docs/.generated/config-baseline.json: fix pre-existing oxfmt issue

* docs: document compact skill prompt fallback

---------

Co-authored-by: Frank Yang <[email protected]>

* Gateway: simplify startup and stabilize mock responses tests

* test: fix stale web search and boot-md contracts

* Gateway tests: centralize mock responses provider setup

* Gateway tests: share ordered client teardown helper

* Infra: ignore ciao probing cancellations

* Docs: repair unreleased changelog attribution

* Docs: normalize unreleased changelog refs

* Channels: ignore enabled-only disabled plugin config

* perf: reduce status json startup memory

* perf: lazy-load status route startup helpers

* Plugins: stage local bundled runtime tree

* Build: share root dist chunks across tsdown entries

* fix(logging): make logger import browser-safe

* fix(changelog): add entry for Control UI logger import fix (openclaw#48469)

* fix(changelog): note Control UI logger import fix

* fix(changelog): attribute Control UI logger fix entry

* fix(changelog): credit original Control UI fix author

* Plugins: remove public extension-api surface (openclaw#48462)

* Plugins: remove public extension-api surface

* Plugins: fix loader setup routing follow-ups

* CI: ignore non-extension helper dirs in extension-fast

* Docs: note extension-api removal as breaking

* fix(ui): language dropdown selection not persisting after refresh (openclaw#48019)

Merged via squash.

Prepared head SHA: 06c8258
Co-authored-by: git-jxj <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* fix(plugins): late-binding subagent runtime for non-gateway load paths (openclaw#46648)

Merged via squash.

Prepared head SHA: 4474265
Co-authored-by: jalehman <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix: enable auto-scroll during assistant response streaming

Fix auto-scroll behavior when AI assistant streams responses in the web UI.
Previously, the viewport would remain at the sent message position and users
had to manually click a badge to see streaming responses.

Fixes openclaw#14959

Changes:
- Reset chat scroll state before sending message to ensure viewport readiness
- Force scroll to bottom after message send to position viewport correctly
- Detect streaming start (chatStream: null -> string) and trigger auto-scroll
- Ensure smooth scroll-following during entire streaming response

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(ui): align chatStream lifecycle type with nullable state

* fix(whatsapp): restore implicit reply mentions for LID identities (openclaw#48494)

Threads selfLid from the Baileys socket through the inbound WhatsApp
pipeline and adds LID-format matching to the implicit mention check
in group gating, so reply-to-bot detection works when WhatsApp sends
the quoted sender in @lid format.

Also fixes the device-suffix stripping regex (was a silent no-op).

Closes openclaw#23029

Co-authored-by: sparkyrider <[email protected]>
Reviewed-by: @ademczuk

* fix(compaction): stabilize toolResult trim/prune flow in safeguard (openclaw#44133)

Merged via squash.

Prepared head SHA: ec789c6
Co-authored-by: SayrWolfridge <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* Fix launcher startup regressions (openclaw#48501)

* Fix launcher startup regressions

* Fix CI follow-up regressions

* Fix review follow-ups

* Fix workflow audit shell inputs

* Handle require resolve gaxios misses

* fix: remove orphaned tool_result blocks during compaction (openclaw#15691) (openclaw#16095)

Merged via squash.

Prepared head SHA: b772432
Co-authored-by: claw-sylphx <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* docs: rename onboarding user-facing wizard copy

Co-authored-by: Tak <[email protected]>

* fix(plugins): keep built plugin loading on one module graph (openclaw#48595)

* Plugins: stabilize global catalog contracts

* Channels: add global threading and directory contracts

* Tests: improve extension runner discovery

* CI: run global contract lane

* Plugins: speed up auth-choice contracts

* Plugins: fix catalog contract mocks

* refactor: move provider catalogs into extensions

* refactor: route bundled channel setup helpers through private sdk bridges

* test: fix check contract type drift

* Tests: lock plugin slash commands to one runtime graph

* Tests: cover Discord provider plugin registry

* Tests: pin loader command activation semantics

* Tests: cover Telegram plugin auth on real registry

* refactor(slack): share setup helpers

* refactor(whatsapp): reuse shared normalize helpers

* Tlon: lazy-load channel runtime paths

* Tests: document Discord plugin auth gating

* feat(plugins): add speech provider registration

* docs(plugins): document capability ownership model

* fix: detect Ollama "prompt too long" as context overflow error (openclaw#34019)

Merged via squash.

Prepared head SHA: 825a402
Co-authored-by: lishuaigit <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* agent: preemptive context overflow detection during tool loops (openclaw#29371)

Merged via squash.

Prepared head SHA: 19661b8
Co-authored-by: keshav55 <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

---------

Co-authored-by: Peter Steinberger <[email protected]>
Co-authored-by: MoerAI <[email protected]>
Co-authored-by: Joey Krug <[email protected]>
Co-authored-by: bot_apk <[email protected]>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Vincent Koc <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
Co-authored-by: brokemac79 <[email protected]>
Co-authored-by: ObitaBot <[email protected]>
Co-authored-by: Prompt Driven <[email protected]>
Co-authored-by: Nimrod Gutman <[email protected]>
Co-authored-by: Gustavo Madeira Santana <[email protected]>
Co-authored-by: Myeongwon Choi <[email protected]>
Co-authored-by: Myeongwon Choi <[email protected]>
Co-authored-by: ImLukeF <[email protected]>
Co-authored-by: 郑耀宏 <[email protected]>
Co-authored-by: Ayaan Zaidi <[email protected]>
Co-authored-by: huntharo <[email protected]>
Co-authored-by: Yauheni Shauchenka <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Altay <[email protected]>
Co-authored-by: Radek Sienkiewicz <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: Val Alexander <[email protected]>
Co-authored-by: Hung-Che Lo <[email protected]>
Co-authored-by: Frank Yang <[email protected]>
Co-authored-by: git-jxj <[email protected]>
Co-authored-by: git-jxj <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: Josh Lehman <[email protected]>
Co-authored-by: jalehman <[email protected]>
Co-authored-by: Jaewon Hwang <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: sparkyrider <[email protected]>
Co-authored-by: sparkyrider <[email protected]>
Co-authored-by: Sayr Wolfridge <[email protected]>
Co-authored-by: SayrWolfridge <[email protected]>
Co-authored-by: Clayton Shaw <[email protected]>
Co-authored-by: claw-sylphx <[email protected]>
Co-authored-by: Tak <[email protected]>
Co-authored-by: lishuaigit <[email protected]>
Co-authored-by: lishuaigit <[email protected]>
Co-authored-by: Keshav Rao <[email protected]>
Co-authored-by: keshav55 <[email protected]>
analysoor-assistant pushed a commit to analysoor-assistant/openclaw that referenced this pull request Mar 19, 2026
Maple778 added a commit to Maple778/openclaw that referenced this pull request Mar 19, 2026
…ndler

Pattern from PR openclaw#47719

The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
Maple778 added a commit to Maple778/openclaw that referenced this pull request Mar 19, 2026
…ndler

Pattern from PR openclaw#47719

The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
Maple778 added a commit to Maple778/openclaw that referenced this pull request Mar 19, 2026
…ndler

Pattern from PR openclaw#47719

The revised proposal addresses the 'missing recovery handler' bug by extracting the restart logic into a robust helper function `triggerGatewayRestart`. This helper is consumed by both the deferred-restart timeout path (fixing the race condition in `onTimeout`) and the immediate-restart path (ensuring consistent behavior). I have addressed the specific sequencing error by ensuring the abort signal is processed only after the gateway decides to restart (or immediately, in the immediate path), and I have replaced the formatted string usage with the raw session state objects to enable precise recovery. A safety check is added for the recovery handler to prevent runtime errors if the dependency is missing.
dustin-olenslager pushed a commit to dustin-olenslager/ironclaw-supreme that referenced this pull request Mar 24, 2026
dustin-olenslager pushed a commit to dustin-olenslager/ironclaw-supreme that referenced this pull request Mar 24, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
dustin-olenslager pushed a commit to dustin-olenslager/ironclaw-supreme that referenced this pull request Mar 24, 2026
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 25, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as #4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
(cherry picked from commit 98f6ec5)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 25, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as #4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
(cherry picked from commit 98f6ec5)
sbezludny pushed a commit to sbezludny/openclaw that referenced this pull request Mar 27, 2026
sbezludny pushed a commit to sbezludny/openclaw that referenced this pull request Mar 27, 2026
1. [P1] Treat remap failures as resume failures — if replaceSubagentRunAfterSteer
   returns false, do NOT clear abortedLastRun, increment failed count.

2. [P2] Count scan-level exceptions as retryable failures — set result.failed > 0
   in the outer catch block so scheduleOrphanRecovery retry logic triggers.

3. [P2] Persist resumed-session dedupe across recovery retries — accept
   resumedSessionKeys as a parameter; scheduleOrphanRecovery lifts the Set to
   its own scope and passes it through retries.

4. [Greptile] Use typed config accessors instead of raw structural cast for TLS
   check in lifecycle.ts.

5. [Greptile] Forward gateway.reload.deferralTimeoutMs to deferGatewayRestartUntilIdle
   in scheduleGatewaySigusr1Restart so user-configured value is not silently ignored.

6. [Greptile] Same as openclaw#4 — already addressed by the typed config fix.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
sbezludny pushed a commit to sbezludny/openclaw that referenced this pull request Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] SIGUSR1 config reload aborts in-flight subagent LLM calls, leaving sessions orphaned with no retry

2 participants