Skip to content

Zalo: fix provider lifecycle restarts#39892

Merged
darkamenosa merged 7 commits intomainfrom
tuyenhx/fix-zalo-provider-lifecycle
Mar 8, 2026
Merged

Zalo: fix provider lifecycle restarts#39892
darkamenosa merged 7 commits intomainfrom
tuyenhx/fix-zalo-provider-lifecycle

Conversation

@darkamenosa
Copy link
Copy Markdown
Member

Summary

  • Problem: monitorZaloProvider returned immediately with a { stop } object instead of blocking until abort, causing the gateway lifecycle manager to restart the provider in a tight loop.
  • Why it matters: Rapid restarts flood logs, burn API quota, and prevent the Zalo channel from functioning reliably.
  • What changed:
    • monitorZaloProvider now returns Promise<void> and blocks via waitForAbort(abortSignal) until the abort signal fires.
    • Polling mode checks getWebhookInfo before calling deleteWebhook (avoids unnecessary API calls; gracefully handles 404).
    • Added sendChatAction API support and wired typingCallbacks into the reply dispatcher for typing indicators.
    • Webhook/API return types corrected (ZaloWebhookInfo instead of boolean).
    • Structured lifecycle logging (mode=, path=, target=, stack traces on errors).
    • Proper finally block ensures cleanup and "provider stopped" log.
  • What did NOT change (scope boundary): Webhook mode registration logic, message processing pipeline, pairing/auth flows.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor

Scope (select all touched areas)

  • Gateway / orchestration
  • Integrations
  • API / contracts

Linked Issue/PR

  • N/A

User-visible / Behavior Changes

  • Zalo provider no longer restarts in a loop; stays alive until the gateway signals abort.
  • Zalo bot now shows "typing..." indicator while generating replies.
  • Polling startup only deletes webhook if one is actually configured (fewer unnecessary API calls).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? Yes — sendChatAction (typing indicator) and getWebhookInfo (pre-polling check)
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: Both are standard Zalo Bot API calls using the existing token; no new credentials or elevated permissions.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22+ / Bun
  • Integration/channel: Zalo

Steps

  1. Configure a Zalo bot account in polling mode.
  2. Start the gateway.
  3. Observe logs — provider should start once and block.

Expected

  • Single Zalo provider init mode=polling log entry; provider stays alive.

Actual

  • Before fix: rapid Zalo provider stopped → restart loop.

Evidence

  • Failing test/log before + passing after
  • New lifecycle tests in extensions/zalo/src/monitor.lifecycle.test.ts covering: polling blocks until abort, existing webhook cleanup, 404 graceful handling.
  • New API tests in extensions/zalo/src/api.test.ts verifying POST method for webhook endpoints.

Human Verification (required)

  • Verified scenarios: Lifecycle test suite passes; polling mode blocks correctly; webhook cleanup conditional on existing webhook.
  • Edge cases checked: getWebhookInfo 404, empty/whitespace webhook URL, already-aborted signal.
  • What you did not verify: Live Zalo environment end-to-end.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert the two commits on this branch.
  • Files/config to restore: extensions/zalo/src/monitor.ts, extensions/zalo/src/api.ts, src/plugin-sdk/zalo.ts
  • Known bad symptoms reviewers should watch for: Provider not starting at all (startup exception), or still looping (lifecycle await not working).

Risks and Mitigations

  • Risk: getWebhookInfo may behave differently on untested Zalo API environments.
    • Mitigation: 404 is handled gracefully; other errors fall through to the existing catch-all that logs and continues polling.

@openclaw-barnacle openclaw-barnacle bot added channel: zalo Channel integration: zalo size: M maintainer Maintainer-authored PR labels Mar 8, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR fixes the Zalo provider lifecycle restart loop. Previously, monitorZaloProvider returned an immediate handle object instead of blocking, causing the gateway lifecycle manager to restart the provider in a tight loop. The function now returns Promise<void> and suspends via waitForAbort until the abort signal fires.

Additional changes included:

  • Polling startup now checks getWebhookInfo before calling deleteWebhook, avoiding unnecessary API calls and handling 404 gracefully
  • sendChatAction API support added and wired into the reply dispatcher for typing indicators
  • Webhook and getWebhookInfo return types corrected from boolean to ZaloWebhookInfo
  • Structured lifecycle logging and a finally block ensuring consistent cleanup

Key observations from review:

  • The core waitForAbort blocking pattern is correct and handles the already-aborted edge case
  • In webhook mode, the deleteWebhook call on abort is fire-and-forget — monitorZaloProvider resolves before the Zalo server-side webhook is confirmed removed; see inline comment for details
  • Lifecycle tests use consecutive await Promise.resolve() ticks to advance the async state machine, which is fragile if the internal async structure gains additional awaits; see inline comment on the test file
  • Webhook mode lifecycle (start, abort, cleanup) has no test coverage — only polling is tested
  • Typing callbacks, nested 404-tolerant try-catch, and structured logging are all cleanly implemented

Confidence Score: 4/5

  • Safe to merge — core lifecycle bug is correctly fixed; remaining concerns are minor cleanup and test robustness issues.
  • The primary bug (tight restart loop) is fixed with a correct and well-tested waitForAbort approach. The new features (typing indicators, conditional webhook cleanup, structured logging) are clean. Score is 4 rather than 5 because: (1) webhook-mode cleanup is fire-and-forget with no test coverage, and (2) the lifecycle test timing is fragile and coupled to internal microtask counts.
  • extensions/zalo/src/monitor.ts (webhook abort cleanup) and extensions/zalo/src/monitor.lifecycle.test.ts (timing fragility, missing webhook mode tests).

Last reviewed commit: 608133b

Comment on lines +730 to +747
abortSignal.addEventListener(
"abort",
() => {
runtime.log?.(`[${account.accountId}] Zalo abort received; deleting webhook`);
void deleteWebhook(token, fetcher)
.then(() => {
runtime.log?.(`[${account.accountId}] Zalo webhook deleted`);
})
.catch((err) => {
runtime.error?.(
`[${account.accountId}] Zalo webhook delete failed: ${formatZaloError(err)}`,
);
});
},
{ once: true },
);
await waitForAbort(abortSignal);
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Webhook cleanup resolves before deletion completes

deleteWebhook is fired-and-forgotten (void) inside the abort listener, then waitForAbort resolves immediately when the signal fires. This means monitorZaloProvider returns its Promise<void> — signalling to the lifecycle manager that the provider has fully stopped — while the deleteWebhook API call is still in flight on Zalo's servers.

In practice the risk is low (re-calling setWebhook overwrites the old registration), but if the lifecycle manager restarts the provider before the deletion finishes there is a brief window where Zalo can deliver events to a webhook route that no longer has a handler (the local unregister() already ran in finally). Consider awaiting the deletion before waitForAbort resolves, or at minimum documenting the intentional fire-and-forget in a comment so future readers don't try to "fix" it:

abortSignal.addEventListener(
  "abort",
  () => {
    runtime.log?.(`[${account.accountId}] Zalo abort received; deleting webhook`);
    // Intentionally fire-and-forget: we do not block shutdown on webhook cleanup.
    void deleteWebhook(token, fetcher)
      .then(() => { ... })
      .catch((err) => { ... });
  },
  { once: true },
);
await waitForAbort(abortSignal);
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/zalo/src/monitor.ts
Line: 730-747

Comment:
**Webhook cleanup resolves before deletion completes**

`deleteWebhook` is fired-and-forgotten (`void`) inside the abort listener, then `waitForAbort` resolves immediately when the signal fires. This means `monitorZaloProvider` returns its `Promise<void>` — signalling to the lifecycle manager that the provider has fully stopped — while the `deleteWebhook` API call is still in flight on Zalo's servers.

In practice the risk is low (re-calling `setWebhook` overwrites the old registration), but if the lifecycle manager restarts the provider before the deletion finishes there is a brief window where Zalo can deliver events to a webhook route that no longer has a handler (the local `unregister()` already ran in `finally`). Consider awaiting the deletion before `waitForAbort` resolves, or at minimum documenting the intentional fire-and-forget in a comment so future readers don't try to "fix" it:

```ts
abortSignal.addEventListener(
  "abort",
  () => {
    runtime.log?.(`[${account.accountId}] Zalo abort received; deleting webhook`);
    // Intentionally fire-and-forget: we do not block shutdown on webhook cleanup.
    void deleteWebhook(token, fetcher)
      .then(() => { ... })
      .catch((err) => { ... });
  },
  { once: true },
);
await waitForAbort(abortSignal);
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +56 to +63
await Promise.resolve();
await Promise.resolve();

expect(getWebhookInfoMock).toHaveBeenCalledTimes(1);
expect(deleteWebhookMock).not.toHaveBeenCalled();
expect(getUpdatesMock).toHaveBeenCalledTimes(1);
expect(settled).toBe(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile microtask-count assumption

The two consecutive await Promise.resolve() calls work today because there is exactly one await between monitorZaloProvider startup and the call to startPollingLoop (namely await getWebhookInfo(...)). If a future change adds another await before the polling loop begins (e.g. for logging, a config fetch, etc.), these assertions will fire before getUpdatesMock has been called, producing a confusing false-negative.

A more resilient approach is to wait for a specific observable side-effect rather than counting microtask ticks — for example, spinning on getUpdatesMock being called or using a manually resolved deferred:

// Give the internal machinery time to reach the poll loop
await vi.waitFor(() => expect(getUpdatesMock).toHaveBeenCalledTimes(1));

This pattern is also more self-documenting about why we're yielding.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/zalo/src/monitor.lifecycle.test.ts
Line: 56-63

Comment:
**Fragile microtask-count assumption**

The two consecutive `await Promise.resolve()` calls work today because there is exactly one `await` between `monitorZaloProvider` startup and the call to `startPollingLoop` (namely `await getWebhookInfo(...)`). If a future change adds another `await` before the polling loop begins (e.g. for logging, a config fetch, etc.), these assertions will fire before `getUpdatesMock` has been called, producing a confusing false-negative.

A more resilient approach is to wait for a specific observable side-effect rather than counting microtask ticks — for example, spinning on `getUpdatesMock` being called or using a manually resolved deferred:

```ts
// Give the internal machinery time to reach the poll loop
await vi.waitFor(() => expect(getUpdatesMock).toHaveBeenCalledTimes(1));
```

This pattern is also more self-documenting about *why* we're yielding.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 608133b425

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +766 to +770
if (err instanceof ZaloApiError && err.errorCode === 404) {
// Some Zalo environments do not expose webhook inspection for polling bots.
runtime.log?.(
`[${account.accountId}] Zalo polling mode webhook inspection unavailable; continuing without webhook cleanup`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back to delete webhook when inspection fails

In polling mode, a 404 from getWebhookInfo now skips cleanup entirely and proceeds to long-polling; this is a regression from the previous unconditional deleteWebhook attempt. If an account was previously in webhook mode and webhook inspection is unavailable (or intermittently failing), the old webhook can remain active and polling may not receive updates, leaving the channel effectively offline until manual cleanup. Please attempt deleteWebhook as a fallback when inspection cannot confirm state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong review.

Copy link
Copy Markdown
Contributor

@gambletan gambletan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Overall solid fix for the provider restart loop. A few suggestions:

  1. Duplicated waitForAbort utilityextensions/zalo/src/monitor.ts introduces a local waitForAbort that's nearly identical to waitForAbortSignal in src/infra/abort-signal.ts. Other channel plugins (Telegram, Line, Nextcloud Talk) already reuse the one from src/infra/abort-signal.ts. Consider using that instead of duplicating.

  2. startAccount return type change — Changed from Promise<ZaloMonitorResult> (with { stop }) to Promise<void>. Worth verifying this is compatible with the ChannelPlugin lifecycle contract — the gateway can no longer explicitly call stop(), only signal via abort.

  3. Nested try/catch readability — Three levels of try/catch in the polling branch could be flattened by extracting webhook cleanup into a named helper function.

  4. Test timing fragilityawait Promise.resolve(); await Promise.resolve(); to settle microtasks is fragile. Consider vi.waitFor() or a polling helper.

Minor: blank line removal in deliverZaloReply (line 628) is cosmetic-only and unrelated to the PR scope.

Core fix (blocking via waitForAbort + proper finally cleanup) is correct. Typing indicator support follows established createTypingCallbacks pattern. Good structured logging improvements.

@darkamenosa
Copy link
Copy Markdown
Member Author

@gambletan I'm fixing

@darkamenosa
Copy link
Copy Markdown
Member Author

@gambletan Here is the problem with the old startAccount:

  • Old model: startAccount() meant "startup completed" and returned { stop }.
  • Gateway interpretation: "promise settled = channel exited".
  • Result: immediate or repeated auto-restarts.
  • Current model: startAccount() means "channel is alive until abort".
  • Result: one task per account, no false exit, and cleanup can finish before the gateway considers the channel stopped.

I just added test to verify compatible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86617c2083

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

);
throw err;
} finally {
await cleanupWebhook?.();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid blocking shutdown on webhook cleanup

In webhook mode, monitorZaloProvider now awaits cleanupWebhook in finally, and that cleanup performs deleteWebhook(token, fetcher) without a request timeout. If the Zalo API call hangs during network issues, the provider promise never resolves, so stopChannel (which awaits the account task) can hang indefinitely and prevent clean shutdown/restart of that account. This should be best-effort (or bounded by a timeout) rather than a hard await on teardown.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f70b8294e7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

});
const typingCallbacks = createTypingCallbacks({
start: async () => {
await sendChatAction(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid blocking reply dispatch on typing API call

typingCallbacks.start awaits sendChatAction directly, but this API call has no timeout (sendChatAction -> callZaloApi without timeoutMs). dispatchReplyWithBufferedBlockDispatcher awaits onReplyStart, so if the Zalo API hangs (network stall/proxy issue), the actual reply delivery path is blocked and users stop receiving responses even though typing indicators are best-effort. Please bound this call (or make it fire-and-forget) so typing failures cannot stall message replies.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb55263fb6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

resolveWebhookPath,
warnMissingProviderGroupPolicyFallbackOnce,
} from "openclaw/plugin-sdk/zalo";
import { waitForAbortSignal } from "../../../src/infra/abort-signal.js";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace root-relative abort import with plugin-safe path

monitor.ts now imports waitForAbortSignal from ../../../src/infra/abort-signal.js, which only works in the monorepo layout; when @openclaw/zalo is installed as an npm plugin, that relative path resolves outside the package and the module cannot be loaded, so provider startup fails before monitoring begins. I checked the plugin loader wiring (src/plugins/loader.ts) and it only aliases openclaw/plugin-sdk*, so this deep relative import is not remapped in plugin runtime environments.

Useful? React with 👍 / 👎.

@darkamenosa darkamenosa merged commit 67b2e81 into main Mar 8, 2026
28 of 30 checks passed
@darkamenosa darkamenosa deleted the tuyenhx/fix-zalo-provider-lifecycle branch March 8, 2026 15:33
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
sauerdaniel pushed a commit to sauerdaniel/openclaw that referenced this pull request Mar 11, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import

(cherry picked from commit 67b2e81)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import

(cherry picked from commit 67b2e81)
0x666c6f added a commit to 0x666c6f/openclaw that referenced this pull request Mar 26, 2026
* refactor: share Apple talk config parsing

* refactor: add canonical talk config payload

* refactor: centralize talk silence timeout defaults

* test: cover invalid talk config inputs

* test: decouple ios talk parsing coverage

* fix: resolve live config paths in status and gateway metadata (openclaw#39952)

* fix: resolve live config paths in status and gateway metadata

* fix: resolve remaining runtime config path references

* test: cover gateway config.set config path response

* fix(web-search): restore OpenRouter compatibility for Perplexity (openclaw#39937) (openclaw#39937)

* Zalo: fix provider lifecycle restarts (openclaw#39892)

* Zalo: fix provider lifecycle restarts

* Zalo: add typing indicators, smart webhook cleanup, and API type fixes

* fix review

* add allow list test secrect

* Zalo: bound webhook cleanup during shutdown

* Zalo: bound typing chat action timeout

* Zalo: use plugin-safe abort helper import

* fix(plugins): ship Feishu bundled runtime dependency (openclaw#39990)

* fix: ship feishu bundled runtime dependency

* test: align feishu bundled dependency specs

* fix(hooks): use resolveAgentIdFromSessionKey in runBeforeReset  (openclaw#39875)

Merged via squash.

Prepared head SHA: 00a2b24
Co-authored-by: rbutera <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* CLI: include commit hash in --version output (openclaw#39712)

* CLI: include commit hash in --version output

* fix(version): harden commit SHA resolution and keep output consistent

* CLI: keep install checks compatible with commit-tagged version output

* fix(cli): include commit hash in root version fast path

* test(cli): allow null commit-hash mocks

* Installer: share version parser across install scripts

* Installer: avoid sourcing helpers from stdin cwd

* CLI: note commit-tagged version output

* CLI: anchor commit hash resolution to module root

* CLI: harden commit hash resolution

* CLI: fix commit hash lookup edge cases

* CLI: prefer live git metadata in dev builds

* CLI: keep git lookup inside package root

* Infra: tolerate invalid moduleUrl hints

* CLI: cache baked commit metadata fallbacks

* CLI: align changelog attribution with prep gate

* CLI: restore changelog contributor credit

---------

Co-authored-by: echoVic <[email protected]>
Co-authored-by: echoVic <[email protected]>

* fix: fail closed talk provider selection

* fix: align talk config secret schemas

* refactor: require canonical talk resolved payload

* test: add talk config contract fixtures

* refactor: split talk gateway config loaders

* refactor: avoid checkout during prep head verification

* refactor: dedupe prep branch push flow

* fix: treat model api drift as baseUrl refresh

* refactor: extract pure models config merge helpers

* refactor: expand provider capability registry

* refactor: reuse one models.json read per write

* fix: publish models.json atomically

* refactor: scope prep push results to env artifacts

* refactor: fold implicit provider injection into resolver

* fix(telegram): use message previews in DMs

* CI: scope CodeQL JavaScript analysis

* feat: allow compaction model override via config (openclaw#38753)

Merged via squash.

Prepared head SHA: a3d6d6c
Co-authored-by: starbuck100 <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix(sessions): clear stale contextTokens on model switch (openclaw#38044)

Merged via squash.

Prepared head SHA: bac2df4
Co-authored-by: yuweuii <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix: prefer bundled channel plugins over npm duplicates (openclaw#40094)

* fix: prefer bundled channel plugins over npm duplicates

* fix: tighten bundled plugin review follow-ups

* fix: address check gate follow-ups

* docs: add changelog for bundled plugin install fix

* fix: align lifecycle test formatting with CI oxfmt

* docs: update Brave Search API docs for Feb 2026 plan restructuring (openclaw#40111)

Merged via squash.

Prepared head SHA: c651f07
Co-authored-by: remusao <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras

* Add too-many-prs override label handling

* CI: satisfy provider merge fixture typing

* Tests: reduce web search secret-scan noise

* Web search: rename Perplexity auth source helper

* Docs: use placeholder OpenRouter key in Perplexity guide

* Docs: use placeholder OpenRouter key in web tool docs

* Fixtures: normalize talk config API key placeholder

* Tests: lower entropy git commit fixtures

* Chore: widen xxxxx detect-secrets allowlist

* Chore: refresh detect-secrets baseline

* Web search: allowlist Perplexity auth source type name

* Chore: refresh detect-secrets baseline after docs line changes

* Chore: refresh detect-secrets baseline after final scan

* Chore: refresh detect-secrets baseline for Feishu docs

* CLI: set local gateway mode in setup

* Tests: format daemon lifecycle CLI coverage

* refactor: use model compat for anthropic tool payload normalization

* refactor: move bundled extension gap allowlists into manifests

* refactor: thread config runtime env through models config

* refactor: split models registry loading from persistence

* refactor: centralize transcript provider quirks

* refactor: decompose implicit provider resolution

* refactor: extract openai stream wrappers

* refactor: validate bundled extension release metadata

* refactor: extract static provider builders

* refactor: extract provider stream wrappers

* refactor: extract bundled extension manifest parser

* test: standardize hermetic provider env snapshots

* test: add implicit provider matrix coverage

* fix(acp): persist spawned child session history (openclaw#40137)

Merged via squash.

Prepared head SHA: 62de5d5
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* fix: require talk resolved payload

* refactor: dedupe android talk config parsing

* test: expand talk config contract fixtures

* refactor: split android talk voice resolution

* test: isolate plugin loader from mocked module cache

* test: isolate legacy plugin-sdk root import check

* test: isolate git commit resolution fallbacks

* refactor: simplify plugin sdk compatibility aliases

* refactor: centralize acp session resolution guards

* refactor: neutralize context engine runtime bridge

* refactor: split doctor config analysis helpers

* refactor: extract ios watch reply coordinator

* fix: restore acp session meta narrowing

* refactor: extract qmd process runner

* fix: restore gate after rebase

* docs: add Browserbase as hosted remote CDP option

Add Browserbase documentation section alongside the existing Browserless
section in the browser docs. Includes signup instructions, CDP connection
configuration, and environment variable setup for both English and Chinese
(zh-CN) translations.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* Revert "docs: add Browserbase as hosted remote CDP option"

This reverts commit c469657.

* docs: add Browserbase as hosted remote CDP option

Add Browserbase documentation section alongside the existing Browserless
section in the browser docs. Includes signup instructions, CDP connection
configuration, and environment variable setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: fix duplicate heading lint error

Rename "Configuration" sub-heading to "Profile setup" to avoid
MD024/no-duplicate-heading conflict with the existing top-level
"Configuration" heading.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: fix Browserbase section to match official docs

Browserbase requires creating a session via their API to get a CDP
connect URL, unlike Browserless which uses a static endpoint. Updated
to show the correct curl-based session creation flow, removed
unverified static WebSocket URL, and added the 5-minute connect
timeout note from official docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: restore direct wss://connect.browserbase.com URL

Browserbase exposes a direct WebSocket connect endpoint that
auto-creates a session, similar to how Browserless works. Simplified
the section to use this static URL pattern instead of requiring
manual session creation via the API.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: fact-check Browserbase section against official docs

- Fix CAPTCHA/stealth/proxy claims: these are Developer plan+ only,
  not available on free tier
- Fix free tier limits: 1 browser hour, 15-min session duration
  (not "60 minutes of monthly usage")
- Add link to pricing page for paid plan details
- Simplify structure to match Browserless section format
- Remove sub-headings to match Browserless section style

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: simplify Browserbase section, drop pricing details

Restore platform-level feature description (CAPTCHA solving, stealth
mode, proxies) without plan-specific pricing gating. Keep free tier
note brief.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(browser): support direct WebSocket CDP URLs for Browserbase

Browserbase uses direct WebSocket connections (wss://) rather than the
standard HTTP-based /json/version CDP discovery flow used by Browserless.
This change teaches the browser tool to accept ws:// and wss:// URLs as
cdpUrl values: when a WebSocket URL is detected, OpenClaw connects
directly instead of attempting HTTP discovery.

Changes:
- config.ts: accept ws:// and wss:// in cdpUrl validation
- cdp.helpers.ts: add isWebSocketUrl() helper
- cdp.ts: skip /json/version when cdpUrl is already a WebSocket URL
- chrome.ts: probe WSS endpoints via WebSocket handshake instead of HTTP
- cdp.test.ts: add test for direct WebSocket target creation
- docs/tools/browser.md: update Browserbase section with correct URL
  format and notes

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* test+docs: comprehensive coverage and generic framing

- Add 12 new tests covering: isWebSocketUrl detection, parseHttpUrl WSS
  acceptance/rejection, direct WS target creation with query params,
  SSRF enforcement on WS URLs, WS reachability probing bypasses HTTP
- Reframe docs section as generic "Direct WebSocket CDP providers" with
  Browserbase as one example — any WSS-based provider works
- Update security tips to mention WSS alongside HTTPS

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(browser): update existing tests for ws/wss protocol support

Two pre-existing tests still expected ws:// URLs to be rejected by
parseHttpUrl, which now accepts them. Switch the invalid-protocol
fixture to ftp:// and tighten the assertion to match the full
"must be http(s) or ws(s)" error message.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(browser): preserve wss:// cdpUrl in legacy default profile resolution

* chore: remove vendor-specific references from code comments

* style(browser): fix oxfmt formatting in config.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: preserve loopback ws cdp tab ops (openclaw#31085) (thanks @shrey150)

* fix: share context engine registry across bundled chunks (openclaw#40115)

Merged via squash.

Prepared head SHA: 6af4820
Co-authored-by: jalehman <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix(browser): rewrite 0.0.0.0 and [::] wildcard addresses in CDP WebSocket URLs

Containerized browsers (e.g. browserless in Docker) report
`ws://0.0.0.0:<internal-port>` in their `/json/version` response.
`normalizeCdpWsUrl` rewrites loopback WS hosts to the external
CDP host:port, but `0.0.0.0` and `[::]` were not treated as
addresses needing rewriting, causing OpenClaw to try connecting
to `ws://0.0.0.0:3000` literally — which always fails.

Fixes openclaw#17752

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: normalize wildcard remote CDP websocket URLs (openclaw#17760) (thanks @joeharouni)

* fix(browser): wait for extension tabs after relay drop (openclaw#32331)

* fix: wait for extension relay tab reconnects (openclaw#32461) (thanks @AaronWander)

* fix(infra): make browser relay bind address configurable

Add browser.relayBindHost config option so the Chrome extension relay
server can bind to a non-loopback address (e.g. 0.0.0.0 for WSL2).
Defaults to 127.0.0.1 when unset, preserving current behavior.

Closes openclaw#39214

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(browser): add IP validation, fix upgrade handler for non-loopback bind

- Zod schema: validate relayBindHost with ipv4/ipv6 instead of bare string
- Upgrade handler: allow non-loopback connections when bindHost is explicitly
  non-loopback (e.g. 0.0.0.0 for WSL2), keeping loopback-only default
- Test: verify actual bind address via relay.bindHost instead of just checking
  reachability on 127.0.0.1 which passes regardless
- Expose bindHost on ChromeExtensionRelayServer type for inspection

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: make browser relay bind address configurable (openclaw#39364) (thanks @mvanhorn)

* docs: add WSL2 + Windows remote Chrome CDP troubleshooting (openclaw#39407) (thanks @Owlock)

* macos: add remote gateway token field for remote mode

* macos: clarify remote token placeholder text

* macos: add mode-toggle remote token sync coverage

* tests: document remote token persistence across mode toggle

* fix(macos): preserve unsupported remote gateway tokens

* docs(changelog): credit macos remote token author

* fix(macos): improve tailscale gateway discovery (openclaw#40167)

Sanitized test tailnet hostnames and re-ran the targeted macOS gateway discovery test suite before merge.

* refactor(browser): scope CDP sessions and harden stale target recovery

* feat: add local backup CLI (openclaw#40163)

Merged via squash.

Prepared head SHA: ed46625
Co-authored-by: shichangs <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras

* fix(ci): scope secrets scan to branch changes

* fix(ci): refresh detect-secrets baseline

* fix: harden backup verify path validation

* fix(plugin-sdk): lazily load legacy root alias

* fix(setup-podman): cd to TMPDIR before podman load to avoid cwd permission error (openclaw#39435)

* fix(setup-podman): cd to TMPDIR before podman load to avoid inherited cwd permission error

* fix(podman): safe cwd in run_as_user to prevent chdir errors

Co-Authored-By: Claude Opus 4.6  <[email protected]>
Signed-off-by: sallyom <[email protected]>

---------

Signed-off-by: sallyom <[email protected]>
Co-authored-by: sallyom <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* fix(cron): consolidate announce delivery, fire-and-forget trigger, and minimal prompt mode (openclaw#40204)

* fix(cron): consolidate announce delivery and detach manual runs

* fix: queue detached cron runs (openclaw#40204)

* Gateway/iOS: replay queued foreground actions safely after resume (openclaw#40281)

Merged via squash.

- Local validation: `pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/server-methods/nodes.invoke-wake.test.ts`
- Local validation: `pnpm build`
- mb-server validation: `pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/server-methods/nodes.invoke-wake.test.ts`
- mb-server validation: `pnpm build`
- mb-server validation: `pnpm protocol:check`

* iOS: auto-load the scoped gateway canvas with safe fallback (openclaw#40282)

Merged via squash.

- mb-server validation: `swift test --package-path apps/shared/OpenClawKit --filter GatewayNodeSessionTests`
- mb-server validation: `pnpm build`
- Scope note: top-level `RootTabs` shell change was intentionally removed from this PR before merge

* Update CHANGELOG.md

* Update CHANGELOG.md

* fix(run-openclaw-podman): add SELinux :Z mount option on enforcing/permissive hosts (openclaw#39449)

* fix(run-openclaw-podman): add SELinux :Z mount option on Linux with enforcing/permissive SELinux

* fix(quadlet): add SELinux :Z label to openclaw.container.in volume mount

* fix(podman): add SELinux :Z mount option for Fedora/RHEL hosts

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: sallyom <[email protected]>

---------

Signed-off-by: sallyom <[email protected]>
Co-authored-by: sallyom <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

* Docker: trim runtime image payload (openclaw#40307)

* Docker: shrink runtime image payload

* Docker: add runtime pnpm opt-in

* Docker: collapse helper entrypoint chmod layers

* Docker: restore bundled pnpm runtime

* Update CHANGELOG.md

* docs(changelog): move post-2026.3.8 entries to unreleased (openclaw#40342)

* docs(changelog): move post-2026.3.8 entries to unreleased

* Update CHANGELOG.md

* fix(tui): improve color contrast for light-background terminals (openclaw#40345)

* fix(tui): improve colour contrast for light-background terminals (openclaw#38636)

Detect light terminal backgrounds via COLORFGBG and apply a WCAG
AA-compliant light palette. Adds OPENCLAW_THEME=light|dark env var
override for terminals without auto-detection.

Uses proper sRGB linearisation and WCAG 2.1 contrast ratios to pick
whichever text palette (dark or light) has higher contrast against
the detected background colour.

Co-authored-by: ademczuk <[email protected]>

* Update CHANGELOG.md

---------

Co-authored-by: ademczuk <[email protected]>
Co-authored-by: ademczuk <[email protected]>

* fix(models): keep --all aligned with synthetic catalog rows

* test(models): refresh list assertions after main sync

* fix: normalize openai-codex gpt-5.4 transport overrides

* docs: add refactor cluster backlog

* refactor: dedupe plugin runtime stores

* refactor: share gateway argv parsing

* refactor: extract gateway port diagnostics helper

* refactor: reuse broadcast route key construction

* refactor: share multi-account config schema fragments

* test: dedupe brave llm-context rejection cases

* refactor: share channel config adapter base

* fix(agents): bootstrap runtime plugins before context-engine resolution

* docs(changelog): remove rebase marker

* refactor: harden browser relay CDP flows

* fix(config): refresh runtime snapshot from disk after write. Fixes openclaw#37175 (openclaw#37313)

Merged via squash.

Prepared head SHA: 69e1861
Co-authored-by: bbblending <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras

* refactor: harden browser runtime profile handling

* refactor(models): extract list row builders

* refactor(agents): extract provider model normalization

* refactor(models): split models.json planning from writes

* refactor(models): split provider discovery helpers

* chore(docs): drop refactor cleanup tracker

* gateway: fix global Control UI 404s for symlinked wrappers and bundled package roots (openclaw#40385)

Merged via squash.

Prepared head SHA: 567b3ed
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Reviewed-by: @velvet-shark

* Docker: improve build cache reuse (openclaw#40351)

* Docker: improve build cache reuse

* Tests: cover Docker build cache layout

* Docker: fix sandbox cache mount continuations

* Docker: document qr-import manifest scope

* Docker: narrow e2e install inputs

* CI: cache Docker builds in workflows

* CI: route sandbox smoke through setup script

* CI: keep sandbox smoke on script path

* fix(tests): correct security check failure

* docs(changelog): correct Control UI contributor credit (openclaw#40420)

Merged via squash.

Prepared head SHA: e4295fe
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Reviewed-by: @velvet-shark

* fix(models): use 1M context for openai-codex gpt-5.4 (openclaw#37876)

Merged via squash.

Prepared head SHA: c410207
Co-authored-by: yuweuii <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix(telegram): add download timeout to prevent polling loop hang (openclaw#40098)

Merged via squash.

Prepared head SHA: abdfa1a
Co-authored-by: tysoncung <[email protected]>
Co-authored-by: obviyus <[email protected]>
Reviewed-by: @obviyus

* ACP: add optional ingress provenance receipts (openclaw#40473)

Merged via squash.

Prepared head SHA: b63e46d
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* alphabetize web search providers (openclaw#40259)

Merged via squash.

Prepared head SHA: be6350e
Co-authored-by: kesku <[email protected]>
Co-authored-by: obviyus <[email protected]>
Reviewed-by: @obviyus

* fix(plugin-sdk): remove remaining bundled plugin src imports (openclaw#39638)

Verified:
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: Kyle <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>

* test: fix android talk config contract fixture

* chore(acpx): move runtime test fixtures to test-utils (openclaw#40548)

Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini

* fix(agents): re-expose configured tools under restrictive profiles

* fix(media): accept reader read result type

* build(protocol): sync generated swift models

* fix: dedupe inbound Telegram DM replies per agent (openclaw#40519)

Merged via squash.

Prepared head SHA: 6e235e7
Co-authored-by: obviyus <[email protected]>
Co-authored-by: obviyus <[email protected]>
Reviewed-by: @obviyus

* fix(matrix): restore robust DM routing without the memberCount heuristic (openclaw#19736)

* fix(matrix): remove memberCount heuristic from DM detection

The memberCount === 2 check in isDirectMessage() misclassifies 2-person
group rooms (admin channels, monitoring rooms) as DMs, routing them to
the main session instead of their room-specific session.

Matrix already distinguishes DMs from groups at the protocol level via
m.direct account data and is_direct member state flags. Both are already
checked by client.dms.isDm() and hasDirectFlag(). The memberCount
heuristic only adds false positives for 2-person groups.

Move resolveMemberCount() below the protocol-level checks so it is only
reached for rooms not matched by m.direct or is_direct. This narrows its
role to diagnostic logging for confirmed group rooms.

Refs: openclaw#19739

* fix(matrix): add conservative fallback for broken DM flags

Some homeservers (notably Continuwuity) have broken m.direct account
data or never set is_direct on invite events. With the memberCount
heuristic removed, these DMs are no longer detected.

Add a conservative fallback that requires two signals before classifying
as DM: memberCount === 2 AND no explicit m.room.name. Group rooms almost
always have explicit names; DMs almost never do.

Error handling distinguishes M_NOT_FOUND (missing state event, expected
for unnamed rooms) from network/auth errors. Non-404 errors fall through
to group classification rather than guessing.

This is independently revertable — removing this commit restores pure
protocol-based detection without any heuristic fallback.

* fix(matrix): add parentPeer for DM room binding support

Add parentPeer to DM routes so conversations are bindable by room ID
while preserving DM trust semantics (secure 1:1, no group restrictions).

Suggested by @KirillShchetinin.

* fix(matrix): override DM detection for explicitly configured rooms

Builds on @robertcorreiro's config-driven approach from openclaw#9106.

Move resolveMatrixRoomConfig() before the DM check. If a room matches
a non-wildcard config entry (matchSource === "direct") and was
classified as DM, override the classification to group. This gives users
a deterministic escape hatch for misclassified rooms.

Wildcards are excluded from the override to avoid breaking DM routing
when a "*" catch-all exists. roomConfig is gated behind isRoom so DMs
never inherit group settings (skills, systemPrompt, autoReply).

This commit is independently droppable if the scope is too broad.

* test(matrix): add DM detection and config override tests

- 15 unit tests for direct.ts: all detection paths, priority order,
  M_NOT_FOUND vs network error handling, edge cases (whitespace names,
  API failures)
- 8 unit tests for rooms.ts: matchSource classification, wildcard
  safety for DM override, direct match priority over wildcard

* Changelog: note matrix DM routing follow-up

* fix(matrix): preserve DM fallback and room bindings

---------

Co-authored-by: Tak Hoffman <[email protected]>

* Fix cron text announce delivery for Telegram targets (openclaw#40575)

Merged via squash.

Prepared head SHA: 54b1513
Co-authored-by: obviyus <[email protected]>
Co-authored-by: obviyus <[email protected]>
Reviewed-by: @obviyus

* fix: clear plugin discovery cache after plugin installation (openclaw#39752)

Verified:
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: GazeKingNuWu <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>

* test: fix windows secrets runtime ci

* fix(daemon): enable LaunchAgent before bootstrap on restart

restartLaunchAgent was missing the launchctl enable call that
installLaunchAgent already performs. launchd can persist a "disabled"
state after bootout, causing bootstrap to silently fail and leaving the
gateway unloaded until a manual reinstall.

Fixes openclaw#39211

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(daemon): also enable LaunchAgent in repairLaunchAgentBootstrap

The repair/recovery path had the same missing `enable` guard as
`restartLaunchAgent`.  If launchd persists a "disabled" state after a
previous `bootout`, the `bootstrap` call in `repairLaunchAgentBootstrap`
fails silently, leaving the gateway unloaded in the recovery flow.

Add the same `enable` guard before `bootstrap` that was already applied
to `installLaunchAgent` and (in this PR) `restartLaunchAgent`.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(gateway): exit non-zero on restart shutdown timeout

When a config-change restart hits the force-exit timeout, exit with
code 1 instead of 0 so launchd/systemd treats it as a failure and
triggers a clean process restart. Stop-timeout stays at exit(0)
since graceful stops should not cause supervisor recovery.

Closes openclaw#36822

* test(secrets): skip ACL-dependent runtime snapshot tests on windows

* fix: add changelog for restart timeout recovery (openclaw#40380) (thanks @dsantoreis)

* fix(browser): enforce redirect-hop SSRF checks

* fix(cron): restore owner-only tools for isolated runs

* test(cron): cover owner-only tool availability

* fix(msteams): enforce sender allowlists with route allowlists

* fix(gateway): validate config before restart to prevent crash + macOS permission loss (openclaw#35862)

When 'openclaw gateway restart' is run with an invalid config, the new
process crashes on startup due to config validation failure. On macOS,
this causes Full Disk Access (TCC) permissions to be lost because the
respawned process has a different PID.

Add getConfigValidationError() helper and pre-flight config validation
in both runServiceRestart() and runServiceStart(). If config is invalid,
abort with a clear error message instead of crashing.

The config watcher's hot-reload path already had this guard
(handleInvalidSnapshot), but the CLI restart/start commands did not.

AI-assisted (OpenClaw agent, fully tested)

* fix(gateway): catch startup failure in run loop to prevent process exit (openclaw#35862)

When an in-process restart (SIGUSR1) triggers a config-triggered restart
and the new config is invalid, params.start() throws and the while loop
exits, killing the process. On macOS this loses TCC permissions.

Wrap params.start() in try/catch: on failure, set server=null, log the
error, and wait for the next SIGUSR1 instead of crashing.

* test: add runServiceStart config pre-flight tests (openclaw#35862)

Address Greptile review: add test coverage for runServiceStart path.
The error message copy-paste issue was already fixed in the DRY refactor
(uses params.serviceNoun instead of hardcoded 'restart').

* fix: address bot review feedback on openclaw#35862

- Remove dead 'return false' in runServiceStart (Greptile)
- Include stack trace in run-loop crash guard error log (Greptile)
- Only catch startup errors on subsequent restarts, not initial start (Codex P1)
- Add JSDoc note about env var false positive edge case (Codex P1)

* fix: move config pre-flight before onNotLoaded in runServiceRestart (Codex P2)

The config check was positioned after onNotLoaded, which could send
SIGUSR1 to an unmanaged process before config was validated.

* fix: release gateway lock on restart failure + reply to Codex reviews

- Release gateway lock when in-process restart fails, so daemon
  restart/stop can still manage the process (Codex P2)
- P1 (env mismatch) already addressed: best-effort by design, documented
  in JSDoc

* fix(gateway): detect launchd supervision via XPC_SERVICE_NAME

On macOS, launchd sets XPC_SERVICE_NAME on managed processes but does
not set LAUNCH_JOB_LABEL or LAUNCH_JOB_NAME. Without checking
XPC_SERVICE_NAME, isLikelySupervisedProcess() returns false for
launchd-managed gateways, causing restartGatewayProcessWithFreshPid()
to fork a detached child instead of returning "supervised". The
detached child holds the gateway lock while launchd simultaneously
respawns the original process (KeepAlive=true), leading to an infinite
lock-timeout / restart loop.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: detect launchd supervision via xpc service name (openclaw#20555) (thanks @dimat)

* fix(node-host): bind bun and deno approval scripts

* fix(skills): pin validated download roots

* fix(telegram): abort in-flight getUpdates fetch on shutdown

When the gateway receives SIGTERM, runner.stop() stops the grammY polling
loop but does not abort the in-flight getUpdates HTTP request. That request
hangs for up to 30 seconds (the Telegram API timeout). If a new gateway
instance starts polling during that window, Telegram returns a 409 Conflict
error, causing message loss and requiring exponential backoff recovery.

This is especially problematic with service managers (launchd, systemd)
that restart the process immediately after SIGTERM.

Wire an AbortController into the fetch layer so every Telegram API request
(especially the long-polling getUpdates) aborts immediately on shutdown:

- bot.ts: Accept optional fetchAbortSignal in TelegramBotOptions; wrap
  the grammY fetch with AbortSignal.any() to merge the shutdown signal.
- monitor.ts: Create a per-iteration AbortController, pass its signal to
  createTelegramBot, and abort it from the SIGTERM handler, force-restart
  path, and finally block.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(telegram): use manual signal forwarding to avoid cross-realm AbortSignal

AbortSignal.any() fails in Node.js when signals come from different module
contexts (grammY's internal signal vs local AbortController), producing:
"The signals[0] argument must be an instance of AbortSignal. Received an
instance of AbortSignal".

Replace with manual event forwarding that works across all realms.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: abort telegram getupdates on shutdown (openclaw#23950) (thanks @Gkinthecodeland)

* fix(cron): stagger missed jobs on restart to prevent gateway overload

When the gateway restarts with many overdue cron jobs, they are now
executed with staggered delays to prevent overwhelming the gateway.

- Add missedJobStaggerMs config (default 5s between jobs)
- Add maxMissedJobsPerRestart limit (default 5 jobs immediately)
- Prioritize most overdue jobs by sorting by nextRunAtMs
- Reschedule deferred jobs to fire gradually via normal timer

Fixes openclaw#18892

* fix: stagger missed cron jobs on restart (openclaw#18925) (thanks @rexlunae)

* build: update app deps except carbon

* refactor: extract telegram polling session

* refactor: split cron startup catch-up flow

* refactor: flatten supervisor marker hints

* docs: reorder 2026.3.8 changelog by impact

* build: sync pnpm lockfile

* chore: refresh secrets baseline

* docs: move 2026.3.8 entries back to unreleased

* test: fix Node 24+ test runner and subagent registry mocks

* test: fix Windows fake runtime bin fixtures

* fix: normalize windows runtime shim executables

* chore: prepare 2026.3.8-beta.1 release

* chore: update appcast for 2026.3.8-beta.1

* test: fix windows runtime and restart loop harnesses

* fix(update): re-enable launchd service before updater bootstrap

* chore: prepare 2026.3.8 npm release

* test: narrow gateway loop signal harness

* fix(launchd): harden macOS launchagent install permissions

* fix(onboard): avoid persisting talk fallback on fresh setup

* build: bump unreleased version to 2026.3.9

* fix: stabilize launchd paths and appcast secret scan

* build: sync plugin versions for 2026.3.9

* fix(ui): preserve control-ui auth across refresh (openclaw#40892)

Merged via squash.

Prepared head SHA: f9b2375
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Reviewed-by: @velvet-shark

* fix(kimi-coding): fix kimi tool format: use native Anthropic tool schema instead of OpenAI … (openclaw#40008)

Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: opriz <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>

* fix(swiftformat): exclude HostEnvSecurityPolicy.generated.swift from formatters (openclaw#39969)

* test(context-engine): add bundle chunk isolation tests for registry (openclaw#40460)

Merged via squash.

Prepared head SHA: 44622ab
Co-authored-by: dsantoreis <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* fix(agents): bound compaction retry wait and drain embedded runs on restart (openclaw#40324)

Merged via squash.

Prepared head SHA: cfd9956
Co-authored-by: cgdusek <[email protected]>
Co-authored-by: jalehman <[email protected]>
Reviewed-by: @jalehman

* Allow ACP sessions.patch lineage fields on ACP session keys (openclaw#40995)

Merged via squash.

Prepared head SHA: c1191ed
Co-authored-by: xaeon2026 <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* Update CONTRIBUTING.md

* Add Robin Waslander to maintainers

* Update CONTRIBUTING.md

* fix(acp): map error states to end_turn instead of unconditional refusal (openclaw#41187)

* fix(acp): map error states to end_turn instead of unconditional refusal

* fix: map ACP error stop reason to end_turn (openclaw#41187) (thanks @pejmanjohn)

---------

Co-authored-by: Pejman Pour-Moezzi <[email protected]>
Co-authored-by: Onur <[email protected]>

* fix(acp): propagate setSessionMode gateway errors to client (openclaw#41185)

* fix(acp): propagate setSessionMode gateway errors to client

* fix: add changelog entry for ACP setSessionMode propagation (openclaw#41185) (thanks @pejmanjohn)

---------

Co-authored-by: Pejman Pour-Moezzi <[email protected]>
Co-authored-by: Onur <[email protected]>

* plugins: harden global hook runner state (openclaw#40184)

* fix(telegram): bridge direct delivery to internal message:sent hooks (openclaw#40185)

* telegram: bridge direct delivery message hooks

* telegram: align sent hooks with command session

* Cron: enforce cron-owned delivery contract (openclaw#40998)

Merged via squash.

Prepared head SHA: 5877389
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* Agents: add embedded error observations (openclaw#41336)

Merged via squash.

Prepared head SHA: 4900042
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* Doctor: fix non-interactive cron repair gating (openclaw#41386)

* iOS: reconnect gateway on foreground return (openclaw#41384)

Merged via squash.

Prepared head SHA: 0e2e0dc
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* fix(cron): do not misclassify empty/NO_REPLY as interim acknowledgement (openclaw#41401)

* fix(cron): do not misclassify empty/NO_REPLY as interim acknowledgement

When a cron task's agent returns NO_REPLY, the payload filter strips the
silent token, leaving an empty text string. isLikelyInterimCronMessage()
previously returned true for empty input, causing the cron runner to
inject a forced rerun prompt ('Your previous response was only an
acknowledgement...').

Change the empty-string branch to return false: empty text after payload
filtering means the agent deliberately chose silent completion, not that
it sent an interim 'on it' message.

Fixes openclaw#41246

* fix(cron): do not misclassify empty/NO_REPLY as interim acknowledgement

Fixes openclaw#41246. (openclaw#41383) thanks @jackal092927.

---------

Co-authored-by: xaeon2026 <[email protected]>

* fix(auth): reset cooldown error counters on expiry to prevent infinite escalation (openclaw#41028)

Merged via squash.

Prepared head SHA: 89bd83f
Co-authored-by: zerone0x <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* Gateway: add pending node work primitives (openclaw#41409)

Merged via squash.

Prepared head SHA: a6d7ca9
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* Gateway: tighten node pending drain semantics (openclaw#41429)

Merged via squash.

Prepared head SHA: 361c2eb
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* acp: fail honestly in bridge mode (openclaw#41424)

Merged via squash.

Prepared head SHA: b5e6e13
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* acp: restore session context and controls (openclaw#41425)

Merged via squash.

Prepared head SHA: fcabdf7
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* Sandbox: import STATE_DIR from paths directly (openclaw#41439)

* acp: enrich streaming updates for ide clients (openclaw#41442)

Merged via squash.

Prepared head SHA: 0764368
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* acp: forward attachments into ACP runtime sessions (openclaw#41427)

Merged via squash.

Prepared head SHA: f2ac51d
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* acp: add regression coverage and smoke-test docs (openclaw#41456)

Merged via squash.

Prepared head SHA: 514d587
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* fix(agents): probe single-provider billing cooldowns (openclaw#41422)

Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* acp: harden follow-up reliability and attachments (openclaw#41464)

Merged via squash.

Prepared head SHA: 7d167df
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* Agents: add fallback error observations (openclaw#41337)

Merged via squash.

Prepared head SHA: 852469c
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* build(protocol): regenerate Swift models after pending node work schemas (openclaw#41477)

Merged via squash.

Prepared head SHA: cae0aaf
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky

* fix(discord): apply effective maxLinesPerMessage in live replies (openclaw#40133)

Merged via squash.

Prepared head SHA: 031d032
Co-authored-by: rbutera <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* Logging: harden probe suppression for observations (openclaw#41338)

Merged via squash.

Prepared head SHA: d18356c
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

* ci(sre:PLA-760): fix smoke workflow fallbacks

* test(sre:PLA-760): allowlist secret-scan false positives

* test(sre:PLA-760): refresh secret baseline for upstream sync

* test(sre:PLA-760): update detect-secrets baseline

* ci(sre:PLA-760): exclude auto-response from zizmor

* fix(ci:PLA-760): unblock audit and bun test lane

* ci(sre:PLA-760): run linux-only ci

---------

Signed-off-by: sallyom <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
Co-authored-by: Tak Hoffman <[email protected]>
Co-authored-by: Ayaan Zaidi <[email protected]>
Co-authored-by: darkamenosa <[email protected]>
Co-authored-by: Hermione <[email protected]>
Co-authored-by: rbutera <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: Altay <[email protected]>
Co-authored-by: echoVic <[email protected]>
Co-authored-by: echoVic <[email protected]>
Co-authored-by: Ayaan Zaidi <[email protected]>
Co-authored-by: Vincent Koc <[email protected]>
Co-authored-by: GitBuck <[email protected]>
Co-authored-by: starbuck100 <[email protected]>
Co-authored-by: jalehman <[email protected]>
Co-authored-by: yuweuii <[email protected]>
Co-authored-by: Rémi <[email protected]>
Co-authored-by: remusao <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Co-authored-by: Mariano <[email protected]>
Co-authored-by: Shrey Pandya <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Josh Lehman <[email protected]>
Co-authored-by: Joe Harouni <[email protected]>
Co-authored-by: AaronWander <[email protected]>
Co-authored-by: Matt Van Horn <[email protected]>
Co-authored-by: Charles Dusek <[email protected]>
Co-authored-by: Nimrod Gutman <[email protected]>
Co-authored-by: shichangs <[email protected]>
Co-authored-by: Gustavo Madeira Santana <[email protected]>
Co-authored-by: langdon <[email protected]>
Co-authored-by: sallyom <[email protected]>
Co-authored-by: Tyler Yust <[email protected]>
Co-authored-by: langdon <[email protected]>
Co-authored-by: ademczuk <[email protected]>
Co-authored-by: ademczuk <[email protected]>
Co-authored-by: Doruk Ardahan <[email protected]>
Co-authored-by: 0xsline <[email protected]>
Co-authored-by: bbblending <[email protected]>
Co-authored-by: Radek Sienkiewicz <[email protected]>
Co-authored-by: velvet-shark <[email protected]>
Co-authored-by: Tyson Cung <[email protected]>
Co-authored-by: obviyus <[email protected]>
Co-authored-by: Mariano <[email protected]>
Co-authored-by: Kesku <[email protected]>
Co-authored-by: Kyle <[email protected]>
Co-authored-by: Kyle <[email protected]>
Co-authored-by: Bronko <[email protected]>
Co-authored-by: GazeKingNuWu <[email protected]>
Co-authored-by: GazeKingNuWu <[email protected]>
Co-authored-by: scoootscooob <[email protected]>
Co-authored-by: Daniel dos Santos Reis <[email protected]>
Co-authored-by: DevMac <[email protected]>
Co-authored-by: merlin <[email protected]>
Co-authored-by: dimatu <[email protected]>
Co-authored-by: George Kalogirou <[email protected]>
Co-authored-by: rexlunae <[email protected]>
Co-authored-by: opriz <[email protected]>
Co-authored-by: opriz <[email protected]>
Co-authored-by: Joshua Lelon Mitchell <[email protected]>
Co-authored-by: dsantoreis <[email protected]>
Co-authored-by: Charles Dusek <[email protected]>
Co-authored-by: xaeon2026 <[email protected]>
Co-authored-by: xaeon2026 <[email protected]>
Co-authored-by: Robin Waslander <[email protected]>
Co-authored-by: Pejman Pour-Moezzi <[email protected]>
Co-authored-by: Pejman Pour-Moezzi <[email protected]>
Co-authored-by: Onur <[email protected]>
Co-authored-by: zerone0x <[email protected]>
Co-authored-by: zerone0x <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: zalo Channel integration: zalo maintainer Maintainer-authored PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants