Skip to content

fix(msteams): keep provider promise pending until abort to stop auto-restart loop#22605

Closed
OpakAlex wants to merge 2 commits intoopenclaw:mainfrom
OpakAlex:fix/msteams-provider-promise-lifecycle
Closed

fix(msteams): keep provider promise pending until abort to stop auto-restart loop#22605
OpakAlex wants to merge 2 commits intoopenclaw:mainfrom
OpakAlex:fix/msteams-provider-promise-lifecycle

Conversation

@OpakAlex
Copy link
Copy Markdown

@OpakAlex OpakAlex commented Feb 21, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: MS Teams provider logs "starting provider (port 3978)" then immediately "auto-restart attempt N/10 in Xs" in a loop; health monitor may log "restarting (reason: stopped)" and reset the attempt counter.
  • Why it matters: The channel never stays "running"; the gateway keeps restarting the provider until "giving up after 10 restart attempts."
  • What changed: monitorMSTeamsProvider returns a promise that stays pending until abort + shutdown (not on first listen); added gateway-lifecycle doc and a server-channels test that a pending startAccount does not trigger auto-restart.
  • What did NOT change (scope boundary): No gateway or health-monitor logic changes; no config/API; other channels unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

MS Teams channel no longer enters an auto-restart loop; provider stays running until the user stops the channel or the gateway exits. New doc for extension authors: gateway channel lifecycle (startAccount contract).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: any
  • Runtime/container: Node 22+
  • Integration/channel: msteams enabled with valid credentials
  • Relevant config (redacted): channels.msteams.enabled, webhook.port, credentials

Steps

  1. Enable MS Teams channel and start the gateway.
  2. Watch logs for "msteams" and "auto-restart".

Expected

One "starting provider (port 3978)" and "msteams provider started on port 3978"; no repeated "auto-restart attempt N/10" unless the provider actually crashes.

Actual (before fix)

"starting provider (port 3978)" followed immediately by "auto-restart attempt 1/10 in 5s", then cycle repeats with backoff up to 10 attempts.

Evidence

  • New test: server-channels.test.ts — "does not auto-restart when startAccount promise stays pending" (startAccount returns never-resolving promise → one call, running stays true).
  • Spec: docs/channels/gateway-lifecycle.md — startAccount promise contract and MS Teams fix.

Human Verification (required)

  • Verified scenarios: Unit test passes; code review of promise lifecycle (pending until abort, then resolve after shutdown).
  • Edge cases checked: No abort signal → promise never resolves (documented); normal stop uses abort.
  • What you did not verify: Live gateway with real MS Teams app (no credentials in env).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert: Disable msteams channel or revert this commit.
  • Files/config to restore: None.
  • Known bad symptoms: If abort listener did not fire, stopping the channel could hang; shutdown is still invoked on abort so behavior unchanged.

Risks and Mitigations

  • Risk: Callers that awaited the old return value for immediate use could break.
    • Mitigation: Gateway only awaits for lifecycle (stopChannel); it does not use the resolved value. No such callers in repo.
  • Risk: None otherwise.
    • Mitigation: N/A

Greptile Summary

Fixes the MS Teams provider auto-restart loop by making monitorMSTeamsProvider return a promise that stays pending while the server is running, matching the gateway's startAccount contract. Previously, the promise resolved immediately after expressApp.listen(), causing the gateway to treat the channel as "exited" and enter a restart loop.

  • extensions/msteams/src/monitor.ts: Wraps the return value in a Promise that stays pending until the abort signal fires and shutdown completes. Without an abort signal, returns a never-resolving promise.
  • src/gateway/server-channels.test.ts: Adds a test confirming that a pending startAccount promise does not trigger auto-restart.
  • docs/channels/gateway-lifecycle.md: New documentation describing the startAccount promise contract for extension channel plugins.

Confidence Score: 5/5

  • This PR is safe to merge — it's a targeted, well-tested fix that correctly aligns the MS Teams provider with the gateway's startAccount promise contract.
  • The change is minimal and focused: it wraps the existing return value in a pending promise (with abort-based resolution), which is the documented correct pattern. The fix is backed by a new test case. No gateway or health-monitor logic was changed. Early return paths for disabled/unconfigured channels are guarded by the gateway's own checks. The abort signal is always freshly created by the gateway, so there's no risk of pre-aborted signals. No new dependencies, no API changes, backward compatible.
  • No files require special attention

Last reviewed commit: 0dc50ed

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation channel: msteams Channel integration: msteams gateway Gateway runtime size: XS size: S channel: discord Channel integration: discord app: macos App: macos security Security documentation commands Command implementations agents Agent runtime and tooling size: XL and removed size: XS size: S labels Feb 21, 2026
@OpakAlex OpakAlex force-pushed the fix/msteams-provider-promise-lifecycle branch from 6cb542d to 612f04a Compare February 21, 2026 12:12
@openclaw-barnacle openclaw-barnacle bot added size: S and removed channel: discord Channel integration: discord app: macos App: macos security Security documentation commands Command implementations agents Agent runtime and tooling size: XL labels Feb 21, 2026
@OpakAlex
Copy link
Copy Markdown
Author

@steipete your commit adds html tags: 073651fb570a9ef555c44e4f1ea54b3d78d84ec2

## Sponsors
+
+<table align="center">
+  <tr>
+    <td align="center" valign="middle" bgcolor="#111827" width="240" height="72">
+      <a href="https://openai.com/" target="_blank" rel="noopener">
+        <img src="docs/assets/sponsors/openai.svg" alt="OpenAI" height="34" />
+      </a>
+    </td>
+    <td width="16"></td>
+    <td align="center" valign="middle" bgcolor="#111827" width="240" height="72">
+      <a href="https://blacksmith.sh/" target="_blank" rel="noopener">
+        <img src="docs/assets/sponsors/blacksmith.svg" alt="Blacksmith" height="34" />
+      </a>
+    </td>
+  </tr>
+</table>
+

Should we allow html tags for CI?

Thanks

@OpakAlex
Copy link
Copy Markdown
Author

@obviyus Can you please check?

@OpakAlex OpakAlex force-pushed the fix/msteams-provider-promise-lifecycle branch from eb13354 to 67183f7 Compare February 21, 2026 18:05
Alexandr Opak and others added 2 commits February 21, 2026 19:17
…restart loop

- monitorMSTeamsProvider now returns a promise that stays pending until
  opts.abortSignal fires and shutdown() completes, so the gateway no
  longer treats the channel as exited and restarts it in a loop.
- Add docs/channels/gateway-lifecycle.md describing startAccount contract.
- Gateway test: startAccount that resolves on abort does not trigger
  auto-restart; call stopChannel so test cleanup exits.
- MSTeams test: use file URL for OneDrive mediaUrl so isLocalPath works
  on all platforms (e.g. Windows CI).
- Apply oxfmt to gateway-lifecycle.md and src/browser/* (format check).

Co-authored-by: Cursor <[email protected]>
@OpakAlex OpakAlex force-pushed the fix/msteams-provider-promise-lifecycle branch from 67183f7 to c94c7b3 Compare February 21, 2026 18:17
@Glucksberg
Copy link
Copy Markdown
Contributor

Just noticed a connection:

Several other PRs seem to address the same problem:

Issue #22169 reports msteams provider starting twice on gateway boot causing EADDRINUSE; PR#22182 fixes the false auto-restart loop by keeping the provider promise pending until abort.

Both approaches have merit — might be worth coordinating.

Related issue(s): #22169

If any of these links don't look right, let me know and I'll correct them.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 1, 2026
@pandego
Copy link
Copy Markdown
Contributor

pandego commented Mar 1, 2026

Closing this PR as superseded by follow-up work on the same issue path.

The msteams lifecycle fix is covered in #22182, so I am closing this one to keep the queue clean and avoid duplicate maintenance.

Thanks everyone for the review context and cross-links.

@BradGroux
Copy link
Copy Markdown
Contributor

Field report from a live production recovery (sanitized, no secrets / no env values). Posting this in case it helps maintainers and others triaging the same restart-loop class.

Executive summary

We hit a Microsoft Teams provider auto-restart loop with the same user-visible signature described here:

  • provider logs startup and directory/channel resolution
  • then immediately reports auto-restart attempts (1/10, 2/10, ...)
  • loop persists despite successful startup-side lookups

In our incident, there were three independent contributors. Fixing only one did not fully resolve the loop:

  1. Package collision (legacy global package co-installed with current package)
  2. Missing Teams bot credential field (app secret not set)
  3. Upstream lifecycle bug pattern (provider promise appears to resolve too early, matching this issue family)

What we observed (timeline-style)

Phase A — restart loop with repeated startup success signals

  • Teams provider repeatedly logged startup, user resolution, and channel resolution.
  • Immediately after these success logs, gateway health/auto-restart kicked in.
  • Backoff pattern matched known restart-loop behavior.

Phase B — eliminated local package-collision factor

  • Found old global package and current package both installed.
  • Removed legacy global package.
  • Result: removed one clear conflict vector, but loop still present.

Phase C — fixed missing credential config

  • Added missing Teams bot app password field (client secret value) in channel config.
  • Triggered config reload / gateway restart.
  • Result: credential state improved, but restart loop still present.

Phase D — remaining behavior matches known monitor/lifecycle bug class

  • Even after cleanup + credentials corrected, pattern remained:
    • start provider
    • resolve users/channels
    • immediate auto-restart
  • This aligns with issue reports where monitor/start function resolves prematurely and channel manager interprets that as provider exit.

Distinguishing signals that helped triage

These indicators were the most useful to separate local misconfiguration from upstream bug behavior:

  1. Success-before-failure pattern

    • If user/group/channel resolution consistently succeeds before restart, network/auth may not be the primary blocker.
  2. Stable repeating loop shape

    • Consistent startup → resolution → restart sequence with backoff strongly suggests lifecycle contract mismatch.
  3. Persistence across remediation layers

    • If loop persists after:
      • removing duplicate installs,
      • setting required credentials,
      • clean restart,
        then upstream monitor lifecycle is likely involved.

Secure remediation checklist that worked best for us

(Generalized to avoid host-specific details)

1) Eliminate package duplication first

  • Verify only one canonical global installation is active.
  • Remove legacy/conflicting package names that may still register plugins.
  • Re-check plugin load path consistency.

2) Validate Teams auth fields completely

  • Ensure app ID, tenant ID, and app password (client secret value) are all present.
  • Confirm config validation passes before restart.

3) Restart and verify by behavior, not process state

  • Don’t trust “running” status alone.
  • Verify no new auto-restart attempt lines over a timed observation window.
  • Verify inbound/outbound Teams flow during that same window.

4) If still looping, classify as likely upstream lifecycle bug

  • Capture exact log sequence around each restart boundary.
  • Attach sanitized sequence to issue/PR for maintainers.

What to include in repro data (high value for maintainers)

Recommend sharing these artifacts (all sanitized):

  • startup line for msteams provider
  • user/group/channel resolution lines
  • first line indicating restart scheduling
  • health-monitor line indicating “reason: stopped” (if present)
  • whether duplicate package installations were found
  • whether credentials were complete at time of test
  • whether loop persists after those are corrected

Suggested maintainer-facing acceptance test

A robust guardrail test for this bug class would assert:

  1. startAccount() for Teams does not resolve while provider is healthy.
  2. Resolver success (users/channels) alone does not trigger subsystem restart logic.
  3. Restart path only activates on explicit stop, fatal error, or abort signal.
  4. No duplicate listener bind occurs during healthy run (prevents EADDRINUSE cascades).

Current status from this field report

  • Local config/package hygiene issues: addressed.
  • Remaining loop signature: still consistent with upstream lifecycle bug described in this issue family.
  • Net: this report supports merging a monitor/start lifecycle fix that keeps provider task pending until true shutdown.

If helpful, I can provide a compact sanitized log excerpt in follow-up showing exact line ordering (startup → resolution → restart) without any identifiers.

@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 2, 2026

Superseded by already-merged MSTeams lifecycle work:

The provider lifecycle now stays pending until abort/shutdown, with monitor lifecycle regression coverage in main. This addresses the same restart-loop root cause.

Closing as duplicate/superseded to keep the queue focused. Thank you for the thorough write-up and docs context.

@steipete steipete closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: msteams Channel integration: msteams docs Improvements or additions to documentation gateway Gateway runtime size: S stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants