Skip to content

fix(msteams): await abort signal to prevent EADDRINUSE restart loop#25582

Closed
byungsker wants to merge 1 commit intoopenclaw:mainfrom
byungsker:fix/msteams-eaddrinuse-await-until-abort
Closed

fix(msteams): await abort signal to prevent EADDRINUSE restart loop#25582
byungsker wants to merge 1 commit intoopenclaw:mainfrom
byungsker:fix/msteams-eaddrinuse-await-until-abort

Conversation

@byungsker
Copy link
Copy Markdown

@byungsker byungsker commented Feb 24, 2026

Problem

monitorMSTeamsProvider() returned as soon as expressApp.listen() bound to the port. The gateway's startAccount runner treats a resolved promise as "provider stopped" and schedules an auto-restart; the second bind attempt then fails with EADDRINUSE, producing an infinite restart loop until MAX_RESTART_ATTEMPTS is exhausted:

[default] auto-restart attempt 1/10 in 5s
EADDRINUSE: address already in use :::3978
[default] auto-restart attempt 2/10 in 11s
...

Fix

Replace the fire-and-forget abort listener with an await-until-abort block so the startAccount promise stays pending while the HTTP server is running:

-  // Handle abort signal
+  // Keep the provider alive until the abort signal fires.
+  // Without this await the startAccount promise resolves immediately after
+  // expressApp.listen() binds to the port, causing the gateway to interpret
+  // the provider as "stopped" and triggering an auto-restart loop that quickly
+  // fails with EADDRINUSE on the second bind attempt.
   if (opts.abortSignal) {
-    opts.abortSignal.addEventListener("abort", () => {
-      void shutdown();
-    });
+    if (!opts.abortSignal.aborted) {
+      await new Promise<void>((resolve) => {
+        opts.abortSignal!.addEventListener("abort", () => resolve(), { once: true });
+      });
+    }
+    await shutdown();
   }

Behaviour matrix:

abortSignal aborted on entry Result
provided no awaits abort → calls shutdown()
provided yes calls shutdown() immediately
absent returns immediately (unchanged)

This matches the pattern already used by the zalouser monitor (extensions/zalouser/src/monitor.ts).

Fixes #25527

Greptile Summary

Fixes the MS Teams provider's EADDRINUSE restart loop by keeping the monitorMSTeamsProvider promise pending until the abort signal fires, rather than resolving immediately after expressApp.listen() binds to the port.

  • Replaces the fire-and-forget addEventListener("abort", ...) with an await new Promise that blocks until abort, followed by await shutdown() — matching the established pattern in extensions/zalouser/src/monitor.ts
  • Correctly handles edge cases: signal not yet aborted (awaits), already aborted on entry (immediate shutdown), and absent signal (returns immediately, unchanged behavior)
  • No issues found — the fix is minimal, well-targeted, and well-documented with a clear comment explaining the rationale

Confidence Score: 5/5

  • This PR is safe to merge — it's a focused, minimal fix that follows an established pattern in the codebase.
  • The change is small (7 lines net), well-documented, and directly addresses a clear bug (EADDRINUSE restart loop). It follows the same await-until-abort pattern already used by the zalouser monitor. All three abort-signal cases (provided/not-aborted, provided/already-aborted, absent) are handled correctly. The gateway always passes abortSignal when calling this function, so the fix covers the production path. No new dependencies, no behavioral changes to other code paths.
  • No files require special attention.

Last reviewed commit: 1d92f36

monitorMSTeamsProvider() returned immediately after expressApp.listen()
bound to the port. The gateway's startAccount runner treats a resolved
promise as "provider stopped" and schedules an auto-restart; the second
bind attempt then fails with EADDRINUSE, producing an infinite restart
loop until MAX_RESTART_ATTEMPTS is exhausted.

Replace the fire-and-forget abort listener with an await-until-abort
block: when an abortSignal is provided the function now stays pending
until the signal fires, then calls shutdown() before returning. If the
signal is already aborted on entry the shutdown is called immediately.
When no abortSignal is provided the existing behaviour is preserved
(server keeps running; caller can invoke shutdown() directly).

Fixes openclaw#25527
@BradGroux
Copy link
Copy Markdown
Contributor

Field report from a live production recovery (sanitized, no secrets / no env values). Posting this in case it helps maintainers and others triaging the same restart-loop class.

Executive summary

We hit a Microsoft Teams provider auto-restart loop with the same user-visible signature described here:

  • provider logs startup and directory/channel resolution
  • then immediately reports auto-restart attempts (1/10, 2/10, ...)
  • loop persists despite successful startup-side lookups

In our incident, there were three independent contributors. Fixing only one did not fully resolve the loop:

  1. Package collision (legacy global package co-installed with current package)
  2. Missing Teams bot credential field (app secret not set)
  3. Upstream lifecycle bug pattern (provider promise appears to resolve too early, matching this issue family)

What we observed (timeline-style)

Phase A — restart loop with repeated startup success signals

  • Teams provider repeatedly logged startup, user resolution, and channel resolution.
  • Immediately after these success logs, gateway health/auto-restart kicked in.
  • Backoff pattern matched known restart-loop behavior.

Phase B — eliminated local package-collision factor

  • Found old global package and current package both installed.
  • Removed legacy global package.
  • Result: removed one clear conflict vector, but loop still present.

Phase C — fixed missing credential config

  • Added missing Teams bot app password field (client secret value) in channel config.
  • Triggered config reload / gateway restart.
  • Result: credential state improved, but restart loop still present.

Phase D — remaining behavior matches known monitor/lifecycle bug class

  • Even after cleanup + credentials corrected, pattern remained:
    • start provider
    • resolve users/channels
    • immediate auto-restart
  • This aligns with issue reports where monitor/start function resolves prematurely and channel manager interprets that as provider exit.

Distinguishing signals that helped triage

These indicators were the most useful to separate local misconfiguration from upstream bug behavior:

  1. Success-before-failure pattern

    • If user/group/channel resolution consistently succeeds before restart, network/auth may not be the primary blocker.
  2. Stable repeating loop shape

    • Consistent startup → resolution → restart sequence with backoff strongly suggests lifecycle contract mismatch.
  3. Persistence across remediation layers

    • If loop persists after:
      • removing duplicate installs,
      • setting required credentials,
      • clean restart,
        then upstream monitor lifecycle is likely involved.

Secure remediation checklist that worked best for us

(Generalized to avoid host-specific details)

1) Eliminate package duplication first

  • Verify only one canonical global installation is active.
  • Remove legacy/conflicting package names that may still register plugins.
  • Re-check plugin load path consistency.

2) Validate Teams auth fields completely

  • Ensure app ID, tenant ID, and app password (client secret value) are all present.
  • Confirm config validation passes before restart.

3) Restart and verify by behavior, not process state

  • Don’t trust “running” status alone.
  • Verify no new auto-restart attempt lines over a timed observation window.
  • Verify inbound/outbound Teams flow during that same window.

4) If still looping, classify as likely upstream lifecycle bug

  • Capture exact log sequence around each restart boundary.
  • Attach sanitized sequence to issue/PR for maintainers.

What to include in repro data (high value for maintainers)

Recommend sharing these artifacts (all sanitized):

  • startup line for msteams provider
  • user/group/channel resolution lines
  • first line indicating restart scheduling
  • health-monitor line indicating “reason: stopped” (if present)
  • whether duplicate package installations were found
  • whether credentials were complete at time of test
  • whether loop persists after those are corrected

Suggested maintainer-facing acceptance test

A robust guardrail test for this bug class would assert:

  1. startAccount() for Teams does not resolve while provider is healthy.
  2. Resolver success (users/channels) alone does not trigger subsystem restart logic.
  3. Restart path only activates on explicit stop, fatal error, or abort signal.
  4. No duplicate listener bind occurs during healthy run (prevents EADDRINUSE cascades).

Current status from this field report

  • Local config/package hygiene issues: addressed.
  • Remaining loop signature: still consistent with upstream lifecycle bug described in this issue family.
  • Net: this report supports merging a monitor/start lifecycle fix that keeps provider task pending until true shutdown.

If helpful, I can provide a compact sanitized log excerpt in follow-up showing exact line ordering (startup → resolution → restart) without any identifiers.

@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 2, 2026

Superseded by already-merged MSTeams monitor lifecycle fixes:

monitorMSTeamsProvider now keeps the provider run pending until abort/shutdown, resolves startup/close races, and includes regression coverage in main.

Closing this as duplicate to keep the queue focused. Thank you for the patch and write-up.

@steipete steipete closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: msteams Channel integration: msteams size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MS Teams provider: EADDRINUSE restart loop — missing await-until-abort in monitorMSTeamsProvider

3 participants