Skip to content

fix(msteams): keep monitor alive until shutdown#24580

Merged
steipete merged 2 commits intoopenclaw:mainfrom
chilu18:fix/msteams-monitor-lifecycle
Mar 2, 2026
Merged

fix(msteams): keep monitor alive until shutdown#24580
steipete merged 2 commits intoopenclaw:mainfrom
chilu18:fix/msteams-monitor-lifecycle

Conversation

@chilu18
Copy link
Copy Markdown
Contributor

@chilu18 chilu18 commented Feb 23, 2026

Summary

  • keep monitorMSTeamsProvider alive until server close so channel manager does not treat startup as an immediate exit
  • fail fast on startup bind/listen errors (for example EADDRINUSE) so restart backoff can classify a real startup failure
  • wire abort handling with a stable listener and cleanup on close
  • add lifecycle regression tests covering:
    • monitor stays pending until abort
    • startup rejects on listen error

Fixes #24522.

Tests

  • corepack pnpm --dir /tmp/openclaw-next4-CwcfMD exec vitest run extensions/msteams/src/monitor.lifecycle.test.ts src/gateway/server-channels.test.ts

Greptile Summary

Fixed lifecycle issue where monitorMSTeamsProvider returned immediately after startup, causing the channel manager to treat it as an exit and trigger restart loops. The fix keeps the monitor promise pending until the HTTP server closes (via abort signal or shutdown), and fails fast on startup errors like EADDRINUSE.

Confidence Score: 5/5

  • Safe to merge - fixes critical lifecycle bug preventing MSTeams monitor from staying alive
  • Implementation correctly addresses the root cause (monitor returning immediately instead of staying pending), includes comprehensive lifecycle tests covering both normal operation and error scenarios, and follows existing patterns from other channel monitors
  • No files require special attention

Last reviewed commit: d839de4

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

resolve();
});
});
opts.abortSignal?.removeEventListener("abort", onAbort);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeEventListener unnecessary since listener added with { once: true } on line 317

Suggested change
opts.abortSignal?.removeEventListener("abort", onAbort);
// Listener auto-removed by { once: true } option
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor.ts
Line: 327

Comment:
`removeEventListener` unnecessary since listener added with `{ once: true }` on line 317

```suggestion
  // Listener auto-removed by { once: true } option
```

How can I resolve this? If you propose a fix, please make it concise.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 1, 2026
@BradGroux
Copy link
Copy Markdown
Contributor

Field report from a live production recovery (sanitized, no secrets / no env values). Posting this in case it helps maintainers and others triaging the same restart-loop class.

Executive summary

We hit a Microsoft Teams provider auto-restart loop with the same user-visible signature described here:

  • provider logs startup and directory/channel resolution
  • then immediately reports auto-restart attempts (1/10, 2/10, ...)
  • loop persists despite successful startup-side lookups

In our incident, there were three independent contributors. Fixing only one did not fully resolve the loop:

  1. Package collision (legacy global package co-installed with current package)
  2. Missing Teams bot credential field (app secret not set)
  3. Upstream lifecycle bug pattern (provider promise appears to resolve too early, matching this issue family)

What we observed (timeline-style)

Phase A — restart loop with repeated startup success signals

  • Teams provider repeatedly logged startup, user resolution, and channel resolution.
  • Immediately after these success logs, gateway health/auto-restart kicked in.
  • Backoff pattern matched known restart-loop behavior.

Phase B — eliminated local package-collision factor

  • Found old global package and current package both installed.
  • Removed legacy global package.
  • Result: removed one clear conflict vector, but loop still present.

Phase C — fixed missing credential config

  • Added missing Teams bot app password field (client secret value) in channel config.
  • Triggered config reload / gateway restart.
  • Result: credential state improved, but restart loop still present.

Phase D — remaining behavior matches known monitor/lifecycle bug class

  • Even after cleanup + credentials corrected, pattern remained:
    • start provider
    • resolve users/channels
    • immediate auto-restart
  • This aligns with issue reports where monitor/start function resolves prematurely and channel manager interprets that as provider exit.

Distinguishing signals that helped triage

These indicators were the most useful to separate local misconfiguration from upstream bug behavior:

  1. Success-before-failure pattern

    • If user/group/channel resolution consistently succeeds before restart, network/auth may not be the primary blocker.
  2. Stable repeating loop shape

    • Consistent startup → resolution → restart sequence with backoff strongly suggests lifecycle contract mismatch.
  3. Persistence across remediation layers

    • If loop persists after:
      • removing duplicate installs,
      • setting required credentials,
      • clean restart,
        then upstream monitor lifecycle is likely involved.

Secure remediation checklist that worked best for us

(Generalized to avoid host-specific details)

1) Eliminate package duplication first

  • Verify only one canonical global installation is active.
  • Remove legacy/conflicting package names that may still register plugins.
  • Re-check plugin load path consistency.

2) Validate Teams auth fields completely

  • Ensure app ID, tenant ID, and app password (client secret value) are all present.
  • Confirm config validation passes before restart.

3) Restart and verify by behavior, not process state

  • Don’t trust “running” status alone.
  • Verify no new auto-restart attempt lines over a timed observation window.
  • Verify inbound/outbound Teams flow during that same window.

4) If still looping, classify as likely upstream lifecycle bug

  • Capture exact log sequence around each restart boundary.
  • Attach sanitized sequence to issue/PR for maintainers.

What to include in repro data (high value for maintainers)

Recommend sharing these artifacts (all sanitized):

  • startup line for msteams provider
  • user/group/channel resolution lines
  • first line indicating restart scheduling
  • health-monitor line indicating “reason: stopped” (if present)
  • whether duplicate package installations were found
  • whether credentials were complete at time of test
  • whether loop persists after those are corrected

Suggested maintainer-facing acceptance test

A robust guardrail test for this bug class would assert:

  1. startAccount() for Teams does not resolve while provider is healthy.
  2. Resolver success (users/channels) alone does not trigger subsystem restart logic.
  3. Restart path only activates on explicit stop, fatal error, or abort signal.
  4. No duplicate listener bind occurs during healthy run (prevents EADDRINUSE cascades).

Current status from this field report

  • Local config/package hygiene issues: addressed.
  • Remaining loop signature: still consistent with upstream lifecycle bug described in this issue family.
  • Net: this report supports merging a monitor/start lifecycle fix that keeps provider task pending until true shutdown.

If helpful, I can provide a compact sanitized log excerpt in follow-up showing exact line ordering (startup → resolution → restart) without any identifiers.

@aymiee
Copy link
Copy Markdown

aymiee commented Mar 2, 2026

I've validated this PR against latest main and it resolves the restart loop for our MSTeams deployment. This is currently blocking our testing of the MSTeams provider. Is there anything preventing a merge?

@BradGroux
Copy link
Copy Markdown
Contributor

I've validated this PR against latest main and it resolves the restart loop for our MSTeams deployment. This is currently blocking our testing of the MSTeams provider. Is there anything preventing a merge?

I posted an update to fix the Teams issue three weeks ago - #13089

I even got Peter to reply to a request to review Teams issues and PRs, but that hasn't happened. Sadly, Teams isn't a priorty - it has been broken for over a month and there have been dozens of issues and PRs submitted.

@steipete steipete merged commit 1567713 into openclaw:main Mar 2, 2026
26 checks passed
@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 2, 2026

Landed via temp rebase onto main.

  • Gate: pnpm vitest run extensions/msteams/src/monitor.lifecycle.test.ts src/gateway/server-channels.test.ts --maxWorkers=1; pnpm check && pnpm build
  • Land commit: b745bc2
  • Merge commit: 1567713

Thanks @chilu18!

@aymiee
Copy link
Copy Markdown

aymiee commented Mar 3, 2026

Thank you all!!. I'm back up and running with msTeams.

john-ver pushed a commit to apmcoin/apmclaw that referenced this pull request Mar 9, 2026
dustin-olenslager pushed a commit to dustin-olenslager/ironclaw-supreme that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: msteams Channel integration: msteams size: M stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MSTeams adapter: startMonitor() returns immediately causing EADDRINUSE restart loop

4 participants