Skip to content

fix(msteams): prevent EADDRINUSE crash on duplicate provider start#49583

Merged
vignesh07 merged 1 commit intoopenclaw:mainfrom
sudie-codes:fix/msteams-eaddrinuse-crash-loop
Mar 18, 2026
Merged

fix(msteams): prevent EADDRINUSE crash on duplicate provider start#49583
vignesh07 merged 1 commit intoopenclaw:mainfrom
sudie-codes:fix/msteams-eaddrinuse-crash-loop

Conversation

@sudie-codes
Copy link
Copy Markdown
Contributor

@sudie-codes sudie-codes commented Mar 18, 2026

Summary

Fixes #25790, related #22169 — Duplicate monitorMSTeamsProvider() calls cause EADDRINUSE crash loop when the gateway boots the provider twice due to a race condition.

What & Why

Problem: When the gateway starts the MS Teams provider twice (race condition during restart/deployment), the second call to expressApp.listen(port) throws EADDRINUSE, entering an infinite crash-restart loop. The bot becomes completely unresponsive until manually restarted.

Root cause: monitorMSTeamsProvider() in monitor.ts had no guard against being called concurrently or after already starting. Each invocation created a new Express app and attempted to bind the same port.

Fix: Refactored monitorMSTeamsProvider() into a thin singleton wrapper:

  • Module-level activeInstance variable stores the in-flight startup Promise
  • Duplicate calls (whether startup is still in-flight or already running) return the same promise immediately with a warning log
  • On startup failure, the guard is cleared so a legitimate retry can proceed
  • On shutdown(), the guard is cleared so the provider can restart cleanly
  • Exported resetMSTeamsProviderInstance() for test isolation

Files changed: extensions/msteams/src/monitor.ts, extensions/msteams/src/monitor.lifecycle.test.ts

Screenshots

N/A — This is a server-side startup guard fix. The race condition occurs during gateway boot when monitorMSTeamsProvider() is called twice before the first call completes. Verification requires simulating concurrent provider starts. The fix is validated via unit test.

Test Results

  • New regression test: "returns existing instance on duplicate start without crashing" — verifies both calls return the same Promise reference
  • All 235 msteams tests pass (27 test files, ~9s)

AI Disclosure

  • AI-assisted (Claude Code with team orchestration)
  • Fully tested — new regression test + all existing tests pass
  • I understand what the code does: singleton guard prevents duplicate Express server bind by returning the existing startup promise on re-entrant calls

@openclaw-barnacle openclaw-barnacle bot added channel: msteams Channel integration: msteams size: S labels Mar 18, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR addresses two independent issues: a singleton guard in monitor.ts to prevent EADDRINUSE crash loops caused by duplicate provider starts, and a security hardening fix in message-handler.ts that closes a sender-allowlist bypass (GHSA-g7cr-9h7q-4qxq).

The singleton guard approach is the right fix for the described race condition and the accompanying regression test is well-structured. However, two issues were found:

  • Premature singleton clear in shutdown() (monitor.ts lines 350–361): activeInstancePromise is set to null at the very start of shutdown(), before httpServer.close() completes. Any monitorMSTeamsProvider() call that races into the window between the clear and the port being released will spin up a new _startMSTeamsProvider(), hit expressApp.listen() while the port is still occupied, and reproduce the EADDRINUSE crash loop. The sentinel should be cleared inside the httpServer.close() callback, after the port is confirmed released.

  • Second caller's AbortSignal is silently discarded (monitor.ts lines 78–83): When a duplicate call carries a different AbortSignal, the guard returns the existing promise but ignores the new signal entirely. This means aborting the second caller will not stop the server, and aborting the first will stop it regardless of the second caller's intent. The warning message should call this out explicitly, and ideally the guard should handle or at least log the signal mismatch.

Confidence Score: 3/5

  • The singleton guard partially solves the EADDRINUSE problem, but a race window in shutdown() can re-introduce it on server restart.
  • The core idea is correct and the security fix looks sound, but the premature activeInstancePromise = null in shutdown() leaves a meaningful race that undermines the stated goal of the PR. The AbortSignal discard is a less urgent but still real lifecycle bug for callers with independent abort controllers.
  • extensions/msteams/src/monitor.ts — specifically the shutdown() closure and the duplicate-call guard.

Comments Outside Diff (1)

  1. extensions/msteams/src/monitor.ts, line 350-361 (link)

    P1 Singleton cleared before server close completes

    activeInstancePromise = null is set at the very beginning of shutdown(), before httpServer.close() finishes. This creates a race window: any caller that invokes monitorMSTeamsProvider() after shutdown() is entered but before the OS releases the port will start a fresh _startMSTeamsProvider(), hit expressApp.listen() while the old server is still closing, and get EADDRINUSE — the exact crash loop this PR is trying to prevent.

    The sentinel should be cleared only after the port is confirmed released:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: extensions/msteams/src/monitor.ts
    Line: 350-361
    
    Comment:
    **Singleton cleared before server close completes**
    
    `activeInstancePromise = null` is set at the very beginning of `shutdown()`, before `httpServer.close()` finishes. This creates a race window: any caller that invokes `monitorMSTeamsProvider()` after `shutdown()` is entered but before the OS releases the port will start a fresh `_startMSTeamsProvider()`, hit `expressApp.listen()` while the old server is still closing, and get `EADDRINUSE` — the exact crash loop this PR is trying to prevent.
    
    The sentinel should be cleared only after the port is confirmed released:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor.ts
Line: 350-361

Comment:
**Singleton cleared before server close completes**

`activeInstancePromise = null` is set at the very beginning of `shutdown()`, before `httpServer.close()` finishes. This creates a race window: any caller that invokes `monitorMSTeamsProvider()` after `shutdown()` is entered but before the OS releases the port will start a fresh `_startMSTeamsProvider()`, hit `expressApp.listen()` while the old server is still closing, and get `EADDRINUSE` — the exact crash loop this PR is trying to prevent.

The sentinel should be cleared only after the port is confirmed released:

```suggestion
  const shutdown = async () => {
    log.info("shutting down msteams provider");
    return new Promise<void>((resolve) => {
      httpServer.close((err) => {
        if (err) {
          log.debug?.("msteams server close error", { error: String(err) });
        }
        activeInstancePromise = null;
        resolve();
      });
    });
  };
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/msteams/src/monitor.ts
Line: 78-83

Comment:
**Second caller's `abortSignal` is silently discarded**

When a duplicate call arrives with a *different* `AbortSignal`, the guard returns the existing promise — but `_startMSTeamsProvider` is already wired to the first call's signal via `keepHttpServerTaskAlive`. The second caller's signal is completely ignored, meaning:

- Aborting the second caller will **not** stop the server.
- Aborting the first caller **will** stop the server even if the second caller never intended it.

In practice, both callers from the gateway race may share the same controller, so this might not manifest today. But this is a hidden contract that could bite future callers. At minimum, a log warning that the incoming signal is being ignored would make debugging easier. A more robust approach is to check whether the already-stored promise was created with a compatible (same or parent) signal and warn loudly if not:

```typescript
if (activeInstancePromise) {
  const core = getMSTeamsRuntime();
  const log = core.logging.getChildLogger({ name: "msteams" });
  log.warn?.("msteams provider already started; returning existing instance — incoming abortSignal ignored");
  return activeInstancePromise;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix(msteams): preven..."

Comment on lines +78 to +83
if (activeInstancePromise) {
const core = getMSTeamsRuntime();
const log = core.logging.getChildLogger({ name: "msteams" });
log.warn?.("msteams provider already started; returning existing instance");
return activeInstancePromise;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Second caller's abortSignal is silently discarded

When a duplicate call arrives with a different AbortSignal, the guard returns the existing promise — but _startMSTeamsProvider is already wired to the first call's signal via keepHttpServerTaskAlive. The second caller's signal is completely ignored, meaning:

  • Aborting the second caller will not stop the server.
  • Aborting the first caller will stop the server even if the second caller never intended it.

In practice, both callers from the gateway race may share the same controller, so this might not manifest today. But this is a hidden contract that could bite future callers. At minimum, a log warning that the incoming signal is being ignored would make debugging easier. A more robust approach is to check whether the already-stored promise was created with a compatible (same or parent) signal and warn loudly if not:

if (activeInstancePromise) {
  const core = getMSTeamsRuntime();
  const log = core.logging.getChildLogger({ name: "msteams" });
  log.warn?.("msteams provider already started; returning existing instance — incoming abortSignal ignored");
  return activeInstancePromise;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor.ts
Line: 78-83

Comment:
**Second caller's `abortSignal` is silently discarded**

When a duplicate call arrives with a *different* `AbortSignal`, the guard returns the existing promise — but `_startMSTeamsProvider` is already wired to the first call's signal via `keepHttpServerTaskAlive`. The second caller's signal is completely ignored, meaning:

- Aborting the second caller will **not** stop the server.
- Aborting the first caller **will** stop the server even if the second caller never intended it.

In practice, both callers from the gateway race may share the same controller, so this might not manifest today. But this is a hidden contract that could bite future callers. At minimum, a log warning that the incoming signal is being ignored would make debugging easier. A more robust approach is to check whether the already-stored promise was created with a compatible (same or parent) signal and warn loudly if not:

```typescript
if (activeInstancePromise) {
  const core = getMSTeamsRuntime();
  const log = core.logging.getChildLogger({ name: "msteams" });
  log.warn?.("msteams provider already started; returning existing instance — incoming abortSignal ignored");
  return activeInstancePromise;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 795ec4212f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +78 to +82
if (activeInstancePromise) {
const core = getMSTeamsRuntime();
const log = core.logging.getChildLogger({ name: "msteams" });
log.warn?.("msteams provider already started; returning existing instance");
return activeInstancePromise;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor later abort signals when deduplicating provider start

Returning the first in-flight promise here drops lifecycle inputs from subsequent calls, especially abortSignal. In the duplicate-start race this change is targeting, the gateway creates a new AbortController per start attempt (src/gateway/server-channels.ts), but stop/restart aborts only the controller currently stored for that account; when the second start call is deduped to the first promise, its signal is never connected to the running provider, so stop can abort the wrong controller and leave the Teams server running (or stopChannel waiting indefinitely). Deduped starts need to merge/track later abort signals instead of ignoring them.

Useful? React with 👍 / 👎.

@vignesh07
Copy link
Copy Markdown
Contributor

Thanks for digging into this and for putting together the repro/test case.

I pushed an update to this branch that fixes the duplicate-start race in the shared gateway channel manager instead of inside the MS Teams monitor.

Why I changed direction:

  • The underlying race is in the shared startup path, where two overlapping startChannel calls can both get past the pre-start checks before the task is registered.
  • Fixing that in the gateway prevents duplicate boots for the same account across all channels, including MS Teams, instead of adding a Teams-only guard.
  • I added a regression test that reproduces the problem with concurrent starts, and a second one that verifies stopping an account mid-boot cancels startup cleanly.

I also removed the extra authz change from extensions/msteams/src/monitor-handler/message-handler.ts from this PR.

Reasoning there:

  • It looked unrelated to the duplicate-start bug this PR is fixing.
  • In the current flow, that hunk did not appear to change the final group authz decision, because route allowlist enforcement and sender allowlist enforcement already happen later in the handler.
  • There is already targeted coverage for that route-allowlist/sender-authz scenario, and it passes on current main without that change.

So the branch is now scoped to the actual root cause: serializing per-account startup in the gateway.

Tests I ran:

  • pnpm test -- src/gateway/server-channels.test.ts
  • pnpm test -- extensions/msteams/src/monitor.lifecycle.test.ts
  • pnpm exec oxfmt --check src/gateway/server-channels.ts src/gateway/server-channels.test.ts
  • pnpm exec oxlint src/gateway/server-channels.ts src/gateway/server-channels.test.ts

Thanks again — I think this ends up as a stronger fix with a better regression test surface.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e6e9e867ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +261 to 264
const existingStart = store.starting.get(id);
if (existingStart) {
await existingStart;
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't drop restart requests while an earlier start is pending

When a channel is still inside an async preflight such as plugin.config.isConfigured (for example extensions/zalouser/src/shared.ts:70), stopChannel() aborts it but returns before store.starting clears. The next startChannel() in restart flows like src/gateway/server-reload-handlers.ts:140-143 lands here, waits for the old start gate, and then returns without scheduling a new task. Once the original start notices the abort/manual stop, it exits and the channel stays offline even though the restart path completed successfully.

Useful? React with 👍 / 👎.

@vignesh07 vignesh07 force-pushed the fix/msteams-eaddrinuse-crash-loop branch from e6e9e86 to f80ed12 Compare March 18, 2026 08:57
@vignesh07 vignesh07 merged commit 1890089 into openclaw:main Mar 18, 2026
7 checks passed
@vignesh07
Copy link
Copy Markdown
Contributor

Thanks @sudie-codes!

Landed via temp rebase onto main.

  • Focused validation:
    •       -           -         - Source commit: https://github.com/openclaw/openclaw/commit/f80ed12ece232fca44cdee3d3384d207df70416b
      
  • Landed commit: 1890089

The broader CI/gate issues were being fixed separately on main during landing.

@vignesh07
Copy link
Copy Markdown
Contributor

vignesh07 commented Mar 18, 2026

Thanks again @sudie-codes!

The broader CI/gate issues were being fixed separately on main during landing.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f80ed12ece

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +261 to 264
const existingStart = store.starting.get(id);
if (existingStart) {
await existingStart;
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate preflight errors to deduplicated start callers

existingStart only waits on startGate, and startGate is always resolved in the finally block. If the original startup rejects before store.tasks is handed off — for example an async plugin.config.isConfigured() check like extensions/zalouser/src/shared.ts:70 throws — the first startChannel() rejects but every overlapping caller that hits this branch resolves successfully. Callers such as src/gateway/channel-health-monitor.ts:164-168 then record a successful restart and enter cooldown even though the channel never came up. The deduplicated path needs to await the original startup result, not a separate resolve-only gate.

Useful? React with 👍 / 👎.

mrosmarin added a commit to mrosmarin/openclaw that referenced this pull request Mar 18, 2026
* main: (230 commits)
  fix llm-task invalid thinking timeout
  Build: narrow tsdown unresolved import guard
  Plugins: sync contract registry image providers
  Build: fail on unresolved tsdown imports
  Deps: align pi-agent-core for declaration builds
  Build: fail on plugin SDK declaration errors
  Release: add plugin npm publish workflow (openclaw#47678)
  fix(security): block build-tool and glibc env injection vectors in host exec sandbox (openclaw#49702)
  test simplify zero-state boundary guards
  ci enforce boundary guardrails
  test: enable vmForks for targeted channel test runs
  test(telegram): fix incomplete sticker-cache mocks in tests
  Config: align model compat thinking format types
  Tlon: pin api-beta to current known-good commit
  Plugin SDK: harden provider auth seams
  fix(config): add missing qwen-chat-template to thinking format schema
  Plugin SDK: register provider auth login entrypoint
  Plugin SDK: split provider auth login seam
  fix: serialize duplicate channel starts (openclaw#49583) (thanks @sudie-codes)
  Telegram: fix reply-runtime test typings
  ...
ssfdre38 pushed a commit to ssfdre38/openclaw-community-edition that referenced this pull request Mar 18, 2026
analysoor-assistant pushed a commit to analysoor-assistant/openclaw that referenced this pull request Mar 19, 2026
RickyTong1 pushed a commit to RickyTong1/openclaw that referenced this pull request Mar 19, 2026
brandontyler pushed a commit to brandontyler/clawdbot that referenced this pull request Mar 19, 2026
fuller-stack-dev pushed a commit to fuller-stack-dev/openclaw that referenced this pull request Mar 20, 2026
fuller-stack-dev pushed a commit to fuller-stack-dev/openclaw that referenced this pull request Mar 20, 2026
pholpaphankorn pushed a commit to pholpaphankorn/openclaw that referenced this pull request Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

msteams plugin: duplicate provider start causes EADDRINUSE crash loop (bundled extension + standalone install conflict)

2 participants