fix(msteams): prevent EADDRINUSE crash on duplicate provider start#49583
Conversation
Greptile SummaryThis PR addresses two independent issues: a singleton guard in The singleton guard approach is the right fix for the described race condition and the accompanying regression test is well-structured. However, two issues were found:
Confidence Score: 3/5
|
extensions/msteams/src/monitor.ts
Outdated
| if (activeInstancePromise) { | ||
| const core = getMSTeamsRuntime(); | ||
| const log = core.logging.getChildLogger({ name: "msteams" }); | ||
| log.warn?.("msteams provider already started; returning existing instance"); | ||
| return activeInstancePromise; | ||
| } |
There was a problem hiding this comment.
Second caller's
abortSignal is silently discarded
When a duplicate call arrives with a different AbortSignal, the guard returns the existing promise — but _startMSTeamsProvider is already wired to the first call's signal via keepHttpServerTaskAlive. The second caller's signal is completely ignored, meaning:
- Aborting the second caller will not stop the server.
- Aborting the first caller will stop the server even if the second caller never intended it.
In practice, both callers from the gateway race may share the same controller, so this might not manifest today. But this is a hidden contract that could bite future callers. At minimum, a log warning that the incoming signal is being ignored would make debugging easier. A more robust approach is to check whether the already-stored promise was created with a compatible (same or parent) signal and warn loudly if not:
if (activeInstancePromise) {
const core = getMSTeamsRuntime();
const log = core.logging.getChildLogger({ name: "msteams" });
log.warn?.("msteams provider already started; returning existing instance — incoming abortSignal ignored");
return activeInstancePromise;
}Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor.ts
Line: 78-83
Comment:
**Second caller's `abortSignal` is silently discarded**
When a duplicate call arrives with a *different* `AbortSignal`, the guard returns the existing promise — but `_startMSTeamsProvider` is already wired to the first call's signal via `keepHttpServerTaskAlive`. The second caller's signal is completely ignored, meaning:
- Aborting the second caller will **not** stop the server.
- Aborting the first caller **will** stop the server even if the second caller never intended it.
In practice, both callers from the gateway race may share the same controller, so this might not manifest today. But this is a hidden contract that could bite future callers. At minimum, a log warning that the incoming signal is being ignored would make debugging easier. A more robust approach is to check whether the already-stored promise was created with a compatible (same or parent) signal and warn loudly if not:
```typescript
if (activeInstancePromise) {
const core = getMSTeamsRuntime();
const log = core.logging.getChildLogger({ name: "msteams" });
log.warn?.("msteams provider already started; returning existing instance — incoming abortSignal ignored");
return activeInstancePromise;
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 795ec4212f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
extensions/msteams/src/monitor.ts
Outdated
| if (activeInstancePromise) { | ||
| const core = getMSTeamsRuntime(); | ||
| const log = core.logging.getChildLogger({ name: "msteams" }); | ||
| log.warn?.("msteams provider already started; returning existing instance"); | ||
| return activeInstancePromise; |
There was a problem hiding this comment.
Honor later abort signals when deduplicating provider start
Returning the first in-flight promise here drops lifecycle inputs from subsequent calls, especially abortSignal. In the duplicate-start race this change is targeting, the gateway creates a new AbortController per start attempt (src/gateway/server-channels.ts), but stop/restart aborts only the controller currently stored for that account; when the second start call is deduped to the first promise, its signal is never connected to the running provider, so stop can abort the wrong controller and leave the Teams server running (or stopChannel waiting indefinitely). Deduped starts need to merge/track later abort signals instead of ignoring them.
Useful? React with 👍 / 👎.
|
Thanks for digging into this and for putting together the repro/test case. I pushed an update to this branch that fixes the duplicate-start race in the shared gateway channel manager instead of inside the MS Teams monitor. Why I changed direction:
I also removed the extra authz change from Reasoning there:
So the branch is now scoped to the actual root cause: serializing per-account startup in the gateway. Tests I ran:
Thanks again — I think this ends up as a stronger fix with a better regression test surface. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e6e9e867ea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const existingStart = store.starting.get(id); | ||
| if (existingStart) { | ||
| await existingStart; | ||
| return; |
There was a problem hiding this comment.
Don't drop restart requests while an earlier start is pending
When a channel is still inside an async preflight such as plugin.config.isConfigured (for example extensions/zalouser/src/shared.ts:70), stopChannel() aborts it but returns before store.starting clears. The next startChannel() in restart flows like src/gateway/server-reload-handlers.ts:140-143 lands here, waits for the old start gate, and then returns without scheduling a new task. Once the original start notices the abort/manual stop, it exits and the channel stays offline even though the restart path completed successfully.
Useful? React with 👍 / 👎.
e6e9e86 to
f80ed12
Compare
|
Thanks @sudie-codes! Landed via temp rebase onto main.
The broader CI/gate issues were being fixed separately on main during landing. |
|
Thanks again @sudie-codes! The broader CI/gate issues were being fixed separately on main during landing. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f80ed12ece
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const existingStart = store.starting.get(id); | ||
| if (existingStart) { | ||
| await existingStart; | ||
| return; |
There was a problem hiding this comment.
Propagate preflight errors to deduplicated start callers
existingStart only waits on startGate, and startGate is always resolved in the finally block. If the original startup rejects before store.tasks is handed off — for example an async plugin.config.isConfigured() check like extensions/zalouser/src/shared.ts:70 throws — the first startChannel() rejects but every overlapping caller that hits this branch resolves successfully. Callers such as src/gateway/channel-health-monitor.ts:164-168 then record a successful restart and enter cooldown even though the channel never came up. The deduplicated path needs to await the original startup result, not a separate resolve-only gate.
Useful? React with 👍 / 👎.
* main: (230 commits) fix llm-task invalid thinking timeout Build: narrow tsdown unresolved import guard Plugins: sync contract registry image providers Build: fail on unresolved tsdown imports Deps: align pi-agent-core for declaration builds Build: fail on plugin SDK declaration errors Release: add plugin npm publish workflow (openclaw#47678) fix(security): block build-tool and glibc env injection vectors in host exec sandbox (openclaw#49702) test simplify zero-state boundary guards ci enforce boundary guardrails test: enable vmForks for targeted channel test runs test(telegram): fix incomplete sticker-cache mocks in tests Config: align model compat thinking format types Tlon: pin api-beta to current known-good commit Plugin SDK: harden provider auth seams fix(config): add missing qwen-chat-template to thinking format schema Plugin SDK: register provider auth login entrypoint Plugin SDK: split provider auth login seam fix: serialize duplicate channel starts (openclaw#49583) (thanks @sudie-codes) Telegram: fix reply-runtime test typings ...
Summary
Fixes #25790, related #22169 — Duplicate
monitorMSTeamsProvider()calls cause EADDRINUSE crash loop when the gateway boots the provider twice due to a race condition.What & Why
Problem: When the gateway starts the MS Teams provider twice (race condition during restart/deployment), the second call to
expressApp.listen(port)throwsEADDRINUSE, entering an infinite crash-restart loop. The bot becomes completely unresponsive until manually restarted.Root cause:
monitorMSTeamsProvider()inmonitor.tshad no guard against being called concurrently or after already starting. Each invocation created a new Express app and attempted to bind the same port.Fix: Refactored
monitorMSTeamsProvider()into a thin singleton wrapper:activeInstancevariable stores the in-flight startupPromiseshutdown(), the guard is cleared so the provider can restart cleanlyresetMSTeamsProviderInstance()for test isolationFiles changed:
extensions/msteams/src/monitor.ts,extensions/msteams/src/monitor.lifecycle.test.tsScreenshots
N/A — This is a server-side startup guard fix. The race condition occurs during gateway boot when
monitorMSTeamsProvider()is called twice before the first call completes. Verification requires simulating concurrent provider starts. The fix is validated via unit test.Test Results
AI Disclosure