Graceful degradation for multi-server proxy setup#3546
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef2a0de915
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception: # Broad catch is intentional: failure modes | ||
| # are diverse (OSError, TimeoutError, RuntimeError, etc.) | ||
| # and the whole point is to skip any server that can't connect. | ||
| logger.warning( | ||
| "Failed to connect to MCP server %r, skipping", | ||
| name, | ||
| exc_info=True, | ||
| ) | ||
| except Exception: | ||
| # Clean up any transports created before the failure | ||
| for t in self._transports: | ||
| await t.close() | ||
| self._transports = [] | ||
| raise | ||
| continue |
There was a problem hiding this comment.
Preserve config errors instead of skipping the server
This except Exception is broader than the graceful-degradation goal: _create_proxy() also performs local setup such as ToolTransform(...), and that code raises deterministic config errors (for example, duplicate rename targets raise ValueError in ToolTransform.__init__). With this change, a multi-server config that has one malformed server and one healthy server now succeeds with the bad server silently omitted instead of surfacing the misconfiguration to the caller, which makes bad configs much harder to detect and debug.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Intentional — the whole point of graceful degradation is to skip any server that can't connect, and the failure modes are too diverse to enumerate meaningfully (OSError, TimeoutError, RuntimeError, MCP errors, etc.). The warning log with exc_info=True ensures config errors are visible, not silent.
Test Failure AnalysisSummary: The Windows Python 3.10 test job failed with a timeout in Root Cause: The failure is not caused by this PR's changes. The PR only modifies However, the new tests added by this PR in Suggested Solution:
Detailed AnalysisFailing job: Timeout log excerpt: 15 tests passed, the 16th timed out. The 16th test in the file is Stack trace shows the test was waiting in asyncio's selector/event loop waiting for the subprocess connection — a Windows subprocess spawn issue. Existing timeout pattern in test_mcp_config.py — existing subprocess tests have the decorator: @pytest.mark.timeout(15)
async def test_multi_server_config_transport(tmp_path: Path):New tests lack this: async def test_multi_server_partial_failure(tmp_path: Path, server_order: dict):
# spawns python subprocess — no timeout markerAll other jobs passed: Python 3.10 on Ubuntu, Python 3.13 on Ubuntu, Integration tests, Tests with lowest-direct dependencies. Related Files
|
ef2a0de to
6b24ee4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b24ee413c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "Failed to connect to MCP server %r, skipping", | ||
| name, | ||
| exc_info=True, | ||
| ) | ||
| except Exception: | ||
| # Clean up any transports created before the failure | ||
| for t in self._transports: | ||
| await t.close() | ||
| self._transports = [] | ||
| raise | ||
| continue |
There was a problem hiding this comment.
Disconnect partially connected backends before skipping them
If a transforming server fails after _create_proxy() has already connected it—for example, a tools mapping that renames two backend tools to the same target makes ToolTransform(...) raise at config.py:183-197—this handler only logs and continues. In that case the disconnect callbacks already pushed onto the shared AsyncExitStack do not run until the composite client exits, so the skipped server's subprocess/HTTP session stays alive for the caller's entire session even though that backend was omitted from the router.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The AsyncExitStack handles cleanup when the session exits — the subprocess isn't leaked, just kept alive until session end. This is the same lifecycle as successfully-mounted servers. The edge case (ToolTransform raising after connection) is narrow enough that the mild resource waste doesn't warrant additional complexity.
MCPConfigTransportcurrently fails the entire proxy setup if any single upstream server can't connect — even when other servers in the config are perfectly healthy. This contradicts the graceful-failure philosophy thatAggregateProvideralready implements at runtime, where individual provider failures are logged and skipped.The setup phase now matches that philosophy: each
_create_proxy()call is individually wrapped, failures are logged at WARNING level with full tracebacks, and the composite is built from whichever servers succeed. If all servers fail, aConnectionErroris raised rather than silently returning an empty composite.Closes #3533