Skip to content

Graceful degradation for multi-server proxy setup#3546

Merged
jlowin merged 2 commits intomainfrom
fix/graceful-multi-server-proxy-setup
Mar 18, 2026
Merged

Graceful degradation for multi-server proxy setup#3546
jlowin merged 2 commits intomainfrom
fix/graceful-multi-server-proxy-setup

Conversation

@jlowin
Copy link
Copy Markdown
Member

@jlowin jlowin commented Mar 18, 2026

MCPConfigTransport currently fails the entire proxy setup if any single upstream server can't connect — even when other servers in the config are perfectly healthy. This contradicts the graceful-failure philosophy that AggregateProvider already implements at runtime, where individual provider failures are logged and skipped.

The setup phase now matches that philosophy: each _create_proxy() call is individually wrapped, failures are logged at WARNING level with full tracebacks, and the composite is built from whichever servers succeed. If all servers fail, a ConnectionError is raised rather than silently returning an empty composite.

config = {
    "mcpServers": {
        "healthy": {"command": "python", "args": ["server.py"]},
        "unreachable": {"command": "nonexistent-binary"},
    }
}

# Before: ConnectionError (total failure)
# After: works — tools from "healthy" are available,
#         WARNING log for "unreachable"
client = Client(config)
async with client:
    tools = await client.list_tools()

Closes #3533

@jlowin jlowin added the enhancement Improvement to existing functionality. For issues and smaller PR improvements. label Mar 18, 2026
@marvin-context-protocol marvin-context-protocol Bot added the client Related to the FastMCP client SDK or client-side functionality. label Mar 18, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef2a0de915

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +115 to +123
except Exception: # Broad catch is intentional: failure modes
# are diverse (OSError, TimeoutError, RuntimeError, etc.)
# and the whole point is to skip any server that can't connect.
logger.warning(
"Failed to connect to MCP server %r, skipping",
name,
exc_info=True,
)
except Exception:
# Clean up any transports created before the failure
for t in self._transports:
await t.close()
self._transports = []
raise
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve config errors instead of skipping the server

This except Exception is broader than the graceful-degradation goal: _create_proxy() also performs local setup such as ToolTransform(...), and that code raises deterministic config errors (for example, duplicate rename targets raise ValueError in ToolTransform.__init__). With this change, a multi-server config that has one malformed server and one healthy server now succeeds with the bad server silently omitted instead of surfacing the misconfiguration to the caller, which makes bad configs much harder to detect and debug.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — the whole point of graceful degradation is to skip any server that can't connect, and the failure modes are too diverse to enumerate meaningfully (OSError, TimeoutError, RuntimeError, MCP errors, etc.). The warning log with exc_info=True ensures config errors are visible, not silent.

@marvin-context-protocol
Copy link
Copy Markdown
Contributor

Test Failure Analysis

Summary: The Windows Python 3.10 test job failed with a timeout in tests/server/test_server.py, specifically at TestSettingsFromEnvironment::test_server_starts_without_auth. This test was the 16th test executed in the file and hit the default 5-second timeout while spawning a Python subprocess on Windows.

Root Cause: The failure is not caused by this PR's changes. The PR only modifies src/fastmcp/client/transports/config.py and tests/test_mcp_config.py. The timing out test (test_server_starts_without_auth) was not changed and uses PythonStdioTransport to spawn a Python subprocess — a notoriously slow operation on Windows where 5 seconds is an insufficient timeout.

However, the new tests added by this PR in tests/test_mcp_config.py (test_multi_server_partial_failure, test_multi_server_partial_failure_logs_warning, test_multi_server_partial_failure_cleanup) also spawn Python subprocesses and do not have a @pytest.mark.timeout(15) decorator. The existing subprocess tests in that file (e.g., test_multi_server_config_transport at line 843) use @pytest.mark.timeout(15). The new tests did not get the same treatment. The test runner never reached test_mcp_config.py in this run, but if it did, those tests would also likely time out on Windows.

Suggested Solution:

  1. For the pre-existing flaky test (test_server_starts_without_auth in tests/server/test_server.py): Add @pytest.mark.timeout(15) to the test — it spawns a Python subprocess and 5 seconds is insufficient on Windows.

  2. For the new tests in this PR (tests/test_mcp_config.py): Add @pytest.mark.timeout(15) to the three new tests that spawn Python subprocesses:

    • test_multi_server_partial_failure
    • test_multi_server_partial_failure_logs_warning
    • test_multi_server_partial_failure_cleanup

    These tests follow the same pattern as existing subprocess-based tests in the same file (e.g., test_multi_server_config_transport) which already have this marker.

Detailed Analysis

Failing job: Tests: Python 3.10 on windows-latest (Job ID: 67597415991)

Timeout log excerpt:

tests\server\test_server.py ..............+++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++
                    DEBUG    Stdio transport connected             stdio.py:191

15 tests passed, the 16th timed out. The 16th test in the file is TestSettingsFromEnvironment::test_server_starts_without_auth.

Stack trace shows the test was waiting in asyncio's selector/event loop waiting for the subprocess connection — a Windows subprocess spawn issue.

Existing timeout pattern in test_mcp_config.py — existing subprocess tests have the decorator:

@pytest.mark.timeout(15)
async def test_multi_server_config_transport(tmp_path: Path):

New tests lack this:

async def test_multi_server_partial_failure(tmp_path: Path, server_order: dict):
    # spawns python subprocess — no timeout marker

All other jobs passed: Python 3.10 on Ubuntu, Python 3.13 on Ubuntu, Integration tests, Tests with lowest-direct dependencies.

Related Files
  • tests/server/test_server.py (line 250): test_server_starts_without_auth — pre-existing test without timeout decorator, spawns subprocess via PythonStdioTransport
  • tests/test_mcp_config.py (new tests ~line 1009+): test_multi_server_partial_failure, test_multi_server_partial_failure_logs_warning, test_multi_server_partial_failure_cleanup — new tests that spawn Python subprocesses without @pytest.mark.timeout(15)
  • tests/test_mcp_config.py (line 843, 422, 366): existing subprocess tests that correctly use @pytest.mark.timeout(15)
  • src/fastmcp/client/transports/config.py: the main logic change in this PR (not related to the failure)

@jlowin jlowin force-pushed the fix/graceful-multi-server-proxy-setup branch from ef2a0de to 6b24ee4 Compare March 18, 2026 16:33
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b24ee413c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +119 to +123
"Failed to connect to MCP server %r, skipping",
name,
exc_info=True,
)
except Exception:
# Clean up any transports created before the failure
for t in self._transports:
await t.close()
self._transports = []
raise
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disconnect partially connected backends before skipping them

If a transforming server fails after _create_proxy() has already connected it—for example, a tools mapping that renames two backend tools to the same target makes ToolTransform(...) raise at config.py:183-197—this handler only logs and continues. In that case the disconnect callbacks already pushed onto the shared AsyncExitStack do not run until the composite client exits, so the skipped server's subprocess/HTTP session stays alive for the caller's entire session even though that backend was omitted from the router.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsyncExitStack handles cleanup when the session exits — the subprocess isn't leaked, just kept alive until session end. This is the same lifecycle as successfully-mounted servers. The edge case (ToolTransform raising after connection) is narrow enough that the mild resource waste doesn't warrant additional complexity.

@jlowin jlowin merged commit 02d55de into main Mar 18, 2026
9 of 12 checks passed
@jlowin jlowin deleted the fix/graceful-multi-server-proxy-setup branch March 18, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client Related to the FastMCP client SDK or client-side functionality. enhancement Improvement to existing functionality. For issues and smaller PR improvements.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warn instead of fail connection for single mcp failure on multi-server proxies

1 participant