Skip to content

Fix RetryMiddleware not retrying tool errors#3858

Merged
jlowin merged 3 commits intomainfrom
fix/retry-middleware-cause-chain
Apr 14, 2026
Merged

Fix RetryMiddleware not retrying tool errors#3858
jlowin merged 3 commits intomainfrom
fix/retry-middleware-cause-chain

Conversation

@strawgate
Copy link
Copy Markdown
Collaborator

Summary

RetryMiddleware never retried tool calls because FastMCP wraps tool exceptions as ToolError(...) from <original> before control returns to the middleware. The middleware's _should_retry() checked isinstance(error, retry_exceptions) but received ToolError, not the original ConnectionError.

The fix: also check error.__cause__. This matches the pattern already used by ErrorHandlingMiddleware._transform_error() (line 92).

# Before: tool called 1 time (no retries)
# After:  tool called 4 times (1 initial + 3 retries)

mcp.add_middleware(RetryMiddleware(max_retries=3, retry_exceptions=(ConnectionError,)))

@mcp.tool
def flaky() -> str:
    raise ConnectionError("transient")

Why wasn't this caught by tests?

The existing test_retry_middleware_with_transient_failures had zero assertions. It used a random fail rate (0.7) so some calls succeeded on the first try, creating the illusion that retries worked. The test passed whether retries fired or not.

What's in this PR

Source fix (4 lines): _should_retry() now checks error.__cause__ in addition to the error itself.

New unit test: test_should_retry_checks_cause_chain — verifies _should_retry returns True for ToolError wrapping ConnectionError, False for ToolError wrapping ValueError, False for bare ToolError.

New integration test: test_retry_actually_retries_through_server_pipeline — creates a tool that fails twice then succeeds, verifies the tool is called exactly 3 times (2 retries + 1 success). This is deterministic, not random.

Replaced broken test: The old random-based test is replaced by the deterministic one above.

Closes #3855

🤖 Generated with Claude Code

@marvin-context-protocol marvin-context-protocol Bot added bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. server Related to FastMCP server implementation or server-side functionality. DON'T MERGE PR is not ready for merging. Used by authors to prevent premature merging. labels Apr 11, 2026
strawgate added a commit that referenced this pull request Apr 11, 2026
- Delete entirely commented-out test_run_server.py (99 lines dead code)
- Fix test_pydantic_model_with_stringified_json_no_strict: replace
  try/except-both-branches-pass with clear pytest.raises assertion
- Fix test_path_traversal_blocked: remove dead assertions after
  pytest.raises (lines after raise never execute)

Error handling middleware test fixes are in a separate PR (#3858)
which also fixes the underlying RetryMiddleware bug.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@strawgate strawgate changed the title [DNM] Fix RetryMiddleware not retrying tool errors Fix RetryMiddleware not retrying tool errors Apr 11, 2026
jlowin pushed a commit that referenced this pull request Apr 12, 2026
- Delete entirely commented-out test_run_server.py (99 lines dead code)
- Fix test_pydantic_model_with_stringified_json_no_strict: replace
  try/except-both-branches-pass with clear pytest.raises assertion
- Fix test_path_traversal_blocked: remove dead assertions after
  pytest.raises (lines after raise never execute)

Error handling middleware test fixes are in a separate PR (#3858)
which also fixes the underlying RetryMiddleware bug.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
jlowin pushed a commit that referenced this pull request Apr 12, 2026
- Delete entirely commented-out test_run_server.py (99 lines dead code)
- Fix test_pydantic_model_with_stringified_json_no_strict: replace
  try/except-both-branches-pass with clear pytest.raises assertion
- Fix test_path_traversal_blocked: remove dead assertions after
  pytest.raises (lines after raise never execute)

Error handling middleware test fixes are in a separate PR (#3858)
which also fixes the underlying RetryMiddleware bug.

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
FastMCP wraps tool exceptions as ToolError(...) from <original>,
so _should_retry() only saw ToolError and never matched the
retryable ConnectionError/TimeoutError in __cause__.

Now checks error.__cause__ in addition to the error itself,
matching the pattern already used by ErrorHandlingMiddleware.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@strawgate strawgate force-pushed the fix/retry-middleware-cause-chain branch from 6b506c9 to 5899eed Compare April 13, 2026 01:26
@strawgate strawgate removed the DON'T MERGE PR is not ready for merging. Used by authors to prevent premature merging. label Apr 13, 2026
- Move FastMCP/Client imports to module level in test_error_handling.py
  per the project's 'all imports at top' rule.
- Strengthen test_retry_middleware_with_permanent_failures: assert
  call_count == 1 to prove the non-retryable error actually doesn't
  retry (was passing whether retries fired or not).
- Add a one-line note on _should_retry explaining the single-level
  __cause__ assumption so future middleware doesn't accidentally
  re-wrap and hide the real error type.
@jlowin jlowin merged commit 031c7e0 into main Apr 14, 2026
8 checks passed
@jlowin jlowin deleted the fix/retry-middleware-cause-chain branch April 14, 2026 00:08
strawgate added a commit that referenced this pull request Apr 14, 2026
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Jeremiah Lowin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. server Related to FastMCP server implementation or server-side functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RetryMiddleware doesn't retry tool errors because FastMCP wraps exceptions before middleware sees them

2 participants