Skip to content

fix: skip max_completion_tokens when maxTokens is None#3284

Merged
jlowin merged 1 commit intoPrefectHQ:mainfrom
eon01:fix/sampling-handler-max-completion-tokens
Feb 24, 2026
Merged

fix: skip max_completion_tokens when maxTokens is None#3284
jlowin merged 1 commit intoPrefectHQ:mainfrom
eon01:fix/sampling-handler-max-completion-tokens

Conversation

@eon01
Copy link
Copy Markdown
Contributor

@eon01 eon01 commented Feb 24, 2026

Description

When maxTokens is not specified by the caller, the OpenAI sampling handler was sending "max_completion_tokens": null explicitly in the API request. OpenAI treats an explicit null differently from an omitted key, it returns an empty response with finish_reason: "stop" and no content, which causes the handler to raise ValueError("No content in response from completion").

Fix: guard max_completion_tokens with if params.maxTokens is not None, consistent with how temperature and stopSequences are already handled in the same block.

This is a follow-up to the rename fix in #3252.

Contributors Checklist

Review Checklist

  • I have self-reviewed my changes
  • My Pull Request is ready for review

Copilot AI review requested due to automatic review settings February 24, 2026 09:08
@marvin-context-protocol marvin-context-protocol Bot added bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. client Related to the FastMCP client SDK or client-side functionality. labels Feb 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in the OpenAI sampling handler where sending an explicit null value for max_completion_tokens causes the API to return empty responses. The fix ensures that when maxTokens is not specified (None), the parameter is omitted from the API request entirely, which is the expected behavior.

Changes:

  • Modified the OpenAI sampling handler to conditionally include max_completion_tokens only when params.maxTokens is not None, making it consistent with how temperature and stopSequences are handled.

Comment thread src/fastmcp/client/sampling/handlers/openai.py
Copy link
Copy Markdown
Member

@jlowin jlowin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jlowin jlowin merged commit 549f48b into PrefectHQ:main Feb 24, 2026
12 of 13 checks passed
@marvin-context-protocol
Copy link
Copy Markdown
Contributor

Test Failure Analysis

Summary: One test failed on Python 3.13 in the "Run unit tests" step: tests/server/middleware/test_rate_limiting.py::TestRateLimitingMiddlewareIntegration::test_rate_limiting_with_different_operations — expected a ToolError to be raised but none was raised.

Root Cause: The test is timing-sensitive and incorrectly calibrated for Python 3.13. It configures RateLimitingMiddleware(max_requests_per_second=9.0, burst_capacity=5) and expects the 3rd explicit call_tool to hit the rate limit. However, the token bucket starts with 5 tokens and the high refill rate (9/sec) means any measurable wall-clock time between async calls replenishes tokens. The number of implicit MCP protocol requests (e.g., initialize, list_tools) made during async with Client(...) likely differs between Python 3.10 and 3.13, leaving one extra token available on 3.13 — causing the 3rd call to succeed instead of raising.

Suggested Solution: Fix test_rate_limiting_with_different_operations to use a near-zero refill rate so token replenishment is negligible during the test, and set burst_capacity to exactly the count of expected requests (based on how many implicit protocol requests the client makes):

  • File: tests/server/middleware/test_rate_limiting.py (lines 383–384)
  • Change: Replace RateLimitingMiddleware(max_requests_per_second=9.0, burst_capacity=5) with something like RateLimitingMiddleware(max_requests_per_second=0.001, burst_capacity=4) (or whichever burst value exhausts after init + list_tools + 2 calls on all Python versions)
  • Why: A near-zero refill rate makes timing irrelevant; the bucket simply counts requests without replenishing during the test

This failure is unrelated to the max_completion_tokens fix in this PR — it's a pre-existing flaky test that happened to fail on this CI run for Python 3.13.

Detailed Analysis

Failing test (tests/server/middleware/test_rate_limiting.py:381):
```python
async def test_rate_limiting_with_different_operations(self, rate_limit_server):
rate_limit_server.add_middleware(
RateLimitingMiddleware(max_requests_per_second=9.0, burst_capacity=5)
)
async with Client(rate_limit_server) as client:
await client.call_tool("quick_action", {"message": "test"})
await client.call_tool("heavy_computation")
# Should be rate limited regardless of operation type
with pytest.raises(ToolError, match="Rate limit exceeded"): # <-- DID NOT RAISE
await client.call_tool("batch_process", {"items": ["a", "b", "c"]})
```

Log excerpt:
```
FAILED tests/server/middleware/test_rate_limiting.py::TestRateLimitingMiddlewareIntegration::test_rate_limiting_with_different_operations

  • Failed: DID NOT RAISE <class 'fastmcp.exceptions.ToolError'>
    = 1 failed, 4379 passed, 1 skipped, 1 xfailed, 6 warnings in 104.53s (0:01:44) =
    ```

Token bucket logic (src/fastmcp/server/middleware/rate_limiting.py:38):
```python
async def consume(self, tokens: int = 1) -> bool:
async with self._lock:
now = time.time()
elapsed = now - self.last_refill
# Add tokens based on elapsed time
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
...
```

With refill_rate=9.0, just ~0.11 seconds between requests restores 1 token. A comparable test (test_sliding_window_rate_limiting) comments: # 1 init + 1 list_tools + 3 calls + 1 to fail for max_requests=6, suggesting 5 requests are expected before the failing one. But the rate-limited version uses burst_capacity=5 with only 2 explicit calls, leaving the 3rd with 1 token remaining on Python 3.13.

Only fails on: Python 3.13 / ubuntu-latest
Passes on: Python 3.10 / ubuntu-latest, Python 3.10 / windows-latest

Related Files
  • tests/server/middleware/test_rate_limiting.py — contains the flaky test; lines 381–394 are the failing test case
  • src/fastmcp/server/middleware/rate_limiting.pyTokenBucketRateLimiter.consume() at line 38 performs time-based refill that makes the test timing-sensitive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working. Reports of errors, unexpected behavior, or broken functionality. client Related to the FastMCP client SDK or client-side functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI handler fails with GPT-5 and O1 models - uses max_tokens instead of max_completion_tokens

3 participants