fix(polling): check rate limits before creating polling ID#24106
fix(polling): check rate limits before creating polling ID#24106Sameerlite merged 3 commits intolitellm_dev_sameer_16_march_weekfrom
Conversation
Move pre-call checks (rate limits, guardrails, budget) to run BEFORE polling ID creation in the background streaming flow. This prevents the edge case where a rate-limited request receives a polling ID that immediately fails. Changes: - Add skip_pre_call_logic parameter to base_process_llm_request to allow skipping pre-call checks (avoiding double-counting of RPM/parallel requests) - Run common_processing_pre_call_logic before generating polling ID in the responses API endpoint. If rate limits/guardrails fail, return error immediately without creating a polling ID - Background streaming task passes skip_pre_call_logic=True to avoid re-running pre-call checks that were already done before polling ID creation - Add tests verifying skip_pre_call_logic parameter works correctly Fixes the edge case where polling_via_cache would return a polling ID for a request that immediately fails due to rate limiting.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a race-condition edge case in the polling/background-streaming flow: previously, pre-call checks (rate limits, guardrails, budget) ran inside the background task after a polling ID was already issued to the client, meaning a rate-limited request would receive a polling ID that immediately transitioned to Key changes and observations:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/response_api_endpoints/endpoints.py | Adds pre-call logic (rate limits, guardrails, budget) before polling ID creation in the polling path. The updated data dict is correctly threaded through to create_initial_state and the background task. Minor: model=None omits a previously-discussed style inconsistency. Exception handling wraps all pre-call errors via _handle_llm_api_exception, which is consistent with the rest of the file. |
| litellm/proxy/common_request_processing.py | Adds skip_pre_call_logic parameter to base_process_llm_request. When True, retrieves litellm_logging_obj from self.data and raises ValueError if absent — a safety guard for the footgun. The during_call_hook continues to run unconditionally regardless of this flag, which is correct. |
| litellm/proxy/response_polling/background_streaming.py | Minimal change: adds skip_pre_call_logic=True to the base_process_llm_request call. Since data is the pre-processed copy (including litellm_logging_obj) from responses_api, the guard in base_process_llm_request will not raise. Comment is clear about the intent. |
| tests/proxy_unit_tests/test_response_polling_pre_call_checks.py | Three tests added. TestSkipPreCallLogic tests the skip_pre_call_logic plumbing on base_process_llm_request. TestPollingEndpointPreCallGuard now includes a meaningful end-to-end test that calls responses_api() and asserts generate_polling_id is never called when pre-call raises. However, there is a subtle mock patching issue — see inline comment. |
Sequence Diagram
sequenceDiagram
participant Client
participant responses_api
participant common_processing_pre_call_logic
participant ResponsePollingHandler
participant background_streaming_task
participant base_process_llm_request
Client->>responses_api: POST /v1/responses (background=true)
responses_api->>responses_api: should_use_polling_for_request()
Note over responses_api,common_processing_pre_call_logic: NEW: Pre-call checks before polling ID creation
responses_api->>common_processing_pre_call_logic: check rate limits, guardrails, budget
alt Pre-call check fails (e.g. rate limit)
common_processing_pre_call_logic-->>responses_api: raise RateLimitError / HTTPException
responses_api->>responses_api: _handle_llm_api_exception(e)
responses_api-->>Client: 429 (synchronous error, no polling ID)
else Pre-call checks pass
common_processing_pre_call_logic-->>responses_api: (updated_data, logging_obj)
responses_api->>ResponsePollingHandler: generate_polling_id()
ResponsePollingHandler-->>responses_api: polling_id
responses_api->>ResponsePollingHandler: create_initial_state(polling_id, data)
responses_api->>background_streaming_task: asyncio.create_task(..., skip_pre_call_logic=True)
responses_api-->>Client: initial_state (with polling_id)
Note over background_streaming_task,base_process_llm_request: Background: skip pre-call to avoid double-counting
background_streaming_task->>base_process_llm_request: base_process_llm_request(skip_pre_call_logic=True)
base_process_llm_request->>base_process_llm_request: logging_obj = data["litellm_logging_obj"]
base_process_llm_request->>base_process_llm_request: during_call_hook + route_request
base_process_llm_request-->>background_streaming_task: StreamingResponse
background_streaming_task->>ResponsePollingHandler: update_state(polling_id, status, output...)
end
Last reviewed commit: "fix(test): rewrite p..."
- Guard logging_obj for None when skip_pre_call_logic=True: raise ValueError if litellm_logging_obj not in data, preventing AttributeError downstream - Add model=None to common_processing_pre_call_logic call in endpoints.py to match style of other call sites - Add test verifying rate-limited request never receives polling ID
| async def test_rate_limit_error_prevents_polling_id_creation(self): | ||
| """When pre-call checks raise, generate_polling_id must not be called""" | ||
| from litellm.proxy.response_polling.polling_handler import ResponsePollingHandler | ||
|
|
||
| rate_limit_exc = litellm.RateLimitError( | ||
| message="TPM limit exceeded", | ||
| llm_provider="", | ||
| model="gpt-4", | ||
| ) | ||
|
|
||
| generate_polling_id_mock = MagicMock(return_value="litellm_poll_test") | ||
|
|
||
| with ( | ||
| patch.object( | ||
| ProxyBaseLLMRequestProcessing, | ||
| "common_processing_pre_call_logic", | ||
| new_callable=AsyncMock, | ||
| side_effect=rate_limit_exc, | ||
| ), | ||
| patch.object( | ||
| ProxyBaseLLMRequestProcessing, | ||
| "_handle_llm_api_exception", | ||
| new_callable=AsyncMock, | ||
| return_value=HTTPException(status_code=429, detail="Rate limit exceeded"), | ||
| ), | ||
| patch.object(ResponsePollingHandler, "generate_polling_id", generate_polling_id_mock), | ||
| ): | ||
| # Simulate the endpoint logic directly (avoids proxy_server import complexity) | ||
| data = {"model": "gpt-4", "background": True} | ||
| processor = ProxyBaseLLMRequestProcessing(data=data) | ||
|
|
||
| raised_exc = None | ||
| try: | ||
| await processor.common_processing_pre_call_logic( | ||
| request=MagicMock(spec=Request), | ||
| general_settings={}, | ||
| proxy_logging_obj=AsyncMock(), | ||
| user_api_key_dict=MagicMock(spec=UserAPIKeyAuth), | ||
| version="1.0.0", | ||
| proxy_config=MagicMock(), | ||
| user_model=None, | ||
| user_temperature=None, | ||
| user_request_timeout=None, | ||
| user_max_tokens=None, | ||
| user_api_base=None, | ||
| model=None, | ||
| route_type="aresponses", | ||
| llm_router=MagicMock(), | ||
| ) | ||
| except litellm.RateLimitError as e: | ||
| raised_exc = e | ||
|
|
||
| # The exception was raised before generate_polling_id could be called | ||
| assert raised_exc is not None | ||
| generate_polling_id_mock.assert_not_called() |
There was a problem hiding this comment.
generate_polling_id_mock.assert_not_called() is vacuously true — the test never invokes responses_api()
The test sets up patches for common_processing_pre_call_logic (raises RateLimitError), _handle_llm_api_exception, and ResponsePollingHandler.generate_polling_id. It then calls processor.common_processing_pre_call_logic(...) directly, which raises due to the mock. The final assertion generate_polling_id_mock.assert_not_called() is trivially true because generate_polling_id is not called anywhere in the test body — it would only be called if responses_api() were invoked.
The actual guard introduced in endpoints.py (the try/except block that wraps common_processing_pre_call_logic at line 126–149 and returns early before ResponsePollingHandler.generate_polling_id()) has zero test coverage. A meaningful test would need to:
- Mock or inject all
proxy_servermodule-level globals thatresponses_api()imports - Call
responses_api()(or the relevant code path) - Assert
generate_polling_idwas not called - Assert the returned exception has status
429
Without this, a regression that removes the try/except guard entirely would not be caught.
…) directly Previously the test called common_processing_pre_call_logic in isolation, making generate_polling_id.assert_not_called() vacuously true. Now the test calls responses_api() end-to-end so it actually verifies that a rate-limited request never receives a polling ID. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
| with ( | ||
| patch.multiple("litellm.proxy.proxy_server", **{ | ||
| k.split(".")[-1]: v for k, v in proxy_server_patches.items() | ||
| }), |
There was a problem hiding this comment.
patch.multiple key derivation is fragile
The proxy_server_patches dict uses fully-qualified dotted names as keys (e.g. "litellm.proxy.proxy_server._read_request_body"), then strips all but the last segment with k.split(".")[-1]. If two entries ever share the same trailing segment (e.g. two different version keys under different sub-modules), only the last one wins in the resulting dict — silently. Consider building the kwargs dict directly with short names to make the intent explicit and avoid any future silent collision.
| except Exception as e: | ||
| raise await processor._handle_llm_api_exception( | ||
| e=e, | ||
| user_api_key_dict=user_api_key_dict, | ||
| proxy_logging_obj=proxy_logging_obj, | ||
| version=version, | ||
| ) |
There was a problem hiding this comment.
HTTPException from pre-call may be re-wrapped by _handle_llm_api_exception
common_processing_pre_call_logic can raise HTTPException directly (e.g. a 429 rate-limit or 403 budget exceeded). When that exception reaches this except Exception block it is passed to _handle_llm_api_exception, which may alter the status code, add logging overhead, or change the response body — all undesired for an exception that is already a well-formed HTTP error.
The synchronous non-polling path (lines 281–287) uses the same broad except Exception pattern, but in that path the exception originates from the LLM call itself. Here it comes from a pre-call gate that already produces correct HTTP error objects. Adding a leading except HTTPException: raise clause would let those pass through unmodified.
8ad2068
into
litellm_dev_sameer_16_march_week
Summary
Move pre-call checks (rate limits, guardrails, budget) to run BEFORE polling ID creation in the background streaming flow. This fixes the edge case where a rate-limited request would receive a polling ID that immediately fails in the background task.
Fixes LIT-2251
Changes
skip_pre_call_logicparameter tobase_process_llm_requestto allow skipping pre-call checks and avoiding double-counting of RPM/parallel request counterscommon_processing_pre_call_logicbefore generating polling ID in the responses API endpoint. If rate limits/guardrails fail, return error immediately without creating a polling IDskip_pre_call_logic=Trueto avoid re-running pre-call checksskip_pre_call_logicparameter works correctlyTesting
tests/proxy_unit_tests/test_response_polling_pre_call_checks.pyverifying the skip parameterFixes: Rate-limited requests no longer receive a polling ID that immediately fails