Skip to content

fix(polling): check rate limits before creating polling ID#24106

Merged
Sameerlite merged 3 commits intolitellm_dev_sameer_16_march_weekfrom
Sameerlite/pre-ratelimit-bg
Mar 20, 2026
Merged

fix(polling): check rate limits before creating polling ID#24106
Sameerlite merged 3 commits intolitellm_dev_sameer_16_march_weekfrom
Sameerlite/pre-ratelimit-bg

Conversation

@Sameerlite
Copy link
Copy Markdown
Collaborator

@Sameerlite Sameerlite commented Mar 19, 2026

Summary

Move pre-call checks (rate limits, guardrails, budget) to run BEFORE polling ID creation in the background streaming flow. This fixes the edge case where a rate-limited request would receive a polling ID that immediately fails in the background task.
Fixes LIT-2251

Changes

  • Add skip_pre_call_logic parameter to base_process_llm_request to allow skipping pre-call checks and avoiding double-counting of RPM/parallel request counters
  • Run common_processing_pre_call_logic before generating polling ID in the responses API endpoint. If rate limits/guardrails fail, return error immediately without creating a polling ID
  • Background streaming task passes skip_pre_call_logic=True to avoid re-running pre-call checks
  • Add tests verifying skip_pre_call_logic parameter works correctly

Testing

  • Added 2 new tests in tests/proxy_unit_tests/test_response_polling_pre_call_checks.py verifying the skip parameter
  • All existing polling tests pass (73 tests)

Fixes: Rate-limited requests no longer receive a polling ID that immediately fails

Move pre-call checks (rate limits, guardrails, budget) to run BEFORE
polling ID creation in the background streaming flow. This prevents the
edge case where a rate-limited request receives a polling ID that
immediately fails.

Changes:
- Add skip_pre_call_logic parameter to base_process_llm_request to allow
  skipping pre-call checks (avoiding double-counting of RPM/parallel requests)
- Run common_processing_pre_call_logic before generating polling ID in the
  responses API endpoint. If rate limits/guardrails fail, return error
  immediately without creating a polling ID
- Background streaming task passes skip_pre_call_logic=True to avoid re-running
  pre-call checks that were already done before polling ID creation
- Add tests verifying skip_pre_call_logic parameter works correctly

Fixes the edge case where polling_via_cache would return a polling ID
for a request that immediately fails due to rate limiting.
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 19, 2026 9:02am

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 19, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing Sameerlite/pre-ratelimit-bg (66f97a0) with main (e5baa22)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR fixes a race-condition edge case in the polling/background-streaming flow: previously, pre-call checks (rate limits, guardrails, budget) ran inside the background task after a polling ID was already issued to the client, meaning a rate-limited request would receive a polling ID that immediately transitioned to failed. The fix runs common_processing_pre_call_logic synchronously in responses_api() before generate_polling_id() is called, and passes skip_pre_call_logic=True to the background task so RPM/TPM counters are not double-counted.

Key changes and observations:

  • Core fix is sound: the data flow is correct — common_processing_pre_call_logic returns an updated data dict (including litellm_logging_obj), which is shallow-copied and passed to the background task, satisfying the ValueError guard added to base_process_llm_request.
  • during_call_hook is unaffected: it continues to run unconditionally in the background task regardless of skip_pre_call_logic, which is the correct behaviour — content guardrails still apply during the actual LLM call.
  • HTTPException re-wrapping: the new except Exception block around the pre-call check passes HTTPException through _handle_llm_api_exception, which may inadvertently transform a well-formed 429 or 403 into a different error shape. A leading except HTTPException: raise clause would avoid this (see inline comment).
  • Tests are materially improved: TestPollingEndpointPreCallGuard.test_rate_limit_error_prevents_polling_id_creation now calls responses_api() end-to-end and asserts generate_polling_id is never reached — this is a meaningful regression guard for the core fix.
  • patch.multiple key derivation: the k.split(".")[-1] shortcut in the test is fragile and could silently drop mocks if two keys share a suffix; consider using explicit short-key kwargs (see inline comment).

Confidence Score: 4/5

  • Safe to merge — the core logic is correct, counters are not double-counted, and the main regression test covers the fixed path. The two style issues noted are low-risk.
  • The fix is straightforward and well-contained: one new parameter, one early-exit guard, and one guard in the existing exception handler. The data flow through the pre-call → polling ID → background task pipeline is correct. The end-to-end test in TestPollingEndpointPreCallGuard meaningfully validates the fix. Minor deductions for the HTTPException re-wrapping concern and the fragile patch.multiple key derivation, neither of which affects correctness in the current codebase.
  • litellm/proxy/response_api_endpoints/endpoints.py lines 143–149 (exception handling) and tests/proxy_unit_tests/test_response_polling_pre_call_checks.py lines 143–146 (mock patching).

Important Files Changed

Filename Overview
litellm/proxy/response_api_endpoints/endpoints.py Adds pre-call logic (rate limits, guardrails, budget) before polling ID creation in the polling path. The updated data dict is correctly threaded through to create_initial_state and the background task. Minor: model=None omits a previously-discussed style inconsistency. Exception handling wraps all pre-call errors via _handle_llm_api_exception, which is consistent with the rest of the file.
litellm/proxy/common_request_processing.py Adds skip_pre_call_logic parameter to base_process_llm_request. When True, retrieves litellm_logging_obj from self.data and raises ValueError if absent — a safety guard for the footgun. The during_call_hook continues to run unconditionally regardless of this flag, which is correct.
litellm/proxy/response_polling/background_streaming.py Minimal change: adds skip_pre_call_logic=True to the base_process_llm_request call. Since data is the pre-processed copy (including litellm_logging_obj) from responses_api, the guard in base_process_llm_request will not raise. Comment is clear about the intent.
tests/proxy_unit_tests/test_response_polling_pre_call_checks.py Three tests added. TestSkipPreCallLogic tests the skip_pre_call_logic plumbing on base_process_llm_request. TestPollingEndpointPreCallGuard now includes a meaningful end-to-end test that calls responses_api() and asserts generate_polling_id is never called when pre-call raises. However, there is a subtle mock patching issue — see inline comment.

Sequence Diagram

sequenceDiagram
    participant Client
    participant responses_api
    participant common_processing_pre_call_logic
    participant ResponsePollingHandler
    participant background_streaming_task
    participant base_process_llm_request

    Client->>responses_api: POST /v1/responses (background=true)
    responses_api->>responses_api: should_use_polling_for_request()
    
    Note over responses_api,common_processing_pre_call_logic: NEW: Pre-call checks before polling ID creation
    responses_api->>common_processing_pre_call_logic: check rate limits, guardrails, budget
    alt Pre-call check fails (e.g. rate limit)
        common_processing_pre_call_logic-->>responses_api: raise RateLimitError / HTTPException
        responses_api->>responses_api: _handle_llm_api_exception(e)
        responses_api-->>Client: 429 (synchronous error, no polling ID)
    else Pre-call checks pass
        common_processing_pre_call_logic-->>responses_api: (updated_data, logging_obj)
        responses_api->>ResponsePollingHandler: generate_polling_id()
        ResponsePollingHandler-->>responses_api: polling_id
        responses_api->>ResponsePollingHandler: create_initial_state(polling_id, data)
        responses_api->>background_streaming_task: asyncio.create_task(..., skip_pre_call_logic=True)
        responses_api-->>Client: initial_state (with polling_id)
        
        Note over background_streaming_task,base_process_llm_request: Background: skip pre-call to avoid double-counting
        background_streaming_task->>base_process_llm_request: base_process_llm_request(skip_pre_call_logic=True)
        base_process_llm_request->>base_process_llm_request: logging_obj = data["litellm_logging_obj"]
        base_process_llm_request->>base_process_llm_request: during_call_hook + route_request
        base_process_llm_request-->>background_streaming_task: StreamingResponse
        background_streaming_task->>ResponsePollingHandler: update_state(polling_id, status, output...)
    end
Loading

Last reviewed commit: "fix(test): rewrite p..."

Comment thread litellm/proxy/common_request_processing.py
Comment thread tests/proxy_unit_tests/test_response_polling_pre_call_checks.py
Comment thread litellm/proxy/response_api_endpoints/endpoints.py
- Guard logging_obj for None when skip_pre_call_logic=True: raise ValueError
  if litellm_logging_obj not in data, preventing AttributeError downstream
- Add model=None to common_processing_pre_call_logic call in endpoints.py
  to match style of other call sites
- Add test verifying rate-limited request never receives polling ID
Comment on lines +110 to +164
async def test_rate_limit_error_prevents_polling_id_creation(self):
"""When pre-call checks raise, generate_polling_id must not be called"""
from litellm.proxy.response_polling.polling_handler import ResponsePollingHandler

rate_limit_exc = litellm.RateLimitError(
message="TPM limit exceeded",
llm_provider="",
model="gpt-4",
)

generate_polling_id_mock = MagicMock(return_value="litellm_poll_test")

with (
patch.object(
ProxyBaseLLMRequestProcessing,
"common_processing_pre_call_logic",
new_callable=AsyncMock,
side_effect=rate_limit_exc,
),
patch.object(
ProxyBaseLLMRequestProcessing,
"_handle_llm_api_exception",
new_callable=AsyncMock,
return_value=HTTPException(status_code=429, detail="Rate limit exceeded"),
),
patch.object(ResponsePollingHandler, "generate_polling_id", generate_polling_id_mock),
):
# Simulate the endpoint logic directly (avoids proxy_server import complexity)
data = {"model": "gpt-4", "background": True}
processor = ProxyBaseLLMRequestProcessing(data=data)

raised_exc = None
try:
await processor.common_processing_pre_call_logic(
request=MagicMock(spec=Request),
general_settings={},
proxy_logging_obj=AsyncMock(),
user_api_key_dict=MagicMock(spec=UserAPIKeyAuth),
version="1.0.0",
proxy_config=MagicMock(),
user_model=None,
user_temperature=None,
user_request_timeout=None,
user_max_tokens=None,
user_api_base=None,
model=None,
route_type="aresponses",
llm_router=MagicMock(),
)
except litellm.RateLimitError as e:
raised_exc = e

# The exception was raised before generate_polling_id could be called
assert raised_exc is not None
generate_polling_id_mock.assert_not_called()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 generate_polling_id_mock.assert_not_called() is vacuously true — the test never invokes responses_api()

The test sets up patches for common_processing_pre_call_logic (raises RateLimitError), _handle_llm_api_exception, and ResponsePollingHandler.generate_polling_id. It then calls processor.common_processing_pre_call_logic(...) directly, which raises due to the mock. The final assertion generate_polling_id_mock.assert_not_called() is trivially true because generate_polling_id is not called anywhere in the test body — it would only be called if responses_api() were invoked.

The actual guard introduced in endpoints.py (the try/except block that wraps common_processing_pre_call_logic at line 126–149 and returns early before ResponsePollingHandler.generate_polling_id()) has zero test coverage. A meaningful test would need to:

  1. Mock or inject all proxy_server module-level globals that responses_api() imports
  2. Call responses_api() (or the relevant code path)
  3. Assert generate_polling_id was not called
  4. Assert the returned exception has status 429

Without this, a regression that removes the try/except guard entirely would not be caught.

…) directly

Previously the test called common_processing_pre_call_logic in isolation,
making generate_polling_id.assert_not_called() vacuously true. Now the test
calls responses_api() end-to-end so it actually verifies that a rate-limited
request never receives a polling ID.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Comment on lines +143 to +146
with (
patch.multiple("litellm.proxy.proxy_server", **{
k.split(".")[-1]: v for k, v in proxy_server_patches.items()
}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 patch.multiple key derivation is fragile

The proxy_server_patches dict uses fully-qualified dotted names as keys (e.g. "litellm.proxy.proxy_server._read_request_body"), then strips all but the last segment with k.split(".")[-1]. If two entries ever share the same trailing segment (e.g. two different version keys under different sub-modules), only the last one wins in the resulting dict — silently. Consider building the kwargs dict directly with short names to make the intent explicit and avoid any future silent collision.

Comment on lines +143 to +149
except Exception as e:
raise await processor._handle_llm_api_exception(
e=e,
user_api_key_dict=user_api_key_dict,
proxy_logging_obj=proxy_logging_obj,
version=version,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 HTTPException from pre-call may be re-wrapped by _handle_llm_api_exception

common_processing_pre_call_logic can raise HTTPException directly (e.g. a 429 rate-limit or 403 budget exceeded). When that exception reaches this except Exception block it is passed to _handle_llm_api_exception, which may alter the status code, add logging overhead, or change the response body — all undesired for an exception that is already a well-formed HTTP error.

The synchronous non-polling path (lines 281–287) uses the same broad except Exception pattern, but in that path the exception originates from the LLM call itself. Here it comes from a pre-call gate that already produces correct HTTP error objects. Adding a leading except HTTPException: raise clause would let those pass through unmodified.

@Sameerlite Sameerlite changed the base branch from main to litellm_dev_sameer_16_march_week March 20, 2026 10:56
@Sameerlite Sameerlite merged commit 8ad2068 into litellm_dev_sameer_16_march_week Mar 20, 2026
36 of 39 checks passed
@ishaan-berri ishaan-berri deleted the Sameerlite/pre-ratelimit-bg branch March 26, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant