fix(polling): check rate limits before creating polling ID by Sameerlite · Pull Request #24106 · BerriAI/litellm

Sameerlite · 2026-03-19T08:30:24Z

Summary

Move pre-call checks (rate limits, guardrails, budget) to run BEFORE polling ID creation in the background streaming flow. This fixes the edge case where a rate-limited request would receive a polling ID that immediately fails in the background task.
Fixes LIT-2251

Changes

Add skip_pre_call_logic parameter to base_process_llm_request to allow skipping pre-call checks and avoiding double-counting of RPM/parallel request counters
Run common_processing_pre_call_logic before generating polling ID in the responses API endpoint. If rate limits/guardrails fail, return error immediately without creating a polling ID
Background streaming task passes skip_pre_call_logic=True to avoid re-running pre-call checks
Add tests verifying skip_pre_call_logic parameter works correctly

Testing

Added 2 new tests in tests/proxy_unit_tests/test_response_polling_pre_call_checks.py verifying the skip parameter
All existing polling tests pass (73 tests)

Fixes: Rate-limited requests no longer receive a polling ID that immediately fails

Move pre-call checks (rate limits, guardrails, budget) to run BEFORE polling ID creation in the background streaming flow. This prevents the edge case where a rate-limited request receives a polling ID that immediately fails. Changes: - Add skip_pre_call_logic parameter to base_process_llm_request to allow skipping pre-call checks (avoiding double-counting of RPM/parallel requests) - Run common_processing_pre_call_logic before generating polling ID in the responses API endpoint. If rate limits/guardrails fail, return error immediately without creating a polling ID - Background streaming task passes skip_pre_call_logic=True to avoid re-running pre-call checks that were already done before polling ID creation - Add tests verifying skip_pre_call_logic parameter works correctly Fixes the edge case where polling_via_cache would return a polling ID for a request that immediately fails due to rate limiting.

vercel · 2026-03-19T08:30:29Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Mar 19, 2026 9:02am

codspeed-hq · 2026-03-19T08:31:59Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing Sameerlite/pre-ratelimit-bg (66f97a0) with main (e5baa22)}

greptile-apps · 2026-03-19T08:36:51Z

Greptile Summary

This PR fixes a race-condition edge case in the polling/background-streaming flow: previously, pre-call checks (rate limits, guardrails, budget) ran inside the background task after a polling ID was already issued to the client, meaning a rate-limited request would receive a polling ID that immediately transitioned to failed. The fix runs common_processing_pre_call_logic synchronously in responses_api() before generate_polling_id() is called, and passes skip_pre_call_logic=True to the background task so RPM/TPM counters are not double-counted.

Key changes and observations:

Core fix is sound: the data flow is correct — common_processing_pre_call_logic returns an updated data dict (including litellm_logging_obj), which is shallow-copied and passed to the background task, satisfying the ValueError guard added to base_process_llm_request.
during_call_hook is unaffected: it continues to run unconditionally in the background task regardless of skip_pre_call_logic, which is the correct behaviour — content guardrails still apply during the actual LLM call.
HTTPException re-wrapping: the new except Exception block around the pre-call check passes HTTPException through _handle_llm_api_exception, which may inadvertently transform a well-formed 429 or 403 into a different error shape. A leading except HTTPException: raise clause would avoid this (see inline comment).
Tests are materially improved: TestPollingEndpointPreCallGuard.test_rate_limit_error_prevents_polling_id_creation now calls responses_api() end-to-end and asserts generate_polling_id is never reached — this is a meaningful regression guard for the core fix.
patch.multiple key derivation: the k.split(".")[-1] shortcut in the test is fragile and could silently drop mocks if two keys share a suffix; consider using explicit short-key kwargs (see inline comment).

Confidence Score: 4/5

Safe to merge — the core logic is correct, counters are not double-counted, and the main regression test covers the fixed path. The two style issues noted are low-risk.
The fix is straightforward and well-contained: one new parameter, one early-exit guard, and one guard in the existing exception handler. The data flow through the pre-call → polling ID → background task pipeline is correct. The end-to-end test in TestPollingEndpointPreCallGuard meaningfully validates the fix. Minor deductions for the HTTPException re-wrapping concern and the fragile patch.multiple key derivation, neither of which affects correctness in the current codebase.
litellm/proxy/response_api_endpoints/endpoints.py lines 143–149 (exception handling) and tests/proxy_unit_tests/test_response_polling_pre_call_checks.py lines 143–146 (mock patching).

Important Files Changed

Filename	Overview
litellm/proxy/response_api_endpoints/endpoints.py	Adds pre-call logic (rate limits, guardrails, budget) before polling ID creation in the polling path. The updated `data` dict is correctly threaded through to `create_initial_state` and the background task. Minor: `model=None` omits a previously-discussed style inconsistency. Exception handling wraps all pre-call errors via `_handle_llm_api_exception`, which is consistent with the rest of the file.
litellm/proxy/common_request_processing.py	Adds `skip_pre_call_logic` parameter to `base_process_llm_request`. When `True`, retrieves `litellm_logging_obj` from `self.data` and raises `ValueError` if absent — a safety guard for the footgun. The `during_call_hook` continues to run unconditionally regardless of this flag, which is correct.
litellm/proxy/response_polling/background_streaming.py	Minimal change: adds `skip_pre_call_logic=True` to the `base_process_llm_request` call. Since `data` is the pre-processed copy (including `litellm_logging_obj`) from `responses_api`, the guard in `base_process_llm_request` will not raise. Comment is clear about the intent.
tests/proxy_unit_tests/test_response_polling_pre_call_checks.py	Three tests added. `TestSkipPreCallLogic` tests the `skip_pre_call_logic` plumbing on `base_process_llm_request`. `TestPollingEndpointPreCallGuard` now includes a meaningful end-to-end test that calls `responses_api()` and asserts `generate_polling_id` is never called when pre-call raises. However, there is a subtle mock patching issue — see inline comment.

Sequence Diagram

sequenceDiagram
    participant Client
    participant responses_api
    participant common_processing_pre_call_logic
    participant ResponsePollingHandler
    participant background_streaming_task
    participant base_process_llm_request

    Client->>responses_api: POST /v1/responses (background=true)
    responses_api->>responses_api: should_use_polling_for_request()
    
    Note over responses_api,common_processing_pre_call_logic: NEW: Pre-call checks before polling ID creation
    responses_api->>common_processing_pre_call_logic: check rate limits, guardrails, budget
    alt Pre-call check fails (e.g. rate limit)
        common_processing_pre_call_logic-->>responses_api: raise RateLimitError / HTTPException
        responses_api->>responses_api: _handle_llm_api_exception(e)
        responses_api-->>Client: 429 (synchronous error, no polling ID)
    else Pre-call checks pass
        common_processing_pre_call_logic-->>responses_api: (updated_data, logging_obj)
        responses_api->>ResponsePollingHandler: generate_polling_id()
        ResponsePollingHandler-->>responses_api: polling_id
        responses_api->>ResponsePollingHandler: create_initial_state(polling_id, data)
        responses_api->>background_streaming_task: asyncio.create_task(..., skip_pre_call_logic=True)
        responses_api-->>Client: initial_state (with polling_id)
        
        Note over background_streaming_task,base_process_llm_request: Background: skip pre-call to avoid double-counting
        background_streaming_task->>base_process_llm_request: base_process_llm_request(skip_pre_call_logic=True)
        base_process_llm_request->>base_process_llm_request: logging_obj = data["litellm_logging_obj"]
        base_process_llm_request->>base_process_llm_request: during_call_hook + route_request
        base_process_llm_request-->>background_streaming_task: StreamingResponse
        background_streaming_task->>ResponsePollingHandler: update_state(polling_id, status, output...)
    end

_{Last reviewed commit: "fix(test): rewrite p..."}

- Guard logging_obj for None when skip_pre_call_logic=True: raise ValueError if litellm_logging_obj not in data, preventing AttributeError downstream - Add model=None to common_processing_pre_call_logic call in endpoints.py to match style of other call sites - Add test verifying rate-limited request never receives polling ID

greptile-apps · 2026-03-19T08:48:15Z

+    async def test_rate_limit_error_prevents_polling_id_creation(self):
+        """When pre-call checks raise, generate_polling_id must not be called"""
+        from litellm.proxy.response_polling.polling_handler import ResponsePollingHandler
+
+        rate_limit_exc = litellm.RateLimitError(
+            message="TPM limit exceeded",
+            llm_provider="",
+            model="gpt-4",
+        )
+
+        generate_polling_id_mock = MagicMock(return_value="litellm_poll_test")
+
+        with (
+            patch.object(
+                ProxyBaseLLMRequestProcessing,
+                "common_processing_pre_call_logic",
+                new_callable=AsyncMock,
+                side_effect=rate_limit_exc,
+            ),
+            patch.object(
+                ProxyBaseLLMRequestProcessing,
+                "_handle_llm_api_exception",
+                new_callable=AsyncMock,
+                return_value=HTTPException(status_code=429, detail="Rate limit exceeded"),
+            ),
+            patch.object(ResponsePollingHandler, "generate_polling_id", generate_polling_id_mock),
+        ):
+            # Simulate the endpoint logic directly (avoids proxy_server import complexity)
+            data = {"model": "gpt-4", "background": True}
+            processor = ProxyBaseLLMRequestProcessing(data=data)
+
+            raised_exc = None
+            try:
+                await processor.common_processing_pre_call_logic(
+                    request=MagicMock(spec=Request),
+                    general_settings={},
+                    proxy_logging_obj=AsyncMock(),
+                    user_api_key_dict=MagicMock(spec=UserAPIKeyAuth),
+                    version="1.0.0",
+                    proxy_config=MagicMock(),
+                    user_model=None,
+                    user_temperature=None,
+                    user_request_timeout=None,
+                    user_max_tokens=None,
+                    user_api_base=None,
+                    model=None,
+                    route_type="aresponses",
+                    llm_router=MagicMock(),
+                )
+            except litellm.RateLimitError as e:
+                raised_exc = e
+
+            # The exception was raised before generate_polling_id could be called
+            assert raised_exc is not None
+            generate_polling_id_mock.assert_not_called()


generate_polling_id_mock.assert_not_called() is vacuously true — the test never invokes responses_api()

The test sets up patches for common_processing_pre_call_logic (raises RateLimitError), _handle_llm_api_exception, and ResponsePollingHandler.generate_polling_id. It then calls processor.common_processing_pre_call_logic(...) directly, which raises due to the mock. The final assertion generate_polling_id_mock.assert_not_called() is trivially true because generate_polling_id is not called anywhere in the test body — it would only be called if responses_api() were invoked.

The actual guard introduced in endpoints.py (the try/except block that wraps common_processing_pre_call_logic at line 126–149 and returns early before ResponsePollingHandler.generate_polling_id()) has zero test coverage. A meaningful test would need to:

Mock or inject all proxy_server module-level globals that responses_api() imports

Call responses_api() (or the relevant code path)

Assert generate_polling_id was not called

Assert the returned exception has status 429

Without this, a regression that removes the try/except guard entirely would not be caught.

…) directly Previously the test called common_processing_pre_call_logic in isolation, making generate_polling_id.assert_not_called() vacuously true. Now the test calls responses_api() end-to-end so it actually verifies that a rate-limited request never receives a polling ID. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

greptile-apps · 2026-03-19T09:06:03Z

+        with (
+            patch.multiple("litellm.proxy.proxy_server", **{
+                k.split(".")[-1]: v for k, v in proxy_server_patches.items()
+            }),


patch.multiple key derivation is fragile

The proxy_server_patches dict uses fully-qualified dotted names as keys (e.g. "litellm.proxy.proxy_server._read_request_body"), then strips all but the last segment with k.split(".")[-1]. If two entries ever share the same trailing segment (e.g. two different version keys under different sub-modules), only the last one wins in the resulting dict — silently. Consider building the kwargs dict directly with short names to make the intent explicit and avoid any future silent collision.

greptile-apps · 2026-03-19T09:06:04Z

+        except Exception as e:
+            raise await processor._handle_llm_api_exception(
+                e=e,
+                user_api_key_dict=user_api_key_dict,
+                proxy_logging_obj=proxy_logging_obj,
+                version=version,
+            )


HTTPException from pre-call may be re-wrapped by _handle_llm_api_exception

common_processing_pre_call_logic can raise HTTPException directly (e.g. a 429 rate-limit or 403 budget exceeded). When that exception reaches this except Exception block it is passed to _handle_llm_api_exception, which may alter the status code, add logging overhead, or change the response body — all undesired for an exception that is already a well-formed HTTP error.

The synchronous non-polling path (lines 281–287) uses the same broad except Exception pattern, but in that path the exception originates from the LLM call itself. Here it comes from a pre-call gate that already produces correct HTTP error objects. Adding a leading except HTTPException: raise clause would let those pass through unmodified.

vercel bot deployed to Preview March 19, 2026 08:31 View deployment

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

Comment thread litellm/proxy/common_request_processing.py

Comment thread tests/proxy_unit_tests/test_response_polling_pre_call_checks.py

Comment thread litellm/proxy/response_api_endpoints/endpoints.py

vercel bot deployed to Preview March 19, 2026 08:42 View deployment

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

vercel bot deployed to Preview March 19, 2026 09:02 View deployment

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

Sameerlite changed the base branch from main to litellm_dev_sameer_16_march_week March 20, 2026 10:56

Sameerlite merged commit 8ad2068 into litellm_dev_sameer_16_march_week Mar 20, 2026
36 of 39 checks passed

ishaan-berri deleted the Sameerlite/pre-ratelimit-bg branch March 26, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(polling): check rate limits before creating polling ID#24106

fix(polling): check rate limits before creating polling ID#24106
Sameerlite merged 3 commits intolitellm_dev_sameer_16_march_weekfrom
Sameerlite/pre-ratelimit-bg

Sameerlite commented Mar 19, 2026 •

edited

Loading

Uh oh!

vercel bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 19, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sameerlite commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

vercel bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

greptile-apps bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sameerlite commented Mar 19, 2026 •

edited

Loading

vercel bot commented Mar 19, 2026 •

edited

Loading

codspeed-hq bot commented Mar 19, 2026 •

edited

Loading

greptile-apps bot commented Mar 19, 2026 •

edited

Loading