feat(file_search): align emulated Responses behavior with native output#23969
Conversation
…hase 2 emulated fallback Phase 1 (native passthrough): - _decode_vector_store_ids_in_tools(): decode LiteLLM-managed unified vector_store_ids to provider-native IDs in file_search tools - Split update_responses_tools_with_model_file_ids() into decode pass (always runs) + code_interpreter mapping pass (guarded) - BaseResponsesAPIConfig.supports_native_file_search() → False by default; OpenAIResponsesAPIConfig overrides to True - ManagedFiles.async_pre_call_hook(): batch team-level access check for unified vector_store_ids in file_search tools (no N+1) - Docs: file_search section in response_api.md Phase 2 (emulated fallback for non-native providers): - litellm/responses/file_search/emulated_handler.py: converts file_search tool → function tool, intercepts tool call, runs asearch(), makes follow-up call, synthesizes OpenAI-format output (file_search_call + message + file_citation annotations) - responses/main.py: routes to emulated handler when provider doesn't support file_search natively Tests: 41 unit tests across 8 families (A-H) in test_file_search_responses.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Covers both paths: - Native passthrough (OpenAI/Azure): create vector store, run via SDK and proxy - Emulated fallback (Anthropic/any): register managed store, run via SDK and proxy Includes output format validation script and troubleshooting section. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…avior Ensure non-OpenAI emulated file_search matches native Responses output by populating search_results (when requested), fixing TypedDict field access, and supporting multi-query searches from tool calls. Made-with: Cursor
Drop tools=[] from transformed chat-completion requests so providers like Anthropic return normal assistant text after tool_result turns. Made-with: Cursor
…d Q&A Replace duplicate path-by-path sections with a single usage-first doc format that includes SDK/Proxy tabs, an architecture diagram, and a focused Q&A section. Made-with: Cursor
Replace inline file_search documentation in response_api.md with a canonical link and add the new tutorial to sidebars so users discover the usage-first guide. Made-with: Cursor
Include all function_call items when building emulated follow-up input and update tests to assert real emulated routing + Responses-format function tool structure. Made-with: Cursor
Forward explicit responses() params on emulated file search calls and preserve hidden params on synthesized responses so callback billing/logging context is retained. Made-with: Cursor
Strip internal logging ids from emulated sub-calls, dedupe included search_results by file_id, clean unused imports, and add unit coverage for dedupe behavior. Made-with: Cursor
…ext, cost tracking - Remove dead `should_use_emulated_file_search` (main.py uses its own inline guard) - Remove dead `fallback_vector_store_ids` param from `_run_vector_searches` - Include all first_response.output items in follow_up_input so text blocks/reasoning from providers like Anthropic aren't dropped from conversation context - Accumulate first provider call's response_cost into synthesized _hidden_params so billing callbacks see the total cost of both emulated-flow LLM calls - Remove broad tools=[] filter from transformation.py (backward-incompatible); the follow-up call already passes tools=None which is filtered by the v is not None guard Made-with: Cursor
…follow-up input Pydantic model instances (ResponseFunctionToolCall, etc.) from first_response.output were included raw in follow_up_input; the transformation layer expects plain dicts and called .get() on them, raising AttributeError. Serialize via model_dump(exclude_none=True). Made-with: Cursor
…act DB helper, clean docstring - Re-add should_use_emulated_file_search() to emulated_handler.py so H5/H6/H7/H13 tests don't fail with ImportError - Remove per-file-id deduplication from _build_search_results_for_include so all chunks are returned (matching OpenAI native file_search behaviour); update test_H14 to assert 2 results - Extract raw prisma DB query in check_vector_store_ids_access into a static _fetch_managed_vector_stores_by_uuids helper so the hot request path uses a named, testable function instead of an inline prisma_client.db.* call - Remove developer-local path from test module docstring Made-with: Cursor
…ueries-plural test - Promote _fetch_managed_vector_stores_by_uuids from @staticmethod to a module-level async helper get_managed_vector_store_rows_by_uuids, following the same standalone helper pattern as get_team_object / get_key_object so the hot-path DB read is a named importable function rather than an inline prisma_client.db.* call - Pass no-log=True to both inner _call_aresponses sub-calls so they do not fire independent billing/monitoring callbacks; cost is accumulated in the synthesized response's _hidden_params for the outer responses() call - Add test_H11b covering the primary queries (plural array) function-tool schema, complementing H11 which exercises only the backward-compat singular query path Made-with: Cursor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR introduces emulated Key issues found:
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| litellm/responses/file_search/emulated_handler.py | New emulated file_search handler: correctly replaces file_search tools with a function tool, handles multi-query execution, and synthesizes OpenAI-format output. should_use_emulated_file_search is defined but dead code (not called from production routing). include forwarding to sub-calls (unstripped "file_search_call.results") was flagged in prior review and is still present. |
| litellm/responses/main.py | Adds emulated file_search routing gate. Issue: update_responses_tools_with_model_file_ids (which now decodes unified vector-store IDs) runs before the emulated path check, so the emulated handler receives decoded provider-native IDs instead of the original LiteLLM unified IDs, breaking managed vector store lookups in the emulated flow. |
| enterprise/litellm_enterprise/proxy/hooks/managed_files.py | Adds vector-store access control for Responses API calls. Logic is well-structured using batch DB fetch and cache-first lookup. Silent no-op when prisma_client is None creates a security bypass where access checks are skipped entirely when DB is unavailable. |
| litellm/proxy/auth/auth_checks.py | New get_managed_vector_store_rows_by_uuids helper follows established cache-first / DB-fallback pattern correctly. Cache key scheme and TTL usage are consistent with existing helpers. |
| litellm/litellm_core_utils/prompt_templates/common_utils.py | New _decode_vector_store_ids_in_tools correctly decodes LiteLLM-managed unified vector-store IDs to provider-native IDs. The decode runs as Pass 1 in update_responses_tools_with_model_file_ids, but this placement causes issues for the emulated path (see responses/main.py comment). |
| litellm/utils.py | Adds _is_litellm_internal_call flag handling in wrapper_async to suppress billing/logging callbacks for emulated file-search sub-calls. Pattern is clean and correctly gates both success and failure handlers. |
| tests/test_litellm/llms/test_file_search_responses.py | Comprehensive mock-only test suite covering decoding, access control, routing, and emulated handler flows. No real network calls. Tests H5–H7 and H13 cover should_use_emulated_file_search, which is dead code in production — these tests do not protect the actual routing logic in responses/main.py. |
| docs/my-website/docs/tutorials/file_search_responses_api.md | New tutorial with architecture diagram, SDK/Proxy usage tabs, and Q&A. Minor inconsistency: JSON example shows "search_results": null but the surrounding code uses include=["file_search_call.results"] which would produce populated results. |
Sequence Diagram
sequenceDiagram
participant Client
participant responses_main as responses/main.py
participant decode as _decode_vector_store_ids_in_tools
participant emulated as aresponses_with_emulated_file_search
participant provider as Provider (e.g. Anthropic)
participant vsearch as vector_stores.main.asearch
Client->>responses_main: responses(model, tools=[file_search], include=[...])
responses_main->>decode: update_responses_tools_with_model_file_ids(tools)
decode-->>responses_main: tools (unified IDs → native IDs)
Note over responses_main: ⚠️ emulated path now gets decoded (native) IDs
alt Provider supports native file_search (OpenAI/Azure)
responses_main->>provider: Forward request unchanged
provider-->>Client: Native file_search response
else Provider does NOT support native file_search
responses_main->>emulated: aresponses_with_emulated_file_search(tools=decoded_tools)
emulated->>emulated: _replace_file_search_tools → function tool
emulated->>provider: 1st call: aresponses(tools=[litellm_file_search fn])
provider-->>emulated: function_call(queries=[...], vector_store_id=...)
emulated->>vsearch: asearch(vector_store_id=decoded_native_id, query)
vsearch-->>emulated: search results
emulated->>provider: 2nd call: aresponses(input=context+results, tools=None)
provider-->>emulated: final answer text
emulated->>emulated: synthesize file_search_call + message output
emulated-->>Client: OpenAI-format ResponsesAPIResponse
end
Comments Outside Diff (1)
-
enterprise/litellm_enterprise/proxy/hooks/managed_files.py, line 385-412 (link)check_vector_store_ids_accesssilently no-ops whenprisma_clientisNoneif not vector_store_ids or prisma_client is None: return # ← access check is silently skipped
In production,
prisma_clientisNonewhen the DB has not been initialised (e.g. during tests, or if the proxy is started without--config). When that happens, any caller — regardless of team — can use any managed vector store without any access check.This is a security bypass. Consider raising an explicit error (or at least logging a warning with
verbose_logger.warning(...)) whenprisma_client is Noneand there are unified vector-store IDs to verify, rather than silently permitting access. Other access helpers inauth_checks.pyfollow a similar guard, so this should be consistent with the existing pattern for security-sensitive skips.
Last reviewed commit: "fix doc"
| isinstance(item, dict) | ||
| and item.get("type") == "function_call" | ||
| and item.get("name") == FILE_SEARCH_FUNCTION_NAME | ||
| ) | ||
| or ( | ||
| hasattr(item, "type") | ||
| and getattr(item, "type") == "function_call" | ||
| and getattr(item, "name", None) == FILE_SEARCH_FUNCTION_NAME | ||
| ) | ||
| ] |
There was a problem hiding this comment.
search_results is None instead of [] when no results returned
When include=["file_search_call.results"] is requested but the vector search finds nothing (empty results list), the condition if include_search_results and results is falsy, so search_results stays None. Clients that request the include parameter expect a list type, not None. An empty [] is the correct sentinel value matching OpenAI's native API behaviour.
| isinstance(item, dict) | |
| and item.get("type") == "function_call" | |
| and item.get("name") == FILE_SEARCH_FUNCTION_NAME | |
| ) | |
| or ( | |
| hasattr(item, "type") | |
| and getattr(item, "type") == "function_call" | |
| and getattr(item, "name", None) == FILE_SEARCH_FUNCTION_NAME | |
| ) | |
| ] | |
| search_results: Optional[List[Dict[str, Any]]] = None | |
| if include_search_results: | |
| search_results = _build_search_results_for_include(results) if results else [] |
| message_output=_build_message_output(response_text, all_results), | ||
| first_response=first_response, | ||
| ) |
There was a problem hiding this comment.
_call_aresponses passes positional args, but aresponses signature is keyword-only
_call_aresponses(input, model, tools, **kwargs) forwards input, model, and tools as positional arguments to aresponses. If the aresponses signature ever reorders these parameters (common during refactors), this will silently pass wrong values. Prefer explicit keyword arguments:
| message_output=_build_message_output(response_text, all_results), | |
| first_response=first_response, | |
| ) | |
| async def _call_aresponses(input, model, tools, **kwargs): # pragma: no cover – thin wrapper for patching in tests | |
| from litellm.responses.main import aresponses | |
| return await aresponses(input=input, model=model, tools=tools, **kwargs) |
…rch-emulated-alignment
| # 2. First provider call — provider will call the file_search function. | ||
| # Mark as an internal sub-call so wrapper_async skips billing callbacks; | ||
| # the parent litellm_logging_obj (propagated via kwargs) fires once at the end. | ||
| first_response: ResponsesAPIResponse = cast( | ||
| ResponsesAPIResponse, | ||
| await _call_aresponses( | ||
| input=input, | ||
| model=model, | ||
| tools=transformed_tools or None, | ||
| **{**kwargs, "_is_litellm_internal_call": True}, | ||
| ), |
There was a problem hiding this comment.
include=["file_search_call.results"] is forwarded to provider sub-calls
emulated_kwargs (built in responses/main.py) propagates the caller's include list directly to both internal _call_aresponses sub-calls:
await _call_aresponses(
input=input,
model=model,
tools=transformed_tools or None,
**{**kwargs, "_is_litellm_internal_call": True}, # kwargs contains include=["file_search_call.results"]
)"file_search_call.results" is an OpenAI-specific Responses API include value that non-native providers won't understand. When the sub-calls reach the underlying provider (e.g. Anthropic), the include param may be forwarded in the request body, causing a validation error or unexpected behavior.
The emulated handler should strip provider-unfriendly include values before forwarding to sub-calls:
# Strip emulation-only include values before forwarding to sub-calls
_provider_safe_include = [v for v in _include if v != "file_search_call.results"]
internal_kwargs = {**kwargs, "_is_litellm_internal_call": True}
if _provider_safe_include != _include:
internal_kwargs["include"] = _provider_safe_include or None| ) | ||
| ) | ||
|
|
||
| if _has_file_search_tool(tools) and ( | ||
| responses_api_provider_config is None | ||
| or not responses_api_provider_config.supports_native_file_search() | ||
| ): | ||
| from litellm.responses.file_search.emulated_handler import ( | ||
| aresponses_with_emulated_file_search, | ||
| ) | ||
|
|
||
| _internal_skip = {"litellm_call_id", "aresponses"} | ||
| emulated_kwargs = { | ||
| "include": include, | ||
| "instructions": instructions, | ||
| "max_output_tokens": max_output_tokens, | ||
| "prompt": prompt, | ||
| "metadata": metadata, | ||
| "parallel_tool_calls": parallel_tool_calls, | ||
| "previous_response_id": previous_response_id, | ||
| "reasoning": reasoning, | ||
| "store": store, | ||
| "background": background, | ||
| "stream": stream, | ||
| "temperature": temperature, | ||
| "text": text, | ||
| "tool_choice": tool_choice, | ||
| "top_p": top_p, | ||
| "truncation": truncation, | ||
| "user": user, | ||
| "service_tier": service_tier, | ||
| "safety_identifier": safety_identifier, | ||
| "text_format": text_format, | ||
| "allowed_openai_params": allowed_openai_params, | ||
| "extra_headers": extra_headers, | ||
| "extra_query": extra_query, | ||
| "extra_body": extra_body, | ||
| "timeout": timeout, | ||
| "custom_llm_provider": custom_llm_provider, | ||
| **{k: v for k, v in kwargs.items() if k not in _internal_skip}, | ||
| } | ||
| if _is_async: | ||
| return aresponses_with_emulated_file_search( | ||
| input=input, model=model, tools=tools, **emulated_kwargs | ||
| ) | ||
| return run_async_function( | ||
| aresponses_with_emulated_file_search, | ||
| input=input, | ||
| model=model, | ||
| tools=tools, | ||
| **emulated_kwargs, | ||
| ) |
There was a problem hiding this comment.
Vector store IDs are decoded to native form before reaching the emulated handler
update_responses_tools_with_model_file_ids (which now decodes unified vector-store IDs to provider-native IDs via _decode_vector_store_ids_in_tools) runs at lines 656–665, before the emulated-path check. This means tools passed to aresponses_with_emulated_file_search already contains decoded provider-native IDs (e.g. vs_openai_real_abc) instead of LiteLLM-managed unified IDs.
Inside the emulated handler, these decoded IDs are extracted by _replace_file_search_tools and eventually passed to litellm.vector_stores.main.asearch(vector_store_id=vs_id, ...). If asearch requires the original LiteLLM-managed unified ID to route and authenticate the search correctly, the emulated flow will fail silently (warning logged, empty results returned) for any managed vector store.
The decode step should only run for the native path, or the emulated handler should receive the original tools before decoding. One approach: skip _decode_vector_store_ids_in_tools when the emulated path will be taken, or save the pre-decoded tools and pass those to the emulated handler.
| rows = await prisma_client.db.litellm_managedvectorstorestable.find_many( | ||
| where={"vector_store_id": {"in": cache_misses}}, | ||
| take=len(cache_misses), | ||
| ) |
There was a problem hiding this comment.
Cache result list is not merged before the not-found check
result accumulates rows found in cache before the DB fetch, but get_managed_vector_store_rows_by_uuids returns the combined result list. The caller (check_vector_store_ids_access) builds found_uuids from this return value and then iterates over uuid_to_unified to find missing entries. This is correct today, but the function returns only after processing cache_misses — the cache-hit items are included in result from the start.
This is fine, but it's worth noting that if the combined result is later filtered or re-ordered, the not-found assertion may raise false 403s. Consider adding a short inline comment explaining that result already contains cached hits and the function returns the merged list, to avoid future misreads of the logic.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
af036ef
into
BerriAI:litellm_dev_sameer_16_march_week

fixes LIT-2136
Summary
file_searchbehavior for non-native providers with OpenAI Responses output by fixing tool formatting and TypedDict field accessfile_search_call.search_resultswheninclude=[\"file_search_call.results\"]is requested, and support multi-query execution in emulated flowAll the prev greptile reviews have been fixed