Skip to content

feat(file_search): align emulated Responses behavior with native output#23969

Merged
Sameerlite merged 19 commits intoBerriAI:litellm_dev_sameer_16_march_weekfrom
Sameerlite:litellm_file-search-emulated-alignment
Mar 20, 2026
Merged

feat(file_search): align emulated Responses behavior with native output#23969
Sameerlite merged 19 commits intoBerriAI:litellm_dev_sameer_16_march_weekfrom
Sameerlite:litellm_file-search-emulated-alignment

Conversation

@Sameerlite
Copy link
Copy Markdown
Collaborator

@Sameerlite Sameerlite commented Mar 18, 2026

fixes LIT-2136

Summary

  • align emulated file_search behavior for non-native providers with OpenAI Responses output by fixing tool formatting and TypedDict field access
  • populate file_search_call.search_results when include=[\"file_search_call.results\"] is requested, and support multi-query execution in emulated flow
  • refresh docs with usage-first SDK/Proxy tabs, architecture diagram, and Q&A while removing duplicated path-by-path walkthroughs
    All the prev greptile reviews have been fixed
image image This was the final code change, after that all were doc fixes. And the issues it raises are not happening for me. But still will create another PR with 4+/5 to to fix the issues raised

Sameerlite and others added 16 commits March 17, 2026 11:41
…hase 2 emulated fallback

Phase 1 (native passthrough):
- _decode_vector_store_ids_in_tools(): decode LiteLLM-managed unified
  vector_store_ids to provider-native IDs in file_search tools
- Split update_responses_tools_with_model_file_ids() into decode pass
  (always runs) + code_interpreter mapping pass (guarded)
- BaseResponsesAPIConfig.supports_native_file_search() → False by default;
  OpenAIResponsesAPIConfig overrides to True
- ManagedFiles.async_pre_call_hook(): batch team-level access check for
  unified vector_store_ids in file_search tools (no N+1)
- Docs: file_search section in response_api.md

Phase 2 (emulated fallback for non-native providers):
- litellm/responses/file_search/emulated_handler.py: converts file_search
  tool → function tool, intercepts tool call, runs asearch(), makes
  follow-up call, synthesizes OpenAI-format output (file_search_call +
  message + file_citation annotations)
- responses/main.py: routes to emulated handler when provider doesn't
  support file_search natively

Tests: 41 unit tests across 8 families (A-H) in test_file_search_responses.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Covers both paths:
- Native passthrough (OpenAI/Azure): create vector store, run via SDK and proxy
- Emulated fallback (Anthropic/any): register managed store, run via SDK and proxy

Includes output format validation script and troubleshooting section.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…avior

Ensure non-OpenAI emulated file_search matches native Responses output by populating search_results (when requested), fixing TypedDict field access, and supporting multi-query searches from tool calls.

Made-with: Cursor
Drop tools=[] from transformed chat-completion requests so providers like Anthropic return normal assistant text after tool_result turns.

Made-with: Cursor
…d Q&A

Replace duplicate path-by-path sections with a single usage-first doc format that includes SDK/Proxy tabs, an architecture diagram, and a focused Q&A section.

Made-with: Cursor
Replace inline file_search documentation in response_api.md with a canonical link and add the new tutorial to sidebars so users discover the usage-first guide.

Made-with: Cursor
Include all function_call items when building emulated follow-up input and update tests to assert real emulated routing + Responses-format function tool structure.

Made-with: Cursor
Forward explicit responses() params on emulated file search calls and preserve hidden params on synthesized responses so callback billing/logging context is retained.

Made-with: Cursor
Strip internal logging ids from emulated sub-calls, dedupe included search_results by file_id, clean unused imports, and add unit coverage for dedupe behavior.

Made-with: Cursor
…ext, cost tracking

- Remove dead `should_use_emulated_file_search` (main.py uses its own inline guard)
- Remove dead `fallback_vector_store_ids` param from `_run_vector_searches`
- Include all first_response.output items in follow_up_input so text blocks/reasoning
  from providers like Anthropic aren't dropped from conversation context
- Accumulate first provider call's response_cost into synthesized _hidden_params so
  billing callbacks see the total cost of both emulated-flow LLM calls
- Remove broad tools=[] filter from transformation.py (backward-incompatible); the
  follow-up call already passes tools=None which is filtered by the v is not None guard

Made-with: Cursor
…follow-up input

Pydantic model instances (ResponseFunctionToolCall, etc.) from first_response.output
were included raw in follow_up_input; the transformation layer expects plain dicts and
called .get() on them, raising AttributeError. Serialize via model_dump(exclude_none=True).

Made-with: Cursor
…act DB helper, clean docstring

- Re-add should_use_emulated_file_search() to emulated_handler.py so H5/H6/H7/H13 tests don't fail with ImportError
- Remove per-file-id deduplication from _build_search_results_for_include so all chunks are returned (matching OpenAI native file_search behaviour); update test_H14 to assert 2 results
- Extract raw prisma DB query in check_vector_store_ids_access into a static _fetch_managed_vector_stores_by_uuids helper so the hot request path uses a named, testable function instead of an inline prisma_client.db.* call
- Remove developer-local path from test module docstring

Made-with: Cursor
…ueries-plural test

- Promote _fetch_managed_vector_stores_by_uuids from @staticmethod to a module-level
  async helper get_managed_vector_store_rows_by_uuids, following the same standalone
  helper pattern as get_team_object / get_key_object so the hot-path DB read is a
  named importable function rather than an inline prisma_client.db.* call
- Pass no-log=True to both inner _call_aresponses sub-calls so they do not fire
  independent billing/monitoring callbacks; cost is accumulated in the synthesized
  response's _hidden_params for the outer responses() call
- Add test_H11b covering the primary queries (plural array) function-tool schema,
  complementing H11 which exercises only the backward-compat singular query path

Made-with: Cursor
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Error Error Mar 20, 2026 11:10am

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 18, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing Sameerlite:litellm_file-search-emulated-alignment (32ded9b) with main (cec3e9e)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR introduces emulated file_search support for non-native providers (Anthropic, Bedrock, etc.) in the Responses API. The emulated handler converts file_search tools to function tools, intercepts the model's tool call, runs vector search, and synthesizes an OpenAI-compatible response. It also adds team-scoped access control for managed vector stores in the pre-call hook, a new supports_native_file_search() extension point on provider configs, and a _is_litellm_internal_call flag to suppress duplicate billing callbacks for the two internal sub-calls the emulated flow makes.

Key issues found:

  • Decoded tools forwarded to emulated handler (responses/main.py): update_responses_tools_with_model_file_ids (which now calls _decode_vector_store_ids_in_tools) executes before the emulated-path routing check. The emulated handler therefore receives provider-native vector store IDs instead of the original LiteLLM unified IDs. If asearch requires unified IDs to route managed store lookups, this will cause silent search failures and uninformed model responses.
  • Silent access-control bypass (managed_files.py): check_vector_store_ids_access silently returns (no-ops) when prisma_client is None, allowing any caller to use any managed vector store without an access check when the DB is unavailable.
  • Dead code: should_use_emulated_file_search (emulated_handler.py): Defined and tested (H5–H7, H13), but never called from production routing in responses/main.py. The production gate uses its own inline duplicate of the same logic. Tests for this function do not protect the actual production routing.
  • Docs inconsistency (file_search_responses_api.md): JSON example shows "search_results": null in a section where the surrounding code uses include=["file_search_call.results"], which would actually produce populated results.

Confidence Score: 2/5

  • Not safe to merge without addressing the decoded-tools issue and the silent access-control bypass in the managed files hook.
  • Two P1 issues were identified: (1) the emulated handler receives decoded provider-native vector store IDs after _decode_vector_store_ids_in_tools runs before the emulated path check, which would silently break managed vector store lookups in emulated mode; (2) the team-scoped access control in check_vector_store_ids_access silently no-ops when prisma_client is None, creating a security bypass. The first issue directly undermines the core feature being added (emulated file_search with managed vector stores). Both need resolution before merge.
  • litellm/responses/main.py (decoded tools before emulated routing), enterprise/litellm_enterprise/proxy/hooks/managed_files.py (silent access-control bypass when DB unavailable)

Important Files Changed

Filename Overview
litellm/responses/file_search/emulated_handler.py New emulated file_search handler: correctly replaces file_search tools with a function tool, handles multi-query execution, and synthesizes OpenAI-format output. should_use_emulated_file_search is defined but dead code (not called from production routing). include forwarding to sub-calls (unstripped "file_search_call.results") was flagged in prior review and is still present.
litellm/responses/main.py Adds emulated file_search routing gate. Issue: update_responses_tools_with_model_file_ids (which now decodes unified vector-store IDs) runs before the emulated path check, so the emulated handler receives decoded provider-native IDs instead of the original LiteLLM unified IDs, breaking managed vector store lookups in the emulated flow.
enterprise/litellm_enterprise/proxy/hooks/managed_files.py Adds vector-store access control for Responses API calls. Logic is well-structured using batch DB fetch and cache-first lookup. Silent no-op when prisma_client is None creates a security bypass where access checks are skipped entirely when DB is unavailable.
litellm/proxy/auth/auth_checks.py New get_managed_vector_store_rows_by_uuids helper follows established cache-first / DB-fallback pattern correctly. Cache key scheme and TTL usage are consistent with existing helpers.
litellm/litellm_core_utils/prompt_templates/common_utils.py New _decode_vector_store_ids_in_tools correctly decodes LiteLLM-managed unified vector-store IDs to provider-native IDs. The decode runs as Pass 1 in update_responses_tools_with_model_file_ids, but this placement causes issues for the emulated path (see responses/main.py comment).
litellm/utils.py Adds _is_litellm_internal_call flag handling in wrapper_async to suppress billing/logging callbacks for emulated file-search sub-calls. Pattern is clean and correctly gates both success and failure handlers.
tests/test_litellm/llms/test_file_search_responses.py Comprehensive mock-only test suite covering decoding, access control, routing, and emulated handler flows. No real network calls. Tests H5–H7 and H13 cover should_use_emulated_file_search, which is dead code in production — these tests do not protect the actual routing logic in responses/main.py.
docs/my-website/docs/tutorials/file_search_responses_api.md New tutorial with architecture diagram, SDK/Proxy usage tabs, and Q&A. Minor inconsistency: JSON example shows "search_results": null but the surrounding code uses include=["file_search_call.results"] which would produce populated results.

Sequence Diagram

sequenceDiagram
    participant Client
    participant responses_main as responses/main.py
    participant decode as _decode_vector_store_ids_in_tools
    participant emulated as aresponses_with_emulated_file_search
    participant provider as Provider (e.g. Anthropic)
    participant vsearch as vector_stores.main.asearch

    Client->>responses_main: responses(model, tools=[file_search], include=[...])
    responses_main->>decode: update_responses_tools_with_model_file_ids(tools)
    decode-->>responses_main: tools (unified IDs → native IDs)
    Note over responses_main: ⚠️ emulated path now gets decoded (native) IDs

    alt Provider supports native file_search (OpenAI/Azure)
        responses_main->>provider: Forward request unchanged
        provider-->>Client: Native file_search response
    else Provider does NOT support native file_search
        responses_main->>emulated: aresponses_with_emulated_file_search(tools=decoded_tools)
        emulated->>emulated: _replace_file_search_tools → function tool
        emulated->>provider: 1st call: aresponses(tools=[litellm_file_search fn])
        provider-->>emulated: function_call(queries=[...], vector_store_id=...)
        emulated->>vsearch: asearch(vector_store_id=decoded_native_id, query)
        vsearch-->>emulated: search results
        emulated->>provider: 2nd call: aresponses(input=context+results, tools=None)
        provider-->>emulated: final answer text
        emulated->>emulated: synthesize file_search_call + message output
        emulated-->>Client: OpenAI-format ResponsesAPIResponse
    end
Loading

Comments Outside Diff (1)

  1. enterprise/litellm_enterprise/proxy/hooks/managed_files.py, line 385-412 (link)

    P1 check_vector_store_ids_access silently no-ops when prisma_client is None

    if not vector_store_ids or prisma_client is None:
        return  # ← access check is silently skipped

    In production, prisma_client is None when the DB has not been initialised (e.g. during tests, or if the proxy is started without --config). When that happens, any caller — regardless of team — can use any managed vector store without any access check.

    This is a security bypass. Consider raising an explicit error (or at least logging a warning with verbose_logger.warning(...)) when prisma_client is None and there are unified vector-store IDs to verify, rather than silently permitting access. Other access helpers in auth_checks.py follow a similar guard, so this should be consistent with the existing pattern for security-sensitive skips.

Last reviewed commit: "fix doc"

Comment on lines +426 to +435
isinstance(item, dict)
and item.get("type") == "function_call"
and item.get("name") == FILE_SEARCH_FUNCTION_NAME
)
or (
hasattr(item, "type")
and getattr(item, "type") == "function_call"
and getattr(item, "name", None) == FILE_SEARCH_FUNCTION_NAME
)
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 search_results is None instead of [] when no results returned

When include=["file_search_call.results"] is requested but the vector search finds nothing (empty results list), the condition if include_search_results and results is falsy, so search_results stays None. Clients that request the include parameter expect a list type, not None. An empty [] is the correct sentinel value matching OpenAI's native API behaviour.

Suggested change
isinstance(item, dict)
and item.get("type") == "function_call"
and item.get("name") == FILE_SEARCH_FUNCTION_NAME
)
or (
hasattr(item, "type")
and getattr(item, "type") == "function_call"
and getattr(item, "name", None) == FILE_SEARCH_FUNCTION_NAME
)
]
search_results: Optional[List[Dict[str, Any]]] = None
if include_search_results:
search_results = _build_search_results_for_include(results) if results else []

Comment on lines +542 to +544
message_output=_build_message_output(response_text, all_results),
first_response=first_response,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _call_aresponses passes positional args, but aresponses signature is keyword-only

_call_aresponses(input, model, tools, **kwargs) forwards input, model, and tools as positional arguments to aresponses. If the aresponses signature ever reorders these parameters (common during refactors), this will silently pass wrong values. Prefer explicit keyword arguments:

Suggested change
message_output=_build_message_output(response_text, all_results),
first_response=first_response,
)
async def _call_aresponses(input, model, tools, **kwargs): # pragma: no cover – thin wrapper for patching in tests
from litellm.responses.main import aresponses
return await aresponses(input=input, model=model, tools=tools, **kwargs)

@Sameerlite Sameerlite changed the base branch from main to litellm_dev_sameer_16_march_week March 20, 2026 10:56
Comment on lines +408 to +418
# 2. First provider call — provider will call the file_search function.
# Mark as an internal sub-call so wrapper_async skips billing callbacks;
# the parent litellm_logging_obj (propagated via kwargs) fires once at the end.
first_response: ResponsesAPIResponse = cast(
ResponsesAPIResponse,
await _call_aresponses(
input=input,
model=model,
tools=transformed_tools or None,
**{**kwargs, "_is_litellm_internal_call": True},
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 include=["file_search_call.results"] is forwarded to provider sub-calls

emulated_kwargs (built in responses/main.py) propagates the caller's include list directly to both internal _call_aresponses sub-calls:

await _call_aresponses(
    input=input,
    model=model,
    tools=transformed_tools or None,
    **{**kwargs, "_is_litellm_internal_call": True},  # kwargs contains include=["file_search_call.results"]
)

"file_search_call.results" is an OpenAI-specific Responses API include value that non-native providers won't understand. When the sub-calls reach the underlying provider (e.g. Anthropic), the include param may be forwarded in the request body, causing a validation error or unexpected behavior.

The emulated handler should strip provider-unfriendly include values before forwarding to sub-calls:

# Strip emulation-only include values before forwarding to sub-calls
_provider_safe_include = [v for v in _include if v != "file_search_call.results"]
internal_kwargs = {**kwargs, "_is_litellm_internal_call": True}
if _provider_safe_include != _include:
    internal_kwargs["include"] = _provider_safe_include or None

Comment on lines 724 to +775
)
)

if _has_file_search_tool(tools) and (
responses_api_provider_config is None
or not responses_api_provider_config.supports_native_file_search()
):
from litellm.responses.file_search.emulated_handler import (
aresponses_with_emulated_file_search,
)

_internal_skip = {"litellm_call_id", "aresponses"}
emulated_kwargs = {
"include": include,
"instructions": instructions,
"max_output_tokens": max_output_tokens,
"prompt": prompt,
"metadata": metadata,
"parallel_tool_calls": parallel_tool_calls,
"previous_response_id": previous_response_id,
"reasoning": reasoning,
"store": store,
"background": background,
"stream": stream,
"temperature": temperature,
"text": text,
"tool_choice": tool_choice,
"top_p": top_p,
"truncation": truncation,
"user": user,
"service_tier": service_tier,
"safety_identifier": safety_identifier,
"text_format": text_format,
"allowed_openai_params": allowed_openai_params,
"extra_headers": extra_headers,
"extra_query": extra_query,
"extra_body": extra_body,
"timeout": timeout,
"custom_llm_provider": custom_llm_provider,
**{k: v for k, v in kwargs.items() if k not in _internal_skip},
}
if _is_async:
return aresponses_with_emulated_file_search(
input=input, model=model, tools=tools, **emulated_kwargs
)
return run_async_function(
aresponses_with_emulated_file_search,
input=input,
model=model,
tools=tools,
**emulated_kwargs,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Vector store IDs are decoded to native form before reaching the emulated handler

update_responses_tools_with_model_file_ids (which now decodes unified vector-store IDs to provider-native IDs via _decode_vector_store_ids_in_tools) runs at lines 656–665, before the emulated-path check. This means tools passed to aresponses_with_emulated_file_search already contains decoded provider-native IDs (e.g. vs_openai_real_abc) instead of LiteLLM-managed unified IDs.

Inside the emulated handler, these decoded IDs are extracted by _replace_file_search_tools and eventually passed to litellm.vector_stores.main.asearch(vector_store_id=vs_id, ...). If asearch requires the original LiteLLM-managed unified ID to route and authenticate the search correctly, the emulated flow will fail silently (warning logged, empty results returned) for any managed vector store.

The decode step should only run for the native path, or the emulated handler should receive the original tools before decoding. One approach: skip _decode_vector_store_ids_in_tools when the emulated path will be taken, or save the pre-decoded tools and pass those to the emulated handler.

Comment on lines +2336 to +2339
rows = await prisma_client.db.litellm_managedvectorstorestable.find_many(
where={"vector_store_id": {"in": cache_misses}},
take=len(cache_misses),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cache result list is not merged before the not-found check

result accumulates rows found in cache before the DB fetch, but get_managed_vector_store_rows_by_uuids returns the combined result list. The caller (check_vector_store_ids_access) builds found_uuids from this return value and then iterates over uuid_to_unified to find missing entries. This is correct today, but the function returns only after processing cache_misses — the cache-hit items are included in result from the start.

This is fine, but it's worth noting that if the combined result is later filtered or re-ordered, the not-found assertion may raise false 403s. Consider adding a short inline comment explaining that result already contains cached hits and the function returns the merged list, to avoid future misreads of the logic.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@Sameerlite
Copy link
Copy Markdown
Collaborator Author

image This was the final code change, after that all were doc fixes

@Sameerlite Sameerlite merged commit af036ef into BerriAI:litellm_dev_sameer_16_march_week Mar 20, 2026
58 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant