fix: surface Anthropic code execution results as code_interpreter_call in Responses API#23784
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
958e3e5 to
3d1228c
Compare
ba88625 to
e77a14b
Compare
Greptile SummaryThis PR closes three gaps that prevented Anthropic server-side code execution results (e.g. What changed:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/llms/anthropic/chat/handler.py | Adds _server_tool_inputs, tool_results, _current_server_tool_id, and _container_id to ModelResponseIterator.__init__. Tracks server-tool-use inputs (assembled from input_json_delta deltas at content_block_stop) and builds cumulative code_interpreter_results at content_block_start for tool results. The streaming architecture is sound and tests cover delta assembly, multiple executions, empty output, and non-bash tool types. |
| litellm/llms/anthropic/chat/transformation.py | Non-streaming path fix: builds code_interpreter_results (list of OutputCodeInterpreterCall) alongside tool_results in provider_specific_fields, then propagates the whole dict into _hidden_params before assigning it to model_response. This closes the gap where the Responses API adapter could never see these fields. |
| litellm/responses/litellm_completion_transformation/streaming_iterator.py | Adds _accumulated_provider_specific_fields and a _merge_provider_specific_fields helper using last-value-wins semantics (matching stream_chunk_builder). Accumulates fields from both chunk.provider_specific_fields and chunk.choices[0].delta.provider_specific_fields in both async and sync iteration paths. On assembly, merges into response._hidden_params["provider_specific_fields"] via setdefault+update. |
| litellm/responses/litellm_completion_transformation/transformation.py | Adds _extract_tool_result_output_items to read pre-built OutputCodeInterpreterCall objects from message.provider_specific_fields["code_interpreter_results"] and performs in-place substitution of matching function_call items to preserve output ordering. |
| litellm/types/responses/main.py | Adds OutputCodeInterpreterCallLog, OutputCodeInterpreterCall Pydantic models matching OpenAI's ResponseCodeInterpreterToolCall shape, and the build_code_interpreter_log_outputs helper. The helper contains Anthropic-specific field names and is placed in a shared types file outside llms/, slightly violating the provider-specific code placement rule. |
| litellm/types/llms/openai.py | Adds OutputCodeInterpreterCall to the ResponsesAPIResponse.output union type, allowing the type system to recognise these items in responses. Remainder of changes are formatting-only (Black reformatting of long type annotations). |
| tests/test_litellm/llms/anthropic/chat/test_code_interpreter_results_extraction.py | New test file with thorough unit and mock end-to-end tests: Pydantic/dict extraction, empty list, ordering preservation (in-place substitution), and a full streaming pipeline mock test exercising the ModelResponseIterator → stream_chunk_builder → _extract_tool_result_output_items chain. All tests use mocks with no real network calls. |
| tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_handler.py | New streaming tests cover: single code execution producing code_interpreter_results, multiple executions emitting cumulative lists (last-value-wins), delta-assembled input, empty stdout/stderr producing outputs=None, and non-bash tool results being skipped. All tests are pure mock tests with no network calls. |
Sequence Diagram
sequenceDiagram
participant Client
participant ResponsesAPI as Responses API<br/>(transformation.py)
participant StreamIter as LiteLLMCompletionStreamingIterator
participant AnthropicHandler as Anthropic ModelResponseIterator<br/>(handler.py / transformation.py)
participant SCB as stream_chunk_builder
Client->>ResponsesAPI: responses.create()
alt Non-streaming path
AnthropicHandler->>AnthropicHandler: extract_response_content()<br/>→ tool_calls, tool_results
AnthropicHandler->>AnthropicHandler: Build code_interpreter_results<br/>(OutputCodeInterpreterCall list)
AnthropicHandler->>AnthropicHandler: provider_specific_fields["code_interpreter_results"] = [...]
AnthropicHandler->>AnthropicHandler: _hidden_params["provider_specific_fields"] = provider_specific_fields
AnthropicHandler-->>ResponsesAPI: ModelResponse (with _hidden_params)
ResponsesAPI->>ResponsesAPI: _extract_tool_result_output_items()<br/>reads message.provider_specific_fields
ResponsesAPI->>ResponsesAPI: In-place substitute function_call → code_interpreter_call
ResponsesAPI-->>Client: ResponsesAPIResponse (with code_interpreter_call items)
else Streaming path
loop Each SSE chunk
AnthropicHandler->>AnthropicHandler: chunk_parser(): content_block_start<br/>for server_tool_use → track _server_tool_inputs
AnthropicHandler->>AnthropicHandler: content_block_stop: assemble input_json_delta → _server_tool_inputs[id]
AnthropicHandler->>AnthropicHandler: content_block_start for tool_result<br/>→ _build_code_interpreter_results()<br/>delta.provider_specific_fields["code_interpreter_results"] = [...]
AnthropicHandler-->>StreamIter: ModelResponseStream chunk
StreamIter->>StreamIter: _merge_provider_specific_fields()<br/>(last-value-wins accumulation)
end
StreamIter->>SCB: stream_chunk_builder(collected_chunks)
SCB->>SCB: last-value-wins merge of<br/>delta.provider_specific_fields → message.provider_specific_fields
SCB-->>StreamIter: ModelResponse
StreamIter->>StreamIter: setdefault("provider_specific_fields",{}).update(<br/>_accumulated_provider_specific_fields)
StreamIter-->>ResponsesAPI: ModelResponse (with _hidden_params + message.psf)
ResponsesAPI->>ResponsesAPI: _extract_tool_result_output_items()<br/>reads message.provider_specific_fields["code_interpreter_results"]
ResponsesAPI->>ResponsesAPI: In-place substitute function_call → code_interpreter_call
ResponsesAPI-->>Client: response.completed event (with code_interpreter_call items)
end
Comments Outside Diff (1)
-
litellm/types/responses/main.py, line 70-85 (link)Provider-specific helper function in shared types file
build_code_interpreter_log_outputscontains Anthropic-specific knowledge — it parses thestdoutandstderrkeys that come from Anthropic'sbash_code_execution_resultformat, and even its own docstring names it "Anthropic bash_code_execution." Placing this logic inlitellm/types/responses/main.py(a provider-agnostic types file outsidellms/) violates the rule to keep provider-specific code insidelitellm/llms/.Since both callers (
handler.pyandtransformation.py) are already inlitellm/llms/anthropic/chat/, a straightforward fix is to move the function there (e.g., into a smallutils.pyin that package) and import it from each caller:# litellm/llms/anthropic/chat/utils.py from litellm.types.responses.main import OutputCodeInterpreterCallLog def build_code_interpreter_log_outputs(content): ...
Then in
handler.pyandtransformation.py, import fromlitellm.llms.anthropic.chat.utilsinstead oflitellm.types.responses.main.Rule Used: What: Avoid writing provider-specific code outside... (source)
Last reviewed commit: "refactor: extract du..."
litellm/responses/litellm_completion_transformation/transformation.py
Outdated
Show resolved
Hide resolved
litellm/responses/litellm_completion_transformation/streaming_iterator.py
Show resolved
Hide resolved
e77a14b to
d7ab3d3
Compare
When the shared aiohttp session closes (due to network interruption, idle timeout, or Redis failover side effects), the proxy permanently falls back to creating a new HTTPS connection per request, losing the benefit of connection pooling for the entire pod lifetime. Fix: make add_shared_session_to_data() async and recreate the session when it is found closed, restoring connection pooling automatically. Fixes BerriAI#23806
… recreation When multiple requests detect a closed shared session simultaneously, they would each create a new aiohttp.ClientSession, leaking intermediate sessions and their TCP connectors. Added double-checked locking pattern with asyncio.Lock to ensure only one coroutine recreates the session. Added concurrent recreation test case.
Address Greptile P1 review: tests that exercise the closed-session code path need to reset the module-level lock to avoid RuntimeError on Python < 3.10 when asyncio.Lock is reused across different event loops.
d7ab3d3 to
ebd8655
Compare
litellm/responses/litellm_completion_transformation/transformation.py
Outdated
Show resolved
Hide resolved
|
@greptileai review |
litellm/responses/litellm_completion_transformation/transformation.py
Outdated
Show resolved
Hide resolved
litellm/responses/litellm_completion_transformation/streaming_iterator.py
Outdated
Show resolved
Hide resolved
litellm/responses/litellm_completion_transformation/streaming_iterator.py
Outdated
Show resolved
Hide resolved
ebd8655 to
7955ff2
Compare
7955ff2 to
702e8ad
Compare
|
@greptileai review Changes since last review:
|
Address review feedback from greptile — use new_callable=AsyncMock on the concurrent test's patch.object to ensure the mock is properly typed as async, even though side_effect already handles the coroutine.
…arams-anthropic-document-file-message-blocks
[Infra] Security and Proxy Extras for Nightly Only known flaky tests failing. The fix for security and proxy extras worked
…session-auto-recovery fix: auto-recover shared aiohttp session when closed
…l in Responses API PR BerriAI#18945 added support for capturing Anthropic server-side tool results (bash_code_execution_tool_result, etc.) in provider_specific_fields, but the data never reached the Responses API output because: 1. Non-streaming: provider_specific_fields wasn't copied into _hidden_params 2. Streaming: chunk delta's provider_specific_fields wasn't accumulated 3. Tool results weren't mapped to standard output items This fix: - Copies provider_specific_fields to _hidden_params in transform_response() - Accumulates provider_specific_fields from streaming chunk deltas - Maps bash_code_execution_tool_result to code_interpreter_call output items with code and outputs (matching OpenAI's native shape) - Removes redundant function_call items for server-side tools - Adds OutputCodeInterpreterCall type to the output union
…cutions
stream_chunk_builder uses "last value wins" for list-valued
provider_specific_fields keys. _build_code_interpreter_results was
emitting only new items (incremental), so earlier results were silently
dropped when multiple sequential code executions occurred.
- Emit cumulative list from _build_code_interpreter_results, matching
web_search_results pattern
- Assemble server_tool_use input from input_json_delta deltas at
content_block_stop (Anthropic streams input: {} in start block)
- Handle dict items in _extract_tool_result_output_items after
model_dump() serialization in stream_chunk_builder
- Simplify _merge_provider_specific_fields to last-value-wins for lists,
matching stream_chunk_builder semantics
When both stdout and stderr are empty strings, the `if parts else str(content)` fallback produced the raw dict representation as logs. Drop the fallback so logs is correctly empty.
- Empty stdout/stderr now produces outputs=None (matching OpenAI parity)
instead of outputs=[{logs:""}], in both streaming and non-streaming paths
- Fix test fixture to use real Anthropic type "bash_code_execution_tool_result"
instead of "code_execution_tool_result"
- Add test for empty-output → outputs=None behavior
- Add unit tests for _extract_tool_result_output_items: Pydantic objects,
plain dicts (post-model_dump), empty/missing provider_specific_fields,
and in-place substitution preserving output ordering
Replace str(content) fallback with empty string so non-dict content (e.g. list-shaped text_editor results) produces outputs=None instead of raw Python object representations in logs.
… only Skip non-bash tool result types (e.g. text_editor_code_execution_tool_result) to avoid producing empty code_interpreter_call items in Responses API output.
…n test - test_non_bash_tool_result_skipped: verifies text_editor results produce zero code_interpreter_call items - test_end_to_end_streaming_chunks_to_code_interpreter_output: exercises full path from Anthropic SSE chunks through ModelResponseIterator, stream_chunk_builder, and _extract_tool_result_output_items without a live server
- Populate container_id on streaming code_interpreter_results by re-emitting at message_delta when container info arrives - Reconstruct Pydantic OutputCodeInterpreterCall objects from plain dicts in _extract_tool_result_output_items so responses_output has uniform types across streaming and non-streaming paths
4a4f647 to
cf8d1ac
Compare
ef3b05b
into
BerriAI:litellm_oss_staging_03_18_2026
Relevant issues
Extends #18945 — Anthropic tool_results were captured in
provider_specific_fieldsbut never reached the Responses API output.Pre-Submission checklist
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewType
🐛 Bug Fix
Changes
PR #18945 added
provider_specific_fields["tool_results"]for Anthropic server-side tool results (bash_code_execution_tool_result,text_editor_code_execution_tool_result). However, consumers using the Responses API (responses.create()) never see this data because of three gaps:1. Non-streaming:
provider_specific_fieldsnot in_hidden_paramsFile:
litellm/llms/anthropic/chat/transformation.pytransform_response()buildsprovider_specific_fieldsand sets it on the message, but never copies it into_hidden_params. The Responses API adapter attransformation.py:1585checks_hidden_params.get("provider_specific_fields")and always findsNone.Fix: Add
_hidden_params["provider_specific_fields"] = provider_specific_fieldsbeforemodel_response._hidden_params = _hidden_params.2. Streaming: delta
provider_specific_fieldsnot accumulatedFile:
litellm/responses/litellm_completion_transformation/streaming_iterator.pyThe Anthropic streaming handler sets
tool_resultsondelta.provider_specific_fields, butLiteLLMCompletionStreamingIteratornever accumulates these across chunks. The assembledModelResponsehas noprovider_specific_fieldsin its_hidden_params.Fix: Add
_accumulated_provider_specific_fieldsdict, collect from chunk and delta in both__anext__and__next__, inject intoresponse._hidden_paramsincreate_litellm_model_response().3. Tool results mapped to standard
code_interpreter_calloutput itemsFile:
litellm/responses/litellm_completion_transformation/transformation.pyEven with the above fixes,
tool_resultsonly appears inprovider_specific_fields— not as standard output items. Anthropic'sbash_code_execution_tool_resultis semantically identical to OpenAI'scode_interpreter_call.Fix: Add
_extract_tool_result_output_items()that maps eachbash_code_execution_tool_resultto anOutputCodeInterpreterCallinstance withcode(from the matching tool_call) andoutputs(logs from stdout/stderr viaOutputCodeInterpreterCallLog). The redundantfunction_callitems for server-side tools are removed from the output, so the final shape matches OpenAI's native Responses API.4. New types:
OutputCodeInterpreterCallandOutputCodeInterpreterCallLogFiles:
litellm/types/responses/main.py,litellm/types/llms/openai.pyAdded
OutputCodeInterpreterCallandOutputCodeInterpreterCallLogPydantic models matching the OpenAI SDK'sResponseCodeInterpreterToolCallshape.OutputCodeInterpreterCallis added to theResponsesAPIResponse.outputunion and is used by_extract_tool_result_output_items()to produce properly typed output items.Testing done
Unit tests
test_code_execution_tool_results_in_hidden_params— verifiestool_resultsreaches_hidden_paramsfor the Responses API adapter (new test)test_code_execution_tool_results_extraction— existing PR Add: missing anthropic tool results in response #18945 test still passesLive integration tests (local litellm proxy)
responses.create()withcode_executiontool →response.provider_specific_fields.tool_resultscontainsbash_code_execution_tool_resultwith stdout/stderrresponses.create(stream=True)→response.completedevent's response hasprovider_specific_fields.tool_resultswith stdoutcode_interpreter_callwithcode="echo hello"andoutputs=[{type: "logs", logs: "hello\n"}]— no redundantfunction_callitemresponses.create(stream=True)returns correct text,provider_specific_fieldsisNone(expected)code_interpreter_calloutput items pass through correctly (outputs=nullis OpenAI's own API behavior, confirmed by testing directly against OpenAI without litellm)CodeExecutionWidgetrenders with syntax-highlighted code and stdout output