Skip to content

fix: surface Anthropic code execution results as code_interpreter_call in Responses API#23784

Merged
Chesars merged 30 commits intoBerriAI:litellm_oss_staging_03_18_2026from
andrzej-pomirski-yohana:fix/surface-anthropic-tool-results-responses-api
Mar 19, 2026
Merged

fix: surface Anthropic code execution results as code_interpreter_call in Responses API#23784
Chesars merged 30 commits intoBerriAI:litellm_oss_staging_03_18_2026from
andrzej-pomirski-yohana:fix/surface-anthropic-tool-results-responses-api

Conversation

@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor

@andrzej-pomirski-yohana andrzej-pomirski-yohana commented Mar 16, 2026

Relevant issues

Extends #18945 — Anthropic tool_results were captured in provider_specific_fields but never reached the Responses API output.

Pre-Submission checklist

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

PR #18945 added provider_specific_fields["tool_results"] for Anthropic server-side tool results (bash_code_execution_tool_result, text_editor_code_execution_tool_result). However, consumers using the Responses API (responses.create()) never see this data because of three gaps:

1. Non-streaming: provider_specific_fields not in _hidden_params

File: litellm/llms/anthropic/chat/transformation.py

transform_response() builds provider_specific_fields and sets it on the message, but never copies it into _hidden_params. The Responses API adapter at transformation.py:1585 checks _hidden_params.get("provider_specific_fields") and always finds None.

Fix: Add _hidden_params["provider_specific_fields"] = provider_specific_fields before model_response._hidden_params = _hidden_params.

2. Streaming: delta provider_specific_fields not accumulated

File: litellm/responses/litellm_completion_transformation/streaming_iterator.py

The Anthropic streaming handler sets tool_results on delta.provider_specific_fields, but LiteLLMCompletionStreamingIterator never accumulates these across chunks. The assembled ModelResponse has no provider_specific_fields in its _hidden_params.

Fix: Add _accumulated_provider_specific_fields dict, collect from chunk and delta in both __anext__ and __next__, inject into response._hidden_params in create_litellm_model_response().

3. Tool results mapped to standard code_interpreter_call output items

File: litellm/responses/litellm_completion_transformation/transformation.py

Even with the above fixes, tool_results only appears in provider_specific_fields — not as standard output items. Anthropic's bash_code_execution_tool_result is semantically identical to OpenAI's code_interpreter_call.

Fix: Add _extract_tool_result_output_items() that maps each bash_code_execution_tool_result to an OutputCodeInterpreterCall instance with code (from the matching tool_call) and outputs (logs from stdout/stderr via OutputCodeInterpreterCallLog). The redundant function_call items for server-side tools are removed from the output, so the final shape matches OpenAI's native Responses API.

4. New types: OutputCodeInterpreterCall and OutputCodeInterpreterCallLog

Files: litellm/types/responses/main.py, litellm/types/llms/openai.py

Added OutputCodeInterpreterCall and OutputCodeInterpreterCallLog Pydantic models matching the OpenAI SDK's ResponseCodeInterpreterToolCall shape. OutputCodeInterpreterCall is added to the ResponsesAPIResponse.output union and is used by _extract_tool_result_output_items() to produce properly typed output items.

Testing done

Unit tests

  • test_code_execution_tool_results_in_hidden_params — verifies tool_results reaches _hidden_params for the Responses API adapter (new test)
  • test_code_execution_tool_results_extraction — existing PR Add: missing anthropic tool results in response #18945 test still passes
  • All 48 targeted Anthropic + OpenAI Responses API tests pass on latest main

Live integration tests (local litellm proxy)

  • Anthropic non-streaming: responses.create() with code_execution tool → response.provider_specific_fields.tool_results contains bash_code_execution_tool_result with stdout/stderr
  • Anthropic streaming: responses.create(stream=True)response.completed event's response has provider_specific_fields.tool_results with stdout
  • Anthropic output items: Response output contains code_interpreter_call with code="echo hello" and outputs=[{type: "logs", logs: "hello\n"}] — no redundant function_call item
  • OpenAI gpt-5 streaming: No regression — responses.create(stream=True) returns correct text, provider_specific_fields is None (expected)
  • OpenAI gpt-5 non-streaming: No regression
  • OpenAI code_interpreter: code_interpreter_call output items pass through correctly (outputs=null is OpenAI's own API behavior, confirmed by testing directly against OpenAI without litellm)
  • TUI headless test: Full stack test through gRPC agent → litellm proxy → Anthropic API. CodeExecutionWidget renders with syntax-highlighted code and stdout output

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 19, 2026 1:12am

Request Review

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 16, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing andrzej-pomirski-yohana:fix/surface-anthropic-tool-results-responses-api (4770b65) with main (488b93c)1

Open in CodSpeed

Footnotes

  1. No successful run was found on litellm_oss_staging_03_18_2026 (fc315ab) during the generation of this report, so main (488b93c) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor Author

@greptileai

@andrzej-pomirski-yohana andrzej-pomirski-yohana force-pushed the fix/surface-anthropic-tool-results-responses-api branch 2 times, most recently from ba88625 to e77a14b Compare March 16, 2026 22:22
@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor Author

@greptileai

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR closes three gaps that prevented Anthropic server-side code execution results (e.g. bash_code_execution_tool_result) from appearing as code_interpreter_call items in the Responses API output, extending the work started in #18945.

What changed:

  • Non-streaming (transformation.py): provider_specific_fields (including the new code_interpreter_results list) is now copied into _hidden_params before the response is returned, so the Responses API adapter can find it.
  • Streaming (handler.py): ModelResponseIterator accumulates server-tool-use inputs from input_json_delta deltas and emits cumulative code_interpreter_results (list of provider-neutral OutputCodeInterpreterCall objects) on each bash_code_execution_tool_result content block, matching stream_chunk_builder's last-value-wins contract.
  • Streaming iterator (streaming_iterator.py): Adds _accumulated_provider_specific_fields with a last-value-wins merge helper, injecting the accumulated data into response._hidden_params after stream_chunk_builder assembles the final ModelResponse.
  • Responses API layer (transformation.py): Adds provider-neutral _extract_tool_result_output_items() that reads pre-built OutputCodeInterpreterCall objects from message.provider_specific_fields["code_interpreter_results"] and substitutes them in-place for the corresponding function_call items, preserving output ordering.
  • New types (types/responses/main.py): Adds OutputCodeInterpreterCall, OutputCodeInterpreterCallLog, and build_code_interpreter_log_outputs. The helper function parses Anthropic-specific stdout/stderr fields and is placed in a shared types file outside llms/; ideally it would live in litellm/llms/anthropic/chat/ to respect the provider-specific code boundary.
  • Tests: Comprehensive mock-only unit and end-to-end tests covering delta input assembly, multiple sequential executions (cumulative emission), empty output (outputs=None), non-bash tool skipping, in-place ordering, and the full ModelResponseIterator → stream_chunk_builder → _extract_tool_result_output_items pipeline.

Confidence Score: 4/5

  • Safe to merge; one minor code-organization concern (Anthropic-specific helper in a shared types file) but no logic bugs or regressions identified.
  • The three-gap fix is well-designed, with correct last-value-wins streaming semantics, in-place substitution preserving output ordering, and comprehensive mock-only unit tests. The only deduction is for build_code_interpreter_log_outputs living in litellm/types/responses/main.py with Anthropic-specific field names, which violates the project's provider-specific code placement rule. No backwards-incompatible changes and no real-network-call tests detected.
  • litellm/types/responses/main.py — build_code_interpreter_log_outputs contains Anthropic-specific parsing logic and should be moved inside litellm/llms/anthropic/.

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/handler.py Adds _server_tool_inputs, tool_results, _current_server_tool_id, and _container_id to ModelResponseIterator.__init__. Tracks server-tool-use inputs (assembled from input_json_delta deltas at content_block_stop) and builds cumulative code_interpreter_results at content_block_start for tool results. The streaming architecture is sound and tests cover delta assembly, multiple executions, empty output, and non-bash tool types.
litellm/llms/anthropic/chat/transformation.py Non-streaming path fix: builds code_interpreter_results (list of OutputCodeInterpreterCall) alongside tool_results in provider_specific_fields, then propagates the whole dict into _hidden_params before assigning it to model_response. This closes the gap where the Responses API adapter could never see these fields.
litellm/responses/litellm_completion_transformation/streaming_iterator.py Adds _accumulated_provider_specific_fields and a _merge_provider_specific_fields helper using last-value-wins semantics (matching stream_chunk_builder). Accumulates fields from both chunk.provider_specific_fields and chunk.choices[0].delta.provider_specific_fields in both async and sync iteration paths. On assembly, merges into response._hidden_params["provider_specific_fields"] via setdefault+update.
litellm/responses/litellm_completion_transformation/transformation.py Adds _extract_tool_result_output_items to read pre-built OutputCodeInterpreterCall objects from message.provider_specific_fields["code_interpreter_results"] and performs in-place substitution of matching function_call items to preserve output ordering.
litellm/types/responses/main.py Adds OutputCodeInterpreterCallLog, OutputCodeInterpreterCall Pydantic models matching OpenAI's ResponseCodeInterpreterToolCall shape, and the build_code_interpreter_log_outputs helper. The helper contains Anthropic-specific field names and is placed in a shared types file outside llms/, slightly violating the provider-specific code placement rule.
litellm/types/llms/openai.py Adds OutputCodeInterpreterCall to the ResponsesAPIResponse.output union type, allowing the type system to recognise these items in responses. Remainder of changes are formatting-only (Black reformatting of long type annotations).
tests/test_litellm/llms/anthropic/chat/test_code_interpreter_results_extraction.py New test file with thorough unit and mock end-to-end tests: Pydantic/dict extraction, empty list, ordering preservation (in-place substitution), and a full streaming pipeline mock test exercising the ModelResponseIterator → stream_chunk_builder → _extract_tool_result_output_items chain. All tests use mocks with no real network calls.
tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_handler.py New streaming tests cover: single code execution producing code_interpreter_results, multiple executions emitting cumulative lists (last-value-wins), delta-assembled input, empty stdout/stderr producing outputs=None, and non-bash tool results being skipped. All tests are pure mock tests with no network calls.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ResponsesAPI as Responses API<br/>(transformation.py)
    participant StreamIter as LiteLLMCompletionStreamingIterator
    participant AnthropicHandler as Anthropic ModelResponseIterator<br/>(handler.py / transformation.py)
    participant SCB as stream_chunk_builder

    Client->>ResponsesAPI: responses.create()

    alt Non-streaming path
        AnthropicHandler->>AnthropicHandler: extract_response_content()<br/>→ tool_calls, tool_results
        AnthropicHandler->>AnthropicHandler: Build code_interpreter_results<br/>(OutputCodeInterpreterCall list)
        AnthropicHandler->>AnthropicHandler: provider_specific_fields["code_interpreter_results"] = [...]
        AnthropicHandler->>AnthropicHandler: _hidden_params["provider_specific_fields"] = provider_specific_fields
        AnthropicHandler-->>ResponsesAPI: ModelResponse (with _hidden_params)
        ResponsesAPI->>ResponsesAPI: _extract_tool_result_output_items()<br/>reads message.provider_specific_fields
        ResponsesAPI->>ResponsesAPI: In-place substitute function_call → code_interpreter_call
        ResponsesAPI-->>Client: ResponsesAPIResponse (with code_interpreter_call items)
    else Streaming path
        loop Each SSE chunk
            AnthropicHandler->>AnthropicHandler: chunk_parser(): content_block_start<br/>for server_tool_use → track _server_tool_inputs
            AnthropicHandler->>AnthropicHandler: content_block_stop: assemble input_json_delta → _server_tool_inputs[id]
            AnthropicHandler->>AnthropicHandler: content_block_start for tool_result<br/>→ _build_code_interpreter_results()<br/>delta.provider_specific_fields["code_interpreter_results"] = [...]
            AnthropicHandler-->>StreamIter: ModelResponseStream chunk
            StreamIter->>StreamIter: _merge_provider_specific_fields()<br/>(last-value-wins accumulation)
        end
        StreamIter->>SCB: stream_chunk_builder(collected_chunks)
        SCB->>SCB: last-value-wins merge of<br/>delta.provider_specific_fields → message.provider_specific_fields
        SCB-->>StreamIter: ModelResponse
        StreamIter->>StreamIter: setdefault("provider_specific_fields",{}).update(<br/>_accumulated_provider_specific_fields)
        StreamIter-->>ResponsesAPI: ModelResponse (with _hidden_params + message.psf)
        ResponsesAPI->>ResponsesAPI: _extract_tool_result_output_items()<br/>reads message.provider_specific_fields["code_interpreter_results"]
        ResponsesAPI->>ResponsesAPI: In-place substitute function_call → code_interpreter_call
        ResponsesAPI-->>Client: response.completed event (with code_interpreter_call items)
    end
Loading

Comments Outside Diff (1)

  1. litellm/types/responses/main.py, line 70-85 (link)

    Provider-specific helper function in shared types file

    build_code_interpreter_log_outputs contains Anthropic-specific knowledge — it parses the stdout and stderr keys that come from Anthropic's bash_code_execution_result format, and even its own docstring names it "Anthropic bash_code_execution." Placing this logic in litellm/types/responses/main.py (a provider-agnostic types file outside llms/) violates the rule to keep provider-specific code inside litellm/llms/.

    Since both callers (handler.py and transformation.py) are already in litellm/llms/anthropic/chat/, a straightforward fix is to move the function there (e.g., into a small utils.py in that package) and import it from each caller:

    # litellm/llms/anthropic/chat/utils.py
    from litellm.types.responses.main import OutputCodeInterpreterCallLog
    
    def build_code_interpreter_log_outputs(content):
        ...

    Then in handler.py and transformation.py, import from litellm.llms.anthropic.chat.utils instead of litellm.types.responses.main.

    Rule Used: What: Avoid writing provider-specific code outside... (source)

Last reviewed commit: "refactor: extract du..."

@andrzej-pomirski-yohana andrzej-pomirski-yohana force-pushed the fix/surface-anthropic-tool-results-responses-api branch from e77a14b to d7ab3d3 Compare March 16, 2026 22:40
@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor Author

@greptileai

voidborne-d and others added 3 commits March 17, 2026 03:11
When the shared aiohttp session closes (due to network interruption,
idle timeout, or Redis failover side effects), the proxy permanently
falls back to creating a new HTTPS connection per request, losing the
benefit of connection pooling for the entire pod lifetime.

Fix: make add_shared_session_to_data() async and recreate the session
when it is found closed, restoring connection pooling automatically.

Fixes BerriAI#23806
… recreation

When multiple requests detect a closed shared session simultaneously,
they would each create a new aiohttp.ClientSession, leaking intermediate
sessions and their TCP connectors. Added double-checked locking pattern
with asyncio.Lock to ensure only one coroutine recreates the session.

Added concurrent recreation test case.
Address Greptile P1 review: tests that exercise the closed-session code
path need to reset the module-level lock to avoid RuntimeError on
Python < 3.10 when asyncio.Lock is reused across different event loops.
@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor Author

@greptileai review

@andrzej-pomirski-yohana andrzej-pomirski-yohana force-pushed the fix/surface-anthropic-tool-results-responses-api branch from ebd8655 to 7955ff2 Compare March 17, 2026 12:37
@andrzej-pomirski-yohana andrzej-pomirski-yohana force-pushed the fix/surface-anthropic-tool-results-responses-api branch from 7955ff2 to 702e8ad Compare March 17, 2026 12:43
@andrzej-pomirski-yohana
Copy link
Copy Markdown
Contributor Author

@greptileai review

Changes since last review:

  • Fixed output item ordering: in-place substitution instead of remove+append
  • Moved streaming conversion to Anthropic handler (_build_code_interpreter_results in handler.py), removed _ensure_code_interpreter_results_on_message from shared layer
  • Deep merge for list-valued provider_specific_fields (_merge_provider_specific_fields)
  • All imports hoisted to top-level (no more inline imports)
  • Added streaming unit test: test_streaming_code_execution_produces_code_interpreter_results

yuneng-jiang and others added 10 commits March 17, 2026 17:37
Address review feedback from greptile — use new_callable=AsyncMock
on the concurrent test's patch.object to ensure the mock is properly
typed as async, even though side_effect already handles the coroutine.
…arams-anthropic-document-file-message-blocks
[Infra] Security and Proxy Extras for Nightly

Only known flaky tests failing. The fix for security and proxy extras worked
…session-auto-recovery

fix: auto-recover shared aiohttp session when closed
…l in Responses API

PR BerriAI#18945 added support for capturing Anthropic server-side tool results
(bash_code_execution_tool_result, etc.) in provider_specific_fields, but
the data never reached the Responses API output because:

1. Non-streaming: provider_specific_fields wasn't copied into _hidden_params
2. Streaming: chunk delta's provider_specific_fields wasn't accumulated
3. Tool results weren't mapped to standard output items

This fix:
- Copies provider_specific_fields to _hidden_params in transform_response()
- Accumulates provider_specific_fields from streaming chunk deltas
- Maps bash_code_execution_tool_result to code_interpreter_call output items
  with code and outputs (matching OpenAI's native shape)
- Removes redundant function_call items for server-side tools
- Adds OutputCodeInterpreterCall type to the output union
…cutions

stream_chunk_builder uses "last value wins" for list-valued
provider_specific_fields keys. _build_code_interpreter_results was
emitting only new items (incremental), so earlier results were silently
dropped when multiple sequential code executions occurred.

- Emit cumulative list from _build_code_interpreter_results, matching
  web_search_results pattern
- Assemble server_tool_use input from input_json_delta deltas at
  content_block_stop (Anthropic streams input: {} in start block)
- Handle dict items in _extract_tool_result_output_items after
  model_dump() serialization in stream_chunk_builder
- Simplify _merge_provider_specific_fields to last-value-wins for lists,
  matching stream_chunk_builder semantics
When both stdout and stderr are empty strings, the `if parts else
str(content)` fallback produced the raw dict representation as logs.
Drop the fallback so logs is correctly empty.
- Empty stdout/stderr now produces outputs=None (matching OpenAI parity)
  instead of outputs=[{logs:""}], in both streaming and non-streaming paths
- Fix test fixture to use real Anthropic type "bash_code_execution_tool_result"
  instead of "code_execution_tool_result"
- Add test for empty-output → outputs=None behavior
- Add unit tests for _extract_tool_result_output_items: Pydantic objects,
  plain dicts (post-model_dump), empty/missing provider_specific_fields,
  and in-place substitution preserving output ordering
Replace str(content) fallback with empty string so non-dict content
(e.g. list-shaped text_editor results) produces outputs=None instead
of raw Python object representations in logs.
… only

Skip non-bash tool result types (e.g. text_editor_code_execution_tool_result)
to avoid producing empty code_interpreter_call items in Responses API output.
…n test

- test_non_bash_tool_result_skipped: verifies text_editor results produce
  zero code_interpreter_call items
- test_end_to_end_streaming_chunks_to_code_interpreter_output: exercises
  full path from Anthropic SSE chunks through ModelResponseIterator,
  stream_chunk_builder, and _extract_tool_result_output_items without
  a live server
- Populate container_id on streaming code_interpreter_results by
  re-emitting at message_delta when container info arrives
- Reconstruct Pydantic OutputCodeInterpreterCall objects from plain
  dicts in _extract_tool_result_output_items so responses_output
  has uniform types across streaming and non-streaming paths
@Chesars Chesars changed the base branch from main to litellm_oss_staging_03_18_2026 March 19, 2026 01:06
@Chesars Chesars merged commit ef3b05b into BerriAI:litellm_oss_staging_03_18_2026 Mar 19, 2026
16 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants