Conversation
…l in Responses API PR #18945 added support for capturing Anthropic server-side tool results (bash_code_execution_tool_result, etc.) in provider_specific_fields, but the data never reached the Responses API output because: 1. Non-streaming: provider_specific_fields wasn't copied into _hidden_params 2. Streaming: chunk delta's provider_specific_fields wasn't accumulated 3. Tool results weren't mapped to standard output items This fix: - Copies provider_specific_fields to _hidden_params in transform_response() - Accumulates provider_specific_fields from streaming chunk deltas - Maps bash_code_execution_tool_result to code_interpreter_call output items with code and outputs (matching OpenAI's native shape) - Removes redundant function_call items for server-side tools - Adds OutputCodeInterpreterCall type to the output union
…cutions
stream_chunk_builder uses "last value wins" for list-valued
provider_specific_fields keys. _build_code_interpreter_results was
emitting only new items (incremental), so earlier results were silently
dropped when multiple sequential code executions occurred.
- Emit cumulative list from _build_code_interpreter_results, matching
web_search_results pattern
- Assemble server_tool_use input from input_json_delta deltas at
content_block_stop (Anthropic streams input: {} in start block)
- Handle dict items in _extract_tool_result_output_items after
model_dump() serialization in stream_chunk_builder
- Simplify _merge_provider_specific_fields to last-value-wins for lists,
matching stream_chunk_builder semantics
When both stdout and stderr are empty strings, the `if parts else str(content)` fallback produced the raw dict representation as logs. Drop the fallback so logs is correctly empty.
- Empty stdout/stderr now produces outputs=None (matching OpenAI parity)
instead of outputs=[{logs:""}], in both streaming and non-streaming paths
- Fix test fixture to use real Anthropic type "bash_code_execution_tool_result"
instead of "code_execution_tool_result"
- Add test for empty-output → outputs=None behavior
- Add unit tests for _extract_tool_result_output_items: Pydantic objects,
plain dicts (post-model_dump), empty/missing provider_specific_fields,
and in-place substitution preserving output ordering
Replace str(content) fallback with empty string so non-dict content (e.g. list-shaped text_editor results) produces outputs=None instead of raw Python object representations in logs.
… only Skip non-bash tool result types (e.g. text_editor_code_execution_tool_result) to avoid producing empty code_interpreter_call items in Responses API output.
…n test - test_non_bash_tool_result_skipped: verifies text_editor results produce zero code_interpreter_call items - test_end_to_end_streaming_chunks_to_code_interpreter_output: exercises full path from Anthropic SSE chunks through ModelResponseIterator, stream_chunk_builder, and _extract_tool_result_output_items without a live server
- Populate container_id on streaming code_interpreter_results by re-emitting at message_delta when container info arrives - Reconstruct Pydantic OutputCodeInterpreterCall objects from plain dicts in _extract_tool_result_output_items so responses_output has uniform types across streaming and non-streaming paths
LangSmith reads the Cost column from outputs.usage_metadata.total_cost, but LangsmithLogger._prepare_log_data never wrote to that key. The response_cost was already computed in StandardLoggingPayload but was not forwarded to the outputs dict. Inject usage_metadata with input_tokens, output_tokens, total_tokens, and total_cost into the outputs dict so LangSmith can display cost. Fixes #24001 Made-with: Cursor
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…thropic-tool-results-responses-api fix: surface Anthropic code execution results as code_interpreter_call in Responses API
The check `content.get("thinking", None) is not None` incorrectly
drops thinking blocks when the `thinking` key is explicitly null or
absent. Changed to `content.get("type") == "thinking"` to match
the fix already applied in the experimental pass-through path (PR #15501).
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Fix thinking blocks dropped when thinking field is null
…gging Preserve router model_group in generic API logs
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Fix/proxy only failure call type
…adata fix(langsmith): populate usage_metadata in outputs for Cost column
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ef440c2f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| responses_output = [ | ||
| ( | ||
| result_by_id[getattr(item, "call_id", None)] | ||
| if ( | ||
| getattr(item, "type", None) == "function_call" |
There was a problem hiding this comment.
Preserve reusable tool-call items in
response.output
For Anthropic code-execution responses, this replaces the original function_call entry with a code_interpreter_call. That breaks the standard response.output continuation path: _transform_responses_api_input_item_to_chat_completion_message() only recognizes function_call / function_call_output (litellm/responses/litellm_completion_transformation/transformation.py:967-1009), so reusing response.output as next-turn input—exactly what tests/llm_responses_api_testing/base_responses_api.py:527-543 does—drops the code-interpreter item entirely and loses the executed tool call on the next request.
Useful? React with 👍 / 👎.
| result_by_id = {item.id: item for item in tool_result_items} | ||
| replaced_ids = set(result_by_id.keys()) | ||
| responses_output = [ | ||
| ( | ||
| result_by_id[getattr(item, "call_id", None)] |
There was a problem hiding this comment.
Keep streaming event item types consistent with final output
In streaming mode, this rewrite only affects the final response.completed.output. The iterator still emits response.output_item.added/done and response.function_call_arguments.* as function_call items for the same call_id (litellm/responses/litellm_completion_transformation/streaming_iterator.py:204-224 and 293-360), so clients that build state from the stream will see output index 1 as a function_call and then have response.completed silently change that slot into code_interpreter_call for Anthropic code-execution streams.
Useful? React with 👍 / 👎.
| type="code_interpreter_call", | ||
| id=call_id, | ||
| code=code_by_id.get(call_id, ""), | ||
| container_id=container_id, | ||
| status="completed", |
There was a problem hiding this comment.
Mark non-zero bash executions as failed
When Anthropic returns a bash_code_execution_tool_result with return_code != 0, this still hard-codes the Responses API item to status="completed". That misreports failed executions as successful in both the streaming and non-streaming paths, so any caller that keys off the normalized status field will miss code-interpreter failures and continue as if the command succeeded.
Useful? React with 👍 / 👎.
|
Alexey seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Greptile SummaryThis staging PR bundles several related improvements across the Anthropic provider, Responses API, streaming infrastructure, and integrations. The major themes are:
Issues found:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/llms/anthropic/chat/handler.py | Adds streaming code_interpreter_results tracking via _server_tool_inputs, tool_results, and _build_code_interpreter_results(). Input assembly from input_json_delta chunks is performed at content_block_stop. Logic appears correct with tests covering delta accumulation and container_id propagation. |
| litellm/llms/anthropic/chat/transformation.py | Refactors transform_parsed_response removing the else-branch, extracts _build_provider_specific_fields/_build_code_interpreter_results helpers, and fixes thinking-block detection (type-based instead of key-presence). Also adds provider_specific_fields to _hidden_params. Clean refactor with good test coverage. |
| litellm/responses/streaming_iterator.py | Adds handling for RESPONSE_INCOMPLETE and RESPONSE_FAILED events; introduces _handle_logging_failed_response() which routes to failure handlers. Status code is always hardcoded to 500 regardless of actual error type, which may misclassify client-side errors. |
| litellm/responses/litellm_completion_transformation/streaming_iterator.py | Accumulates provider_specific_fields across streaming chunks using last-value-wins semantics in _merge_provider_specific_fields, stored in _accumulated_provider_specific_fields and merged into _hidden_params on create_litellm_model_response. End-to-end test confirms stream_chunk_builder properly propagates to message.provider_specific_fields. |
| litellm/responses/litellm_completion_transformation/transformation.py | Adds _extract_tool_result_output_items and in-place substitution logic to replace function_call items with OutputCodeInterpreterCall items in Responses API output. Tests verify both non-streaming (Pydantic) and streaming (dict) paths, including ordering preservation. |
| litellm/integrations/langsmith.py | Extracts _extract_metadata_fields, _build_extra_metadata, _build_outputs_with_usage, and _ensure_required_ids helpers. _build_outputs_with_usage enriches LangSmith outputs with usage_metadata (cost, token counts). Also fixes the original operator precedence bug in trace_id/dotted_order guard conditions. |
| litellm/litellm_core_utils/streaming_handler.py | Replaces O(N) safety_checker with O(1) raise_on_model_repetition using a rolling consecutive-count (_repeated_messages_count). Correctly handles None/short content resets and is thoroughly tested with parametrized cases covering edge conditions. |
| litellm/types/responses/main.py | Adds OutputCodeInterpreterCallLog, OutputCodeInterpreterCall types, and build_code_interpreter_log_outputs() helper shared by streaming and non-streaming paths. The helper contains Anthropic-specific parsing logic (stdout/stderr keys, STDERR: prefix) in the shared types layer — noted as a concern in previous threads. |
Sequence Diagram
sequenceDiagram
participant Anthropic as Anthropic SSE
participant MRI as ModelResponseIterator
participant CSW as CustomStreamWrapper
participant LCSI as LiteLLMCompletionStreamingIterator
participant SCB as stream_chunk_builder
participant Trans as LiteLLMCompletionResponsesConfig
Anthropic->>MRI: content_block_start (server_tool_use)
MRI->>MRI: Store _current_server_tool_id + empty input
Anthropic->>MRI: content_block_delta (input_json_delta)
MRI->>MRI: Accumulate partial_json in content_blocks
Anthropic->>MRI: content_block_stop
MRI->>MRI: Assemble JSON → _server_tool_inputs[id]
Anthropic->>MRI: content_block_start (bash_code_execution_tool_result)
MRI->>MRI: Append to tool_results
MRI->>MRI: _build_code_interpreter_results() → cumulative list
MRI-->>CSW: chunk with delta.provider_specific_fields[code_interpreter_results]
Anthropic->>MRI: message_delta (container)
MRI->>MRI: Store container_id, re-emit code_interpreter_results
MRI-->>CSW: chunk with updated container_id
CSW-->>LCSI: ModelResponseStream chunks
LCSI->>LCSI: _merge_provider_specific_fields() (last-value-wins)
LCSI->>LCSI: _ensure_output_item_for_chunk()
LCSI->>SCB: collected_chat_completion_chunks
SCB-->>LCSI: ModelResponse (message.provider_specific_fields populated)
LCSI->>LCSI: Merge _accumulated_provider_specific_fields → _hidden_params
LCSI->>Trans: _get_output_items_from_chat_completion_response(ModelResponse)
Trans->>Trans: _extract_tool_result_output_items()
Trans->>Trans: Replace function_call → OutputCodeInterpreterCall in-place
Trans-->>LCSI: ResponsesAPIResponse with code_interpreter_call items
Comments Outside Diff (1)
-
litellm/llms/anthropic/chat/handler.py, line 888-910 (link)content_blocksiteration silently discardsKeyErroron missingdeltakeyWhen
content_block_stopfires for aserver_tool_useblock, every element inself.content_blocksis expected to carry adeltadict. The directblock["delta"]access on line 893 will raiseKeyErrorfor any block that doesn't havedelta(e.g. an unexpected chunk shape). That exception is caught by the outerchunk_parsertry/except, silently leavingself._server_tool_inputs[self._current_server_tool_id]pointing at the empty dict set duringcontent_block_start, which meanscodewill be""in the finalOutputCodeInterpreterCall.A previous thread suggested using
.get():for block in self.content_blocks: delta = block.get("delta", {}) if delta.get("type") == "input_json_delta": partial_json = delta.get("partial_json") if isinstance(partial_json, str): args += partial_json
Last reviewed commit: "Merge branch 'main' ..."
| code_by_id: Dict[str, str] = {} | ||
| for tc in tool_calls: | ||
| try: | ||
| args = json.loads(tc.get("function", {}).get("arguments", "{}")) | ||
| code_by_id[tc.get("id", "")] = args.get("command", "") | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silent failure may mask missing
code field
The try/except Exception: pass silently swallows failures when building code_by_id. If tc is not a plain dict (e.g. a Pydantic model whose .get() behaves differently), or if arguments is invalid JSON, the resulting OutputCodeInterpreterCall will have code="" with no observable error.
Consider at minimum logging the exception so debugging is possible, and also guard the outer loop with tool_calls or [] as a defensive measure:
| code_by_id: Dict[str, str] = {} | |
| for tc in tool_calls: | |
| try: | |
| args = json.loads(tc.get("function", {}).get("arguments", "{}")) | |
| code_by_id[tc.get("id", "")] = args.get("command", "") | |
| except Exception: | |
| pass | |
| code_by_id: Dict[str, str] = {} | |
| for tc in (tool_calls or []): | |
| try: | |
| args = json.loads(tc.get("function", {}).get("arguments", "{}")) | |
| code_by_id[tc.get("id", "")] = args.get("command", "") | |
| except Exception: | |
| verbose_logger.debug( | |
| "Failed to extract code from tool call: %s", tc | |
| ) |
| ), | ||
| index=self.tool_index, | ||
| ) | ||
| # Update server_tool_inputs with fully assembled input | ||
| # from input_json_delta chunks (content_block_start has {}) | ||
| if ( | ||
| self.current_content_block_type == "server_tool_use" | ||
| and self._current_server_tool_id | ||
| ): | ||
| args = "" | ||
| for block in self.content_blocks: | ||
| if block["delta"]["type"] == "input_json_delta": | ||
| args += block["delta"].get("partial_json", "") | ||
| if args: | ||
| try: | ||
| self._server_tool_inputs[ | ||
| self._current_server_tool_id | ||
| ] = json.loads(args) | ||
| except (json.JSONDecodeError, TypeError): | ||
| pass | ||
| self._current_server_tool_id = None | ||
| # Reset response_format tool tracking when block stops | ||
| self.is_response_format_tool = False | ||
| # Reset current content block type |
There was a problem hiding this comment.
content_blocks iteration assumes every block has a delta key
In content_block_stop, the loop:
for block in self.content_blocks:
if block["delta"]["type"] == "input_json_delta":uses a direct block["delta"] access. While self.content_blocks is populated exclusively from content_block_delta events (which always carry delta), there is no guard if an unexpected chunk shape slips in. The outer chunk_parser try/except will catch the resulting KeyError, but the effect is that the whole content_block_stop processing is silently skipped for that block — which would leave _server_tool_inputs unpopulated and the code field empty.
A safer pattern:
for block in self.content_blocks:
delta = block.get("delta", {})
if delta.get("type") == "input_json_delta":
args += delta.get("partial_json", "")…etection-performance Fix model repetition detection performance
fix: fix logging for response incomplete streaming + custom pricing on /v1/messages and /v1/responses
| id: str | ||
| code: Optional[str] | ||
| container_id: Optional[str] | ||
| status: Literal["in_progress", "completed", "incomplete", "failed"] | ||
| outputs: Optional[List[OutputCodeInterpreterCallLog]] | ||
|
|
||
|
|
||
| def build_code_interpreter_log_outputs( | ||
| content: Any, | ||
| ) -> Optional[List[OutputCodeInterpreterCallLog]]: | ||
| """Convert Anthropic bash_code_execution stdout/stderr to log outputs. | ||
|
|
||
| Shared by streaming (handler.py) and non-streaming (transformation.py) paths. | ||
| """ | ||
| if not isinstance(content, dict): | ||
| return None |
There was a problem hiding this comment.
Provider-specific parsing logic in the
types/ layer
build_code_interpreter_log_outputs understands Anthropic's internal bash_code_execution_result format (stdout, stderr keys) and the "STDERR: " prefix convention. This is Anthropic-specific knowledge; placing it in litellm/types/responses/main.py leaks provider semantics into the shared types layer.
Per the repository guideline, provider-specific logic should live under litellm/llms/<provider>/. Consider moving this helper to litellm/llms/anthropic/ (e.g., litellm/llms/anthropic/common_utils.py) and importing it from there in both handler.py and transformation.py.
# Better placement: litellm/llms/anthropic/common_utils.py
def build_code_interpreter_log_outputs(content: Any) -> Optional[List[OutputCodeInterpreterCallLog]]:
...This keeps types/responses/main.py as pure data-model definitions with no provider-specific parsing.
Rule Used: What: Avoid writing provider-specific code outside... (source)
- Extract helper methods in langsmith._prepare_log_data to reduce from 51 to <50 statements - Extract helper methods in anthropic.transform_parsed_response to reduce from 57 to <50 statements - Fixes PLR0915 linter errors - All existing tests pass (10 langsmith tests, 126 anthropic tests) Made-with: Cursor
|
The cicd run from main that was used as ref: https://app.circleci.com/pipelines/github/BerriAI/litellm/69965 (Latest at the time of testing). The current CICD of this branch has the be aligned and is not failing any extra tests. Main is also failing some tests at the moment hence added the ref used |
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes