Skip to content

Litellm oss staging 03 18 2026#24081

Merged
30 commits merged intomainfrom
litellm_oss_staging_03_18_2026
Mar 20, 2026
Merged

Litellm oss staging 03 18 2026#24081
30 commits merged intomainfrom
litellm_oss_staging_03_18_2026

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Mar 19, 2026

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

hytromo and others added 21 commits December 17, 2025 10:02
…l in Responses API

PR #18945 added support for capturing Anthropic server-side tool results
(bash_code_execution_tool_result, etc.) in provider_specific_fields, but
the data never reached the Responses API output because:

1. Non-streaming: provider_specific_fields wasn't copied into _hidden_params
2. Streaming: chunk delta's provider_specific_fields wasn't accumulated
3. Tool results weren't mapped to standard output items

This fix:
- Copies provider_specific_fields to _hidden_params in transform_response()
- Accumulates provider_specific_fields from streaming chunk deltas
- Maps bash_code_execution_tool_result to code_interpreter_call output items
  with code and outputs (matching OpenAI's native shape)
- Removes redundant function_call items for server-side tools
- Adds OutputCodeInterpreterCall type to the output union
…cutions

stream_chunk_builder uses "last value wins" for list-valued
provider_specific_fields keys. _build_code_interpreter_results was
emitting only new items (incremental), so earlier results were silently
dropped when multiple sequential code executions occurred.

- Emit cumulative list from _build_code_interpreter_results, matching
  web_search_results pattern
- Assemble server_tool_use input from input_json_delta deltas at
  content_block_stop (Anthropic streams input: {} in start block)
- Handle dict items in _extract_tool_result_output_items after
  model_dump() serialization in stream_chunk_builder
- Simplify _merge_provider_specific_fields to last-value-wins for lists,
  matching stream_chunk_builder semantics
When both stdout and stderr are empty strings, the `if parts else
str(content)` fallback produced the raw dict representation as logs.
Drop the fallback so logs is correctly empty.
- Empty stdout/stderr now produces outputs=None (matching OpenAI parity)
  instead of outputs=[{logs:""}], in both streaming and non-streaming paths
- Fix test fixture to use real Anthropic type "bash_code_execution_tool_result"
  instead of "code_execution_tool_result"
- Add test for empty-output → outputs=None behavior
- Add unit tests for _extract_tool_result_output_items: Pydantic objects,
  plain dicts (post-model_dump), empty/missing provider_specific_fields,
  and in-place substitution preserving output ordering
Replace str(content) fallback with empty string so non-dict content
(e.g. list-shaped text_editor results) produces outputs=None instead
of raw Python object representations in logs.
… only

Skip non-bash tool result types (e.g. text_editor_code_execution_tool_result)
to avoid producing empty code_interpreter_call items in Responses API output.
…n test

- test_non_bash_tool_result_skipped: verifies text_editor results produce
  zero code_interpreter_call items
- test_end_to_end_streaming_chunks_to_code_interpreter_output: exercises
  full path from Anthropic SSE chunks through ModelResponseIterator,
  stream_chunk_builder, and _extract_tool_result_output_items without
  a live server
- Populate container_id on streaming code_interpreter_results by
  re-emitting at message_delta when container info arrives
- Reconstruct Pydantic OutputCodeInterpreterCall objects from plain
  dicts in _extract_tool_result_output_items so responses_output
  has uniform types across streaming and non-streaming paths
LangSmith reads the Cost column from outputs.usage_metadata.total_cost,
but LangsmithLogger._prepare_log_data never wrote to that key. The
response_cost was already computed in StandardLoggingPayload but was
not forwarded to the outputs dict.

Inject usage_metadata with input_tokens, output_tokens, total_tokens,
and total_cost into the outputs dict so LangSmith can display cost.

Fixes #24001

Made-with: Cursor
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…thropic-tool-results-responses-api

fix: surface Anthropic code execution results as code_interpreter_call in Responses API
The check `content.get("thinking", None) is not None` incorrectly
drops thinking blocks when the `thinking` key is explicitly null or
absent. Changed to `content.get("type") == "thinking"` to match
the fix already applied in the experimental pass-through path (PR #15501).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Fix thinking blocks dropped when thinking field is null
…gging

Preserve router model_group in generic API logs
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 20, 2026 0:59am

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Mar 19, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_oss_staging_03_18_2026 (8d92d86) with main (b1731b6)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (8d92d86) during the generation of this report, so b1731b6 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Krish Dholakia added 3 commits March 18, 2026 21:29
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ef440c2f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1743 to +1747
responses_output = [
(
result_by_id[getattr(item, "call_id", None)]
if (
getattr(item, "type", None) == "function_call"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve reusable tool-call items in response.output

For Anthropic code-execution responses, this replaces the original function_call entry with a code_interpreter_call. That breaks the standard response.output continuation path: _transform_responses_api_input_item_to_chat_completion_message() only recognizes function_call / function_call_output (litellm/responses/litellm_completion_transformation/transformation.py:967-1009), so reusing response.output as next-turn input—exactly what tests/llm_responses_api_testing/base_responses_api.py:527-543 does—drops the code-interpreter item entirely and loses the executed tool call on the next request.

Useful? React with 👍 / 👎.

Comment on lines +1741 to +1745
result_by_id = {item.id: item for item in tool_result_items}
replaced_ids = set(result_by_id.keys())
responses_output = [
(
result_by_id[getattr(item, "call_id", None)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep streaming event item types consistent with final output

In streaming mode, this rewrite only affects the final response.completed.output. The iterator still emits response.output_item.added/done and response.function_call_arguments.* as function_call items for the same call_id (litellm/responses/litellm_completion_transformation/streaming_iterator.py:204-224 and 293-360), so clients that build state from the stream will see output index 1 as a function_call and then have response.completed silently change that slot into code_interpreter_call for Anthropic code-execution streams.

Useful? React with 👍 / 👎.

Comment on lines +1777 to +1781
type="code_interpreter_call",
id=call_id,
code=code_by_id.get(call_id, ""),
container_id=container_id,
status="completed",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark non-zero bash executions as failed

When Anthropic returns a bash_code_execution_tool_result with return_code != 0, this still hard-codes the Responses API item to status="completed". That misreports failed executions as successful in both the streaming and non-streaming paths, so any caller that keys off the normalized status field will miss code-interpreter failures and continue as if the command succeeded.

Useful? React with 👍 / 👎.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
6 out of 9 committers have signed the CLA.

✅ hytromo
✅ andrzej-pomirski-yohana
✅ xr843
✅ Chesars
✅ emerzon
✅ Sameerlite
❌ github-actions[bot]
❌ avik2004
❌ Alexey


Alexey seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This staging PR bundles several related improvements across the Anthropic provider, Responses API, streaming infrastructure, and integrations. The major themes are:

  • Anthropic code interpreter support: Adds OutputCodeInterpreterCall / OutputCodeInterpreterCallLog types and propagates code_interpreter_results from both the streaming (ModelResponseIterator) and non-streaming (AnthropicConfig.transform_parsed_response) paths into provider_specific_fields. The Responses API layer (LiteLLMCompletionResponsesConfig) then replaces matching function_call items with code_interpreter_call items in the output. An end-to-end mock test (test_end_to_end_streaming_chunks_to_code_interpreter_output) verifies the full data path including stream_chunk_builder's last-value-wins semantics.
  • Responses API failure/incomplete handling: BaseResponsesAPIStreamingIterator now stores completed_response for RESPONSE_INCOMPLETE and RESPONSE_FAILED events (not only RESPONSE_COMPLETED), and routes RESPONSE_FAILED to failure handlers instead of success handlers. The logging layer (litellm_logging.py) is updated to handle all three event types in cost calculation.
  • Streaming safety checker refactor: safety_checker is replaced with raise_on_model_repetition, an O(1) rolling counter (_repeated_messages_count) that avoids holding the last N chunks in memory.
  • LangSmith enrichment: _prepare_log_data now adds usage_metadata (token counts + cost) to Langsmith run outputs, fixing a missing Cost column in LangSmith. Several helper methods are extracted for testability, and a pre-existing operator-precedence bug in the trace_id/dotted_order guard conditions is fixed.
  • Custom pricing for Anthropic passthrough: _create_anthropic_response_logging_payload now forwards custom_pricing and router_model_id to litellm.completion_cost(), ensuring custom deployment pricing is applied for /messages pass-through calls.
  • Router generic API call improvements: _generic_api_call now calls _update_kwargs_before_fallbacks and passes request_kwargs to get_available_deployment, aligning it with the standard completion code path for metadata and deployment selection.

Issues found:

  • _handle_logging_failed_response always creates the exception with status_code=500, regardless of what the actual response error payload contains. Client-side failures (e.g. content policy) will be misclassified in failure logs.
  • The content_block_stop loop in handler.py accesses block["delta"] directly; a previous thread suggested using .get() — the fix has not been applied yet.
  • _build_code_by_id_map silently swallows all exceptions including ones that would result in an empty code field; consider at minimum debug-logging the exception.

Confidence Score: 4/5

  • Mostly safe to merge; one logic issue in failure-event status code classification and two pre-existing defensive-coding gaps that are low-risk in practice.
  • The PR is well-tested — it adds 2,515 lines of tests covering streaming, non-streaming, end-to-end, and edge-case scenarios across all major changed modules. The core data flows (code_interpreter_results accumulation, LangSmith enrichment, streaming safety refactor) are verified by mock unit tests. The one P1 issue (hardcoded 500 in _handle_logging_failed_response) affects failure-log accuracy rather than request correctness, and is unlikely to cause user-visible regressions. The two P2 style concerns around silent exception handling are present in new code but are non-blocking. No backwards-incompatible changes or DB calls in the critical path were found.
  • litellm/responses/streaming_iterator.py (hardcoded status_code=500 in failure handler), litellm/llms/anthropic/chat/handler.py (content_block_stop direct dict access noted in previous thread still present)

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/handler.py Adds streaming code_interpreter_results tracking via _server_tool_inputs, tool_results, and _build_code_interpreter_results(). Input assembly from input_json_delta chunks is performed at content_block_stop. Logic appears correct with tests covering delta accumulation and container_id propagation.
litellm/llms/anthropic/chat/transformation.py Refactors transform_parsed_response removing the else-branch, extracts _build_provider_specific_fields/_build_code_interpreter_results helpers, and fixes thinking-block detection (type-based instead of key-presence). Also adds provider_specific_fields to _hidden_params. Clean refactor with good test coverage.
litellm/responses/streaming_iterator.py Adds handling for RESPONSE_INCOMPLETE and RESPONSE_FAILED events; introduces _handle_logging_failed_response() which routes to failure handlers. Status code is always hardcoded to 500 regardless of actual error type, which may misclassify client-side errors.
litellm/responses/litellm_completion_transformation/streaming_iterator.py Accumulates provider_specific_fields across streaming chunks using last-value-wins semantics in _merge_provider_specific_fields, stored in _accumulated_provider_specific_fields and merged into _hidden_params on create_litellm_model_response. End-to-end test confirms stream_chunk_builder properly propagates to message.provider_specific_fields.
litellm/responses/litellm_completion_transformation/transformation.py Adds _extract_tool_result_output_items and in-place substitution logic to replace function_call items with OutputCodeInterpreterCall items in Responses API output. Tests verify both non-streaming (Pydantic) and streaming (dict) paths, including ordering preservation.
litellm/integrations/langsmith.py Extracts _extract_metadata_fields, _build_extra_metadata, _build_outputs_with_usage, and _ensure_required_ids helpers. _build_outputs_with_usage enriches LangSmith outputs with usage_metadata (cost, token counts). Also fixes the original operator precedence bug in trace_id/dotted_order guard conditions.
litellm/litellm_core_utils/streaming_handler.py Replaces O(N) safety_checker with O(1) raise_on_model_repetition using a rolling consecutive-count (_repeated_messages_count). Correctly handles None/short content resets and is thoroughly tested with parametrized cases covering edge conditions.
litellm/types/responses/main.py Adds OutputCodeInterpreterCallLog, OutputCodeInterpreterCall types, and build_code_interpreter_log_outputs() helper shared by streaming and non-streaming paths. The helper contains Anthropic-specific parsing logic (stdout/stderr keys, STDERR: prefix) in the shared types layer — noted as a concern in previous threads.

Sequence Diagram

sequenceDiagram
    participant Anthropic as Anthropic SSE
    participant MRI as ModelResponseIterator
    participant CSW as CustomStreamWrapper
    participant LCSI as LiteLLMCompletionStreamingIterator
    participant SCB as stream_chunk_builder
    participant Trans as LiteLLMCompletionResponsesConfig

    Anthropic->>MRI: content_block_start (server_tool_use)
    MRI->>MRI: Store _current_server_tool_id + empty input
    Anthropic->>MRI: content_block_delta (input_json_delta)
    MRI->>MRI: Accumulate partial_json in content_blocks
    Anthropic->>MRI: content_block_stop
    MRI->>MRI: Assemble JSON → _server_tool_inputs[id]

    Anthropic->>MRI: content_block_start (bash_code_execution_tool_result)
    MRI->>MRI: Append to tool_results
    MRI->>MRI: _build_code_interpreter_results() → cumulative list
    MRI-->>CSW: chunk with delta.provider_specific_fields[code_interpreter_results]

    Anthropic->>MRI: message_delta (container)
    MRI->>MRI: Store container_id, re-emit code_interpreter_results
    MRI-->>CSW: chunk with updated container_id

    CSW-->>LCSI: ModelResponseStream chunks
    LCSI->>LCSI: _merge_provider_specific_fields() (last-value-wins)
    LCSI->>LCSI: _ensure_output_item_for_chunk()

    LCSI->>SCB: collected_chat_completion_chunks
    SCB-->>LCSI: ModelResponse (message.provider_specific_fields populated)
    LCSI->>LCSI: Merge _accumulated_provider_specific_fields → _hidden_params

    LCSI->>Trans: _get_output_items_from_chat_completion_response(ModelResponse)
    Trans->>Trans: _extract_tool_result_output_items()
    Trans->>Trans: Replace function_call → OutputCodeInterpreterCall in-place
    Trans-->>LCSI: ResponsesAPIResponse with code_interpreter_call items
Loading

Comments Outside Diff (1)

  1. litellm/llms/anthropic/chat/handler.py, line 888-910 (link)

    content_blocks iteration silently discards KeyError on missing delta key

    When content_block_stop fires for a server_tool_use block, every element in self.content_blocks is expected to carry a delta dict. The direct block["delta"] access on line 893 will raise KeyError for any block that doesn't have delta (e.g. an unexpected chunk shape). That exception is caught by the outer chunk_parser try/except, silently leaving self._server_tool_inputs[self._current_server_tool_id] pointing at the empty dict set during content_block_start, which means code will be "" in the final OutputCodeInterpreterCall.

    A previous thread suggested using .get():

    for block in self.content_blocks:
        delta = block.get("delta", {})
        if delta.get("type") == "input_json_delta":
            partial_json = delta.get("partial_json")
            if isinstance(partial_json, str):
                args += partial_json

Last reviewed commit: "Merge branch 'main' ..."

Comment on lines +1761 to +1767
code_by_id: Dict[str, str] = {}
for tc in tool_calls:
try:
args = json.loads(tc.get("function", {}).get("arguments", "{}"))
code_by_id[tc.get("id", "")] = args.get("command", "")
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent failure may mask missing code field

The try/except Exception: pass silently swallows failures when building code_by_id. If tc is not a plain dict (e.g. a Pydantic model whose .get() behaves differently), or if arguments is invalid JSON, the resulting OutputCodeInterpreterCall will have code="" with no observable error.

Consider at minimum logging the exception so debugging is possible, and also guard the outer loop with tool_calls or [] as a defensive measure:

Suggested change
code_by_id: Dict[str, str] = {}
for tc in tool_calls:
try:
args = json.loads(tc.get("function", {}).get("arguments", "{}"))
code_by_id[tc.get("id", "")] = args.get("command", "")
except Exception:
pass
code_by_id: Dict[str, str] = {}
for tc in (tool_calls or []):
try:
args = json.loads(tc.get("function", {}).get("arguments", "{}"))
code_by_id[tc.get("id", "")] = args.get("command", "")
except Exception:
verbose_logger.debug(
"Failed to extract code from tool call: %s", tc
)

Comment on lines 888 to 911
),
index=self.tool_index,
)
# Update server_tool_inputs with fully assembled input
# from input_json_delta chunks (content_block_start has {})
if (
self.current_content_block_type == "server_tool_use"
and self._current_server_tool_id
):
args = ""
for block in self.content_blocks:
if block["delta"]["type"] == "input_json_delta":
args += block["delta"].get("partial_json", "")
if args:
try:
self._server_tool_inputs[
self._current_server_tool_id
] = json.loads(args)
except (json.JSONDecodeError, TypeError):
pass
self._current_server_tool_id = None
# Reset response_format tool tracking when block stops
self.is_response_format_tool = False
# Reset current content block type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 content_blocks iteration assumes every block has a delta key

In content_block_stop, the loop:

for block in self.content_blocks:
    if block["delta"]["type"] == "input_json_delta":

uses a direct block["delta"] access. While self.content_blocks is populated exclusively from content_block_delta events (which always carry delta), there is no guard if an unexpected chunk shape slips in. The outer chunk_parser try/except will catch the resulting KeyError, but the effect is that the whole content_block_stop processing is silently skipped for that block — which would leave _server_tool_inputs unpopulated and the code field empty.

A safer pattern:

for block in self.content_blocks:
    delta = block.get("delta", {})
    if delta.get("type") == "input_json_delta":
        args += delta.get("partial_json", "")

…etection-performance

Fix model repetition detection performance
fix: fix logging for response incomplete streaming + custom pricing on /v1/messages and /v1/responses
Comment on lines +63 to +78
id: str
code: Optional[str]
container_id: Optional[str]
status: Literal["in_progress", "completed", "incomplete", "failed"]
outputs: Optional[List[OutputCodeInterpreterCallLog]]


def build_code_interpreter_log_outputs(
content: Any,
) -> Optional[List[OutputCodeInterpreterCallLog]]:
"""Convert Anthropic bash_code_execution stdout/stderr to log outputs.

Shared by streaming (handler.py) and non-streaming (transformation.py) paths.
"""
if not isinstance(content, dict):
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Provider-specific parsing logic in the types/ layer

build_code_interpreter_log_outputs understands Anthropic's internal bash_code_execution_result format (stdout, stderr keys) and the "STDERR: " prefix convention. This is Anthropic-specific knowledge; placing it in litellm/types/responses/main.py leaks provider semantics into the shared types layer.

Per the repository guideline, provider-specific logic should live under litellm/llms/<provider>/. Consider moving this helper to litellm/llms/anthropic/ (e.g., litellm/llms/anthropic/common_utils.py) and importing it from there in both handler.py and transformation.py.

# Better placement: litellm/llms/anthropic/common_utils.py
def build_code_interpreter_log_outputs(content: Any) -> Optional[List[OutputCodeInterpreterCallLog]]:
    ...

This keeps types/responses/main.py as pure data-model definitions with no provider-specific parsing.

Rule Used: What: Avoid writing provider-specific code outside... (source)

- Extract helper methods in langsmith._prepare_log_data to reduce from 51 to <50 statements
- Extract helper methods in anthropic.transform_parsed_response to reduce from 57 to <50 statements
- Fixes PLR0915 linter errors
- All existing tests pass (10 langsmith tests, 126 anthropic tests)

Made-with: Cursor
@Sameerlite
Copy link
Copy Markdown
Collaborator

Sameerlite commented Mar 19, 2026

The cicd run from main that was used as ref: https://app.circleci.com/pipelines/github/BerriAI/litellm/69965 (Latest at the time of testing). The current CICD of this branch has the be aligned and is not failing any extra tests. Main is also failing some tests at the moment hence added the ref used

@ghost ghost merged commit 482b77e into main Mar 20, 2026
33 of 36 checks passed
@ishaan-berri ishaan-berri deleted the litellm_oss_staging_03_18_2026 branch March 26, 2026 22:29
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants