-
Notifications
You must be signed in to change notification settings - Fork 718
Images in Anthropic tool_result content blocks are silently dropped for VLM models #393
Description
Bug Description
When using oMLX as an Anthropic Messages API proxy for VLM models (e.g., Qwen 3.5-VL), images embedded inside tool_result content blocks are silently dropped. This means VLM-capable models cannot see images returned by tool calls (such as Claude Code's Read tool reading PNG/JPG files).
Steps to Reproduce
- Use a VLM model via the
/v1/messagesendpoint (e.g.,qwen3.5-VL-397b-a17b-vlm-6bit-300gb) - Send a conversation containing a
tool_resultwith image content:
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_123",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "<base64 data>"
}
},
{
"type": "text",
"text": "Image content from /tmp/screenshot.png"
}
]
}
]
}- The model receives only the text content; the image is lost
Root Cause
In omlx/api/anthropic_utils.py, the _extract_tool_result_content() function only extracts type: "text" blocks from list content and ignores type: "image" blocks:
elif isinstance(content, list):
text_parts = []
for item in content:
if isinstance(item, dict):
if item.get("type") == "text":
text_parts.append(item.get("text", ""))
# <-- "image" blocks are silently skippedAdditionally, in convert_anthropic_to_internal(), even when preserve_images=True, the tool_result handlers don't check for image blocks inside the tool result's content — they only call _extract_tool_result_content() which strips images.
This affects both the native tool calling path (line ~185) and the non-native fallback path (line ~232).
Impact
This breaks image support for any client that sends images via tool results, including:
- Claude Code: When the
Readtool reads an image file, it returns the image data inside atool_resultblock. VLM models never see the image. - Any other Anthropic API client that returns images as tool outputs
Suggested Fix
- Add a helper to extract images from tool result content:
def _extract_images_from_tool_result_content(content, image_parts):
"""Extract image blocks from tool result content for VLM processing."""
if isinstance(content, list):
for item in content:
if isinstance(item, dict) and item.get("type") == "image":
_append_anthropic_image_part(image_parts, item)
elif isinstance(content, dict) and content.get("type") == "image":
_append_anthropic_image_part(image_parts, content)- In both tool_result handlers within
convert_anthropic_to_internal(), after processing the tool result text, extract images whenpreserve_images=True:
# After processing tool_result text...
if preserve_images:
_extract_images_from_tool_result_content(tool_result_content, image_parts)The extracted images accumulate in image_parts and get included in the next user message via _build_message_from_parts(), where extract_images_from_messages() in the VLM engine picks them up.
Environment
- oMLX version: 0.2.20.dev2
- Model: qwen3.5-VL-397b-a17b-vlm-6bit-300gb
- Client: Claude Code via Anthropic Messages API (
/v1/messages)