feat: Gemma 4 reasoning parser and agentic tool calling#565
feat: Gemma 4 reasoning parser and agentic tool calling#565jundot merged 1 commit intojundot:mainfrom
Conversation
dc2274f to
6499c0b
Compare
Adds generic output parser sessions, Gemma 4 reasoning channel parsing,
and complete agentic tool calling support for Gemma 4 VLM models.
## Reasoning parser
Refactors the scheduler to support pluggable per-model output parser
sessions. First implementation handles Gemma 4's reasoning channel:
strips <|channel>thought…<channel|> from streamed output and re-emits
it as <think>…</think> in reasoning_content.
- omlx/adapter/output_parser.py (new): detect_output_parser() factory
- omlx/adapter/gemma4.py (new): Gemma4OutputParserSession
- omlx/scheduler.py: threads parser session through generation loop
- omlx/utils/tokenizer.py: helpers for the session
## Tool call output parsing
_inject_tool_calling was using mlx_lm._infer_tool_parser which has no
knowledge of Gemma4's <|tool_call> marker, returning None and leaving
has_tool_calling unset so raw markup leaked into response content.
The pinned mlx-vlm (43b9b20) ships mlx_vlm.tool_parsers — a superset
that adds <|tool_call> -> gemma4 detection and a correct per-model
parser. _inject_tool_calling now prefers this, falling back to the
mlx_lm path for older installs.
## Tool result ingestion (message extractor pattern)
The Gemma 4 chat template has no handling for role=tool messages. Tool
results must appear on a model-role turn as tool_responses:
{"role": "assistant", "tool_responses": [{"name": fn, "response": ...}]}
Passing raw role=tool messages caused the template to emit
<|tool_response> literals in content and halt after the first tool call.
Two secondary bugs: _merge_consecutive_roles was collapsing the
tool_calls and tool_responses turns into one (fixed with
_PRESERVE_BOUNDARY_KEY); _drop_void_assistant_messages was stripping
the tool_responses turn (fixed with a tool_responses guard).
Design mirrors detect_output_parser — model-specific logic stays out of
server.py:
- detect_message_extractor() in output_parser.py returns the extractor
- BatchedEngine/VLMBatchedEngine expose it as message_extractor property
- server.py uses getattr(engine, 'message_extractor', None)
- All Gemma4 logic lives in omlx/adapter/gemma4.py
## Tests
tests/test_gemma4_messages.py - 11 tests: message conversion, tool
result folding, name resolution,
multi-turn agentic loops
tests/test_output_parser.py - reasoning parser session
tests/test_scheduler.py - scheduler integration
tests/test_utils_tokenizer.py - tokenizer helpers
2917d21 to
397287e
Compare
|
Note there is an underlying bug in mlx-vlm that I discovered, so some cases of tool use are broken even with this implementation. See Blaizzy/mlx-vlm#914 |
|
Note on earlier comment: Blaizzy/mlx-vlm#914 has been fixed, and tool calls are being properly parsed from mlx-vlm in commit b8c0c5d. |
|
@TipKnuckle I've updated mlx-lm and mlx-vlm to the latest versions, which I expect will resolve many of the existing issues. I'm currently working through the related changes needed on the omlx side. Please allow me some time - will follow up here soon! |
|
@jundot Thanks, I realize this is an architectural change, but I do think it's the right direction to handle custom parsing like gemma4 has introduced. SOLAR is another model that would benefit from this (though would require its own parser as well). Updating the mlx-lm and mlx-vlm will unfortunately not help with Gemma4 tool calling with your current main: Upstream main (omlx/engine/vlm.py):
This PR (omlx/engine/vlm.py):
The problem:
The latest mlx-lm commit (4469ad4 - "Add gemma 4") doesn't add tool parser support - it's model architecture support, not tool calling. So even updating mlx-lm to latest, Gemma 4 tool calling still won't work without the mlx-vlm tool parser change. |
jundot
left a comment
There was a problem hiding this comment.
Reviewed the full diff against current main (6 commits ahead of the v0.3.2 base). No merge conflicts, no functional overlap with recent changes.
Really clean work. The adapter pattern generalizing Harmony-only code into pluggable output parser sessions is well thought out. Model-specific logic stays out of server.py, Harmony backward compat is fully preserved, and the message extractor property is a nice touch. Tool calling coverage is thorough, all the edge cases i can think of are handled. Test coverage is solid too.
Merging.
|
Great, thanks! |
Adds generic output parser sessions and complete Gemma 4 reasoning + agentic tool calling support. Supersedes #561.
Reasoning parser
Refactors the scheduler to support pluggable per-model output parser sessions. First implementation handles Gemma 4's
<|channel>thoughtreasoning channel — strips it from streamed output and re-emits as<think>…</think>inreasoning_content.omlx/adapter/output_parser.py(new):detect_output_parser()factoryomlx/adapter/gemma4.py(new):Gemma4OutputParserSessionomlx/scheduler.py: threads parser session through generation loopomlx/utils/tokenizer.py: helpers for the sessionTool call output parsing
_inject_tool_callingwas usingmlx_lm._infer_tool_parserwhich has no knowledge of Gemma 4's<|tool_call>marker — it returnedNone,has_tool_callingwas never set, and raw markup leaked into response content. The pinned mlx-vlm (43b9b20) shipsmlx_vlm.tool_parsersas a superset that adds<|tool_call>→gemma4detection and the correct parser._inject_tool_callingnow prefers this, falling back to the mlx_lm path for older installs.Tool result ingestion (message extractor pattern)
The Gemma 4 chat template has no handling for
role=toolmessages. Tool results must appear on a model-role turn astool_responses:{"role": "assistant", "tool_responses": [{"name": fn_name, "response": {...}}]}Passing raw
role=toolmessages caused the template to emit<|tool_response>literals in content and halt after the first tool call. Two secondary bugs:_merge_consecutive_roleswas collapsing thetool_callsandtool_responsesturns into one (fixed with_PRESERVE_BOUNDARY_KEY);_drop_void_assistant_messageswas stripping thetool_responsesturn (fixed with atool_responsesguard).Design mirrors
detect_output_parser— model-specific logic stays out ofserver.py:detect_message_extractor()inoutput_parser.pyreturns the right extractor callableBatchedEngine/VLMBatchedEngineexpose it as amessage_extractorpropertyserver.pycallsgetattr(engine, "message_extractor", None)— no model-type chainsomlx/adapter/gemma4.pyTests
tests/test_gemma4_messages.pytests/test_output_parser.pytests/test_scheduler.pytests/test_utils_tokenizer.py