Skip to content

Conversation

@hksdpc255
Copy link
Contributor

@hksdpc255 hksdpc255 commented Nov 2, 2025

Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Based on PR #15904, this patch introduces a generalized implementation for almost all XML-style tool-call formats.

Supported models

  • GLM 4.5/4.6
  • MiniMax M2
  • SeedOSS
  • Kimi-K2 (Thinking and non-thinking)
  • Qwen3-Coder (Thinking and non-thinking)
  • Apriel-1.5
  • Xiaomi-MiMo

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Enhanced unit tests

Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.

Additional Notes

  • All unit tests have passed.
  • Community testing is welcome! Please try it out with your model integrations.
  • If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
  • When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

Please use the chat template included in this PR, or any other chat template that you are certain will work correctly

@MikeLP
Copy link

MikeLP commented Nov 2, 2025

I'm looking forward to get this PR merged!

@hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is?

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 2, 2025

For now, I’d recommend using a custom template if you’re running more complex workloads.
As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

Edit: The official template is now working properly. There’s no longer need for a custom template.

Edit2: Official template support for Minimax-M2 has been removed. See comment and ochafik/minja#7 (comment) for details.

@ochafik
Copy link
Collaborator

ochafik commented Nov 2, 2025

FYI I've updated (my fork of) Minja w/ support for GLM 4.6's template.
Might affect how you deal w/ the polyfills, as it should now detect GLM's tool call capability properly.

@hksdpc255
Copy link
Contributor Author

@ochafik Excellent work! Once llama.cpp syncs your changes, some parts of this PR can be safely removed.

However, there are still a few small patches needed — for example, replacing dict.items() with dict | items.

@hksdpc255
Copy link
Contributor Author

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@ochafik
Copy link
Collaborator

ochafik commented Nov 3, 2025

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@hksdpc255 Both should be supported. The confusing error you probably got was because minja implements items() on dict but not on str. It should detect whether the template expects arguments to be an object instead of a more common json string of said object (see requires_object_arguments), and adjust the inputs accordingly: now hopefully works for GLM 4.6.

As for list[-1], it's supported, but MinMax M2's template has a bug, see this comment.

And please feel free to file bugs on https://github.com/ochafik/minja, it's should be cleaner to add syntax support there than to patch things up in llama.cpp.

@hksdpc255
Copy link
Contributor Author

@ochafik Thank you for pointing that out. I’m currently applying your suggested fix in llama.cpp and will test whether it works as expected. Thanks again for the help!

@hksdpc255
Copy link
Contributor Author

Good news! The Minimax M2 tool call is now working.

I’ll push the fix later.

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 3, 2025

Screen shot for Zed editor: 图片

Model: unsloth's UD-Q3_K_XL

@hksdpc255 hksdpc255 mentioned this pull request Nov 3, 2025
@emuchogu
Copy link

emuchogu commented Nov 3, 2025

Hi @hksdpc255 ,
I cloned your repo https://github.com/hksdpc255/llama.cpp/tree/xml_toolcall and unfortunately it's still not producing the initial think tag at least in the cli. See below.

Model: unsloth--MiniMax-M2-GGUF Q8_0

./llama-cli \
  -m /models/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/*/Q8_0/MiniMax-M2-Q8_0-00001-of-00005.gguf \
  -ngl 99 \
  -sm layer \
  -ts 1,1,1,1,1,1,1,1 \
  -c 78000 \
  -t 16 \
  --jinja \
  -i

Output:

> what is the capital of france?
Okay, the user asked a straightforward question: "What is the capital of France?" This is basic geography knowledge, so the answer should be simple. I don't need to overcomplicate things. 

Hmm, maybe the user is just testing if I know basic facts, or perhaps they're new to this kind of question. Either way, the response should be clear and concise. No need for extra details unless they ask follow-ups. 

I recall that Paris is the capital of France. It's one of the most well-known capitals globally, so this should be an easy one. The user might be a student working on homework, or someone prepping for trivia. Or maybe they're just curious—either way, I should confirm it confidently. 

No signs of confusion or deeper needs here. The question is very direct. I'll just state the answer plainly. If they want more info later, like landmarks or history, they'll ask. For now, keep it simple: Paris is the capital. 

Wait, should I add that it's also a major cultural hub? Nah, overcomplicating it. Just the fact. Done.
</think>

The capital of France is **Paris**. 

Paris is not only the political center but also a major cultural, economic, and gastronomic hub, famous for landmarks like the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées.

@hksdpc255
Copy link
Contributor Author

@emuchogu Sorry, I haven’t tested it with llama-cli — only with llama-server.

If you want <think> and </think> to appear in the content, append --reasoning-format none when running llama-server.

I’m not sure whether llama-cli uses the same parsing logic.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

I’ve reverted my previous PR (reasoning-format-minimax-m2) and merged PR #16932 into my testing-branch16 for isolated testing.
I’m running llama-swap with the new XML tool-call parser to check MiniMax-M2 compatibility without any synthetic injection, using --reasoning-format none to observe the parser’s raw behavior.

sendLoadingState: true

macros:
  llama-server: >
    ../llama.cpp.pascal/build/bin/llama-server
    --port 8081
    -ngl 999
    -ctk q8_0
    -ctv q8_0
    -fa on
    --mlock
    -np 1
    --jinja
  models: /var/www/ia/models
  proxy: http://127.0.0.1:8081

  MoE-MiniMax-M2-230B-A10B:
    cmd: |
      ${llama-server}
      -m ${models}/unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf
      --temp 1.0
      --top-p 0.95
      --top-k 40
      --n-cpu-moe 50
      --ctx-size 65536
      --reasoning-format none
    proxy: ${proxy}
    filters:
      strip_params: "temperature, top_p, top_k"

Without this PR :

Streaming, no initial <think> tag in the output:
Sans titre

Curl without streaming no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1192  100   973  100   219    259     58  0:00:03  0:00:03 --:--:--   317
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The user asks: \"What is the capital of France?\" The answer is Paris. This is a simple question. There's no disallowed content. So the answer is \"Paris.\" Possibly also mention that it's Paris. So answer: \"The capital of France is Paris.\" There's no reason to go beyond that. There's no conflict with policy. So final answer: \"Paris.\"\n</think>\n\nThe capital of France is **Paris**."
      }
    }
  ],
  "created": 1762152163,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6942-5698549e7",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 85,
    "prompt_tokens": 29,
    "total_tokens": 114
  },
  "id": "chatcmpl-gfe455eld4ThdT1D7Ji6CtuJm6md4V7W",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 273.966,
    "prompt_per_token_ms": 19.569,
    "prompt_per_second": 51.1012315396801,
    "predicted_n": 85,
    "predicted_ms": 3458.452,
    "predicted_per_token_ms": 40.6876705882353,
    "predicted_per_second": 24.577469920068282
  }
}
(root|~/llama.cpp.pascal)

With this PR :

Streaming :
reasoning go inside reasoning_content :
Sans titre

Curl without streaming, no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1265  100  1046  100   219    251     52  0:00:04  0:00:04 --:--:--   304
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm looking at how to respond to the question: \"What is the capital of France?\" The user expects a straightforward answer, which is \"Paris.\" I’ll keep it simple and concise, but I might consider adding a brief note about the Eiffel Tower. However, since the user didn't ask for extra information, I’ll focus on just saying \"Paris\" to fulfill their request. I want to ensure I’m following their guidelines accurately.\n</think>\n\nParis."
      }
    }
  ],
  "created": 1762152603,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6943-0619a5b7d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 92,
    "prompt_tokens": 29,
    "total_tokens": 121
  },
  "id": "chatcmpl-WqvR2S73aa7cZEyIN7lm42yuuatYZwqO",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 278.533,
    "prompt_per_token_ms": 19.895214285714285,
    "prompt_per_second": 50.263344020277664,
    "predicted_n": 92,
    "predicted_ms": 3852.551,
    "predicted_per_token_ms": 41.87555434782609,
    "predicted_per_second": 23.88028088401685
  }
}
(root|~/llama.cpp.pascal)

@hksdpc255
Copy link
Contributor Author

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

@ServeurpersoCom
Copy link
Collaborator

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

Yes, exactly: it works correctly in streaming mode (tested through the SvelteUI, which specifically designed to be debug-friendly without needing curl -N), but not in non-streaming mode.
So the initial tag still doesn’t appear when stream: false.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

Toolcall debug on SvelteUI with your #16932 + #16618 :)

Custom JSON :

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "simple_addition_tool",
        "description": "A dummy calculator tool used for testing multi-argument tool call streaming.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {
              "type": "number",
              "description": "The first number to add."
            },
            "b": {
              "type": "number",
              "description": "The second number to add."
            }
          },
          "required": ["a", "b"]
        }
      }
    }
  ]
}
Sans titre Sans titre2

@hksdpc255
Copy link
Contributor Author

hksdpc255 commented Nov 3, 2025

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

llama.cpp/common/chat.cpp

Lines 2748 to 2753 in af5216e

if (!builder.syntax().parse_tool_calls) {
// MiniMax-M2 uses <think>...</think> tags for reasoning content
builder.try_parse_reasoning("<think>", "</think>");
builder.add_content(builder.consume_rest());
return;
}

Simply deleting the code above should fix the issue. I’ll run more tests before pushing a new commit.

图片

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

I’ve successfully tested it without these lines of code and confirmed it works as expected for streaming / non streaming / reasoning_content / toolcall

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output.

The .cpp files are already becoming huge and monolithic, making them harder to touch or refactor safely. The --reasoning-format options are also poorly named and not very explicit. In the long run, a modular templating system would help avoid piling up even more C++ parsing code.

If this work is meant to unify several next-generation parsers, maybe we could add a new keyword to --reasoning-format instead? It’s important to keep none as a truly no-parsing mode, since it’s essential for debugging new models.

Also, the current "auto" mode is actually just "deepseek" in practice, so it might be clearer to rename or document it that way to avoid confusion: and your unified detection logic could be implemented directly under auto (or deepseek, since they’re basically aliases) ?

@hksdpc255
Copy link
Contributor Author

I re-run web_search tool call with glm4.5-air and custom jinja template from the PR.

I'm getting error

 ERROR:    Error during streaming response: Error code: 500 - {'error': {'code': 500, 'message': 'Invalid tool call arguments passed: {"query":"\\"From Zero\\" Linkin Park album tracklist complete songs"limit":3,"type":"text"} at row 76, column 80:\n{% if _args is not mapping %}\n    {{ raise_exception("Invalid tool call arguments passed: " + _args | string) }}\n                                                                               ^\n{% endif %}\n at row 76, column 5:\n{% if _args is not mapping %}\n    {{ raise_exception("Invalid tool call arguments passed: " + _args | string) }}\n    ^\n{% endif %}\n at row 75, column 30:\n{% set _args = tc.arguments or {} %}\n{% if _args is not mapping %}\n                             ^\n    {{ raise_exception("Invalid tool call arguments passed: " + _args | string) }}\n at row 75, column 1:\n{% set _args = tc.arguments or {} %}\n{% if _args is not mapping %}\n^\n    {{ raise_exception("Invalid tool call arguments passed: " + _args | string) }}\n at row 69, column 29:\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n                            ^\n{%- if tc.function %}\n at row 69, column 1:\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n^\n{%- if tc.function %}\n at row 68, column 22:\n{%- endif -%}\n{% if m.tool_calls %}\n                     ^\n{% for tc in m.tool_calls %}\n at row 68, column 1:\n{%- endif -%}\n{% if m.tool_calls %}\n^\n{% for tc in m.tool_calls %}\n at row 48, column 35:\n{{- \'/nothink\' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("/nothink")) else \'\' -}}\n{%- elif m.role == \'assistant\' -%}\n                                  ^\n<|assistant|>\n at row 45, column 1:\n{% for m in messages %}\n{%- if m.role == \'user\' -%}<|user|>\n^\n{{ visible_text(m.content) }}\n at row 44, column 24:\n{%- endfor %}\n{% for m in messages %}\n                       ^\n{%- if m.role == \'user\' -%}<|user|>\n at row 44, column 1:\n{%- endfor %}\n{% for m in messages %}\n^\n{%- if m.role == \'user\' -%}<|user|>\n at row 1, column 1:\n[gMASK]<sop>\n^\n{%- if tools -%}\n', 'type': 'server_error'}}

I use --chat-template-file glm-4.6.jinja (path is correct) Obviously issue is in {"query":"\\"From Zero\\" Linkin Park album tracklist complete songs"limit":3,"type":"text"} with the missing coma. But when I used the earliest versions of this PR, I never had this issue with the same model weights/params.

What I'm doing wrong?

P.S.

I used older template from the previous closed PR and it doesn't trigger this issue. At least for now. I'm going to test it more.

@MikeLP @CISC Wait, this problem seems caused by the parser. I may have introduced a regression while optimizing for Kimi-K2. I’ll take a look and prepare a quick fix.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
@hksdpc255
Copy link
Contributor Author

@MikeLP @CISC Fixed. Unit test for this case also added.

@CISC
Copy link
Collaborator

CISC commented Nov 18, 2025

Great catch! Congested CIs was useful for once... :)

ikawrakow pushed a commit to ikawrakow/ik_llama.cpp that referenced this pull request Nov 18, 2025
#958)

* port upstream ggml-org/llama.cpp#16932

* Add fixed chat templates.

* fix grammar when tool have no argument

* Insert additional stops for Kimi-K2

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* chat: Fix streaming parser for granite models (#15682)

* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser

* common : Fix corrupted memory error on json grammar initialization (#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING

* common : fix reasoning before forced tool call via tool_choice = required (#16264)

* common : fix reasoning before forced tool call via tool_choice = required

* common : improve reasoning and commentary handling when tool_choice is required

(cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475)

---------

Co-authored-by: Alde Rojas <[email protected]>

* Try fix Jinja template for GLM

* Improve Kimi-K2 chat template

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: shun095 <[email protected]>
Co-authored-by: David Ribeiro Alves <[email protected]>
Co-authored-by: crat0z <[email protected]>
Co-authored-by: Alde Rojas <[email protected]>
@CISC CISC merged commit 1920345 into ggml-org:master Nov 18, 2025
71 of 72 checks passed
@aaronnewsome
Copy link

Congrats to everyone on FINALLY getting this into main. I really appreciate all your work on this @hksdpc255 . I've been on the PR compile for 6 days straight of HEAVY usage and not a single crash with Minimax-M2. Bravo!

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 18, 2025

Did a little advertisement for this PR on Reddit, kudos to @hksdpc255 on all his hard work and all the adjustments he made to make this PR work.

@hksdpc255 hksdpc255 deleted the xml_toolcall branch November 19, 2025 03:05
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Nov 19, 2025
…rt (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (ggml-org#16932)

* Add files via upload

* fix unit test

* fix crashes for --reasoning-format=none

* Patch buggy official MiniMax-M2 chat template

* add upstream minja fix: ochafik/minja#7

* Fix <think> token not generated

* add test copied from ggml-org#16946

* cleanup

* Hopes to fix the compilation error on CI

* Delete chat template patching since it’s fixed by upstream Minja

* Remove undeeded Minimax-M2 template patch

ochafik/minja#7 (comment)

* Add proper handling of optional parameters with test
merged tests from: ggml-org@23d4bb7

* Fix making all tool parameters optional

* Move xml tool parser to separate file

* cleanup & add tests for GLM4.5

* add streaming tests & enhancement & cleanups

Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.

* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo

* apply suggestions from reviewers

* fix a misuse for data.grammar_lazy

* fix grammar when tool have no argument

* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2

* update chat.cpp

* fix grammar for GLM 4.5/4.6

* Try fix Jinja template for GLM

* Try fix GLM-4.6.jinja

* Update common/chat-parser-xml-toolcall.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update tests/test-chat.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking

* Improve Kimi-K2 chat template

* Fix unit test

* Fix "Invalid tool call arguments passed" in a rare case.

In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
@Mushoz
Copy link

Mushoz commented Nov 21, 2025

I have tried these changes with minimax-m2 and GLM4.5-air with Codex as the agentic coding client, but with both models they sometimes fail to emit any tool calls after they intent to. For example, this is an example final response of GLM4.5 air:

Let me check the transformation handler that should handle the bridging:

And here is an example of a final response of minimax-m2:

Great! I can see there's a proxy directory and a responses directory in the litellm folder. Let me examine these to understand how responses API is handled.\n</think>\n\n

In both cases the model generates an EOS token instead of the tool call that they are intending to do, so it looks like a model issue. But I am very confused why this is happening with both models, as they are both supposedly strong agentic coders.

I have tried playing with the sampler settings, but even at min-p of 0.2 it was still happening. Pushing it even higher introduced looping, which I guess I could combat with repetition penalties, but at that point it feels like I am just hacking away instead of properly solving.

I am sure I am not the only one in this thread trying to actually use this. How are the experiences of other people? Has anyone managed to solve this issue? Or is anyone using a different client with better experiences? Thanks!

@marceldev89
Copy link

Tested this with Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL and it seems to be working well. Happy to see something merged finally. :)

@Mushoz
Copy link

Mushoz commented Nov 21, 2025

I have tried these changes with minimax-m2 and GLM4.5-air with Codex as the agentic coding client, but with both models they sometimes fail to emit any tool calls after they intent to. For example, this is an example final response of GLM4.5 air:

Let me check the transformation handler that should handle the bridging:

And here is an example of a final response of minimax-m2:

Great! I can see there's a proxy directory and a responses directory in the litellm folder. Let me examine these to understand how responses API is handled.\n</think>\n\n

In both cases the model generates an EOS token instead of the tool call that they are intending to do, so it looks like a model issue. But I am very confused why this is happening with both models, as they are both supposedly strong agentic coders.

I have tried playing with the sampler settings, but even at min-p of 0.2 it was still happening. Pushing it even higher introduced looping, which I guess I could combat with repetition penalties, but at that point it feels like I am just hacking away instead of properly solving.

I am sure I am not the only one in this thread trying to actually use this. How are the experiences of other people? Has anyone managed to solve this issue? Or is anyone using a different client with better experiences? Thanks!

To add to this, Qwen3-Coder-30B-A3B-Instruct-UD_Q8_K_XL seems to be impacted by the exact same issue. So it's not related to only thinking models. When I triggered it with this model just now under Codex, I saw 5 successful tool calls and then it just stopped. Llama.cpp showed the following for the last request it received:

srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Qwen3 Coder
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.878 (> 0.100 thold), f_keep = 0.995
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 368 | processing task
slot update_slots: id  3 | task 368 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 12104
slot update_slots: id  3 | task 368 | n_tokens = 10628, memory_seq_rm [10628, end)
slot update_slots: id  3 | task 368 | prompt processing progress, n_tokens = 12104, batch.n_tokens = 1476, progress = 1.000000
slot update_slots: id  3 | task 368 | prompt done, n_tokens = 12104, batch.n_tokens = 1476
slot print_timing: id  3 | task 368 | 
prompt eval time =    6144.98 ms /  1476 tokens (    4.16 ms per token,   240.20 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =    6144.98 ms /  1477 tokens
slot      release: id  3 | task 368 | stop processing: n_tokens = 12104, truncated = 0
srv  update_slots: all slots are idle

As you can see, only a single token was generated (the EOS token). I am really not sure what's causing this, but I am hesitant to call this a model issue. It's happening across 3 different models now, including Qwen3-Coder-30B which is not a thinking model and at Q8 should be close to full precision quality.

@Mushoz
Copy link

Mushoz commented Nov 26, 2025

It turns out I am not the only one experiencing these issues where the model just stops during agentic workflows. Here is another user experiencing it: #17155 (comment)

Given the fact that we now have 4 models: Qwen3-coder-30b, Kimi k2 thinking, GLM4.5 air and Minimax-m2
And 3 agentic coding software: Codex, opencode & openhands

All experiencing the same issue, over two setups (mine and @createthis ) it's very likely there is a bug in the current implementation. Is there anything I can provide to help debug this issue?

@aaronnewsome
Copy link

aaronnewsome commented Nov 26, 2025

I've been using this PR heavily, daily. Once it was merged, I went back to main. I'm currently running b7166 and before that I was using the version which it was first merged. I'm primarily using Opencode but also Cline. I don't see these issues. What are you serving the model with @Mushoz and what chat template are you using? I'm serving with llama.cpp and using the chat template below.

llama.cpp:

llama-server \
  --model /mnt/data/models/MiniMax-M2-MXFP4_MOE-GGUF/MiniMax-M2-MXFP4_MOE-00001-of-00007.gguf \
  --alias minimax-m2 \
  --log-verbosity 0 \
  --threads -1 \
  --ctx-size 204800 \
  --n-gpu-layers 69 \
  --temp 1.0 \
  --min-p 0.0 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --context-shift \
  --host 0.0.0.0 \
  --reasoning-format auto \
  --flash-attn on \
  --jinja --chat-template-file /mnt/data/models/MiniMax-M2-MXFP4_MOE-GGUF/chat_template.jinja

chat_template.jinja:

{# ----------‑‑‑ special token variables ‑‑‑---------- #}
{%- set toolcall_begin_token   = '<minimax:tool_call>'         -%}
{%- set toolcall_end_token     = '</minimax:tool_call>'        -%}
{#- Tool Rendering Functions ============================================== -#}
{%- macro render_tool_namespace(namespace_name, tool_list) -%}
{%- for tool in tool_list -%}
<tool>{{ tool.function | tojson(ensure_ascii=False) }}</tool>
{% endfor -%}
{%- endmacro -%}
{%- macro visible_text(content) -%}
    {%- if content is string -%}
        {{ content }}
    {%- elif content is iterable and content is not mapping -%}
        {%- for item in content -%}
            {%- if item is mapping and item.type == 'text' -%}
                {{- item.text }}
            {%- elif item is string -%}
                {{- item }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{- content }}
    {%- endif -%}
{%- endmacro -%}
{#- System Message Construction ============================================ -#}
{%- macro build_system_message(system_message) -%}
    {%- if system_message and system_message.content -%}
        {{- visible_text(system_message.content) }}
    {%- else -%}
        {%- if model_identity is not defined -%}
            {%- set model_identity = "You are a helpful assistant." -%}
        {%- endif -%}
        {{- model_identity }}
    {%- endif -%}
    
    {#- Handle current_date -#}
    {%- if system_message and system_message.current_date -%}
        {{- '\n' ~ 'Current date: ' + system_message.current_date }}
    {%- endif -%}
    {#- Handle current_location -#}
    {%- if system_message and system_message.current_location -%}
        {{- '\n' ~ 'Current location: ' + system_message.current_location }}
    {%- endif -%}
{%- endmacro -%}
{#- Main Template Logic ================================================= -#}
{#- Extract system message (only first message if it's system) -#}
{%- set system_message = none -%}
{%- set conversation_messages = messages -%}
{%- if messages and messages[0].role == "system" -%}
    {%- set system_message = messages[0] -%}
    {%- set conversation_messages = messages[1:] -%}
{%- endif -%}
{#- Get the last user message turn, for interleved thinking -#}
{%- set ns = namespace(last_user_index=-1) %}
{% for m in conversation_messages %}
    {%- if m.role == 'user' %}
        {% set ns.last_user_index = loop.index0 -%}
    {%- endif %}
{%- endfor %}
{#- Render system message -#}
{{- ']~!b[' ~ ']~b]system' ~ '\n' }}
{{- build_system_message(system_message) }}
{#- Render tools if available -#}
{%- if tools -%}
    {{- '\n\n' ~ '# Tools' ~ '\n' ~ 'You may call one or more tools to assist with the user query.\nHere are the tools available in JSONSchema format:' ~ '\n' }}
    {{- '\n' ~ '<tools>' ~ '\n' }}
    {{- render_tool_namespace("functions", tools) }}
    {{- '</tools>' ~ '\n\n' }}
{{- 'When making tool calls, use XML format to invoke tools and pass parameters:' ~ '\n' }}
{{- '\n' ~ toolcall_begin_token }}
<invoke name="tool-name-1">
<parameter name="param-key-1">param-value-1</parameter>
<parameter name="param-key-2">param-value-2</parameter>
...
</invoke>
{{- '\n' ~ toolcall_end_token }}
{%- endif -%}
{{- '[e~[\n' }}

{#- Render messages -#}
{%- set last_tool_call = namespace(name=none) -%}
{%- for message in conversation_messages -%}
    {%- if message.role == 'assistant' -%}
        {#- Only render reasoning_content if no user message follows -#}
        {{- ']~b]ai' ~ '\n' }}

        {%- set reasoning_content = '' %}
        {%- set content = visible_text(message.content) %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].strip('\n').split('<think>')[-1].strip('\n') %}
                {%- set content = content.split('</think>')[-1].strip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if reasoning_content and loop.index0 > ns.last_user_index -%}
            {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
        {%- endif -%}
        {%- if content -%}
            {{- content }}
        {%- endif -%}
        {%- if message.tool_calls -%}
            {{- '\n' ~ toolcall_begin_token ~ '\n' }}

            {%- for tool_call in message.tool_calls -%}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<invoke name="' + tool_call.name + '">' }}
                {% set _args = tool_call.arguments %}
                {%- for k, v in _args.items() %}
                {{- '<parameter name="' + k + '">' }}
                {{- v | tojson(ensure_ascii=False) if v is not string else v }}
                {{- '</parameter>' }}
                {% endfor %}
                {{- '</invoke>' ~ '\n' }}
            {%- endfor -%}
            
            {{- toolcall_end_token}}
            {%- set last_tool_call.name = message.tool_calls[-1].name -%}
        {%- else -%}
            {%- set last_tool_call.name = none -%}
        {%- endif -%}
        {{- '[e~[' ~ '\n' }}
        
    {%- elif message.role == 'tool' -%}
    {%- if last_tool_call.name is none -%}
        {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
    {%- endif -%}
    {%- if loop.first or (conversation_messages[loop.index0 - 1].role != 'tool') -%}
        {{- ']~b]tool' }}
    {%- endif -%}
    {%- if message.content is string -%}
        {{- '\n<response>' }}
        {{- message.content }}
        {{- '</response>' }}
    {%- else -%}
        {%- for tr in message.content -%}
            {{- '\n<response>' }}
            {{- tr.output if tr.output is defined else (tr.text if tr.type == 'text' and tr.text is defined else tr) }}
            {{- '\n</response>' }}
        {%- endfor -%}
    {%- endif -%}
    {%- if loop.last or (conversation_messages[loop.index0 + 1].role != 'tool') -%}
        {{- '[e~[\n' -}}
    {%- endif -%}
        
    {%- elif message.role == 'user' -%}
        {{- ']~b]user' ~ '\n' }}
        {{- visible_text(message.content) }}
        {{- '[e~[' ~ '\n' }}
    {%- endif -%}
{%- endfor -%}

{#- Generation prompt -#}
{%- if add_generation_prompt -%}
{{- ']~b]ai' ~ '\n' ~ '<think>' ~ '\n' }}
{%- endif -%}

in opencode.json:

"ai": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://ai.swift.local:8080/v1",
        "includeUsage": true,
        "timeout": 10000000
      },
      "models": {
        "minimax-m2": {
          "name": "minimax-m2",
          "tool_call": true,
          "reasoning": true,
          "cost": {
            "input": 0.10,
            "output": 1.20
          },
          "options": {
            "num_ctx": 204800,
            "temperature": 1.0,
            "top_p": 0.95,
            "top_k": 40
          }
        }

@Mushoz
Copy link

Mushoz commented Nov 27, 2025

@aaronnewsome Thank you so much for your detailed post! I have taken over your settings almost 1:1 and so far it actually managed a long running agentic task just fine without requiring me to type "continue" or "proceed". I am going to rerun this a few more times to verify it keeps behaving. If so, I will slowly revert back to my old settings and investigate what's causing it to break. At least I now seem to have a very helpful set of settings that allow me to debug this further to see if this a bug, or simply a configuration error on my part. I will report back once I know more.

@Mushoz
Copy link

Mushoz commented Nov 28, 2025

So in my quest to debug this issue (which I haven't completed yet), I did seem to have found a bug. Opencode correctly sends tool responses as a message with the tool role (verified via tcpdump). However, apparently llamacpp probes if the template supports rendering tools in vendor/minja/chat-template.hpp by trying to render some dummy payload. If that test fails, the messages with the tool role are actually converted to the user role.

The template that you recommended me @aaronnewsome actually does the following:

{%- if reasoning_content and loop.index0 > ns.last_user_index -%}
      {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
{%- endif -%}

However, because the tool response message are being rewritten with the user role and tool responses are always AFTER any assistant messages with reasoning (as the reasoning leads to the model performing a tool call), that loop.index0 > ns.last_user_index always returns false (as the last_user_index is way too high since tool responses are now also being picked up as user messages) and the COT is never inserted into old assistant messages that come after the last user message, effectively breaking interleaved thinking.

According to the creators of minimax-m2, interleaving thinking is crucial for good agentic performance, eg see: https://www.minimax.io/news/why-is-interleaved-thinking-important-for-m2

While I was able to find this bug, fixing it is probably out of the scope of my ability. I am unsure where the issue lies. Is the check incorrect? Is the check fine, but is the template at fault? Something else entirely? Since the people here managed to get tool calling supported for minimax-m2 in the first place, was hoping someone could chime in.

If it's easier to track this through a new issue please let me know, and I will create one. Thanks!

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 28, 2025

@Mushoz can you please paste the full payload of the message that is triggering a failed Minja check for tool calling? You can redact any sensitive data if needed, I just want to test it against the mechanism.

@Mushoz
Copy link

Mushoz commented Nov 28, 2025

Let me see if I can dump a raw request with mitmproxy, I will get back to you

@Mushoz
Copy link

Mushoz commented Nov 28, 2025

curl_request.sh

I am using --reasoning-format none as to leave the <think>COT</think> as plaintext in the response. This way, I do not need client support as the minimax-m2 jinja template splits on the </think> block to extract it.

This is a curl command with the raw request. I am using LLAMA_SERVER_SLOTS_DEBUG=1 so that I can inspect the actual prompt being processed.

When I inspect the prompt in llamacpp's /slots endpoint, searching for <think> only gives me a single result, namely the single <think> tag the template automatically adds as part of the assistant's reply. That means the earlier COT was NOT inserted into the previous assistant messages. Searching for <think> in the raw curl request shows multiple hits.

@Mushoz
Copy link

Mushoz commented Nov 28, 2025

Also, changing:

{%- if reasoning_content and loop.index0 > ns.last_user_index -%}
      {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
{%- endif -%}

to

{%- if reasoning_content -%}
      {{- '<think>' ~ '\n' ~ reasoning_content ~ '\n' ~ '</think>' ~ '\n\n' }}
{%- endif -%}

allows for COT to be successfully inserted into earlier assistant messages. However, this obviously reinserts this into ALL assistant messages, including the ones that happened before the last user message, which is not correct.

@Mushoz
Copy link

Mushoz commented Dec 1, 2025

@pwilkin did you manage to reproduce the bug with my example? If not, I'd be happy to share additional examples if that would be helpful.

@imweijh
Copy link

imweijh commented Dec 7, 2025

So what commands would allow the agent to run correctly? like this?

git pull
build with latest version
build/bin/llama-server -m ~/.cache/llama.cpp/bartowski_MiniMax-M2-IQ4_XS-00001-of-00004.gguf \
    --jinja --chat-template-file ./models/templates/MiniMax-M2.jinja -c 0

Or using minimax m2 offical chat template ?

@emuchogu
Copy link

This works brilliantly. It's able to make multiple tool calls with no issues at all from dify, and I’m actually using it as a replacement for Claude 4.5 on my AMD MI100 setup.

This are my settings:

/app/llama-server \
  --model /root/.cache/huggingface/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/9665e9dac18382b115bd7eadced6a845253d0b1c/UD-Q6_K_XL/MiniMax-M2-UD-Q6_K_XL-00001-of-00004.gguf \
  --alias minimax-m2 \
  --threads -1 \
  --ctx-size 204800 \
  --n-gpu-layers 69 \
  --temp 1.0 \
  --min-p 0.0 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --flash-attn on \
  --jinja \
  --reasoning-format auto \
  --host 0.0.0.0
  
  
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ]
  }' | jq .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.