-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Misc. bug: Getting 500 - Failed to parse input at pos x when tool calling #20650
Description
Name and Version
./llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
version: 8382 (1bbec6a)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
[Unit]
Description=llama.cpp Qwen3-30B Server
After=network.target
[Service]
User=root
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
Environment=GGML_CUDA_GRAPH_OPT=1
WorkingDirectory=/var/opt/lib/fold/llama.cpp.cuda
ExecStart=/var/opt/lib/fold/llama.cpp.cuda/build/bin/llama-server \
--threads 22 \
--threads-batch 8 \
--jinja \
--cont-batching \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--model /root/models/glm/GLM-4.7-Flash-Q6_K.gguf.1 \
--ctx-size 70000 \
--parallel 1 \
--n-cpu-moe 5 \
--batch-size 8192 \
--ubatch-size 4096 \
--port 8050 \
--metrics \
--no-mmap \
--mlock \
-mg 0 \
-ts 20,22 \
-ot ".ffn_(up)_exps.=CPU" \
--cache-ram 0 \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.01 \
--repeat-penalty 1.0 \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 48 \
--draft-max 64 \
--host 0.0.0.0
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetProblem description & steps to reproduce
Using the last commit, when I try to run my ADK agent, I'm getting ' operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 1225: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}'. Same this happens in LibreChat, where I get 'Something went wrong. Here's the specific error message we encountered: An error occurred while processing the request: Failed to parse input at pos 111: <tool_call>listRepos_action_MTAuMTAzLj</tool_call>'.
Mentioning that it happens on GLM.4-7.Flash and also on Qwen3.5-35B.
First Bad Commit
Relevant log output
Logs
Mar 16 21:07:04 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:04 server llama-server[2827312]: srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Mar 16 21:07:04 server llama-server[2827312]: srv params_from_: Chat format: peg-native
Mar 16 21:07:04 server llama-server[2827312]: slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 285099151736
Mar 16 21:07:04 server llama-server[2827312]: slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:04 server llama-server[2827312]: slot launch_slot_: id 0 | task 2143 | processing task, is_child = 0
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id 0 | task 2143 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id 0 | task 2143 | n_tokens = 52, memory_seq_rm [52, end)
Mar 16 21:07:04 server llama-server[2827312]: slot init_sampler: id 0 | task 2143 | init sampler, took 1.64 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id 0 | task 2143 | prompt processing done, n_tokens = 4175, batch.n_tokens = 4123
Mar 16 21:07:06 server llama-server[2827312]: begin: ngram_mod occupancy = 9638/4194304 (0.00)
Mar 16 21:07:14 server llama-server[2827312]: slot print_timing: id 0 | task 2143 |
Mar 16 21:07:14 server llama-server[2827312]: prompt eval time = 1828.37 ms / 4123 tokens ( 0.44 ms per token, 2255.02 tokens per second)
Mar 16 21:07:14 server llama-server[2827312]: eval time = 7843.73 ms / 313 tokens ( 25.06 ms per token, 39.90 tokens per second)
Mar 16 21:07:14 server llama-server[2827312]: total time = 9672.10 ms / 4436 tokens
Mar 16 21:07:14 server llama-server[2827312]: draft acceptance rate = 0.15625 ( 10 accepted / 64 generated)
Mar 16 21:07:14 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 11 2424 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 12, dur(b,g,a) = 1.460, 2.450, 0.001 ms
Mar 16 21:07:14 server llama-server[2827312]: slot release: id 0 | task 2143 | stop processing: n_tokens = 4487, truncated = 0
Mar 16 21:07:14 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:14 server llama-server[2827312]: srv stop: cancel task, id_task = 2143
Mar 16 21:07:14 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:14 server llama-server[2827312]: srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 1225: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:14 server llama-server[2827312]: srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:14 server llama-server[2827312]: srv params_from_: Chat format: peg-native
Mar 16 21:07:14 server llama-server[2827312]: slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.930
Mar 16 21:07:14 server llama-server[2827312]: slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:14 server llama-server[2827312]: slot launch_slot_: id 0 | task 2448 | processing task, is_child = 0
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id 0 | task 2448 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id 0 | task 2448 | need to evaluate at least 1 token for each active slot (n_past = 4175, task.n_tokens() = 4175)
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id 0 | task 2448 | n_past was set to 4174
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id 0 | task 2448 | n_tokens = 4174, memory_seq_rm [4174, end)
Mar 16 21:07:14 server llama-server[2827312]: slot init_sampler: id 0 | task 2448 | init sampler, took 0.79 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id 0 | task 2448 | prompt processing done, n_tokens = 4175, batch.n_tokens = 1
Mar 16 21:07:14 server llama-server[2827312]: begin: ngram_mod occupancy = 9918/4194304 (0.00)
Mar 16 21:07:18 server llama-server[2827312]: slot print_timing: id 0 | task 2448 |
Mar 16 21:07:18 server llama-server[2827312]: prompt eval time = 28.98 ms / 1 tokens ( 28.98 ms per token, 34.51 tokens per second)
Mar 16 21:07:18 server llama-server[2827312]: eval time = 3690.38 ms / 141 tokens ( 26.17 ms per token, 38.21 tokens per second)
Mar 16 21:07:18 server llama-server[2827312]: total time = 3719.35 ms / 142 tokens
Mar 16 21:07:18 server llama-server[2827312]: draft acceptance rate = 0.00000 ( 0 accepted / 64 generated)
Mar 16 21:07:18 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 12 2564 2, #gen drafts = 3, #acc drafts = 2, #gen tokens = 192, #acc tokens = 12, dur(b,g,a) = 1.777, 2.586, 0.001 ms
Mar 16 21:07:18 server llama-server[2827312]: slot release: id 0 | task 2448 | stop processing: n_tokens = 4315, truncated = 0
Mar 16 21:07:18 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:18 server llama-server[2827312]: srv stop: cancel task, id_task = 2448
Mar 16 21:07:18 server llama-server[2827312]: srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 604: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:18 server llama-server[2827312]: srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:18 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:19 server llama-server[2827312]: srv params_from_: Chat format: peg-native
Mar 16 21:07:19 server llama-server[2827312]: slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.968
Mar 16 21:07:19 server llama-server[2827312]: slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:19 server llama-server[2827312]: slot launch_slot_: id 0 | task 2591 | processing task, is_child = 0
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id 0 | task 2591 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id 0 | task 2591 | need to evaluate at least 1 token for each active slot (n_past = 4175, task.n_tokens() = 4175)
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id 0 | task 2591 | n_past was set to 4174
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id 0 | task 2591 | n_tokens = 4174, memory_seq_rm [4174, end)
Mar 16 21:07:19 server llama-server[2827312]: slot init_sampler: id 0 | task 2591 | init sampler, took 1.69 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id 0 | task 2591 | prompt processing done, n_tokens = 4175, batch.n_tokens = 1
Mar 16 21:07:19 server llama-server[2827312]: begin: ngram_mod occupancy = 10051/4194304 (0.00)
Mar 16 21:07:19 server llama-server[2827312]: accept: low acceptance streak (3) – resetting ngram_mod
Mar 16 21:07:22 server llama-server[2827312]: slot print_timing: id 0 | task 2591 |
Mar 16 21:07:22 server llama-server[2827312]: prompt eval time = 27.07 ms / 1 tokens ( 27.07 ms per token, 36.94 tokens per second)
Mar 16 21:07:22 server llama-server[2827312]: eval time = 2596.60 ms / 114 tokens ( 22.78 ms per token, 43.90 tokens per second)
Mar 16 21:07:22 server llama-server[2827312]: total time = 2623.67 ms / 115 tokens
Mar 16 21:07:22 server llama-server[2827312]: draft acceptance rate = 0.42188 ( 27 accepted / 64 generated)
Mar 16 21:07:22 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 13 2650 3, #gen drafts = 4, #acc drafts = 3, #gen tokens = 256, #acc tokens = 39, dur(b,g,a) = 2.087, 2.700, 1.221 ms
Mar 16 21:07:22 server llama-server[2827312]: slot release: id 0 | task 2591 | stop processing: n_tokens = 4288, truncated = 0
Mar 16 21:07:22 server llama-server[2827312]: srv update_slots: all slots are idle
Mar 16 21:07:22 server llama-server[2827312]: srv stop: cancel task, id_task = 2591
Mar 16 21:07:22 server llama-server[2827312]: srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 508: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:22 server llama-server[2827312]: srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:22 server llama-server[2827312]: srv update_slots: all slots are idle