Skip to content

Misc. bug: Getting 500 - Failed to parse input at pos x when tool calling #20650

@mvatafu

Description

@mvatafu

Name and Version

./llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
version: 8382 (1bbec6a)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

[Unit]
Description=llama.cpp Qwen3-30B Server
After=network.target

[Service]
User=root
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
Environment=GGML_CUDA_GRAPH_OPT=1
WorkingDirectory=/var/opt/lib/fold/llama.cpp.cuda
ExecStart=/var/opt/lib/fold/llama.cpp.cuda/build/bin/llama-server \
  --threads 22 \
  --threads-batch 8 \
  --jinja \
  --cont-batching \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --model /root/models/glm/GLM-4.7-Flash-Q6_K.gguf.1 \
  --ctx-size 70000 \
  --parallel 1 \
  --n-cpu-moe 5 \
  --batch-size 8192 \
  --ubatch-size 4096 \
  --port 8050 \
  --metrics \
  --no-mmap \
  --mlock \
  -mg 0 \
  -ts 20,22 \
  -ot ".ffn_(up)_exps.=CPU" \
  --cache-ram 0 \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.01 \
  --repeat-penalty 1.0 \
  --spec-type ngram-mod \
  --spec-ngram-size-n 24 \
  --draft-min 48 \
  --draft-max 64 \
  --host 0.0.0.0

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Problem description & steps to reproduce

Using the last commit, when I try to run my ADK agent, I'm getting ' operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 1225: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}'. Same this happens in LibreChat, where I get 'Something went wrong. Here's the specific error message we encountered: An error occurred while processing the request: Failed to parse input at pos 111: <tool_call>listRepos_action_MTAuMTAzLj</tool_call>'.

Mentioning that it happens on GLM.4-7.Flash and also on Qwen3.5-35B.

Image

First Bad Commit

last commit #20612 - 1bbec6a

Relevant log output

Logs
Mar 16 21:07:04 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:04 server llama-server[2827312]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Mar 16 21:07:04 server llama-server[2827312]: srv  params_from_: Chat format: peg-native
Mar 16 21:07:04 server llama-server[2827312]: slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 285099151736
Mar 16 21:07:04 server llama-server[2827312]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:04 server llama-server[2827312]: slot launch_slot_: id  0 | task 2143 | processing task, is_child = 0
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id  0 | task 2143 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id  0 | task 2143 | n_tokens = 52, memory_seq_rm [52, end)
Mar 16 21:07:04 server llama-server[2827312]: slot init_sampler: id  0 | task 2143 | init sampler, took 1.64 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:04 server llama-server[2827312]: slot update_slots: id  0 | task 2143 | prompt processing done, n_tokens = 4175, batch.n_tokens = 4123
Mar 16 21:07:06 server llama-server[2827312]: begin: ngram_mod occupancy = 9638/4194304 (0.00)
Mar 16 21:07:14 server llama-server[2827312]: slot print_timing: id  0 | task 2143 |
Mar 16 21:07:14 server llama-server[2827312]: prompt eval time =    1828.37 ms /  4123 tokens (    0.44 ms per token,  2255.02 tokens per second)
Mar 16 21:07:14 server llama-server[2827312]:        eval time =    7843.73 ms /   313 tokens (   25.06 ms per token,    39.90 tokens per second)
Mar 16 21:07:14 server llama-server[2827312]:       total time =    9672.10 ms /  4436 tokens
Mar 16 21:07:14 server llama-server[2827312]: draft acceptance rate = 0.15625 (   10 accepted /    64 generated)
Mar 16 21:07:14 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 11 2424 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 12, dur(b,g,a) = 1.460, 2.450, 0.001 ms
Mar 16 21:07:14 server llama-server[2827312]: slot      release: id  0 | task 2143 | stop processing: n_tokens = 4487, truncated = 0
Mar 16 21:07:14 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:14 server llama-server[2827312]: srv          stop: cancel task, id_task = 2143
Mar 16 21:07:14 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:14 server llama-server[2827312]: srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 1225: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:14 server llama-server[2827312]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:14 server llama-server[2827312]: srv  params_from_: Chat format: peg-native
Mar 16 21:07:14 server llama-server[2827312]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.930
Mar 16 21:07:14 server llama-server[2827312]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:14 server llama-server[2827312]: slot launch_slot_: id  0 | task 2448 | processing task, is_child = 0
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id  0 | task 2448 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id  0 | task 2448 | need to evaluate at least 1 token for each active slot (n_past = 4175, task.n_tokens() = 4175)
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id  0 | task 2448 | n_past was set to 4174
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id  0 | task 2448 | n_tokens = 4174, memory_seq_rm [4174, end)
Mar 16 21:07:14 server llama-server[2827312]: slot init_sampler: id  0 | task 2448 | init sampler, took 0.79 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:14 server llama-server[2827312]: slot update_slots: id  0 | task 2448 | prompt processing done, n_tokens = 4175, batch.n_tokens = 1
Mar 16 21:07:14 server llama-server[2827312]: begin: ngram_mod occupancy = 9918/4194304 (0.00)
Mar 16 21:07:18 server llama-server[2827312]: slot print_timing: id  0 | task 2448 |
Mar 16 21:07:18 server llama-server[2827312]: prompt eval time =      28.98 ms /     1 tokens (   28.98 ms per token,    34.51 tokens per second)
Mar 16 21:07:18 server llama-server[2827312]:        eval time =    3690.38 ms /   141 tokens (   26.17 ms per token,    38.21 tokens per second)
Mar 16 21:07:18 server llama-server[2827312]:       total time =    3719.35 ms /   142 tokens
Mar 16 21:07:18 server llama-server[2827312]: draft acceptance rate = 0.00000 (    0 accepted /    64 generated)
Mar 16 21:07:18 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 12 2564 2, #gen drafts = 3, #acc drafts = 2, #gen tokens = 192, #acc tokens = 12, dur(b,g,a) = 1.777, 2.586, 0.001 ms
Mar 16 21:07:18 server llama-server[2827312]: slot      release: id  0 | task 2448 | stop processing: n_tokens = 4315, truncated = 0
Mar 16 21:07:18 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:18 server llama-server[2827312]: srv          stop: cancel task, id_task = 2448
Mar 16 21:07:18 server llama-server[2827312]: srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 604: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:18 server llama-server[2827312]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:18 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:19 server llama-server[2827312]: srv  params_from_: Chat format: peg-native
Mar 16 21:07:19 server llama-server[2827312]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.968
Mar 16 21:07:19 server llama-server[2827312]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
Mar 16 21:07:19 server llama-server[2827312]: slot launch_slot_: id  0 | task 2591 | processing task, is_child = 0
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id  0 | task 2591 | new prompt, n_ctx_slot = 70144, n_keep = 0, task.n_tokens = 4175
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id  0 | task 2591 | need to evaluate at least 1 token for each active slot (n_past = 4175, task.n_tokens() = 4175)
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id  0 | task 2591 | n_past was set to 4174
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id  0 | task 2591 | n_tokens = 4174, memory_seq_rm [4174, end)
Mar 16 21:07:19 server llama-server[2827312]: slot init_sampler: id  0 | task 2591 | init sampler, took 1.69 ms, tokens: text = 4175, total = 4175
Mar 16 21:07:19 server llama-server[2827312]: slot update_slots: id  0 | task 2591 | prompt processing done, n_tokens = 4175, batch.n_tokens = 1
Mar 16 21:07:19 server llama-server[2827312]: begin: ngram_mod occupancy = 10051/4194304 (0.00)
Mar 16 21:07:19 server llama-server[2827312]: accept: low acceptance streak (3) – resetting ngram_mod
Mar 16 21:07:22 server llama-server[2827312]: slot print_timing: id  0 | task 2591 |
Mar 16 21:07:22 server llama-server[2827312]: prompt eval time =      27.07 ms /     1 tokens (   27.07 ms per token,    36.94 tokens per second)
Mar 16 21:07:22 server llama-server[2827312]:        eval time =    2596.60 ms /   114 tokens (   22.78 ms per token,    43.90 tokens per second)
Mar 16 21:07:22 server llama-server[2827312]:       total time =    2623.67 ms /   115 tokens
Mar 16 21:07:22 server llama-server[2827312]: draft acceptance rate = 0.42188 (   27 accepted /    64 generated)
Mar 16 21:07:22 server llama-server[2827312]: statistics ngram_mod: #calls(b,g,a) = 13 2650 3, #gen drafts = 4, #acc drafts = 3, #gen tokens = 256, #acc tokens = 39, dur(b,g,a) = 2.087, 2.700, 1.221 ms
Mar 16 21:07:22 server llama-server[2827312]: slot      release: id  0 | task 2591 | stop processing: n_tokens = 4288, truncated = 0
Mar 16 21:07:22 server llama-server[2827312]: srv  update_slots: all slots are idle
Mar 16 21:07:22 server llama-server[2827312]: srv          stop: cancel task, id_task = 2591
Mar 16 21:07:22 server llama-server[2827312]: srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 508: <tool_call>read_file_diff_md</tool_call>","type":"server_error"}}
Mar 16 21:07:22 server llama-server[2827312]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Mar 16 21:07:22 server llama-server[2827312]: srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions