-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Misc. bug: Server 500 error (gpt-oss, AutoParser, cache) #20532
Description
Name and Version
$ llama-server --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122502 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122502 MiB (52688 MiB free)
version: 8334 (463b6a963)
built with GNU 13.3.0 for Linux aarch64Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --host 0.0.0.0 --port 8033 --log-colors on --log-prefix --log-timestamps -ngl 999 --threads -1 --no-mmap --mlock --direct-io --cache-ram 0 -hf ggml-org/gpt-oss-120b-GGUF --jinja --flash-attn 1 --ubatch-size 2048 --batch-size 2048 --temp 1.0 --min-p 0.01 --top-p 1.0 --top-k 0.0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 --chat-template-kwargs '{"reasoning_effort": "low"}'Problem description & steps to reproduce
Summary
Starting the last few days I've been getting server-500 error messages Failed to parse input at pos while running the Aider Polyglot eval. I had difficulty trying to find a minimal reproducible test case, it turned out to be a sequence of steps that causes the issue. It appears to me that this is some sort of slot prompt cache corruption that the new AutoParser uncovered. All of my tests were run on a DGX Spark.
I am still working to find the root cause, but decided to post my findings in case someone else can shortcut me to the fix.
Error samples
567.38.283.954 W srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 0: ...<|message|><|start|>assistant<|channel|>analysis<|message|>We need to produce a summary per instructions. The user wants a brief summary of the partial conversation, focusing more on recent messages. Must include function names, libraries, packages, filenames referenced. No code fences. Must start with \"I asked you...\". Write as user, referring to assistant as you. Summarize: initial request about Zebra Puzzle, assistant gave simple solve implementation in zebra_puzzle.cpp with namespace and solve returning Solution. User said updated files. Assistant responded \"Ok.\" Then user posted build output showing compile errors: solve not a member of zebra_puzzle. Assistant then gave header file with struct Solution and declaration of solve. Need to include that. Also mention files: zebra_puzzle.cpp, zebra_puzzle.h, zebra_puzzle_test.cpp, CMake, etc. Mention standard library <string>. No code fences. Provide new paragraph when topic changes. Let's craft.<|end|><|start|>assistant<|channel|>final<|message|>I asked you to implement the Zebra Puzzle solution, so you added a zebra_puzzle.cpp file that defines a zebra_puzzle::solve() function returning a Solution object with the hard‑coded answers “Norwegian” and “Japanese”. \n\nI then told you I had updated the source files, and you replied briefly that you noted the change. \n\nAfter that I ran the CMake build and posted the compilation output, which showed that zebra_puzzle_test.cpp could not find a solve function in the zebra_puzzle namespace. To fix the errors you provided a new zebra_puzzle.h header that declares the Solution struct (with std::string members drinksWater and ownsZebra) and the solve() function prototype inside the zebra_puzzle namespace, using the standard <string> library.","type":"server_error"}}
45.27.987.355 W srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 0: ...\n<|end|><|start|>assistant<|channel|>analysis<|message|>We need summary.<|end|><|start|>assistant<|channel|>final<|message|>I asked you to implement parallel letter‑frequency counting in C++. You first gave me a header **parallel_letter_frequency.h** declaring `std::array<std::size_t, 26> count_letter_frequencies(const std::vector<std::string>& texts);` and a source file **parallel_letter_frequency.cpp** that uses `<algorithm>`, `<execution>`, and helper functions `detail::count_in_string` and `detail::combine_counts` to perform a parallel `std::transform_reduce`. \n\nLater you pointed out a compilation issue: the header was missing `<cstddef>`. You responded with an updated **parallel_letter_frequency.h** that now includes `<cstddef>` along with `<array>`, `<string>`, and `<vector>`. The function declaration remains the same, and the namespace and include guards are unchanged.","type":"server_error"}}
3.36.810.109 W srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 0: assistant…\n\n<|end|><|start|>assistant<|channel|>analysis<|message|>We need to produce summary.<|end|><|start|>assistant<|channel|>final<|message|>I asked you to implement a 0/1 knapsack solution. You created **knapsack.h** with the `Item` struct (fields `weight` and `value`) and declared the function `int maxValue(const std::vector<Item>& items, int capacity);` inside the `knapsack` namespace. In **knapsack.cpp** you defined `maxValue` using a one‑dimensional DP array, handling edge cases, skipping items heavier than the capacity, and returning the best achievable value.\n\nLater I told you I had updated the files and pointed out an error marker at line 25 of **knapsack.h** where the `maxValue` declaration appears. You responded by reposting the corrected **knapsack.h** content, confirming the declaration and surrounding documentation remain unchanged.","type":"server_error"}}
3.16.010.983 W srv operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 55: <|start|>final<|channel|>final<|message|>I asked you to implement a 0/1 knapsack solution and you provided the header **knapsack.h** defining the `knapsack::Item` struct and the function `int maxValue(const std::vector<Item>& items, int capacity);`, along with documentation comments. You also gave the implementation in **knapsack.cpp**, using a one‑dimensional DP array to compute the maximum value without exceeding the capacity.\n\nLater I told you I had updated the files and pointed out an error marker at line 25 in **knapsack.h** where the declaration of `maxValue` appears. You responded by reposting the corrected **knapsack.h** file, ensuring the function declaration is properly placed and the file compiles.","type":"server_error"}}
Findings:
- Error does not occur with
--no-cache-prompt - Error does repeat with same prompt until cache in best match slot is replaced
- Error does not occur on same prompt with
--slot-save-pathand/slots/X?action=eraseapi call - Error does occur with ggml-org and unsloth gpt-oss-120b, and all combinations of their chat templates
- Error does occur with gpt-oss-120b medium and low thinking
- Error does occur with ggml-org gpt-oss-20b (medium thinking only)
- Error does occur with and without speculative decoding (ngram-mod)
Steps to Recreate
In an attempt to nail down the exact call I added temporary logging of the json message posted to /v1/chat/completions. I don't recommend this for production and there might be a better way, but this is what I used to recreate.
Danger prompt logging
Logging
NOTE: Only apply this patch if you want to create your own replay logs. I've attached the logs I've created using this method. And the patch is not needed for playback.
prompt-log.patch
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index b4373c101..eee06d542 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -3651,6 +3651,7 @@ void server_routes::init_routes() {
auto res = create_response();
std::vector<raw_buffer> files;
json body = json::parse(req.body);
+ SRV_INF("Request: %s\n", body.dump().c_str());
json body_parsed = oaicompat_chat_params_parse(
body,
meta->chat_params,And apply it with:
git apply prompt-log.patchUsing a log file created with the above method, we can spit out the sequence of prompts with the following:
replayLog.sh
#!/bin/bash
FAIL_LOG="failLog.log"
# Extract all of the lines from the log that start with this header (note: log creator requires prompt-log.patch to be applied)
PROMPTS=$(sed -n 's/^[^srv]*srv operator(): Request: //p' $FAIL_LOG)
while IFS= read -r line; do
echo -e "\n\e[31m*** PROMPT ***\e[0m"
echo "$line"
echo -e "\n\e[34m*** RESPONSE ***\e[0m"
echo "$line" | curl -s --fail-with-body -X POST http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -d @-
if [[ "$?" -ne 0 ]]; then
echo -e "\n\nExiting due to server error...\n\n"
break
fi
echo ""
done <<< "$PROMPTS"NOTE: Be sure to pipe all of the llama-server output to file by appending &> ~/failLog.log
I will attach my entire log file which can be used with the above script to recreate the error seen.
First Bad Commit
Errors only started appearing after 566059a (only because of additional error checking present). And appears sporadically at different points through later commits (any small change to the caching or the parsing seems to affect the manifestation).
Relevant log output
Logs that I used to recreate the issue with the above script: