-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Eval bug: Constantly use more token than expected in llama.cpp with Qwen3.5-35B-A3B models #20099
Description
Name and Version
❯ llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 8398 (81d3d6c)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
Vulkan & rocm
Hardware
AMD Ryzen AI 360 w/880M
Models
Qwen3.5-35B-A3B (Thinking disabled) and Qwen3-Next-80B-A3B-Instruct
Problem description & steps to reproduce
I'm using the Immersive translate firefox extension to translate webpage articles via API. At the same time I'm running a local llama-server (from llama.cpp) to serve the API.
Immersive Translate restricts each API request to a maximum of 1300 characters and requires a 20-second delay between calls to respect rate limits. To handle longer articles, the extension splits the content into segments of up to 1300 characters. This usually keeps the word count per chunk in a safe, manageable range. I've observed that each API call uses roughly 700–900 tokens. On my setup, the actual time per call ends up being about 16–21 seconds (likely due to network latency + the enforced interval). So I set the 20-second delay between calls.
Here is the log:
Log for each API called:
07:15:44 llama-server: total time = 14096.61 ms / 759 tokens
07:15:58 llama-server: total time = 6989.59 ms / 148 tokens
07:16:21 llama-server: total time = 10367.31 ms / 215 tokens
07:16:33 llama-server: total time = 1816.04 ms / 261 tokens
07:17:05 llama-server: total time = 14180.87 ms / 753 tokens
07:17:32 llama-server: total time = 20762.03 ms / 940 tokens
07:17:47 llama-server: total time = 14930.24 ms / 317 tokens
07:17:58 llama-server: total time = 7517.04 ms / 502 tokens
07:18:27 llama-server: total time = 16116.15 ms / 827 tokens
07:18:46 llama-server: total time = 14809.73 ms / 757 tokens
07:19:11 llama-server: total time = 20363.96 ms / 951 tokens
07:19:15 llama-server: total time = 4081.01 ms / 261 tokens
07:19:51 llama-server: total time = 19766.75 ms / 861 tokens
07:20:09 llama-server: total time = 18776.18 ms / 848 tokens
07:20:31 llama-server: total time = 20001.11 ms / 865 tokens
07:20:45 llama-server: total time = 13900.11 ms / 593 tokens
07:20:55 llama-server: total time = 4437.38 ms / 290 tokens
07:21:24 llama-server: total time = 13455.90 ms / 665 tokens
07:21:43 llama-server: total time = 11957.84 ms / 577 tokens
07:22:05 llama-server: total time = 15125.87 ms / 690 tokens
07:22:27 llama-server: total time = 16229.08 ms / 700 tokens
07:22:46 llama-server: total time = 15604.36 ms / 686 tokens
07:22:54 llama-server: total time = 3242.95 ms / 242 tokens
07:23:25 llama-server: total time = 14001.47 ms / 649 tokens
07:23:49 llama-server: total time = 18765.17 ms / 822 tokens
07:24:02 llama-server: total time = 10860.97 ms / 537 tokens
07:24:14 llama-server: total time = 3224.33 ms / 248 tokens
07:24:41 llama-server: total time = 10833.63 ms / 555 tokens
07:25:07 llama-server: total time = 16381.44 ms / 734 tokens
07:25:24 llama-server: total time = 13377.58 ms / 619 tokens
07:25:45 llama-server: total time = 14145.33 ms / 642 tokens
07:26:06 llama-server: total time = 15536.54 ms / 703 tokens
07:26:18 llama-server: total time = 8195.68 ms / 441 tokens
07:26:43 llama-server: total time = 12365.73 ms / 599 tokens
07:27:09 llama-server: total time = 19161.65 ms / 843 tokens
07:27:26 llama-server: total time = 15589.03 ms / 668 tokens
07:27:47 llama-server: total time = 17136.63 ms / 779 tokens
07:28:09 llama-server: total time = 19048.27 ms / 804 tokens
07:28:27 llama-server: total time = 16278.56 ms / 722 tokens
07:28:33 llama-server: total time = 2642.20 ms / 218 tokens
07:29:07 llama-server: total time = 17140.73 ms / 764 tokens
07:29:22 llama-server: total time = 11749.06 ms / 598 tokens
07:29:44 llama-server: total time = 13600.18 ms / 713 tokens
07:30:07 llama-server: total time = 17045.18 ms / 798 tokens
07:30:37 llama-server: total time = 26454.97 ms / 1162 tokens
07:31:01 llama-server: total time = 23859.52 ms / 739 tokens
07:31:31 llama-server: total time = 30384.73 ms / 1289 tokens
07:31:44 llama-server: total time = 13305.07 ms / 625 tokens
07:31:57 llama-server: total time = 12464.76 ms / 579 tokens
07:32:26 llama-server: total time = 29054.23 ms / 1161 tokens
07:32:50 llama-server: total time = 23511.21 ms / 860 tokens
07:33:36 llama-server: total time = 45909.49 ms / 1626 tokens
07:33:49 llama-server: total time = 13538.13 ms / 627 tokens
07:34:08 llama-server: total time = 18626.50 ms / 806 tokens
07:34:40 llama-server: total time = 32001.76 ms / 1388 tokens
07:34:52 llama-server: total time = 12224.19 ms / 576 tokens
07:35:08 llama-server: total time = 15785.83 ms / 668 tokens
07:35:21 llama-server: total time = 12529.78 ms / 579 tokens
07:35:40 llama-server: total time = 19683.33 ms / 823 tokens
07:36:00 llama-server: total time = 19929.24 ms / 836 tokens
07:36:11 llama-server: total time = 10528.97 ms / 435 tokens
07:36:23 llama-server: total time = 12319.67 ms / 472 tokens
07:36:38 llama-server: total time = 14544.96 ms / 544 tokens
07:36:53 llama-server: total time = 15121.09 ms / 576 tokens
07:37:11 llama-server: total time = 17584.80 ms / 759 tokens
07:37:28 llama-server: total time = 17736.19 ms / 761 tokens
07:37:45 llama-server: total time = 14463.15 ms / 648 tokens
07:38:08 llama-server: total time = 17503.07 ms / 802 tokens
07:38:21 llama-server: total time = 10337.53 ms / 515 tokens
07:38:43 llama-server: total time = 12920.57 ms / 635 tokens
07:38:59 llama-server: total time = 8814.36 ms / 365 tokens
07:39:24 llama-server: total time = 13557.90 ms / 546 tokens
07:39:41 llama-server: total time = 10424.04 ms / 528 tokens
07:40:02 llama-server: total time = 11173.59 ms / 550 tokens
07:40:19 llama-server: total time = 9135.88 ms / 471 tokens
07:40:38 llama-server: total time = 7527.62 ms / 404 tokens
07:41:05 llama-server: total time = 14752.71 ms / 668 tokens
07:41:20 llama-server: total time = 10103.80 ms / 407 tokens
07:41:47 llama-server: total time = 15663.15 ms / 621 tokens
07:42:03 llama-server: total time = 11359.15 ms / 564 tokens
07:42:28 llama-server: total time = 17399.32 ms / 776 tokens
07:42:48 llama-server: total time = 17059.90 ms / 789 tokens
07:43:19 llama-server: total time = 28266.73 ms / 1153 tokens
07:43:42 llama-server: total time = 22899.02 ms / 889 tokens
07:44:10 llama-server: total time = 27041.02 ms / 1090 tokens
07:44:19 llama-server: total time = 9441.02 ms / 482 tokens
07:44:36 llama-server: total time = 16885.56 ms / 760 tokens
07:45:03 llama-server: total time = 26911.67 ms / 1132 tokens
07:45:40 llama-server: total time = 37277.13 ms / 1239 tokens
07:46:31 llama-server: total time = 51052.86 ms / 1782 tokens
07:46:46 llama-server: total time = 14480.60 ms / 556 tokens
07:47:02 llama-server: total time = 15874.06 ms / 563 tokens
07:47:34 llama-server: total time = 32411.07 ms / 1358 tokens
07:47:50 llama-server: total time = 15425.22 ms / 574 tokens
07:48:02 llama-server: total time = 12001.29 ms / 466 tokens
07:48:19 llama-server: total time = 17453.73 ms / 609 tokens
07:48:36 llama-server: total time = 16762.52 ms / 639 tokens
07:48:54 llama-server: total time = 17468.69 ms / 715 tokens
07:49:10 llama-server: total time = 16651.48 ms / 766 tokens
07:49:22 llama-server: total time = 11977.52 ms / 586 tokens
07:49:41 llama-server: total time = 18467.26 ms / 851 tokens
07:50:00 llama-server: total time = 19333.21 ms / 700 tokens
07:50:14 llama-server: total time = 14220.44 ms / 530 tokens
07:50:21 llama-server: total time = 6926.85 ms / 396 tokens
07:50:31 llama-server: total time = 9046.57 ms / 456 tokens
07:50:40 llama-server: total time = 9817.99 ms / 492 tokens
07:51:07 llama-server: total time = 15682.36 ms / 707 tokens
07:51:23 llama-server: total time = 11560.64 ms / 562 tokens
07:51:43 llama-server: total time = 11680.91 ms / 587 tokens
07:52:02 llama-server: total time = 11095.06 ms / 538 tokens
07:52:26 llama-server: total time = 15961.12 ms / 743 tokens
07:52:41 llama-server: total time = 10145.25 ms / 502 tokens
07:53:08 llama-server: total time = 17327.05 ms / 726 tokens
07:53:25 llama-server: total time = 14794.46 ms / 697 tokens
07:53:52 llama-server: total time = 21936.63 ms / 891 tokens
07:54:04 llama-server: total time = 11669.32 ms / 561 tokens
07:54:27 llama-server: total time = 16300.08 ms / 706 tokens
07:54:43 llama-server: total time = 12669.06 ms / 595 tokens
07:55:10 llama-server: total time = 19038.73 ms / 838 tokens
07:55:20 llama-server: total time = 8790.97 ms / 439 tokens
07:56:02 llama-server: total time = 31286.78 ms / 1186 tokens
07:56:08 llama-server: total time = 6372.69 ms / 248 tokens
07:56:38 llama-server: total time = 26920.87 ms / 1025 tokens
07:56:47 llama-server: total time = 8664.55 ms / 467 tokens
07:57:11 llama-server: total time = 20533.50 ms / 860 tokens
07:57:38 llama-server: total time = 26934.49 ms / 1143 tokens
07:58:08 llama-server: total time = 29944.37 ms / 1100 tokens
07:58:53 llama-server: total time = 44804.45 ms / 1560 tokens
07:59:13 llama-server: total time = 19956.16 ms / 720 tokens
07:59:26 llama-server: total time = 12370.58 ms / 604 tokens
08:00:05 llama-server: total time = 39777.19 ms / 1547 tokens
08:00:23 llama-server: total time = 17947.41 ms / 777 tokens
08:00:31 llama-server: total time = 7945.46 ms / 420 tokens
08:00:42 llama-server: total time = 10251.82 ms / 496 tokens
08:00:52 llama-server: total time = 10574.39 ms / 494 tokens
08:01:01 llama-server: total time = 8436.00 ms / 425 tokens
08:01:14 llama-server: total time = 12974.18 ms / 661 tokens
08:01:29 llama-server: total time = 15496.11 ms / 666 tokens
08:01:43 llama-server: total time = 12203.77 ms / 606 tokens
08:02:07 llama-server: total time = 15676.75 ms / 697 tokens
08:02:29 llama-server: total time = 18966.88 ms / 675 tokens
08:02:48 llama-server: total time = 17710.54 ms / 669 tokens
08:03:06 llama-server: total time = 15106.71 ms / 601 tokens
08:03:25 llama-server: total time = 14701.68 ms / 560 tokens
08:03:50 llama-server: total time = 19364.92 ms / 727 tokens
08:04:07 llama-server: total time = 16387.78 ms / 616 tokens
08:04:31 llama-server: total time = 20383.19 ms / 746 tokens
08:04:42 llama-server: total time = 10931.60 ms / 538 tokens
08:05:12 llama-server: total time = 21272.71 ms / 911 tokens
08:05:24 llama-server: total time = 11466.04 ms / 584 tokens
08:05:50 llama-server: total time = 19907.46 ms / 866 tokens
08:06:07 llama-server: total time = 15835.01 ms / 772 tokens
08:06:27 llama-server: total time = 16255.03 ms / 735 tokens
08:06:43 llama-server: total time = 12149.81 ms / 590 tokens
08:07:07 llama-server: total time = 16571.01 ms / 789 tokens
08:07:22 llama-server: total time = 11728.34 ms / 573 tokens
08:07:47 llama-server: total time = 16303.15 ms / 746 tokens
08:08:06 llama-server: total time = 15596.87 ms / 693 tokens
08:08:43 llama-server: total time = 32767.01 ms / 1294 tokens
08:09:06 llama-server: total time = 22272.77 ms / 970 tokens
08:09:22 llama-server: total time = 16187.79 ms / 727 tokens
08:09:34 llama-server: total time = 11968.40 ms / 541 tokens
08:10:02 llama-server: total time = 27999.42 ms / 1143 tokens
08:10:30 llama-server: total time = 28521.68 ms / 1179 tokens
08:11:09 llama-server: total time = 38071.22 ms / 1311 tokens
08:11:25 llama-server: total time = 16007.86 ms / 599 tokens
08:11:43 llama-server: total time = 18812.15 ms / 702 tokens
08:12:31 llama-server: total time = 47642.79 ms / 1696 tokens
But after running the API server for sometime there would be a spike on the tokens used. Here are the highlights:
Filtered log:
<- 15 minutes after the translation start ->
07:30:37 llama-server: total time = 26454.97 ms / 1162 tokens
07:31:31 llama-server: total time = 30384.73 ms / 1289 tokens
07:32:26 llama-server: total time = 29054.23 ms / 1161 tokens
07:33:36 llama-server: total time = 45909.49 ms / 1626 tokens
07:34:40 llama-server: total time = 32001.76 ms / 1388 tokens
<- 9 minutes after the previous spike ->
07:43:19 llama-server: total time = 28266.73 ms / 1153 tokens
07:44:10 llama-server: total time = 27041.02 ms / 1090 tokens
07:45:03 llama-server: total time = 26911.67 ms / 1132 tokens
07:45:40 llama-server: total time = 37277.13 ms / 1239 tokens
07:46:31 llama-server: total time = 51052.86 ms / 1782 tokens
07:47:34 llama-server: total time = 32411.07 ms / 1358 tokens
<- 9 minutes after the previous spike ->
07:56:02 llama-server: total time = 31286.78 ms / 1186 tokens
07:56:38 llama-server: total time = 26920.87 ms / 1025 tokens
07:57:38 llama-server: total time = 26934.49 ms / 1143 tokens
07:58:08 llama-server: total time = 29944.37 ms / 1100 tokens
07:58:53 llama-server: total time = 44804.45 ms / 1560 tokens
08:00:05 llama-server: total time = 39777.19 ms / 1547 tokens
<- 8 minutes after the previous spike ->
08:08:43 llama-server: total time = 32767.01 ms / 1294 tokens
08:10:02 llama-server: total time = 27999.42 ms / 1143 tokens
08:10:30 llama-server: total time = 28521.68 ms / 1179 tokens
08:11:09 llama-server: total time = 38071.22 ms / 1311 tokens
08:12:31 llama-server: total time = 47642.79 ms / 1696 tokens
Here is a screenshot of the translation:
According to the screenshot, it has </think>. With thinking disabled my primary suspect would be the thinking model is enabled during each interval.
First Bad Commit
No response
Relevant log output
Logs
❯ llama-server \
--models-dir ~/unsloth \
--ctx-size 16384 \
--temp 0 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--reasoning-budget 0 \
--port 11434 \
--presence_penalty 1.5 \
--repeat_penalty 1.0 \
-np 1 \
--timeout 100 \
--host 0.0.0.0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 8192 (8ade892) with GNU 15.2.1 for Linux x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
srv load_models: Loaded 0 cached model presets
srv load_models: Loaded 2 local model presets from ~/unsloth
srv load_models: Available models (2) (*: custom preset)
srv load_models: Qwen3-30B-A3B-Instruct-2507-GGUF
srv load_models: Qwen3.5-35B-A3B-GGUF
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://0.0.0.0:11434
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv ensure_model: model name=Qwen3.5-35B-A3B-GGUF is not loaded, loading...
srv load: spawning server instance with name=Qwen3.5-35B-A3B-GGUF on port 58673
srv load: spawning server instance with args:
srv load: /usr/bin/llama-server
srv load: --host
srv load: 127.0.0.1
srv load: --min-p
srv load: 0.00
srv load: --port
srv load: 58673
srv load: --presence-penalty
srv load: 1.5
srv load: --reasoning-budget
srv load: 0
srv load: --repeat-penalty
srv load: 1.0
srv load: --temperature
srv load: 0
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.8
srv load: --alias
srv load: Qwen3.5-35B-A3B-GGUF
srv load: --ctx-size
srv load: 16384
srv load: --model
srv load: ~/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
srv load: --parallel
srv load: 1
srv load: --timeout
srv load: 100
srv ensure_model: waiting until model name=Qwen3.5-35B-A3B-GGUF is fully loaded...
[58673] ggml_vulkan: Found 1 Vulkan devices:
[58673] ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[58673] build: 8192 (8ade892) with GNU 15.2.1 for Linux x86_64
[58673] system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
[58673]
[58673] system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[58673]
[58673] Running without SSL
[58673] init: using 15 threads for HTTP server
[58673] start: binding port with default address family
[58673] main: loading model
[58673] srv load_model: loading model '~/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf'
[58673] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[58673] llama_params_fit_impl: projected to use 21566 MiB of device memory vs. 55263 MiB of free device memory
[58673] llama_params_fit_impl: will leave 33697 >= 1024 MiB of free device memory, no changes needed
[58673] llama_params_fit: successfully fit params to free device memory
[58673] llama_params_fit: fitting params to free memory took 0.59 seconds
[58673] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 890M Graphics (RADV STRIX1)) (0000:c3:00.0) - 55327 MiB free
[58673] llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from ~/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
[58673] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[58673] llama_model_loader: - kv 0: general.architecture str = qwen35moe
[58673] llama_model_loader: - kv 1: general.type str = model
[58673] llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
[58673] llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
[58673] llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
[58673] llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
[58673] llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
[58673] llama_model_loader: - kv 7: general.quantized_by str = Unsloth
[58673] llama_model_loader: - kv 8: general.size_label str = 35B-A3B
[58673] llama_model_loader: - kv 9: general.license str = apache-2.0
[58673] llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
[58673] llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
[58673] llama_model_loader: - kv 12: general.base_model.count u32 = 1
[58673] llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
[58673] llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
[58673] llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
[58673] llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
[58673] llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
[58673] llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
[58673] llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
[58673] llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
[58673] llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
[58673] llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
[58673] llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
[58673] llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
[58673] llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
[58673] llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
[58673] llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
[58673] llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
[58673] llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
[58673] llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
[58673] llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
[58673] llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
[58673] llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
[58673] llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
[58673] llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
[58673] llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
[58673] llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
[58673] llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
[58673] llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
[58673] llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
[58673] llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[58673] llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[58673] llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
[58673] llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
[58673] llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
[58673] llama_model_loader: - kv 46: general.quantization_version u32 = 2
[58673] llama_model_loader: - kv 47: general.file_type u32 = 15
[58673] llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
[58673] llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
[58673] llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
[58673] llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 30
[58673] llama_model_loader: - type f32: 301 tensors
[58673] llama_model_loader: - type q8_0: 312 tensors
[58673] llama_model_loader: - type q4_K: 78 tensors
[58673] llama_model_loader: - type q5_K: 41 tensors
[58673] llama_model_loader: - type q6_K: 1 tensors
[58673] print_info: file format = GGUF V3 (latest)
[58673] print_info: file type = Q4_K - Medium
[58673] print_info: file size = 20.70 GiB (5.13 BPW)
[58673] load: 0 unused tokens
[58673] load: printing all EOG tokens:
[58673] load: - 248044 ('<|endoftext|>')
[58673] load: - 248046 ('<|im_end|>')
[58673] load: - 248063 ('<|fim_pad|>')
[58673] load: - 248064 ('<|repo_name|>')
[58673] load: - 248065 ('<|file_sep|>')
[58673] load: special tokens cache size = 33
[58673] load: token to piece cache size = 1.7581 MB
[58673] print_info: arch = qwen35moe
[58673] print_info: vocab_only = 0
[58673] print_info: no_alloc = 0
[58673] print_info: n_ctx_train = 262144
[58673] print_info: n_embd = 2048
[58673] print_info: n_embd_inp = 2048
[58673] print_info: n_layer = 40
[58673] print_info: n_head = 16
[58673] print_info: n_head_kv = 2
[58673] print_info: n_rot = 64
[58673] print_info: n_swa = 0
[58673] print_info: is_swa_any = 0
[58673] print_info: n_embd_head_k = 256
[58673] print_info: n_embd_head_v = 256
[58673] print_info: n_gqa = 8
[58673] print_info: n_embd_k_gqa = 512
[58673] print_info: n_embd_v_gqa = 512
[58673] print_info: f_norm_eps = 0.0e+00
[58673] print_info: f_norm_rms_eps = 1.0e-06
[58673] print_info: f_clamp_kqv = 0.0e+00
[58673] print_info: f_max_alibi_bias = 0.0e+00
[58673] print_info: f_logit_scale = 0.0e+00
[58673] print_info: f_attn_scale = 0.0e+00
[58673] print_info: n_ff = 0
[58673] print_info: n_expert = 256
[58673] print_info: n_expert_used = 8
[58673] print_info: n_expert_groups = 0
[58673] print_info: n_group_used = 0
[58673] print_info: causal attn = 1
[58673] print_info: pooling type = 0
[58673] print_info: rope type = 40
[58673] print_info: rope scaling = linear
[58673] print_info: freq_base_train = 10000000.0
[58673] print_info: freq_scale_train = 1
[58673] print_info: n_ctx_orig_yarn = 262144
[58673] print_info: rope_yarn_log_mul = 0.0000
[58673] print_info: rope_finetuned = unknown
[58673] print_info: mrope sections = [11, 11, 10, 0]
[58673] print_info: ssm_d_conv = 4
[58673] print_info: ssm_d_inner = 4096
[58673] print_info: ssm_d_state = 128
[58673] print_info: ssm_dt_rank = 32
[58673] print_info: ssm_n_group = 16
[58673] print_info: ssm_dt_b_c_rms = 0
[58673] print_info: model type = ?B
[58673] print_info: model params = 34.66 B
[58673] print_info: general.name = Qwen3.5-35B-A3B
[58673] print_info: vocab type = BPE
[58673] print_info: n_vocab = 248320
[58673] print_info: n_merges = 247587
[58673] print_info: BOS token = 11 ','
[58673] print_info: EOS token = 248046 '<|im_end|>'
[58673] print_info: EOT token = 248046 '<|im_end|>'
[58673] print_info: PAD token = 248055 '<|vision_pad|>'
[58673] print_info: LF token = 198 'Ċ'
[58673] print_info: FIM PRE token = 248060 '<|fim_prefix|>'
[58673] print_info: FIM SUF token = 248062 '<|fim_suffix|>'
[58673] print_info: FIM MID token = 248061 '<|fim_middle|>'
[58673] print_info: FIM PAD token = 248063 '<|fim_pad|>'
[58673] print_info: FIM REP token = 248064 '<|repo_name|>'
[58673] print_info: FIM SEP token = 248065 '<|file_sep|>'
[58673] print_info: EOG token = 248044 '<|endoftext|>'
[58673] print_info: EOG token = 248046 '<|im_end|>'
[58673] print_info: EOG token = 248063 '<|fim_pad|>'
[58673] print_info: EOG token = 248064 '<|repo_name|>'
[58673] print_info: EOG token = 248065 '<|file_sep|>'
[58673] print_info: max token length = 256
[58673] load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
[58673] load_tensors: offloading output layer to GPU
[58673] load_tensors: offloading 39 repeating layers to GPU
[58673] load_tensors: offloaded 41/41 layers to GPU
[58673] load_tensors: CPU_Mapped model buffer size = 515.31 MiB
[58673] load_tensors: Vulkan0 model buffer size = 20685.78 MiB
[58673] ..................................................................................................
[58673] common_init_result: added <|endoftext|> logit bias = -inf
[58673] common_init_result: added <|im_end|> logit bias = -inf
[58673] common_init_result: added <|fim_pad|> logit bias = -inf
[58673] common_init_result: added <|repo_name|> logit bias = -inf
[58673] common_init_result: added <|file_sep|> logit bias = -inf
[58673] llama_context: constructing llama_context
[58673] llama_context: n_seq_max = 1
[58673] llama_context: n_ctx = 16384
[58673] llama_context: n_ctx_seq = 16384
[58673] llama_context: n_batch = 2048
[58673] llama_context: n_ubatch = 512
[58673] llama_context: causal_attn = 1
[58673] llama_context: flash_attn = auto
[58673] llama_context: kv_unified = false
[58673] llama_context: freq_base = 10000000.0
[58673] llama_context: freq_scale = 1
[58673] llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[58673] llama_context: Vulkan_Host output buffer size = 0.95 MiB
[58673] llama_kv_cache: Vulkan0 KV buffer size = 320.00 MiB
[58673] llama_kv_cache: size = 320.00 MiB ( 16384 cells, 10 layers, 1/1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
[58673] llama_memory_recurrent: Vulkan0 RS buffer size = 62.81 MiB
[58673] llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
[58673] sched_reserve: reserving ...
[58673] sched_reserve: Flash Attention was auto, set to enabled
[58673] sched_reserve: Vulkan0 compute buffer size = 498.00 MiB
[58673] sched_reserve: Vulkan_Host compute buffer size = 40.03 MiB
[58673] sched_reserve: graph nodes = 9399 (with bs=512), 4389 (with bs=1)
[58673] sched_reserve: graph splits = 2
[58673] sched_reserve: reserve took 44.62 ms, sched copies = 1
[58673] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[58673] srv load_model: initializing slots, n_slots = 1
[58673] common_speculative_is_compat: the target context does not support partial sequence removal
[58673] srv load_model: speculative decoding not supported by this context
[58673] slot load_model: id 0 | task -1 | new slot, n_ctx = 16384
[58673] srv load_model: prompt cache is enabled, size limit: 8192 MiB
[58673] srv load_model: use `--cache-ram 0` to disable the prompt cache
[58673] srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[58673] init: chat template, example_format: '<|im_start|>system
[58673] You are a helpful assistant<|im_end|>
[58673] <|im_start|>user
[58673] Hello<|im_end|>
[58673] <|im_start|>assistant
[58673] Hi there<|im_end|>
[58673] <|im_start|>user
[58673] How are you?<|im_end|>
[58673] <|im_start|>assistant
[58673] <think>
[58673] '
[58673] srv init: init: chat template, thinking = 0
[58673] main: model loaded
[58673] main: server is listening on http://127.0.0.1:58673
[58673] main: starting the main loop...
[58673] cmd_child_to_router:ready
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv update_slots: all slots are idle
[58673] srv operator(): child server monitoring thread started, waiting for EOF on stdin...
[58673] srv params_from_: Chat format: peg-constructed
[58673] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
[58673] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[58673] slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
[58673] slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 441
[58673] slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
[58673] slot init_sampler: id 0 | task 0 | init sampler, took 0.09 ms, tokens: text = 441, total = 441
[58673] slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 441, batch.n_tokens = 441
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
[58673] slot print_timing: id 0 | task 0 |
[58673] prompt eval time = 2070.27 ms / 441 tokens ( 4.69 ms per token, 213.02 tokens per second)
[58673] eval time = 6776.65 ms / 113 tokens ( 59.97 ms per token, 16.67 tokens per second)
[58673] total time = 8846.92 ms / 554 tokens
[58673] slot release: id 0 | task 0 | stop processing: n_tokens = 553, truncated = 0
[58673] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.765 (> 0.100 thold), f_keep = 0.564
[58673] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[58673] slot launch_slot_: id 0 | task 49 | processing task, is_child = 0
[58673] slot update_slots: id 0 | task 49 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 408
[58673] slot update_slots: id 0 | task 49 | n_past = 312, slot.prompt.tokens.size() = 553, seq_id = 0, pos_min = 552, n_swa = 1
[58673] slot update_slots: id 0 | task 49 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
[58673] slot update_slots: id 0 | task 49 | n_tokens = 0, memory_seq_rm [0, end)
[58673] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
[58673] slot init_sampler: id 0 | task 49 | init sampler, took 0.07 ms, tokens: text = 408, total = 408
[58673] slot update_slots: id 0 | task 49 | prompt processing done, n_tokens = 408, batch.n_tokens = 408
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
[58673] slot print_timing: id 0 | task 49 |
[58673] prompt eval time = 1969.10 ms / 408 tokens ( 4.83 ms per token, 207.20 tokens per second)
[58673] eval time = 5705.37 ms / 95 tokens ( 60.06 ms per token, 16.65 tokens per second)
[58673] total time = 7674.48 ms / 503 tokens
[58673] slot release: id 0 | task 49 | stop processing: n_tokens = 502, truncated = 0
[58673] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.780 (> 0.100 thold), f_keep = 0.622
[58673] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[58673] slot launch_slot_: id 0 | task 184 | processing task, is_child = 0
[58673] slot update_slots: id 0 | task 184 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 400
[58673] slot update_slots: id 0 | task 184 | n_past = 312, slot.prompt.tokens.size() = 502, seq_id = 0, pos_min = 501, n_swa = 1
[58673] slot update_slots: id 0 | task 184 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
[58673] slot update_slots: id 0 | task 184 | n_tokens = 0, memory_seq_rm [0, end)
[58673] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
[58673] slot init_sampler: id 0 | task 184 | init sampler, took 0.09 ms, tokens: text = 400, total = 400
[58673] slot update_slots: id 0 | task 184 | prompt processing done, n_tokens = 400, batch.n_tokens = 400
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
[58673] slot print_timing: id 0 | task 184 |
[58673] prompt eval time = 2002.51 ms / 400 tokens ( 5.01 ms per token, 199.75 tokens per second)
[58673] eval time = 4974.63 ms / 83 tokens ( 59.94 ms per token, 16.68 tokens per second)
[58673] total time = 6977.13 ms / 483 tokens
[58673] slot release: id 0 | task 184 | stop processing: n_tokens = 482, truncated = 0
[58673] srv update_slots: all slots are idle
[58673] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
[58673] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.794 (> 0.100 thold), f_keep = 0.647
[58673] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[58673] slot launch_slot_: id 0 | task 294 | processing task, is_child = 0
[58673] slot update_slots: id 0 | task 294 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 393
[58673] slot update_slots: id 0 | task 294 | n_past = 312, slot.prompt.tokens.size() = 482, seq_id = 0, pos_min = 481, n_swa = 1
[58673] slot update_slots: id 0 | task 294 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
[58673] slot update_slots: id 0 | task 294 | n_tokens = 0, memory_seq_rm [0, end)
[58673] slot init_sampler: id 0 | task 294 | init sampler, took 0.11 ms, tokens: text = 393, total = 393
[58673] slot update_slots: id 0 | task 294 | prompt processing done, n_tokens = 393, batch.n_tokens = 393
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed
srv proxy_reques: proxying request to model Qwen3.5-35B-A3B-GGUF on port 58673
[58673] srv params_from_: Chat format: peg-constructed