Skip to content

Conversation

@0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Oct 31, 2025

Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.

Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.

Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from d5192bf to d2f8f00 Compare November 1, 2025 11:31
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 1, 2025

AMD Radeon Pro VII

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 63.49 ± 0.20 71.40 ± 0.24 83.84 ± 0.26 +17.4%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 64.74 ± 0.12 67.75 ± 0.09 78.96 ± 0.20 +16.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 48.80 ± 0.08 60.59 ± 0.14 59.91 ± 0.24 -1.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 49.47 ± 0.44 58.06 ± 0.11 57.43 ± 0.04 -1.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 65.92 ± 0.15 72.60 ± 0.17 76.77 ± 0.24 +5.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 67.66 ± 0.18 69.41 ± 0.12 72.90 ± 0.19 +5.0%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 19.10 ± 0.16 19.11 ± 0.09 24.50 ± 0.16 +28.2%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 19.00 ± 0.05 18.24 ± 0.21 23.61 ± 0.22 +29.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 80.04 ± 0.02 90.66 ± 0.17 87.32 ± 0.46 -3.7%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 80.24 ± 0.10 86.01 ± 5.01 86.50 ± 0.53 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 67.68 ± 0.06 82.89 ± 0.22 85.36 ± 0.61 +3.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 70.80 ± 0.03 75.71 ± 0.17 77.52 ± 0.12 +2.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 107.99 ± 0.65 127.26 ± 0.27 128.89 ± 0.75 +1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 114.36 ± 0.11 125.49 ± 0.07 126.27 ± 0.37 +0.6%

AMD Radeon RX 6800 XT

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 93.30 ± 0.25 115.95 ± 3.40 122.98 ± 0.14 +6.1%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 95.99 ± 0.11 109.65 ± 1.76 113.62 ± 0.02 +3.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 75.50 ± 0.01 93.13 ± 0.05 90.81 ± 0.01 -2.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 77.68 ± 0.00 88.41 ± 0.04 86.52 ± 0.01 -2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 101.67 ± 0.04 148.71 ± 0.08 151.96 ± 0.03 +2.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 106.92 ± 0.01 136.12 ± 0.39 137.91 ± 0.04 +1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 120.05 ± 0.05 145.28 ± 0.05 145.86 ± 0.02 +0.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 124.10 ± 0.00 142.70 ± 0.06 143.23 ± 0.04 +0.4%

Intel A770

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 29.90 ± 0.32 44.53 ± 0.74 +48.9%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 19.55 ± 0.01 26.37 ± 0.00 +34.9%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 15.91 ± 0.01 15.92 ± 0.02 +0.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 12.52 ± 0.03 12.56 ± 0.01 +0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 38.36 ± 0.04 47.72 ± 0.05 +24.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 29.89 ± 0.01 34.91 ± 0.02 +16.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 12.00 ± 0.01 14.29 ± 1.43 +19.1%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 10.46 ± 0.02 11.90 ± 0.34 +13.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 46.88 ± 2.27 49.79 ± 5.03 +6.2%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 47.69 ± 0.42 51.01 ± 0.11 +7.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 43.62 ± 0.04 41.81 ± 0.21 -4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 28.22 ± 0.05 28.22 ± 0.01 +0.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 23.94 ± 0.03 39.25 ± 0.02 +64.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 22.87 ± 0.05 36.10 ± 0.01 +57.8%

RTX 3090

model size params ngl fa test t/s (CUDA) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 138.00 ± 0.66 114.32 ± 0.45 112.74 ± 0.36 -1.4%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 136.82 ± 0.35 116.74 ± 0.35 114.95 ± 0.29 -1.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 105.80 ± 0.29 98.13 ± 0.18 95.82 ± 0.58 -2.4%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 105.10 ± 0.27 100.27 ± 0.37 96.59 ± 0.37 -3.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 145.41 ± 0.43 123.22 ± 0.41 121.58 ± 2.54 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 144.52 ± 0.09 125.32 ± 0.18 126.04 ± 0.19 +0.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 48.59 ± 0.03 38.82 ± 0.63 41.02 ± 0.18 +5.7%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 48.44 ± 0.06 39.31 ± 0.14 41.31 ± 0.09 +5.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 141.75 ± 0.46 143.90 ± 0.91 145.12 ± 1.67 +0.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 141.72 ± 0.44 144.40 ± 0.24 145.24 ± 0.20 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 165.61 ± 1.53 151.74 ± 7.18 153.97 ± 0.99 +1.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 162.49 ± 0.32 159.56 ± 1.25 159.13 ± 0.85 -0.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 205.45 ± 1.12 153.52 ± 12.40 160.16 ± 17.99 +4.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 210.33 ± 0.86 159.12 ± 0.81 172.44 ± 0.27 +8.4%

@0cc4m 0cc4m marked this pull request as ready for review November 1, 2025 11:47
@0cc4m 0cc4m requested a review from jeffbolznv November 1, 2025 11:48
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a quick read through. I'll do some perf testing soon.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 2, 2025

As usual, I appear to have caused an llvmpipe issue. I'll look into it.

@jeffbolznv
Copy link
Collaborator

Some initial perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       239.48 ± 11.34 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.44 ± 7.81 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        129.84 ± 4.07 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       872.67 ± 15.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       845.99 ± 13.20 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       391.09 ± 24.08 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       265.33 ± 14.59 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       251.59 ± 17.44 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       305.19 ± 28.81 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       301.64 ± 24.09 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       356.71 ± 17.34 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        273.06 ± 2.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       317.10 ± 15.70 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         91.93 ± 0.22 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.29 ± 0.22 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.03 ± 1.52 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         70.20 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         48.53 ± 0.66 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       431.26 ± 28.74 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       397.86 ± 23.85 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.72 ± 3.56 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       153.41 ± 10.78 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.66 ± 3.49 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       173.04 ± 12.22 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         37.22 ± 0.54 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        159.48 ± 1.35 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.88 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.48 ± 0.54 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       238.12 ± 12.03 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        202.69 ± 5.07 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.12 ± 4.19 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       855.76 ± 15.46 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      641.24 ± 260.16 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       396.68 ± 14.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        264.39 ± 8.21 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       250.60 ± 18.72 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       317.92 ± 10.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       325.54 ± 12.60 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       358.63 ± 16.21 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.27 ± 4.62 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.73 ± 7.12 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.43 ± 2.13 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.05 ± 0.23 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.30 ± 0.94 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         71.16 ± 0.26 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         49.35 ± 0.18 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        461.59 ± 1.94 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        420.99 ± 1.95 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.92 ± 2.62 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        152.94 ± 8.52 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        106.06 ± 3.89 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       178.63 ± 16.11 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         41.86 ± 1.68 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        160.77 ± 1.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.78 ± 1.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.95 ± 0.12 |

I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       314.61 ± 23.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        323.84 ± 1.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        322.33 ± 2.26 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        319.46 ± 2.80 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        318.55 ± 3.96 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        332.90 ± 5.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.56 ± 0.96 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.42 ± 7.14 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.52 ± 6.45 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.98 ± 1.17 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       327.08 ± 19.41 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.18 ± 5.79 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        339.58 ± 3.17 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        338.76 ± 2.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        337.12 ± 5.83 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        132.41 ± 3.78 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.42 ± 0.73 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.74 ± 0.18 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.36 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.26 ± 0.30 |

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       331.53 ± 16.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        335.87 ± 1.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.85 ± 4.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.90 ± 2.64 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        333.53 ± 3.58 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.99 ± 2.56 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.84 ± 1.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.21 ± 5.07 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.78 ± 6.82 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.95 ± 1.13 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       321.82 ± 31.23 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        329.96 ± 4.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        335.48 ± 2.55 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.77 ± 6.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.00 ± 5.05 |

build: b153aac38 (6921)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.75 ± 3.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.28 ± 0.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.52 ± 0.39 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.62 ± 0.41 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.60 ± 0.40 |

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from b153aac to 1b78909 Compare November 7, 2025 19:51
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 7, 2025

Most seem to be noise, except the improvement for gpt-oss MXFP4 is real

The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there.

I still have some tuning to do here, my Strix Halo device isn't liking this PR yet.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 1b78909 to 937f992 Compare November 15, 2025 13:26
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 15, 2025

The tuning seems okay now, even though I didn't change anything. @jeffbolznv Please take another look. Did you have any concerns with your benchmarks?

Here are updated results:

AMD Radeon 8060S

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 44.26 ± 0.95 56.34 ± 2.35 60.17 ± 1.18 +6.8%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 45.61 ± 0.05 52.61 ± 1.51 58.76 ± 0.07 +11.7%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 34.70 ± 0.14 46.78 ± 1.25 48.09 ± 1.42 +2.8%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 35.60 ± 0.03 46.05 ± 0.23 47.44 ± 0.22 +3.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 35.68 ± 0.05 43.27 ± 0.23 43.42 ± 0.32 +0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 37.39 ± 0.05 42.69 ± 0.12 43.55 ± 0.06 +2.0%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 68.22 ± 0.22 91.84 ± 5.25 90.58 ± 1.56 -1.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 68.39 ± 0.40 89.59 ± 0.81 89.12 ± 0.93 -0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 55.58 ± 0.44 91.88 ± 0.94 95.63 ± 0.64 +4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 59.90 ± 0.38 88.72 ± 0.91 93.37 ± 0.39 +5.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 61.70 ± 0.17 69.55 ± 1.05 72.25 ± 0.74 +3.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 64.48 ± 0.33 70.75 ± 0.44 73.16 ± 0.17 +3.4%

AMD RX 6800 XT

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 91.24 ± 0.24 117.51 ± 0.54 123.59 ± 0.21 +5.2%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 96.87 ± 0.10 111.03 ± 0.15 114.65 ± 0.01 +3.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 73.90 ± 0.01 93.52 ± 0.05 91.08 ± 0.02 -2.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 78.24 ± 0.00 88.73 ± 0.02 86.83 ± 0.01 -2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 97.29 ± 0.07 154.58 ± 1.14 162.02 ± 0.03 +4.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 105.90 ± 0.00 139.83 ± 0.03 145.91 ± 0.02 +4.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 116.54 ± 0.01 145.73 ± 0.76 147.00 ± 0.01 +0.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 122.11 ± 0.01 143.62 ± 0.02 144.69 ± 0.05 +0.7%

AMD Radeon Pro VII

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 61.77 ± 0.21 72.03 ± 0.31 86.22 ± 0.39 +19.7%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 64.29 ± 0.13 68.72 ± 0.16 81.18 ± 0.89 +18.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 47.98 ± 0.08 61.70 ± 0.18 60.99 ± 0.59 -1.2%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 49.86 ± 0.00 59.22 ± 0.08 58.43 ± 0.24 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 65.02 ± 0.17 73.87 ± 0.39 78.50 ± 0.44 +6.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 68.17 ± 0.11 70.42 ± 0.25 75.18 ± 0.17 +6.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 19.27 ± 0.16 19.19 ± 0.02 24.82 ± 0.09 +29.3%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 19.41 ± 0.03 18.88 ± 0.05 24.08 ± 0.13 +27.5%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 80.06 ± 0.09 83.66 ± 5.57 82.35 ± 2.57 -1.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 80.35 ± 0.11 78.49 ± 1.53 82.49 ± 2.99 +5.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 62.83 ± 0.02 85.69 ± 0.94 90.46 ± 1.04 +5.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 68.49 ± 0.00 77.31 ± 0.79 81.04 ± 0.60 +4.8%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 101.34 ± 0.04 128.84 ± 0.12 128.34 ± 1.83 -0.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 113.33 ± 0.08 125.89 ± 0.23 126.20 ± 1.51 +0.2%

Intel A770

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 30.20 ± 0.43 44.36 ± 0.85 +46.9%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 20.15 ± 0.01 27.51 ± 0.06 +36.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 15.83 ± 0.03 15.88 ± 0.04 +0.3%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 12.77 ± 0.02 12.79 ± 0.02 +0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 38.31 ± 0.05 46.81 ± 0.73 +22.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 31.17 ± 0.12 36.67 ± 0.10 +17.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 12.02 ± 0.01 14.82 ± 1.23 +23.3%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 10.70 ± 0.00 12.08 ± 0.33 +12.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 46.69 ± 3.24 48.46 ± 2.09 +3.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 49.76 ± 0.08 51.98 ± 0.06 +4.5%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 23.34 ± 0.02 38.08 ± 0.12 +63.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 22.62 ± 0.06 36.32 ± 0.08 +60.6%

Nvidia RTX 3090

model size params ngl fa test t/s (CUDA) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 137.34 ± 0.39 118.30 ± 0.45 117.37 ± 0.28 -0.8%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 139.96 ± 0.52 120.61 ± 0.18 119.81 ± 0.30 -0.7%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 106.20 ± 0.37 101.26 ± 0.43 100.46 ± 0.36 -0.8%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 107.55 ± 0.21 102.67 ± 0.69 101.69 ± 0.70 -1.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 143.67 ± 0.30 120.73 ± 4.47 121.30 ± 5.42 +0.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 146.97 ± 0.18 125.54 ± 0.94 126.59 ± 2.29 +0.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 48.35 ± 0.01 40.43 ± 0.48 42.24 ± 0.13 +4.5%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 48.82 ± 0.05 40.85 ± 0.14 42.32 ± 0.05 +3.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 141.25 ± 0.92 142.11 ± 12.00 145.57 ± 12.59 +2.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 142.04 ± 0.34 149.71 ± 1.52 151.22 ± 0.42 +1.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 155.74 ± 1.15 153.97 ± 16.85 153.30 ± 16.31 -0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 164.32 ± 3.05 167.49 ± 1.56 167.34 ± 1.11 -0.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 197.83 ± 0.79 154.47 ± 12.95 163.89 ± 14.45 +6.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 207.36 ± 0.49 164.30 ± 0.59 175.28 ± 1.24 +6.7%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

Something's broken in the nvidia-vulkan-cm and cm2 runs, I'll look into it.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

I can't reproduce the problem, even on my RTX 3090, with coopmat2, coopmat or without coopmat. Not sure what is going on. It looks like incoherence, but for me the example runs just fine. @jeffbolznv any ideas?

@jeffbolznv
Copy link
Collaborator

I pulled the branch but wasn't able to reproduce the failure. I don't have any great ideas - maybe some missing bounds checking?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

MMVQ is much simpler with bounds checking, since all the inputs are in blocks of 256, 128 or 32. I didn't change how the output is stored, so I don't think that's likely.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 3c22e38 to e086733 Compare November 19, 2025 15:27
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 19, 2025

I would like to merge this, but the CI keeps failing in a way I can't reproduce or understand. cmake-vulkan now failed with illegal segfaults on llvmpipe. What is going on there?

@Acly
Copy link
Collaborator

Acly commented Nov 19, 2025

Had those Illegal (instruction) failures once in a PR and it was related to bad Ccache. Maybe you can clear it and re-run that test.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 20, 2025

How do I clear it?

@Acly
Copy link
Collaborator

Acly commented Nov 20, 2025

I'm guessing find the ccache entry related to this PR in https://github.com/ggml-org/llama.cpp/actions/caches and delete it. I don't have the required permissions, maybe you do. @slaren did it at the time.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from e086733 to e69d645 Compare November 22, 2025 09:57
@0cc4m 0cc4m marked this pull request as draft November 22, 2025 12:11
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 9d0f9af to 9cbe4f8 Compare November 23, 2025 09:39
@github-actions github-actions bot added the testing Everything test related label Nov 23, 2025
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 9cbe4f8 to ad5127d Compare November 27, 2025 05:35
@0cc4m 0cc4m marked this pull request as ready for review November 27, 2025 05:35
@0cc4m 0cc4m requested a review from jeffbolznv November 27, 2025 14:57
@jeffbolznv
Copy link
Collaborator

A quick before/after, I may not have time to review until tomorrow.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        253.63 ± 0.38 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        210.66 ± 4.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        136.78 ± 1.85 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        955.02 ± 4.19 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        951.54 ± 3.72 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        433.50 ± 1.27 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       311.57 ± 14.30 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        269.96 ± 7.90 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       353.91 ± 15.40 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        340.22 ± 0.56 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        391.69 ± 1.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        290.78 ± 2.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        350.00 ± 0.67 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         94.54 ± 0.83 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.24 ± 0.14 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        255.27 ± 1.76 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        209.36 ± 8.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        135.55 ± 3.51 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        952.00 ± 3.88 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        950.21 ± 5.38 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        433.62 ± 1.05 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        307.18 ± 7.16 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        273.64 ± 1.35 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       347.02 ± 26.65 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       355.92 ± 33.07 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        390.81 ± 1.94 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        294.21 ± 0.42 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        349.87 ± 2.96 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         94.99 ± 1.15 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.96 ± 0.06 |

I still see the speedup for gpt-oss MXFP4. Is this still unexpected? If so I can try to dig in and find out what's going on.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 28, 2025

I disabled MMVQ for MXFP4 on modern Nvidia, so the only difference is that I enabled subgroup paths for mul_mat_vec_id. Feel free to look into it if you want. I didn't look much into tuning, I just copied the approach for mul_mat_vec, for now.

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review all the shader logic in detail, but I reviewed the rest, and also ran test-backend-ops with GGML_VK_FORCE_MMVQ and it passed.

return false;
}

// General issue with q3_k and q6_k
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a performance issue, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll rephrase it to be clearer. The reason is simply that those two quants can only use 2-byte loads. Maybe it'd be worth repacking all the 2-byte/1-byte-aligned quants at some point.

// the number of rows computed per shader depends on GPU model and quant
uint32_t rm_stdq = 1;
uint32_t rm_kq = 2;
uint32_t rm_stdq_int = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My WSL build is still using an old glslc without int dot support, and gets these errors:


/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp: In function ‘void ggml_vk_load_shaders(vk_device&)’:
/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3515:14: error: variable ‘rm_stdq_int’ set but not used [-Werror=unused-but-set-variable]
 3515 |     uint32_t rm_stdq_int = 1;
      |              ^~~~~~~~~~~
/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3516:14: error: unused variable ‘rm_kq_int’ [-Werror=unused-variable]
 3516 |     uint32_t rm_kq_int = 1;
      |              ^~~~~~~~~
cc1plus: all warnings being treated as errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed.

@jeffbolznv
Copy link
Collaborator

I set GGML_VK_FORCE_MMVQ and saw a big speedup (+15%) in Qwen2.5-7B-Instruct-1M-Q2_K.gguf. In the past I've seen that Q2_K is small enough that it can still be math-limited rather than bandwidth-limited and it might be worthwhile to enable mmvq for that type on more GPUs. But it can happen in a later change.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 29, 2025

I didn't disable q2_k specifically, but I disabled using MMVQ for smaller k. The value for when to enable it should definitely be tuned further, I just roughly guessed it for my hardware.

@0cc4m 0cc4m merged commit 47a268e into master Nov 29, 2025
71 of 74 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch November 29, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants