Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

0cc4m · 2025-10-31T17:55:36Z

Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.

Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.

Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.

0cc4m · 2025-11-01T11:47:44Z

AMD Radeon Pro VII

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	63.49 ± 0.20	71.40 ± 0.24	83.84 ± 0.26	+17.4%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	64.74 ± 0.12	67.75 ± 0.09	78.96 ± 0.20	+16.5%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	48.80 ± 0.08	60.59 ± 0.14	59.91 ± 0.24	-1.1%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	49.47 ± 0.44	58.06 ± 0.11	57.43 ± 0.04	-1.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	65.92 ± 0.15	72.60 ± 0.17	76.77 ± 0.24	+5.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	67.66 ± 0.18	69.41 ± 0.12	72.90 ± 0.19	+5.0%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	19.10 ± 0.16	19.11 ± 0.09	24.50 ± 0.16	+28.2%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	19.00 ± 0.05	18.24 ± 0.21	23.61 ± 0.22	+29.4%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	80.04 ± 0.02	90.66 ± 0.17	87.32 ± 0.46	-3.7%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	80.24 ± 0.10	86.01 ± 5.01	86.50 ± 0.53	+0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	67.68 ± 0.06	82.89 ± 0.22	85.36 ± 0.61	+3.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	70.80 ± 0.03	75.71 ± 0.17	77.52 ± 0.12	+2.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	107.99 ± 0.65	127.26 ± 0.27	128.89 ± 0.75	+1.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	114.36 ± 0.11	125.49 ± 0.07	126.27 ± 0.37	+0.6%

AMD Radeon RX 6800 XT

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	93.30 ± 0.25	115.95 ± 3.40	122.98 ± 0.14	+6.1%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	95.99 ± 0.11	109.65 ± 1.76	113.62 ± 0.02	+3.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	75.50 ± 0.01	93.13 ± 0.05	90.81 ± 0.01	-2.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	77.68 ± 0.00	88.41 ± 0.04	86.52 ± 0.01	-2.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	101.67 ± 0.04	148.71 ± 0.08	151.96 ± 0.03	+2.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	106.92 ± 0.01	136.12 ± 0.39	137.91 ± 0.04	+1.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	120.05 ± 0.05	145.28 ± 0.05	145.86 ± 0.02	+0.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	124.10 ± 0.00	142.70 ± 0.06	143.23 ± 0.04	+0.4%

Intel A770

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	29.90 ± 0.32	44.53 ± 0.74	+48.9%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	19.55 ± 0.01	26.37 ± 0.00	+34.9%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	15.91 ± 0.01	15.92 ± 0.02	+0.1%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	12.52 ± 0.03	12.56 ± 0.01	+0.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	38.36 ± 0.04	47.72 ± 0.05	+24.4%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	29.89 ± 0.01	34.91 ± 0.02	+16.8%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	12.00 ± 0.01	14.29 ± 1.43	+19.1%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	10.46 ± 0.02	11.90 ± 0.34	+13.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	46.88 ± 2.27	49.79 ± 5.03	+6.2%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	47.69 ± 0.42	51.01 ± 0.11	+7.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	43.62 ± 0.04	41.81 ± 0.21	-4.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	28.22 ± 0.05	28.22 ± 0.01	+0.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	23.94 ± 0.03	39.25 ± 0.02	+64.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	22.87 ± 0.05	36.10 ± 0.01	+57.8%

RTX 3090

model	size	params	ngl	fa	test	t/s (CUDA)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	138.00 ± 0.66	114.32 ± 0.45	112.74 ± 0.36	-1.4%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	136.82 ± 0.35	116.74 ± 0.35	114.95 ± 0.29	-1.5%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	105.80 ± 0.29	98.13 ± 0.18	95.82 ± 0.58	-2.4%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	105.10 ± 0.27	100.27 ± 0.37	96.59 ± 0.37	-3.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	145.41 ± 0.43	123.22 ± 0.41	121.58 ± 2.54	-1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	144.52 ± 0.09	125.32 ± 0.18	126.04 ± 0.19	+0.6%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	48.59 ± 0.03	38.82 ± 0.63	41.02 ± 0.18	+5.7%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	48.44 ± 0.06	39.31 ± 0.14	41.31 ± 0.09	+5.1%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	141.75 ± 0.46	143.90 ± 0.91	145.12 ± 1.67	+0.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	141.72 ± 0.44	144.40 ± 0.24	145.24 ± 0.20	+0.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	165.61 ± 1.53	151.74 ± 7.18	153.97 ± 0.99	+1.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	162.49 ± 0.32	159.56 ± 1.25	159.13 ± 0.85	-0.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	205.45 ± 1.12	153.52 ± 12.40	160.16 ± 17.99	+4.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	210.33 ± 0.86	159.12 ± 0.81	172.44 ± 0.27	+8.4%

jeffbolznv

I only did a quick read through. I'll do some perf testing soon.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

0cc4m · 2025-11-02T09:23:24Z

As usual, I appear to have caused an llvmpipe issue. I'll look into it.

jeffbolznv · 2025-11-02T19:25:27Z

Some initial perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       239.48 ± 11.34 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.44 ± 7.81 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        129.84 ± 4.07 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       872.67 ± 15.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       845.99 ± 13.20 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       391.09 ± 24.08 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       265.33 ± 14.59 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       251.59 ± 17.44 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       305.19 ± 28.81 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       301.64 ± 24.09 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       356.71 ± 17.34 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        273.06 ± 2.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       317.10 ± 15.70 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         91.93 ± 0.22 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.29 ± 0.22 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.03 ± 1.52 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         70.20 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         48.53 ± 0.66 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       431.26 ± 28.74 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       397.86 ± 23.85 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.72 ± 3.56 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       153.41 ± 10.78 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.66 ± 3.49 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       173.04 ± 12.22 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         37.22 ± 0.54 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        159.48 ± 1.35 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.88 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.48 ± 0.54 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       238.12 ± 12.03 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        202.69 ± 5.07 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.12 ± 4.19 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       855.76 ± 15.46 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      641.24 ± 260.16 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       396.68 ± 14.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        264.39 ± 8.21 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       250.60 ± 18.72 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       317.92 ± 10.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       325.54 ± 12.60 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       358.63 ± 16.21 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.27 ± 4.62 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.73 ± 7.12 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.43 ± 2.13 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.05 ± 0.23 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.30 ± 0.94 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         71.16 ± 0.26 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         49.35 ± 0.18 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        461.59 ± 1.94 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        420.99 ± 1.95 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.92 ± 2.62 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        152.94 ± 8.52 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        106.06 ± 3.89 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       178.63 ± 16.11 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         41.86 ± 1.68 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        160.77 ± 1.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.78 ± 1.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.95 ± 0.12 |

I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       314.61 ± 23.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        323.84 ± 1.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        322.33 ± 2.26 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        319.46 ± 2.80 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        318.55 ± 3.96 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        332.90 ± 5.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.56 ± 0.96 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.42 ± 7.14 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.52 ± 6.45 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.98 ± 1.17 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       327.08 ± 19.41 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.18 ± 5.79 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        339.58 ± 3.17 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        338.76 ± 2.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        337.12 ± 5.83 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        132.41 ± 3.78 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.42 ± 0.73 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.74 ± 0.18 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.36 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.26 ± 0.30 |

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       331.53 ± 16.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        335.87 ± 1.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.85 ± 4.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.90 ± 2.64 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        333.53 ± 3.58 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.99 ± 2.56 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.84 ± 1.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.21 ± 5.07 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.78 ± 6.82 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.95 ± 1.13 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       321.82 ± 31.23 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        329.96 ± 4.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        335.48 ± 2.55 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.77 ± 6.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.00 ± 5.05 |

build: b153aac38 (6921)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.75 ± 3.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.28 ± 0.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.52 ± 0.39 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.62 ± 0.41 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.60 ± 0.40 |

0cc4m · 2025-11-07T19:53:04Z

Most seem to be noise, except the improvement for gpt-oss MXFP4 is real

The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there.

I still have some tuning to do here, my Strix Halo device isn't liking this PR yet.

0cc4m · 2025-11-15T14:16:00Z

The tuning seems okay now, even though I didn't change anything. @jeffbolznv Please take another look. Did you have any concerns with your benchmarks?

Here are updated results:

AMD Radeon 8060S

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	44.26 ± 0.95	56.34 ± 2.35	60.17 ± 1.18	+6.8%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	45.61 ± 0.05	52.61 ± 1.51	58.76 ± 0.07	+11.7%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	34.70 ± 0.14	46.78 ± 1.25	48.09 ± 1.42	+2.8%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	35.60 ± 0.03	46.05 ± 0.23	47.44 ± 0.22	+3.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	35.68 ± 0.05	43.27 ± 0.23	43.42 ± 0.32	+0.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	37.39 ± 0.05	42.69 ± 0.12	43.55 ± 0.06	+2.0%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	68.22 ± 0.22	91.84 ± 5.25	90.58 ± 1.56	-1.4%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	68.39 ± 0.40	89.59 ± 0.81	89.12 ± 0.93	-0.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	55.58 ± 0.44	91.88 ± 0.94	95.63 ± 0.64	+4.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	59.90 ± 0.38	88.72 ± 0.91	93.37 ± 0.39	+5.2%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	61.70 ± 0.17	69.55 ± 1.05	72.25 ± 0.74	+3.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	64.48 ± 0.33	70.75 ± 0.44	73.16 ± 0.17	+3.4%

AMD RX 6800 XT

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	91.24 ± 0.24	117.51 ± 0.54	123.59 ± 0.21	+5.2%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	96.87 ± 0.10	111.03 ± 0.15	114.65 ± 0.01	+3.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	73.90 ± 0.01	93.52 ± 0.05	91.08 ± 0.02	-2.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	78.24 ± 0.00	88.73 ± 0.02	86.83 ± 0.01	-2.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	97.29 ± 0.07	154.58 ± 1.14	162.02 ± 0.03	+4.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	105.90 ± 0.00	139.83 ± 0.03	145.91 ± 0.02	+4.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	116.54 ± 0.01	145.73 ± 0.76	147.00 ± 0.01	+0.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	122.11 ± 0.01	143.62 ± 0.02	144.69 ± 0.05	+0.7%

AMD Radeon Pro VII

model	size	params	ngl	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	61.77 ± 0.21	72.03 ± 0.31	86.22 ± 0.39	+19.7%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	64.29 ± 0.13	68.72 ± 0.16	81.18 ± 0.89	+18.1%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	47.98 ± 0.08	61.70 ± 0.18	60.99 ± 0.59	-1.2%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	49.86 ± 0.00	59.22 ± 0.08	58.43 ± 0.24	-1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	65.02 ± 0.17	73.87 ± 0.39	78.50 ± 0.44	+6.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	68.17 ± 0.11	70.42 ± 0.25	75.18 ± 0.17	+6.8%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	19.27 ± 0.16	19.19 ± 0.02	24.82 ± 0.09	+29.3%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	19.41 ± 0.03	18.88 ± 0.05	24.08 ± 0.13	+27.5%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	80.06 ± 0.09	83.66 ± 5.57	82.35 ± 2.57	-1.6%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	80.35 ± 0.11	78.49 ± 1.53	82.49 ± 2.99	+5.1%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	62.83 ± 0.02	85.69 ± 0.94	90.46 ± 1.04	+5.6%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	68.49 ± 0.00	77.31 ± 0.79	81.04 ± 0.60	+4.8%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	101.34 ± 0.04	128.84 ± 0.12	128.34 ± 1.83	-0.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	113.33 ± 0.08	125.89 ± 0.23	126.20 ± 1.51	+0.2%

Intel A770

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	30.20 ± 0.43	44.36 ± 0.85	+46.9%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	20.15 ± 0.01	27.51 ± 0.06	+36.5%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	15.83 ± 0.03	15.88 ± 0.04	+0.3%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	12.77 ± 0.02	12.79 ± 0.02	+0.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	38.31 ± 0.05	46.81 ± 0.73	+22.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	31.17 ± 0.12	36.67 ± 0.10	+17.6%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	12.02 ± 0.01	14.82 ± 1.23	+23.3%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	10.70 ± 0.00	12.08 ± 0.33	+12.9%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	46.69 ± 3.24	48.46 ± 2.09	+3.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	49.76 ± 0.08	51.98 ± 0.06	+4.5%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	23.34 ± 0.02	38.08 ± 0.12	+63.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	22.62 ± 0.06	36.32 ± 0.08	+60.6%

Nvidia RTX 3090

model	size	params	ngl	fa	test	t/s (CUDA)	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	0	tg128	137.34 ± 0.39	118.30 ± 0.45	117.37 ± 0.28	-0.8%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	139.96 ± 0.52	120.61 ± 0.18	119.81 ± 0.30	-0.7%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	0	tg128	106.20 ± 0.37	101.26 ± 0.43	100.46 ± 0.36	-0.8%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	107.55 ± 0.21	102.67 ± 0.69	101.69 ± 0.70	-1.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	0	tg128	143.67 ± 0.30	120.73 ± 4.47	121.30 ± 5.42	+0.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	146.97 ± 0.18	125.54 ± 0.94	126.59 ± 2.29	+0.8%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	0	tg128	48.35 ± 0.01	40.43 ± 0.48	42.24 ± 0.13	+4.5%
llama 13B Q5_K - Small	15.18 GiB	23.57 B	99	1	tg128	48.82 ± 0.05	40.85 ± 0.14	42.32 ± 0.05	+3.6%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	0	tg128	141.25 ± 0.92	142.11 ± 12.00	145.57 ± 12.59	+2.4%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	142.04 ± 0.34	149.71 ± 1.52	151.22 ± 0.42	+1.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	0	tg128	155.74 ± 1.15	153.97 ± 16.85	153.30 ± 16.31	-0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	164.32 ± 3.05	167.49 ± 1.56	167.34 ± 1.11	-0.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	0	tg128	197.83 ± 0.79	154.47 ± 12.95	163.89 ± 14.45	+6.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	207.36 ± 0.49	164.30 ± 0.59	175.28 ± 1.24	+6.7%

0cc4m · 2025-11-16T10:16:53Z

Something's broken in the nvidia-vulkan-cm and cm2 runs, I'll look into it.

0cc4m · 2025-11-16T13:38:23Z

I can't reproduce the problem, even on my RTX 3090, with coopmat2, coopmat or without coopmat. Not sure what is going on. It looks like incoherence, but for me the example runs just fine. @jeffbolznv any ideas?

jeffbolznv · 2025-11-16T14:37:40Z

I pulled the branch but wasn't able to reproduce the failure. I don't have any great ideas - maybe some missing bounds checking?

0cc4m · 2025-11-16T15:08:14Z

MMVQ is much simpler with bounds checking, since all the inputs are in blocks of 256, 128 or 32. I didn't change how the output is stored, so I don't think that's likely.

0cc4m · 2025-11-19T18:04:46Z

I would like to merge this, but the CI keeps failing in a way I can't reproduce or understand. cmake-vulkan now failed with illegal segfaults on llvmpipe. What is going on there?

Acly · 2025-11-19T18:29:25Z

Had those Illegal (instruction) failures once in a PR and it was related to bad Ccache. Maybe you can clear it and re-run that test.

0cc4m · 2025-11-20T08:53:25Z

How do I clear it?

Acly · 2025-11-20T09:55:42Z

I'm guessing find the ccache entry related to this PR in https://github.com/ggml-org/llama.cpp/actions/caches and delete it. I don't have the required permissions, maybe you do. @slaren did it at the time.

jeffbolznv · 2025-11-27T18:29:08Z

A quick before/after, I may not have time to review until tomorrow.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        253.63 ± 0.38 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        210.66 ± 4.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        136.78 ± 1.85 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        955.02 ± 4.19 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        951.54 ± 3.72 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        433.50 ± 1.27 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       311.57 ± 14.30 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        269.96 ± 7.90 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       353.91 ± 15.40 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        340.22 ± 0.56 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        391.69 ± 1.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        290.78 ± 2.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        350.00 ± 0.67 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         94.54 ± 0.83 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.24 ± 0.14 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        255.27 ± 1.76 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        209.36 ± 8.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        135.55 ± 3.51 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        952.00 ± 3.88 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        950.21 ± 5.38 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        433.62 ± 1.05 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        307.18 ± 7.16 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        273.64 ± 1.35 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       347.02 ± 26.65 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       355.92 ± 33.07 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        390.81 ± 1.94 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        294.21 ± 0.42 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        349.87 ± 2.96 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         94.99 ± 1.15 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.96 ± 0.06 |

I still see the speedup for gpt-oss MXFP4. Is this still unexpected? If so I can try to dig in and find out what's going on.

0cc4m · 2025-11-28T08:47:44Z

I disabled MMVQ for MXFP4 on modern Nvidia, so the only difference is that I enabled subgroup paths for mul_mat_vec_id. Feel free to look into it if you want. I didn't look much into tuning, I just copied the approach for mul_mat_vec, for now.

jeffbolznv

I didn't review all the shader logic in detail, but I reviewed the rest, and also ran test-backend-ops with GGML_VK_FORCE_MMVQ and it passed.

jeffbolznv · 2025-11-28T17:39:05Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

        return false;
    }

+    // General issue with q3_k and q6_k


This is just a performance issue, right?

Yes, I'll rephrase it to be clearer. The reason is simply that those two quants can only use 2-byte loads. Maybe it'd be worth repacking all the 2-byte/1-byte-aligned quants at some point.

jeffbolznv · 2025-11-28T17:53:53Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

    // the number of rows computed per shader depends on GPU model and quant
    uint32_t rm_stdq = 1;
    uint32_t rm_kq = 2;
+    uint32_t rm_stdq_int = 1;


My WSL build is still using an old glslc without int dot support, and gets these errors:

/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp: In function ‘void ggml_vk_load_shaders(vk_device&)’: /mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3515:14: error: variable ‘rm_stdq_int’ set but not used [-Werror=unused-but-set-variable] 3515 | uint32_t rm_stdq_int = 1; | ^~~~~~~~~~~ /mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3516:14: error: unused variable ‘rm_kq_int’ [-Werror=unused-variable] 3516 | uint32_t rm_kq_int = 1; | ^~~~~~~~~ cc1plus: all warnings being treated as errors

Should be fixed.

jeffbolznv · 2025-11-28T18:04:10Z

I set GGML_VK_FORCE_MMVQ and saw a big speedup (+15%) in Qwen2.5-7B-Instruct-1M-Q2_K.gguf. In the past I've seen that Q2_K is small enough that it can still be math-limited rather than bandwidth-limited and it might be worthwhile to enable mmvq for that type on more GPUs. But it can happen in a later change.

0cc4m · 2025-11-29T07:10:30Z

I didn't disable q2_k specifically, but I disabled using MMVQ for smaller k. The value for when to enable it should definitely be tuned further, I just roughly guessed it for my hardware.

DajanaV mentioned this pull request Oct 31, 2025

UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support auroralabs-loci/llama.cpp#25

Closed

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from d5192bf to d2f8f00 Compare November 1, 2025 11:31

0cc4m marked this pull request as ready for review November 1, 2025 11:47

0cc4m requested a review from jeffbolznv November 1, 2025 11:48

jeffbolznv reviewed Nov 1, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from b153aac to 1b78909 Compare November 7, 2025 19:51

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 1b78909 to 937f992 Compare November 15, 2025 13:26

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from b99726c to 3c22e38 Compare November 16, 2025 09:11

DajanaV mentioned this pull request Nov 16, 2025

UPSTREAM PR #16900: Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support auroralabs-loci/llama.cpp#224

Open

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 3c22e38 to e086733 Compare November 19, 2025 15:27

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from e086733 to e69d645 Compare November 22, 2025 09:57

0cc4m mentioned this pull request Nov 22, 2025

vulkan: remove a couple unnecessary switches #17419

Merged

0cc4m marked this pull request as draft November 22, 2025 12:11

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 9d0f9af to 9cbe4f8 Compare November 23, 2025 09:39

github-actions bot added the testing Everything test related label Nov 23, 2025

loci-dev mentioned this pull request Nov 26, 2025

UPSTREAM PR #17514: vulkan: use a fixed 1KB buffer for the add_rms_fusion opt auroralabs-loci/llama.cpp#332

Open

0cc4m added 14 commits November 27, 2025 06:32

vulkan: split mul_mmq_funcs for mul_mat_vecq use

f7a638f

add mxfp4 mmvq

211bcd4

add q2_k mmvq

9ba1258

add q3_k mmvq

9eeb42f

add q4_k and q5_k mmvq

741bf82

add q6_k mmvq

7a8b853

handle 4x4 quants per mmvq thread

ef0060a

enable MUL_MAT_ID mmvq support

593e94f

enable subgroup optimizations for mul_mat_vec_id shaders

bb3bcaa

device tuning

4512c55

request prealloc_y sync after quantization

dd54d39

fix indentation

b119ac7

fix llvmpipe test failures

f8e3288

fix mul_mat_id mmvq condition

ad5127d

0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 9cbe4f8 to ad5127d Compare November 27, 2025 05:35

0cc4m marked this pull request as ready for review November 27, 2025 05:35

0cc4m requested a review from jeffbolznv November 27, 2025 14:57

jeffbolznv reviewed Nov 28, 2025

View reviewed changes

fix unused variable warning

6cb0923

jeffbolznv approved these changes Nov 29, 2025

View reviewed changes

0cc4m merged commit 47a268e into master Nov 29, 2025
71 of 74 checks passed

0cc4m deleted the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch November 29, 2025 08:37

0cc4m mentioned this pull request Dec 1, 2025

Misc. bug: Vulkan's performance degradation(TG) on A770 from b7194 and FA problem #17628

Open

jeffbolznv mentioned this pull request Dec 2, 2025

vulkan: enable mmvq for q2_k on NVIDIA #17675

Merged

loci-dev mentioned this pull request Dec 2, 2025

UPSTREAM PR #17675: vulkan: enable mmvq for q2_k on NVIDIA auroralabs-loci/llama.cpp#399

Open

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900

Uh oh!

Conversation

0cc4m commented Oct 31, 2025

Uh oh!

0cc4m commented Nov 1, 2025

AMD Radeon Pro VII

AMD Radeon RX 6800 XT

Intel A770

RTX 3090

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0cc4m commented Nov 2, 2025

Uh oh!

jeffbolznv commented Nov 2, 2025

Uh oh!

0cc4m commented Nov 7, 2025

Uh oh!

0cc4m commented Nov 15, 2025

AMD Radeon 8060S

AMD RX 6800 XT

AMD Radeon Pro VII

Intel A770

Nvidia RTX 3090

Uh oh!

0cc4m commented Nov 16, 2025

Uh oh!

0cc4m commented Nov 16, 2025

Uh oh!

jeffbolznv commented Nov 16, 2025

Uh oh!

0cc4m commented Nov 16, 2025

Uh oh!

0cc4m commented Nov 19, 2025

Uh oh!

Acly commented Nov 19, 2025

Uh oh!

0cc4m commented Nov 20, 2025

Uh oh!

Acly commented Nov 20, 2025

Uh oh!

jeffbolznv commented Nov 27, 2025

Uh oh!

0cc4m commented Nov 28, 2025

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Nov 28, 2025

Uh oh!

0cc4m commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants