-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support #16900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d5192bf to
d2f8f00
Compare
AMD Radeon Pro VII
AMD Radeon RX 6800 XT
Intel A770
RTX 3090
|
jeffbolznv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a quick read through. I'll do some perf testing soon.
|
As usual, I appear to have caused an llvmpipe issue. I'll look into it. |
|
Some initial perf results: I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real: |
b153aac to
1b78909
Compare
The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there. I still have some tuning to do here, my Strix Halo device isn't liking this PR yet. |
1b78909 to
937f992
Compare
|
The tuning seems okay now, even though I didn't change anything. @jeffbolznv Please take another look. Did you have any concerns with your benchmarks? Here are updated results: AMD Radeon 8060S
AMD RX 6800 XT
AMD Radeon Pro VII
Intel A770
Nvidia RTX 3090
|
b99726c to
3c22e38
Compare
|
Something's broken in the nvidia-vulkan-cm and cm2 runs, I'll look into it. |
|
I can't reproduce the problem, even on my RTX 3090, with coopmat2, coopmat or without coopmat. Not sure what is going on. It looks like incoherence, but for me the example runs just fine. @jeffbolznv any ideas? |
|
I pulled the branch but wasn't able to reproduce the failure. I don't have any great ideas - maybe some missing bounds checking? |
|
MMVQ is much simpler with bounds checking, since all the inputs are in blocks of 256, 128 or 32. I didn't change how the output is stored, so I don't think that's likely. |
3c22e38 to
e086733
Compare
|
I would like to merge this, but the CI keeps failing in a way I can't reproduce or understand. |
|
Had those Illegal (instruction) failures once in a PR and it was related to bad Ccache. Maybe you can clear it and re-run that test. |
|
How do I clear it? |
|
I'm guessing find the ccache entry related to this PR in https://github.com/ggml-org/llama.cpp/actions/caches and delete it. I don't have the required permissions, maybe you do. @slaren did it at the time. |
e086733 to
e69d645
Compare
9d0f9af to
9cbe4f8
Compare
9cbe4f8 to
ad5127d
Compare
|
A quick before/after, I may not have time to review until tomorrow. I still see the speedup for gpt-oss MXFP4. Is this still unexpected? If so I can try to dig in and find out what's going on. |
|
I disabled MMVQ for MXFP4 on modern Nvidia, so the only difference is that I enabled subgroup paths for mul_mat_vec_id. Feel free to look into it if you want. I didn't look much into tuning, I just copied the approach for mul_mat_vec, for now. |
jeffbolznv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review all the shader logic in detail, but I reviewed the rest, and also ran test-backend-ops with GGML_VK_FORCE_MMVQ and it passed.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| return false; | ||
| } | ||
|
|
||
| // General issue with q3_k and q6_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a performance issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll rephrase it to be clearer. The reason is simply that those two quants can only use 2-byte loads. Maybe it'd be worth repacking all the 2-byte/1-byte-aligned quants at some point.
| // the number of rows computed per shader depends on GPU model and quant | ||
| uint32_t rm_stdq = 1; | ||
| uint32_t rm_kq = 2; | ||
| uint32_t rm_stdq_int = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My WSL build is still using an old glslc without int dot support, and gets these errors:
/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp: In function ‘void ggml_vk_load_shaders(vk_device&)’:
/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3515:14: error: variable ‘rm_stdq_int’ set but not used [-Werror=unused-but-set-variable]
3515 | uint32_t rm_stdq_int = 1;
| ^~~~~~~~~~~
/mnt/c/github/jeffbolznv/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:3516:14: error: unused variable ‘rm_kq_int’ [-Werror=unused-variable]
3516 | uint32_t rm_kq_int = 1;
| ^~~~~~~~~
cc1plus: all warnings being treated as errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed.
|
I set GGML_VK_FORCE_MMVQ and saw a big speedup (+15%) in Qwen2.5-7B-Instruct-1M-Q2_K.gguf. In the past I've seen that Q2_K is small enough that it can still be math-limited rather than bandwidth-limited and it might be worthwhile to enable mmvq for that type on more GPUs. But it can happen in a later change. |
|
I didn't disable q2_k specifically, but I disabled using MMVQ for smaller k. The value for when to enable it should definitely be tuned further, I just roughly guessed it for my hardware. |
Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.
Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.
Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.