-
Notifications
You must be signed in to change notification settings - Fork 14.1k
CUDA: use mul_mat_q kernels by default #2683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: use mul_mat_q kernels by default #2683
Conversation
|
I imagine that cuBLAS is still significantly faster with f16 and f32 models. The same will be true for CPU and BLAS, in fact, for quantized models it should already be quite competitive with BLAS. Another name for the flag could be |
|
There is no mul_mat_q implementation for f16 and f32. For those data types cuBLAS is always used. More generally I don't think we can realistically write kernels that are faster for those standard data formats and we can instead reduce VRAM usage by transforming the hidden state which is much smaller. I personally don't think this PR is very urgent so I'm fine with just waiting until we have more of the pieces for compiling without BLAS libraries. Related: When I previously reported good performance for my initial matrix matrix multiplication implementation after adding restrict I must have done something wrong. When I re-tested the performance it was still only 50% that of cuBLAS. |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that cuBLAS is still significantly faster with f16 and f32 models.
Yes, on that note, when I wrote ggml-org/ggml#293 I forgot to specify that we want to avoid BLAS mainly for quantized types. For F16 and F32 we will likely still need it in the near term, unless we manage to close the gap to a good extend. But as people have noted - this will be difficult to do in this project
So technically, the BLAS dependency as a library will likely stay, but we should limit its usage as much as possible
|
Maybe before merging, double-check that the perplexity results with I am running a few tests and the deviation looks within expectations:
i.e. ppl delta of |
|
If I compile for cuBlas only, these are not available? |
|
Here is a PPL comparison for LLaMA-v2-7B and various k_quants:
|
|
And here is a timing comparison for LLaMA-v2-7B, RTX-4080, PPL for context length of 512 on Wikitext. Time is in seconds.
|
f80a3fe to
f6a8864
Compare
|
@ikawrakow Thank you for checking the perplexity and performance. |
|
@JohannesGaessler Possible bug in in the kernels: #2765 |
|
|
There seem to have been no further reports of problems with the mul_mat_q kernels so I think it's fine to use them by default. This PR does just that and replaces the
-mmq/--mul-mat-qCLI argument with-nommq/--no-mul-mat-q. Unless I'm mistaken the long-term plan is to also add equivalent CPU kernels for matrix matrix multiplications. Ideally I think the same CLI argument should then be used for switching the algorithm. So if you think that "mul_mat_q" is a bad name for matrix multiplications using quantized data now would be a good time to tell me.