CUDA: use mul_mat_q kernels by default #2683

JohannesGaessler · 2023-08-20T15:39:59Z

There seem to have been no further reports of problems with the mul_mat_q kernels so I think it's fine to use them by default. This PR does just that and replaces the -mmq/--mul-mat-q CLI argument with -nommq/--no-mul-mat-q. Unless I'm mistaken the long-term plan is to also add equivalent CPU kernels for matrix matrix multiplications. Ideally I think the same CLI argument should then be used for switching the algorithm. So if you think that "mul_mat_q" is a bad name for matrix multiplications using quantized data now would be a good time to tell me.

slaren · 2023-08-20T19:56:02Z

I imagine that cuBLAS is still significantly faster with f16 and f32 models. The same will be true for CPU and BLAS, in fact, for quantized models it should already be quite competitive with BLAS. Another name for the flag could be use-blas.

JohannesGaessler · 2023-08-21T07:47:49Z

There is no mul_mat_q implementation for f16 and f32. For those data types cuBLAS is always used. More generally I don't think we can realistically write kernels that are faster for those standard data formats and we can instead reduce VRAM usage by transforming the hidden state which is much smaller. I personally don't think this PR is very urgent so I'm fine with just waiting until we have more of the pieces for compiling without BLAS libraries.

Related: When I previously reported good performance for my initial matrix matrix multiplication implementation after adding restrict I must have done something wrong. When I re-tested the performance it was still only 50% that of cuBLAS.

ggerganov

I imagine that cuBLAS is still significantly faster with f16 and f32 models.

Yes, on that note, when I wrote ggml-org/ggml#293 I forgot to specify that we want to avoid BLAS mainly for quantized types. For F16 and F32 we will likely still need it in the near term, unless we manage to close the gap to a good extend. But as people have noted - this will be difficult to do in this project

So technically, the BLAS dependency as a library will likely stay, but we should limit its usage as much as possible

ggerganov · 2023-08-22T07:49:56Z

Maybe before merging, double-check that the perplexity results with --mmq and without are similar for different quantization types (if we haven't done so yet).

I am running a few tests and the deviation looks within expectations:

Q4_0 with BLAS: 5.9635
Q4_0 without BLAS (--mmq): 5.9698

i.e. ppl delta of +0.0063, which is similar to what we got on the CPU back when we switched from BLAS to custom kernels (#951)

Ph0rk0z · 2023-08-22T11:53:48Z

If I compile for cuBlas only, these are not available?

ikawrakow · 2023-08-22T16:42:42Z

Here is a PPL comparison for LLaMA-v2-7B and various k_quants:

Quantization	PPL (BLAS)	PPL (-mmq)	PPL delta
Q2_K	6.4139	6.4149	+0.0010
Q3_K_S	6.2925	6.2944	+0.0014
Q3_K_M	6.0240	6.0314	+0.0074
Q3_K_L	5.9816	5.9883	+0.0067
Q4_K_S	5.8838	5.8919	+0.0081
Q4_K_M	5.8738	5.8790	+0.0052
Q5_K_S	5.8175	5.8243	+0.0069
Q5_K_M	5.8221	5.8279	+0.0058
Q6_K	5.8067	5.8093	+0.0026

ikawrakow · 2023-08-22T16:56:30Z

And here is a timing comparison for LLaMA-v2-7B, RTX-4080, PPL for context length of 512 on Wikitext. Time is in seconds.

Quantization	Time (BLAS)	Time (-mmq)	Speedup
Q2_K	186.0	164.2	+13.3%
Q3_K_S	185.3	157.2	+17.9%
Q3_K_M	183.5	153.1	+19.9%
Q3_K_L	184.7	156.8	+17.8%
Q4_K_S	186.6	143.9	+29.8%
Q4_K_M	184.3	144.3	+27.7%
Q5_K_S	184.0	152.9	+20.3%
Q5_K_M	185.8	152.5	+21.8%
Q6_K	185.8	149.3	+24.4%

JohannesGaessler · 2023-08-22T20:46:46Z

@ikawrakow Thank you for checking the perplexity and performance.

klosax · 2023-08-25T15:26:33Z

@JohannesGaessler Possible bug in in the kernels: #2765

slaren · 2023-08-30T14:09:26Z

llama_context_default_params should probably be changed as well, so that libraries and other users also enable mul_mat_q by default.

https://github.com/ggerganov/llama.cpp/blob/b532a69b2fd08067f34f32f37a2fd9b37678a34a/llama.cpp#L5290

ggerganov approved these changes Aug 22, 2023

View reviewed changes

CUDA: use mul_mat_q kernels by default

f6a8864

JohannesGaessler force-pushed the cuda-mmq-default branch from f80a3fe to f6a8864 Compare August 22, 2023 18:41

JohannesGaessler merged commit c63bb1d into ggml-org:master Aug 22, 2023

ikawrakow mentioned this pull request Aug 27, 2023

k_quants tuning for Falcon-7b #2816

Merged

JohannesGaessler mentioned this pull request Aug 30, 2023

CUDA: mul_mat_q=true as default for llama_context_params #2912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: use mul_mat_q kernels by default #2683

CUDA: use mul_mat_q kernels by default #2683

Uh oh!

JohannesGaessler commented Aug 20, 2023

Uh oh!

slaren commented Aug 20, 2023 •

edited

Loading

Uh oh!

JohannesGaessler commented Aug 21, 2023

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov commented Aug 22, 2023

Uh oh!

Ph0rk0z commented Aug 22, 2023

Uh oh!

ikawrakow commented Aug 22, 2023

Uh oh!

ikawrakow commented Aug 22, 2023

Uh oh!

JohannesGaessler commented Aug 22, 2023

Uh oh!

klosax commented Aug 25, 2023

Uh oh!

slaren commented Aug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CUDA: use mul_mat_q kernels by default #2683

CUDA: use mul_mat_q kernels by default #2683

Uh oh!

Conversation

JohannesGaessler commented Aug 20, 2023

Uh oh!

slaren commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Aug 21, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Aug 22, 2023

Uh oh!

Ph0rk0z commented Aug 22, 2023

Uh oh!

ikawrakow commented Aug 22, 2023

Uh oh!

ikawrakow commented Aug 22, 2023

Uh oh!

JohannesGaessler commented Aug 22, 2023

Uh oh!

klosax commented Aug 25, 2023

Uh oh!

slaren commented Aug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

slaren commented Aug 20, 2023 •

edited

Loading