Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

There seem to have been no further reports of problems with the mul_mat_q kernels so I think it's fine to use them by default. This PR does just that and replaces the -mmq/--mul-mat-q CLI argument with -nommq/--no-mul-mat-q. Unless I'm mistaken the long-term plan is to also add equivalent CPU kernels for matrix matrix multiplications. Ideally I think the same CLI argument should then be used for switching the algorithm. So if you think that "mul_mat_q" is a bad name for matrix multiplications using quantized data now would be a good time to tell me.

@slaren
Copy link
Member

slaren commented Aug 20, 2023

I imagine that cuBLAS is still significantly faster with f16 and f32 models. The same will be true for CPU and BLAS, in fact, for quantized models it should already be quite competitive with BLAS. Another name for the flag could be use-blas.

@JohannesGaessler
Copy link
Collaborator Author

There is no mul_mat_q implementation for f16 and f32. For those data types cuBLAS is always used. More generally I don't think we can realistically write kernels that are faster for those standard data formats and we can instead reduce VRAM usage by transforming the hidden state which is much smaller. I personally don't think this PR is very urgent so I'm fine with just waiting until we have more of the pieces for compiling without BLAS libraries.

Related: When I previously reported good performance for my initial matrix matrix multiplication implementation after adding restrict I must have done something wrong. When I re-tested the performance it was still only 50% that of cuBLAS.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that cuBLAS is still significantly faster with f16 and f32 models.

Yes, on that note, when I wrote ggml-org/ggml#293 I forgot to specify that we want to avoid BLAS mainly for quantized types. For F16 and F32 we will likely still need it in the near term, unless we manage to close the gap to a good extend. But as people have noted - this will be difficult to do in this project

So technically, the BLAS dependency as a library will likely stay, but we should limit its usage as much as possible

@ggerganov
Copy link
Member

Maybe before merging, double-check that the perplexity results with --mmq and without are similar for different quantization types (if we haven't done so yet).

I am running a few tests and the deviation looks within expectations:

  • Q4_0 with BLAS: 5.9635
  • Q4_0 without BLAS (--mmq): 5.9698

i.e. ppl delta of +0.0063, which is similar to what we got on the CPU back when we switched from BLAS to custom kernels (#951)

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 22, 2023

If I compile for cuBlas only, these are not available?

@ikawrakow
Copy link
Contributor

Here is a PPL comparison for LLaMA-v2-7B and various k_quants:

Quantization PPL (BLAS) PPL (-mmq) PPL delta
Q2_K 6.4139 6.4149 +0.0010
Q3_K_S 6.2925 6.2944 +0.0014
Q3_K_M 6.0240 6.0314 +0.0074
Q3_K_L 5.9816 5.9883 +0.0067
Q4_K_S 5.8838 5.8919 +0.0081
Q4_K_M 5.8738 5.8790 +0.0052
Q5_K_S 5.8175 5.8243 +0.0069
Q5_K_M 5.8221 5.8279 +0.0058
Q6_K 5.8067 5.8093 +0.0026

@ikawrakow
Copy link
Contributor

And here is a timing comparison for LLaMA-v2-7B, RTX-4080, PPL for context length of 512 on Wikitext. Time is in seconds.

Quantization Time (BLAS) Time (-mmq) Speedup
Q2_K 186.0 164.2 +13.3%
Q3_K_S 185.3 157.2 +17.9%
Q3_K_M 183.5 153.1 +19.9%
Q3_K_L 184.7 156.8 +17.8%
Q4_K_S 186.6 143.9 +29.8%
Q4_K_M 184.3 144.3 +27.7%
Q5_K_S 184.0 152.9 +20.3%
Q5_K_M 185.8 152.5 +21.8%
Q6_K 185.8 149.3 +24.4%

@JohannesGaessler
Copy link
Collaborator Author

@ikawrakow Thank you for checking the perplexity and performance.

@JohannesGaessler JohannesGaessler merged commit c63bb1d into ggml-org:master Aug 22, 2023
@klosax
Copy link
Contributor

klosax commented Aug 25, 2023

@JohannesGaessler Possible bug in in the kernels: #2765

@slaren
Copy link
Member

slaren commented Aug 30, 2023

llama_context_default_params should probably be changed as well, so that libraries and other users also enable mul_mat_q by default.

https://github.com/ggerganov/llama.cpp/blob/b532a69b2fd08067f34f32f37a2fd9b37678a34a/llama.cpp#L5290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants