Compile time option to use bf16 for quants without MMQ kernels #261

ikawrakow · 2025-03-17T18:55:23Z

The IQ2_KS, IQ2_K, ..., IQ6_K quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to fp16 and cublasGemmEx GEMM using fp16 precision. For the DeepSeek series of MoE models this leads to NaNs.

Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to bf16 and use bf16 cuBLAS GEMM. This is added as a compile time option enabled via

cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options

(or, if you like me prefer using ccmake, after pulling the PR, cmake .. && ccmake ., and then set the GGML_CUDA_IQK_FORCE_BF16 to ON).

I have tested with DeepSeek-Lite quantized with IQ4_KSS and IQ4_K. In both cases I get NaNs when running perplexity on the main branch. Turning on the GGML_CUDA_IQK_FORCE_BF16 option provided by this PR results in meaningful PPL values.

@davidsyoung This should solve the issues with the IQ4_KSS DeepSeek-R1 model you created.

davidsyoung · 2025-03-17T23:38:28Z

Awesome! Will re-quant over night and test tomorrow!

saood06 · 2025-03-17T23:43:23Z

Awesome! Will re-quant over night and test tomorrow!

In case you still have the old quants, you can just use those with the new code you don't have to make new quants.

davidsyoung · 2025-03-17T23:45:25Z

Unfortunately I don’t! My cache drive is limited so I tend to delete pretty soon.

Co-authored-by: Iwan Kawrakow <[email protected]>

Compile time option to use bf16 for qunts without MMQ kernels

f326a5e

ikawrakow merged commit bdcae90 into main Mar 18, 2025

ikawrakow pushed a commit that referenced this pull request Mar 18, 2025

Fix #261

55b2cf9

ikawrakow added a commit that referenced this pull request Mar 18, 2025

Fix #261 (#262)

f4ebf13

Co-authored-by: Iwan Kawrakow <[email protected]>

ikawrakow mentioned this pull request Mar 26, 2025

Use bf16 instead of fp16 block scales for q8_1 #292

Merged

ikawrakow mentioned this pull request May 14, 2025

CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K #417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compile time option to use bf16 for quants without MMQ kernels #261

Compile time option to use bf16 for quants without MMQ kernels #261

Uh oh!

ikawrakow commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 17, 2025

Uh oh!

saood06 commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Compile time option to use bf16 for quants without MMQ kernels #261

Compile time option to use bf16 for quants without MMQ kernels #261

Uh oh!

Conversation

ikawrakow commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 17, 2025

Uh oh!

saood06 commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants