Skip to content

Conversation

@ikawrakow
Copy link
Owner

The IQ2_KS, IQ2_K, ..., IQ6_K quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to fp16 and cublasGemmEx GEMM using fp16 precision. For the DeepSeek series of MoE models this leads to NaNs.

Ideally I should add MMQ kernels for these quantization types. But for now, the PR provides a quick fix: dequantize to bf16 and use bf16 cuBLAS GEMM. This is added as a compile time option enabled via

cmake -DGGML_CUDA_IQK_FORCE_BF16 $other_cmake_options

(or, if you like me prefer using ccmake, after pulling the PR, cmake .. && ccmake ., and then set the GGML_CUDA_IQK_FORCE_BF16 to ON).

I have tested with DeepSeek-Lite quantized with IQ4_KSS and IQ4_K. In both cases I get NaNs when running perplexity on the main branch. Turning on the GGML_CUDA_IQK_FORCE_BF16 option provided by this PR results in meaningful PPL values.

@davidsyoung This should solve the issues with the IQ4_KSS DeepSeek-R1 model you created.

@davidsyoung
Copy link

Awesome! Will re-quant over night and test tomorrow!

@saood06
Copy link
Collaborator

saood06 commented Mar 17, 2025

Awesome! Will re-quant over night and test tomorrow!

In case you still have the old quants, you can just use those with the new code you don't have to make new quants.

@davidsyoung
Copy link

Unfortunately I don’t! My cache drive is limited so I tend to delete pretty soon.

@ikawrakow ikawrakow merged commit bdcae90 into main Mar 18, 2025
ikawrakow pushed a commit that referenced this pull request Mar 18, 2025
ikawrakow added a commit that referenced this pull request Mar 18, 2025
Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants