-
Notifications
You must be signed in to change notification settings - Fork 166
Adding IQ6_K #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Adding IQ6_K #14
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).
90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.
We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.
We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.
Respectable performance, only slightly slower than Q6_K.
About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.
See comments in f3a823c
4 tasks
frost555
pushed a commit
to frost555/ik_llama.cpp
that referenced
this pull request
Aug 6, 2025
Glm 4.5 testing
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR
IQ6_K- see New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8 for motivationIQ3_K,IQ4_KandIQ5_KNew IQ6_K
The graph below is a copy of the graph in #8 with the quantization error of the new
IQ6_Knon-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared toQ6_K(0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.Fixing the Zen4 implementation of
IQ3_K,IQ4_KandIQ5_KWhile working on
IQ6_K, I have noticed that there is a problem with the Zen4 implementation of theIQ3,4,5_Kquants. I was using the standard k-quants matrix multiplication template (mul_mat_qX_K_q8_K_AVX512). On Zen4, this template uses the_mm512_dpbusd_epi32instruction to perform the dot product between the quants of the left matrix and theQ8_Kquants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall withinint16_trange, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the_mm512_dpbusd_epi32may go outside of theint16_trange, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation forIQ3_Kon a Ryzen-7950X (using 16 threads for PP-512 and 4 threads for TG-128)