Adding IQ6_K #14

ikawrakow · 2024-08-09T08:03:33Z

This PR

Adds IQ6_K - see New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K #8 for motivation
Fixes the Zen4 implementation of IQ3_K, IQ4_K and IQ5_K

New IQ6_K

The graph below is a copy of the graph in #8 with the quantization error of the new IQ6_K non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared to Q6_K (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.

Fixing the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`

While working on IQ6_K, I have noticed that there is a problem with the Zen4 implementation of the IQ3,4,5_K quants. I was using the standard k-quants matrix multiplication template (mul_mat_qX_K_q8_K_AVX512). On Zen4, this template uses the _mm512_dpbusd_epi32 instruction to perform the dot product between the quants of the left matrix and the Q8_K quants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall within int16_t range, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the _mm512_dpbusd_epi32 may go outside of the int16_t range, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation for IQ3_K on a Ryzen-7950X (using 16 threads for PP-512 and 4 threads for TG-128)

	t/s (PP-512)	t/s (TG-128)
Before fix	180.77 ± 0.62	16.10 ± 0.16
After fix	167.69 ± 0.69	15.84 ± 0.33
Ratio	0.940	0.984

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

Respectable performance, only slightly slower than Q6_K.

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

See comments in f3a823c

Glm 4.5 testing

Iwan Kawrakow added 10 commits August 7, 2024 15:24

iq6_k: WIP (nothing works)

54ce23b

iq6_k: WIP (quantize/dequantize)

85f448e

iq6_k: CUDA dequantize

b3d6e10

We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).

iq6_k: CUDA dot product

07d3b4c

90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.

iq6_k: slightly better Zen4 iqk_mul_mat

0d55d5f

We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.

iq6_k: NEON

b8abe10

Respectable performance, only slightly slower than Q6_K.

iq6_k: Metal

c9c5ed5

About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.

iq6_k: AVX2

3ef9f7a

Fix Zen4 implementation of iq3_k, iq4_k, iq5_k

bf74535

See comments in f3a823c

ikawrakow merged commit f0d7a0d into main Aug 9, 2024

nux mentioned this pull request May 17, 2025

Bug: CUDA error: an illegal memory access was encountered #425

Closed

ciprianveg mentioned this pull request May 29, 2025

Bug: The streaming every couple of rows blocks for 5-8s #464

Closed

Ph0rk0z mentioned this pull request Jul 23, 2025

Bug: Command-A Spits incoherence when using -sm row #633

Closed

ChicoPinto70 mentioned this pull request Jul 26, 2025

Port speculative decoding from upstream to llama-server #645

Merged

4 tasks

frost555 pushed a commit to frost555/ik_llama.cpp that referenced this pull request Aug 6, 2025

Merge pull request ikawrakow#14 from Thireus/glm-4.5-testing

46c3fa9

Glm 4.5 testing

os360 mentioned this pull request Aug 27, 2025

Bug: Still crashing with -fmoe on AVX2 Q8_0 with unsloth/GLM-4.5-Air-GGUF/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf #736

Closed

os360 mentioned this pull request Nov 23, 2025

Bug: Crash in FlashQKV<128, 8, 128>::normalize_and_store_1row failing with GLM 4.5, AVX2, -amb/-ub/-b #809

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding IQ6_K #14

Adding IQ6_K #14

Uh oh!

ikawrakow commented Aug 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding IQ6_K #14

Adding IQ6_K #14

Uh oh!

Conversation

ikawrakow commented Aug 9, 2024

New IQ6_K

Fixing the Zen4 implementation of IQ3_K, IQ4_K and IQ5_K

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixing the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`