Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was playing with Deepseek-Lite and noticed that
IQ4_NL, so I noticed that after repacking toIQ4_NL_R4it does not work for row sizes that are not a multiple of 128 (4 blocks). So, I fixed that (AVX2 and Zen4)Q5_0_R4andQ6_0_R4Quantization error as measured by PPL is surprisingly low for the low-bit quants, even
IQ1_Sis kind of semi-usable. It is not a "true"IQ1_Squantization as quite a few tensors get quantized toIQ4_NL, and I changed the attention tensors, which represent a tiny fraction of the overall model sizes, to be quantized with much higher bpw. We end up using 2.525 bpw for the repeating layers, andPPL(IQ1_S)/PPL(fp16) - 1 = 49.4%. But I now understand the hype around the Internet when the other day somebody was pretending to have invented 1-bit quantization and quantization mixes by usingIQ1_Sinllama.cppfor Deepseek-R1.