Skip to content

Conversation

@ikawrakow
Copy link
Owner

In DeepSeek models there are two additional tensors, *attn_k_b.weight and *attn_v_b.weight required for MLA. When MLA is enabled, these will get used for attention computation. When standard attention is used, then the *attn_kv_b.weight tensors are used instead. Hence, when one has used standard attention to compute the imatrix, there will be no data for *attn_k_b.weight and *attn_v_b.weight; if one uses MLA, then there will be no data for *attn_kv_b.weight. As the *attn_v_b.weight tensors are simply the lower half of *attn_kv_b.weight (i.e., the second half of rows), they "see" the exact same activations as the *attn_kv_b.weight tensors. This PR takes advantage of this and enables the usage of *attn_kv_b.weight imatrix data for *attn_v_b.weight and vice versa.

The situation with *attn_k_b.weight is more tricky and will require a much bigger change to be fixed. *attn_k_b.weight is the transposed upper half of *attn_kv_b.weight. The *attn_kv_b.weight tensors have a shape of 512 x 4096, so the upper half is 512 x 2048. At run time it multiplies activations X to produce a 2048 x n_token tensor, which is then viewed as 128 x n_token x 16 for further processing by the 16 attention heads. On the other hand, *attn_k_b.weight is stored as 128 x 8192 and is then viewed as 128 x 512 x 16 for multiplication with the query Q, so the imatrix data collection functions sees a matrix with just 128 columns, so quite useless to actually guide the quantization process. To make this actually useful, a modification in the imatrix tool is required to collect data for 128 x 16 columns, along with a modification in the quantization function to make use of imatrix data with 128 x 16 columns. This is left for a future PR, so for now there will be no imatrix data for *attn_k_b.weight even if the imatrix was computed with MLA enabled.

@davidsyoung
Copy link

This is great, for lack of better understanding, if I am using an imatrix file that I assume was computed with standard attention, and I re-compute now, I should see better performance due to the attn_v_b.weight tensor now having imatrix data?

It's still of course lacking the imatrix data for attn_k_b.weight tensor. It would be interesting to understand what difference these changes will make to perplexity.

@ikawrakow
Copy link
Owner Author

If you are quantizing the attention tensors to q8_0 you will not see a difference. The imatrix helps a lot for 1-, 2-, and 3-bit quantization, has a more modest impact at 4 bits, has almost no impact at 5 bits, and has basically no impact at 6+ bits.

@davidsyoung
Copy link

If you are quantizing the attention tensors to q8_0 you will not see a difference. The imatrix helps a lot for 1-, 2-, and 3-bit quantization, has a more modest impact at 4 bits, has almost no impact at 5 bits, and has basically no impact at 6+ bits.

Great to know, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants