DeepSeek imatrix stuff #250
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In DeepSeek models there are two additional tensors,
*attn_k_b.weightand*attn_v_b.weightrequired for MLA. When MLA is enabled, these will get used for attention computation. When standard attention is used, then the*attn_kv_b.weighttensors are used instead. Hence, when one has used standard attention to compute the imatrix, there will be no data for*attn_k_b.weightand*attn_v_b.weight; if one uses MLA, then there will be no data for*attn_kv_b.weight. As the*attn_v_b.weighttensors are simply the lower half of*attn_kv_b.weight(i.e., the second half of rows), they "see" the exact same activations as the*attn_kv_b.weighttensors. This PR takes advantage of this and enables the usage of*attn_kv_b.weightimatrix data for*attn_v_b.weightand vice versa.The situation with
*attn_k_b.weightis more tricky and will require a much bigger change to be fixed.*attn_k_b.weightis the transposed upper half of*attn_kv_b.weight. The*attn_kv_b.weighttensors have a shape of512 x 4096, so the upper half is512 x 2048. At run time it multiplies activationsXto produce a2048 x n_tokentensor, which is then viewed as128 x n_token x 16for further processing by the 16 attention heads. On the other hand,*attn_k_b.weightis stored as128 x 8192and is then viewed as128 x 512 x 16for multiplication with the queryQ, so the imatrix data collection functions sees a matrix with just 128 columns, so quite useless to actually guide the quantization process. To make this actually useful, a modification in theimatrixtool is required to collect data for128 x 16columns, along with a modification in the quantization function to make use of imatrix data with128 x 16columns. This is left for a future PR, so for now there will be no imatrix data for*attn_k_b.weighteven if the imatrix was computed with MLA enabled.