Do not quantize activations if not necessary #79
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It has always bugged me that
ggmlunnecessarily repeats the "quantization" of activations when the corresponding matrix multiplication cannot be done directly. E.g.,Q,KandVall multiply the input to the self-attention layer. Similarly,ffn_upandffn_gatemultiply the same activations for parallel FFNs. "Quantization" is in quotes, because it applies tofp16andbf16tensors when the matrix multiplication function used does not work directly withfp32activations. There are typically 7 tensors per layer in a transformer model, so basically 3 out of 7 "quantizations" are unnecessary.This PR remedies this unfortunate situation by storing "quantized" activations in a dedicated part of the work buffer (so the data cannot be trashed by other ops that also need a work buffer), and by remembering the name of the last tensor that was quantized. I was hoping that by avoiding the unnecessary quantization we can also skip the thread synchronization barrier that we have in
ggml_compute_forward_mul_matafter quantization, but I guess I'm missing something because skipping the barrier may hang the inference pipeline, so for now the barrier is still there.Quantization takes a relatively small fraction of the overall graph evaluation time, so performance gains are typically in the ~1% range. But for a
bf16model with a long context I'm finding a non-trivial performance improvement when running on a CPU with nativebf16support (Ryzen-7950X). Here is a comparison for LLaMA-3.1-8B with a context of 8192 tokens5.4% gain in performance is nothing to sneeze at, especially considering how minor the necessary code change is.