Do not quantize activations if not necessary #79

ikawrakow · 2024-10-04T07:08:59Z

It has always bugged me that ggml unnecessarily repeats the "quantization" of activations when the corresponding matrix multiplication cannot be done directly. E.g., Q, K and V all multiply the input to the self-attention layer. Similarly, ffn_up and ffn_gate multiply the same activations for parallel FFNs. "Quantization" is in quotes, because it applies to fp16 and bf16 tensors when the matrix multiplication function used does not work directly with fp32 activations. There are typically 7 tensors per layer in a transformer model, so basically 3 out of 7 "quantizations" are unnecessary.

This PR remedies this unfortunate situation by storing "quantized" activations in a dedicated part of the work buffer (so the data cannot be trashed by other ops that also need a work buffer), and by remembering the name of the last tensor that was quantized. I was hoping that by avoiding the unnecessary quantization we can also skip the thread synchronization barrier that we have in ggml_compute_forward_mul_mat after quantization, but I guess I'm missing something because skipping the barrier may hang the inference pipeline, so for now the barrier is still there.

Quantization takes a relatively small fraction of the overall graph evaluation time, so performance gains are typically in the ~1% range. But for a bf16 model with a long context I'm finding a non-trivial performance improvement when running on a CPU with native bf16 support (Ryzen-7950X). Here is a comparison for LLaMA-3.1-8B with a context of 8192 tokens

model	size	params	backend	threads	type_k	type_v	fa	test	t/s
llama 8B BF16 (main)	14.96 GiB	8.03 B	CPU	16	bf16	bf16	1	pp8192	178.64 ± 0.69
llama 8B BF16 (PR)	14.96 GiB	8.03 B	CPU	16	bf16	bf16	1	pp8192	188.28 ± 0.49

5.4% gain in performance is nothing to sneeze at, especially considering how minor the necessary code change is.

This reverts commit 0bf4d99.

* Revert "Do not quantize activations if not necessary (#79)" This reverts commit 0bf4d99. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Iwan Kawrakow added 2 commits October 4, 2024 09:45

Do not quantize activations if not necessary

0b79e5e

Do not quantize activations if not necessary also for MoE models

ed477f1

ikawrakow merged commit 0bf4d99 into main Oct 4, 2024

ikawrakow pushed a commit that referenced this pull request Feb 7, 2025

Revert "Do not quantize activations if not necessary (#79)"

4daff2f

This reverts commit 0bf4d99.

ikawrakow mentioned this pull request Feb 7, 2025

Revert #79 #192

Merged

ikawrakow added a commit that referenced this pull request Feb 8, 2025

Revert #79 (#192)

6d7b58e

* Revert "Do not quantize activations if not necessary (#79)" This reverts commit 0bf4d99. * Fixed compilation after revert --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not quantize activations if not necessary #79

Do not quantize activations if not necessary #79

Uh oh!

ikawrakow commented Oct 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Do not quantize activations if not necessary #79

Do not quantize activations if not necessary #79

Uh oh!

Conversation

ikawrakow commented Oct 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants