[inductor pattern][cpu] add a new int8 woq mm pattern #131310

Valentine233 · 2024-07-22T07:27:10Z

Add int8 woq mm pattern for llama, which successfully hits all the 675 woq linears.

Implementation

Differences with previous patterns:

scale is fp32 instead of bf16.
no reshape for x and output.

Performance

Performance data on llama, with 1 numa node, freezing mode:

Before pattern match:
---------- Summary: ----------
inference-latency: 4.810 sec.
first-token-latency: 0.280 sec.
rest-token-latency: 0.146 sec.
P90-rest-token-latency: 0.148 sec.
After pattern match:
---------- Summary: ----------
inference-latency: 2.537 sec.
first-token-latency: 0.964 sec.
rest-token-latency: 0.051 sec.
P90-rest-token-latency: 0.052 sec.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-07-22T07:27:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131310

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f3b3961 with merge base c4bf400 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

leslie-fang-intel · 2024-07-22T12:58:55Z

Why the first token latency even worse after this PR?

torch/_inductor/fx_passes/quantization.py

leslie-fang-intel

just curious, why the data type of scales has changed from bf16 to fp32? fp32 here caused overhead of 2 conversion and not sure how much overhead it will induce. Have you tested the UT add in TorchAO, can it hit the pattern matcher now?

torch/_inductor/fx_passes/quantization.py

test/inductor/test_mkldnn_pattern_matcher.py

torch/_inductor/fx_passes/quantization.py

Valentine233 · 2024-07-25T08:43:57Z

Summary

The fp32-related discussions above are expected to be solved by [int8 woq] make the scale type the same as input for bf16 autocast ao#534.
The regression of first token needs a further analysis.

Valentine233 · 2024-08-05T01:30:37Z

Hi @Valentine233, can you please add info about the precise benchmark you used, perhaps from your bash history? I used the following command, but the benchmark goes on for a long time, and only posts E2E latency. I'm guessing computing per-token latency requires modifying the transformers library. Please confirm, thanks!

 numactl --membind=0 --cpunodebind=0 -C 0-31 python run_llm.py -m "meta-llama/Llama-2-7b-hf" --device cpu --dtype bf16 --output_dir$(pwd)/tmp --batch-size 256 --ws-total-cores 32 --ws-cores-per-instance 32 --weight-dtype INT8 --benchmark --torchao --inductor --num-iter 10 --num-warmup 1 --profile

I'm also seeing this warning. Please advise if it's normal. Thanks!

W0804 17:20:34.220000 2951501 torch/_dynamo/convert_frame.py:828] [0/8] torch._dynamo hit config.cache_size_limit (8)
W0804 17:20:34.220000 2951501 torch/_dynamo/convert_frame.py:828] [0/8]    function: 'forward' (transformers/src/transformers/models/llama/modeling_llama.py:639)
W0804 17:20:34.220000 2951501 torch/_dynamo/convert_frame.py:828] [0/8]    last reason: 0/0: ___check_obj_id(L['past_key_values'], 94511674602464)
W0804 17:20:34.220000 2951501 torch/_dynamo/convert_frame.py:828] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0804 17:20:34.220000 2951501 torch/_dynamo/convert_frame.py:828] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.

This is the script to run on one node: numactl -C 56-111 -m 1 python ../../../../../../models/language_modeling/pytorch/llama/inference/cpu/run_llm.py --benchmark --num-warmup 1 --num-iter 2 --token-latency --dtype 'bf16' -m 'meta-llama/Llama-2-7b-hf' --max-new-tokens 32 --input-tokens 32 --batch-size 1 --weight-only-quant --torchao --weight-dtype INT8. Can't remember about the warning.

sanchitintel · 2024-08-05T20:12:55Z

After offline discussion with @Valentine233, it turns out that the runtime error I'm encountering at my end (`torch._dynamo.exc.InternalTorchDynamoError: 'PlainAQTLayout' object has no attribute 'layout_type') is similar to pytorch/ao#534 (comment).

@Valentine233, is it possible for you to rebase your local PyTorch repo & verify whether or not you're also encountering the same problem? Thanks

Valentine233 · 2024-08-08T03:19:57Z

Update for first token regression:

First token has a regression because aten::_weight_int8pack_mm is slower than the decomposed one, given certain input shapes. cc @mingfeima @jgong5 @leslie-fang-intel

First token

For the first token, the input shapes are x: [128, 4096], w: [4096, 4096], scale: [4096].
Profiling
With woq pattern:

Without woq pattern:

Next token

For the next token, the input shapes are x: [4, 4096], w: [4096, 4096], scale: [4096].
Profiling
With woq pattern:

Without woq pattern:

Reproduce

You could reproduce the result by running the UT added in this PR.

sanchitintel · 2024-08-08T06:13:04Z

While auto-tuning would be 1.5x faster for the first token, it too would've resulted in a regression for the first token.
x: [128, 4096], w: [4096, 4096], scale: [4096] seems to do better if weights are dequantized upfront.

However, the overall performance order (higher is better) is:

Auto-tuning int8 WoQ GEMM (another PR that needs this PR for LLaMA2) > ATen int8 WoQ GEMM kernel (this PR) > Dequantizing weights upfront

## Summary As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with #131310. Pull Request resolved: #131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

sanchitintel · 2024-08-19T19:39:13Z

Potential fix - since Xeon CPUs are used on machines with a large amount of RAM, we can optionally cache both quantized & dequantized weights for large values of M (based on some heuristic), and lower the corresponding FX pattern to a custom function (like the way quantized_decomposed custom functions have been defined) that accepts both quantized & dequantized weights. Then that custom function should be lowered to a template-based auto-tuning implementation of WoQ GEMM.

In such a case, the auto-tuning GEMM template should allow a fallback that could be used for large values of M, and use the cached dequantized weights if a large value of M would be encountered at runtime.

…131887) ## Summary As part of pytorch#125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with pytorch#131310. Pull Request resolved: pytorch#131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

github-actions · 2024-10-18T20:34:33Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 22, 2024

Valentine233 added the topic: not user facing topic category label Jul 22, 2024

pytorchbot added the open source label Jul 22, 2024

Valentine233 requested review from jgong5 and leslie-fang-intel and removed request for jgong5 July 22, 2024 07:40

leslie-fang-intel reviewed Jul 22, 2024

View reviewed changes

torch/_inductor/fx_passes/quantization.py Outdated Show resolved Hide resolved

leslie-fang-intel reviewed Jul 22, 2024

View reviewed changes

torch/_inductor/fx_passes/quantization.py Outdated Show resolved Hide resolved

jgong5 requested changes Jul 23, 2024

View reviewed changes

test/inductor/test_mkldnn_pattern_matcher.py Outdated Show resolved Hide resolved

torch/_inductor/fx_passes/quantization.py Show resolved Hide resolved

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024

Valentine233 added 4 commits July 25, 2024 00:48

[inductor pattern][cpu] add a new int8 woq mm patthern matcher

2cf75e7

update add a new int8 woq mm patthern matcher

347c3b7

update add a new int8 woq mm patthern matcher

ded9ee1

update add a new int8 woq mm patthern matcher

40ca2c9

Valentine233 force-pushed the woq_mm_cpu branch from 0e57c29 to 40ca2c9 Compare July 25, 2024 07:51

update add a new int8 woq mm patthern matcher

afa0ed3

Valentine233 added 3 commits July 25, 2024 19:31

update add a new int8 woq mm patthern matcher

99f2cd8

update add a new int8 woq mm patthern matcher

f3bfc73

update add a new int8 woq mm patthern matcher

f3b3961

sanchitintel mentioned this pull request Jul 27, 2024

Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue #131887

Closed

[inductor pattern][cpu] add a new int8 woq mm pattern #131310

[inductor pattern][cpu] add a new int8 woq mm pattern #131310

Uh oh!

Conversation

Valentine233 commented Jul 22, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Performance

Uh oh!

pytorch-bot bot commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131310

✅ No Failures

Uh oh!

leslie-fang-intel commented Jul 22, 2024

Uh oh!

Uh oh!

leslie-fang-intel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Valentine233 commented Jul 25, 2024

Uh oh!

This comment was marked as outdated.

Valentine233 commented Aug 5, 2024

Uh oh!

sanchitintel commented Aug 5, 2024

Uh oh!

Valentine233 commented Aug 8, 2024

First token

Next token

Reproduce

Uh oh!

This comment was marked as off-topic.

sanchitintel commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Valentine233 commented Jul 22, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 22, 2024 •

edited

Loading

leslie-fang-intel left a comment •

edited

Loading

sanchitintel commented Aug 8, 2024 •

edited

Loading

sanchitintel commented Aug 19, 2024 •

edited

Loading