[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

leslie-fang-intel · 2024-06-19T08:19:44Z

Stack from ghstack (oldest at bottom):

Summary
Based on previous PR, add the config to support of int8 output and unary post op fusion with ReLU and GeLU

Activation dtype: uint8
Weight dtype: int8
Output dtype: float32/bfloat16/uint8
Post Op Fusion: with unary post operator fusion

Test Plan

clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise

Next Step

[✓] Unary post op fusion
[✓] Int8 output
Binary Fusion
AMX int8 MicroGEMM Kernel

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

…t and Unary Post Op [ghstack-poisoned]

pytorch-bot · 2024-06-19T08:19:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129048

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 1e6d1a7 with merge base dabaebd ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / rocm6.1-py3.8-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh) (detected as infra flaky with no log or failing log classifier)
pull / before-test / llm-retrieval (gh) (matched llm-retrieval rule in flaky-rules.json)
Unexpected HTTP response: 429
trunk / before-test / llm-retrieval (gh) (matched llm-retrieval rule in flaky-rules.json)
Unexpected HTTP response: 429

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…t and Unary Post Op ghstack-source-id: 94a73f1 Pull Request resolved: #129048

… INT8 output and Unary Post Op" **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…t and Unary Post Op ghstack-source-id: 234e69c Pull Request resolved: pytorch#129048

… INT8 output and Unary Post Op" **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

jgong5 · 2024-06-21T09:23:34Z

test/inductor/test_cpu_select_algorithm.py

                exact_dtype=True,
            )
-            self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)
+            self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 2)


why?
Also, please add check: self.assertEqual(counters["inductor"]["cpp_epilogue_fusion_counter"], 1)

The testcase has been changed from single Linear module to 2 nearby Linear modules. That will hit the autotuning twice.
Thanks and added the check of self.assertEqual(counters["inductor"]["cpp_epilogue_fusion_counter"], 0), since the relu, gelu has been fused into onednn.qlinear as part of template instead of fusion by scheduler.

… INT8 output and Unary Post Op" **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-06-28T01:24:24Z

Hi @jansel, this PR may also need your kind review since I have modified torch/_inductor/utils.py.

… INT8 output and Unary Post Op" **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-06-30T09:46:14Z

@pytorchbot merge

pytorchmergebot · 2024-06-30T09:48:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template. - Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar. - We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion. **Test Plan** ``` python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear ``` Pull Request resolved: #129049 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048

…ion (#129103) **Summary** Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion. - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with binary and optional[Unary] post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: #129103 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049

**Summary** Add the AMX micro gemm kernel with int8 data type. **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [✓] AMX int8 MicroGEMM Kernel Pull Request resolved: #129220 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103

…129221) **Summary** This PR mainly refactor 2 things: 1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered. 2. Add an util function to get the output data type and compute data type from input data type. Pull Request resolved: #129221 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049, #129103, #129220

… template (#129470) **Summary** Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template. **Test Plan** ``` numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear ``` Pull Request resolved: #129470 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221

…t and Unary Post Op (pytorch#129048) **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: pytorch#129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#128825

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 outpu…

1b32e4c

…t and Unary Post Op [ghstack-poisoned]

This was referenced Jun 19, 2024

[Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark #128824

Closed

[Inductor][Quant] Use output dtype torch.uint8 explicitly #128804

Closed

leslie-fang-intel mentioned this pull request Jun 19, 2024

[Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view #128808

Closed

pytorch-bot bot added the ciflow/inductor label Jun 19, 2024

leslie-fang-intel mentioned this pull request Jun 19, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

Closed

4 tasks

pytorch-bot bot added the module: inductor label Jun 19, 2024

leslie-fang-intel added a commit that referenced this pull request Jun 19, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 outpu…

e1995ba

…t and Unary Post Op ghstack-source-id: 94a73f1 Pull Request resolved: #129048

leslie-fang-intel marked this pull request as draft June 19, 2024 08:19

leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jun 19, 2024

pytorchbot added the open source label Jun 19, 2024

This was referenced Jun 19, 2024

[Inductor][Quant] Change the schema of QLinear Binary #129049

Closed

[Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion #129103

Closed

leslie-fang-intel mentioned this pull request Jun 20, 2024

Add mkldnn_ir.py into merge rule #129107

Closed

leslie-fang-intel mentioned this pull request Jun 20, 2024

[Test] Int8 nemurical with reduce range of activation #129125

Closed

leslie-fang-intel added a commit to leslie-fang-intel/pytorch that referenced this pull request Jun 21, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 outpu…

7c87234

…t and Unary Post Op ghstack-source-id: 234e69c Pull Request resolved: pytorch#129048

This was referenced Jun 21, 2024

[Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM #129220

Closed

[Inductor][CPP] Pass weight dtype explicitly for cpp gemm template #129221

Closed

leslie-fang-intel marked this pull request as ready for review June 21, 2024 08:30

leslie-fang-intel requested review from chunyuan-w and jgong5 June 21, 2024 08:30

jgong5 approved these changes Jun 21, 2024

View reviewed changes

leslie-fang-intel added 2 commits June 22, 2024 16:35

leslie-fang-intel added 3 commits June 25, 2024 00:50

leslie-fang-intel mentioned this pull request Jun 25, 2024

[Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template #129470

Closed

leslie-fang-intel requested a review from jansel June 28, 2024 01:23

jansel approved these changes Jun 28, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 30, 2024

pytorchmergebot closed this in 1a689ea Jun 30, 2024

pytorchmergebot added Merged and removed merging labels Jun 30, 2024

github-actions bot deleted the gh/leslie-fang-intel/118/head branch July 31, 2024 01:50

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

Uh oh!

leslie-fang-intel commented Jun 19, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

Uh oh!

jgong5 Jun 21, 2024

Uh oh!

leslie-fang-intel Jun 24, 2024

Uh oh!

leslie-fang-intel commented Jun 28, 2024

Uh oh!

leslie-fang-intel commented Jun 30, 2024

Uh oh!

pytorchmergebot commented Jun 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

Uh oh!

Conversation

leslie-fang-intel commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129048

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

jgong5 Jun 21, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel commented Jun 28, 2024

Uh oh!

leslie-fang-intel commented Jun 30, 2024

Uh oh!

pytorchmergebot commented Jun 30, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leslie-fang-intel commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading