[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

etaf · 2024-10-28T01:22:58Z

Stack from ghstack (oldest at bottom):

[AOTI XPU] Rename test_cuda_cpp_wrapper.py to test_gpu_cpp_wrapper.py, #135320
[AOTI XPU] Enable Cpp wraper for Intel GPU. #135318
[AOTI XPU] Remove workarounds after update torch-xpu-ops that extend c_shim_xpu layer with out-of-tree ATen OPs. #139026
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. #136742
-> [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part，since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-10-28T01:23:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139025

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3912761 with merge base 2ede4c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2024-10-28T01:27:07Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

[ghstack-poisoned]

EikanWang · 2024-10-30T03:50:09Z

torchgen/gen.py

+    if options.backend_whitelist and str(DispatchKey.XPU) in options.backend_whitelist:
+        options.xpu = True
+


Invalid code changes.

EikanWang · 2024-10-30T03:50:31Z

torchgen/gen.py

+    xpu_in_whitelist = (
+        options.backend_whitelist and str(DispatchKey.XPU) in options.backend_whitelist
+    )
+    if (not options.xpu) and (not xpu_in_whitelist):
+        ignore_keys.add(DispatchKey.XPU)
+
+        if DispatchKey.XPU in dispatch_keys:
+            del dispatch_keys[dispatch_keys.index(DispatchKey.XPU)]
+


Pls. add some comments to elaborate on the motivation.

[ghstack-poisoned]

etaf · 2024-11-09T13:02:12Z

@pytorchbot merge

pytorchmergebot · 2024-11-09T13:03:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742) [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU. ### Motivation Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced. ### Design To extend the c shim with more OP for a backend from out-of-tree. The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree. The generated c shim is stored in the `extend` subdirectory , for example: ``` torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp ``` example usage: `python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim ` `--xpu`: generate c shim for XPU `--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`) `--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim. Pull Request resolved: #136742 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #139025

Before this change, if one builds PyTorch without XPU build process will be perpetually regenrating code because of the reference to non-existing file, that will make autograd codegened files always out of date THis is a regression introduced by #139025 Fixes #140432

Before this change, if one builds PyTorch without XPU build process will be perpetually regenerating code because of the reference to non-existing file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output: ``` ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty ``` This is a regression introduced by #139025. After this change, incremental rebuilds with no changes cause no build actions: ``` % ninja -j1 -v -d explain -n torch_cpu ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja: no work to do. ``` Test plan: Wait for at least on XPU build to finish... Fixes #140432 Pull Request resolved: #140438 Approved by: https://github.com/kit1980, https://github.com/huydhn

…ee XPU structured GEMM OPs. (pytorch#139025) [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops. Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part，since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well. At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU. Pull Request resolved: pytorch#139025 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire

… make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (pytorch#136742) [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU. ### Motivation Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced. ### Design To extend the c shim with more OP for a backend from out-of-tree. The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree. The generated c shim is stored in the `extend` subdirectory , for example: ``` torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp ``` example usage: `python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim ` `--xpu`: generate c shim for XPU `--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`) `--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim. Pull Request resolved: pytorch#136742 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: pytorch#139025

Before this change, if one builds PyTorch without XPU build process will be perpetually regenerating code because of the reference to non-existing file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output: ``` ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty ``` This is a regression introduced by pytorch#139025. After this change, incremental rebuilds with no changes cause no build actions: ``` % ninja -j1 -v -d explain -n torch_cpu ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja: no work to do. ``` Test plan: Wait for at least on XPU build to finish... Fixes pytorch#140432 Pull Request resolved: pytorch#140438 Approved by: https://github.com/kit1980, https://github.com/huydhn

Update

7567827

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 28, 2024

etaf changed the title ~~Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs.~~ [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. Oct 28, 2024

pytorchbot added the open source label Oct 28, 2024

Update

1d97df2

[ghstack-poisoned]

etaf added ciflow/xpu Run XPU CI tasks topic: not user facing topic category labels Oct 28, 2024

Update

ddbb2ef

[ghstack-poisoned]

etaf marked this pull request as draft October 28, 2024 03:36

etaf added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Oct 28, 2024

etaf added 2 commits October 27, 2024 23:08

Update

08f19db

[ghstack-poisoned]

Update

2e5a804

[ghstack-poisoned]

pytorch-bot bot added the module: inductor label Oct 29, 2024

Update

cd712b0

[ghstack-poisoned]

etaf marked this pull request as ready for review October 29, 2024 12:29

etaf requested a review from EikanWang October 29, 2024 12:30

etaf added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2024

etaf requested review from ZhiweiYan-96 and desertfire October 30, 2024 00:24

EikanWang approved these changes Oct 30, 2024

View reviewed changes

etaf added 3 commits October 30, 2024 03:43

Update

96b93eb

[ghstack-poisoned]

Update

319953b

[ghstack-poisoned]

Update

9ef10b0

[ghstack-poisoned]

Update

3912761

[ghstack-poisoned]

pytorchmergebot added the merging label Nov 9, 2024

pytorchmergebot added the Merged label Nov 9, 2024

pytorchmergebot closed this in 929a647 Nov 9, 2024

pytorchmergebot removed the merging label Nov 9, 2024

etaf mentioned this pull request Nov 12, 2024

Add c shim wrapper codegen. intel/torch-xpu-ops#1021

Merged

malfet mentioned this pull request Nov 12, 2024

Incremental builds always contain of 100+ steps if build without XPU #140432

Closed

malfet mentioned this pull request Nov 12, 2024

[Build] Do not regenerate code endlessly without XPU #140438

Closed

github-actions bot deleted the gh/etaf/53/head branch December 10, 2024 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

Uh oh!

etaf commented Oct 28, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 28, 2024

Uh oh!

EikanWang Oct 30, 2024

Uh oh!

EikanWang Oct 30, 2024

Uh oh!

etaf commented Nov 9, 2024

Uh oh!

pytorchmergebot commented Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		if options.backend_whitelist and str(DispatchKey.XPU) in options.backend_whitelist:
		options.xpu = True

[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. #139025

Uh oh!

Conversation

etaf commented Oct 28, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139025

✅ No Failures

Uh oh!

github-actions bot commented Oct 28, 2024

Attention! native_functions.yaml was changed

Uh oh!

EikanWang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

EikanWang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

etaf commented Nov 9, 2024

Uh oh!

pytorchmergebot commented Nov 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

etaf commented Oct 28, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading