-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Intel GPU] xpu-ops codegen via backend whitelist #130082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130082
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 55aa37a with merge base 6cbb143 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 3629701 Pull Request resolved: pytorch/pytorch#130082
# Motivation Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way. We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs. This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers. Pull Request resolved: #130019 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
atalman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| elif backend_index.dispatch_key == DispatchKey.MPS: | ||
| headers.append("#include <ATen/mps/EmptyTensor.h>") | ||
| elif backend_index.dispatch_key == DispatchKey.XPU: | ||
| # XPU specific, this header resides in third_party/torch-xpu-ops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a real folder in this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD , it is not a real folder. FYI: #120891 (review)
|
hi @albanD, great thanks for your comments. The header here resides in the During the compilation process, the operators in |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
hi @albanD I may merge the change PR first. If you need me have further change, please give me your valuable suggestions, and I would append commit to refractor them, great thanks! |
|
@ZhiweiYan-96 , please add document to describe the usage of this PR. |
# Motivation `copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators. Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device. Pull Request resolved: #130088 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #130019, #130082
This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: #133980 Approved by: https://github.com/EikanWang, https://github.com/malfet
This PR is a supplement to pytorch#130082. The previous PR pytorch#130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: pytorch#133980 Approved by: https://github.com/EikanWang, https://github.com/malfet
Motivation
This PR intends to enhance the codegen to allow generate codes for XPU backend.
XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts.
We utilize the backend_whitelist argument in
gen.pyto generate XPU needed headers and source codes.Usage
XPU ops lie in
third_pary/torch-xpu-ops, the codegen process is triggered before the complation oftorch-xpu-opsWe use the following commands to generate XPU operators
python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPUThe diff lies at
backend-whitelist=XPU. The backend-whitelist key is an existent argument in torchgen.The input of
gen.pyare code templates and operators yaml. We share the same templates inaten. A simplified yaml lies inthird_party/torch-xpu-ops, which only includes the supported xpu operators. This yaml is a copy-and-modify ofnative_functions.yaml. No extra entry is added, the format is same as the one inatenResult
All operators headers are generated in
build/xpu/ATen/opsindependently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder.Verification
third-party/torch-xpu-ops, we migrate all supported kernels to structured kernels style, where they are registered throughREGISTER_XPU_DISPATCHorTORCH_IMPL_FUNC, and we have UT verification based ontest_ops.pyStack from ghstack (oldest at bottom):
cc @gujinghui @EikanWang @fengyuan14 @guangyey