Skip to content

Conversation

@ZhiweiYan-96
Copy link
Collaborator

@ZhiweiYan-96 ZhiweiYan-96 commented Jul 4, 2024

Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in gen.py to generate XPU needed headers and source codes.

Usage

XPU ops lie in third_pary/torch-xpu-ops, the codegen process is triggered before the complation of torch-xpu-ops

We use the following commands to generate XPU operators

python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU

The diff lies at backend-whitelist=XPU. The backend-whitelist key is an existent argument in torchgen.

The input of gen.py are code templates and operators yaml. We share the same templates in aten. A simplified yaml lies in third_party/torch-xpu-ops, which only includes the supported xpu operators. This yaml is a copy-and-modify of native_functions.yaml. No extra entry is added, the format is same as the one in aten

Result

All operators headers are generated in build/xpu/ATen/ops independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder.

Verification

  • In third-party/torch-xpu-ops, we migrate all supported kernels to structured kernels style, where they are registered through REGISTER_XPU_DISPATCH or TORCH_IMPL_FUNC, and we have UT verification based on test_ops.py

Stack from ghstack (oldest at bottom):

cc @gujinghui @EikanWang @fengyuan14 @guangyey

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130082

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 55aa37a with merge base 6cbb143 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
[ghstack-poisoned]
@ZhiweiYan-96 ZhiweiYan-96 added the module: xpu Intel XPU related issues label Jul 4, 2024
[ghstack-poisoned]
@ZhiweiYan-96 ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Jul 5, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as ready for review July 12, 2024 06:17
[ghstack-poisoned]
@EikanWang EikanWang requested review from albanD, atalman and malfet July 17, 2024 02:22
[ghstack-poisoned]
[ghstack-poisoned]
francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jul 29, 2024
# Motivation
Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way.

We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for  Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs.

This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers.

Pull Request resolved: #130019
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

elif backend_index.dispatch_key == DispatchKey.MPS:
headers.append("#include <ATen/mps/EmptyTensor.h>")
elif backend_index.dispatch_key == DispatchKey.XPU:
# XPU specific, this header resides in third_party/torch-xpu-ops
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a real folder in this repo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD , it is not a real folder. FYI: #120891 (review)

@ZhiweiYan-96
Copy link
Collaborator Author

ZhiweiYan-96 commented Jul 31, 2024

hi @albanD, great thanks for your comments. The header here resides in the thirdy_party/torch-xpu-ops, where most XPU operators are in it. The header is exposed to PyTorch build process by cmake change here https://github.com/pytorch/pytorch/pull/130082/files#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1054.

During the compilation process, the operators in third_party/torch-xpu-ops and generated source files could both see these headers. We have some verification this with lots uts. Here is a log fraction we've tested in torch-xpu-ops repo accompained with the change in current PR.

test_ops_xpu.py::TestSelfKwarg::test_self_kwargs PASSED                                                                                                                                                                                          [  0%]
test_ops_xpu.py::TestCommonXPU::test_compare_cpu_H_xpu_float32 PASSED                                                                                                                                                                            [  0%]
test_ops_xpu.py::TestCommonXPU::test_compare_cpu_T_xpu_float32 PASSED                                                                                                                                                                            [  0%]
test_ops_xpu.py::TestCommonXPU::test_compare_cpu___getitem___xpu_float32

@ZhiweiYan-96
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ZhiweiYan-96
Copy link
Collaborator Author

hi @albanD I may merge the change PR first. If you need me have further change, please give me your valuable suggestions, and I would append commit to refractor them, great thanks!

@EikanWang
Copy link
Collaborator

@ZhiweiYan-96 , please add document to describe the usage of this PR.

pytorchmergebot pushed a commit that referenced this pull request Aug 6, 2024
# Motivation
`copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators.  Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device.

Pull Request resolved: #130088
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #130019, #130082
@github-actions github-actions bot deleted the gh/ZhiweiYan-96/16/head branch August 31, 2024 02:00
pytorchmergebot pushed a commit that referenced this pull request Sep 5, 2024
This PR is a supplement to #130082. The previous PR  #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts.  Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```

Pull Request resolved: #133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This PR is a supplement to pytorch#130082. The previous PR  pytorch#130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts.  Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```

Pull Request resolved: pytorch#133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: xpu Intel XPU related issues open source topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants