[Intel GPU]device guard codegen for XPU #133980

ZhiweiYan-96 · 2024-08-20T07:15:10Z

This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated.

namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace

Nevertheless, without current change, the generated code is

namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace

Stack from ghstack (oldest at bottom):

-> [Intel GPU]device guard codegen for XPU #133980

[ghstack-poisoned]

pytorch-bot · 2024-08-20T07:15:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133980

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 64f254d with merge base e000cf0 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / test (gh) (similar failure)
RuntimeError: Didn't find enought cxx11 symbols
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (disabled by #134602)
test_transformers.py::TestSDPAPrivateUse1Only::test_scaled_dot_product_fused_attention_overrideable_backward

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8477a21 Pull Request resolved: #133980

[ghstack-poisoned]

EikanWang · 2024-08-23T07:24:23Z

@ZhiweiYan-96 , please add a detailed description to explain why Intel GPU requires such changes. In addition, please make your PR simple enough.

EikanWang

I prefer to split this PR into two PRs.

torchgen/gen.py and torchgen/model.py are a single PR
TensorAdvancedIndexing.cpp is another PR. In addition, it is better to merge #134072 to this PR.

[ghstack-poisoned]

ghstack-source-id: 5c45efe Pull Request resolved: #133980

ZhiweiYan-96 · 2024-08-23T08:54:05Z

@EikanWang Thanks for your suggestion! I have append our background information into to the PR description. The changes for indexing is separated into the PR https://github.com/pytorch/pytorch/pull/134072/files here.

[ghstack-poisoned]

EikanWang · 2024-09-04T13:46:04Z

@atalman , @malfet may I know if you have comments regarding this PR?

malfet · 2024-09-04T13:59:52Z

Looks fine, though I'm a bit dismayed by the fact that one needs 5 dispatch keys per device

EikanWang · 2024-09-04T14:09:51Z

@pytorchbot rebase

pytorchmergebot · 2024-09-04T14:11:30Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-09-04T14:11:44Z

Successfully rebased gh/ZhiweiYan-96/25/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/133980)

ghstack-source-id: 1c89097 Pull Request resolved: #133980

EikanWang · 2024-09-05T01:45:03Z

@pytorchbot merge

pytorchmergebot · 2024-09-05T01:47:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #134453 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #133980

ghstack-source-id: d814f1b Pull Request resolved: pytorch/pytorch#133980

This PR is a supplement to pytorch#130082. The previous PR pytorch#130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: pytorch#133980 Approved by: https://github.com/EikanWang, https://github.com/malfet

…#134453) Pull Request resolved: pytorch#134453 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: pytorch#133980

…138802) This PR is a supplement to #133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators. With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated. ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(*current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(*current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; c10::OptionalDeviceGuard guard_; }; ``` However, without current change, the generated code is ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; }; ``` Pull Request resolved: #138802 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang

…ytorch#138802) This PR is a supplement to pytorch#133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators. With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated. ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(*current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { auto current_device = guard_.current_device(); if (C10_UNLIKELY(current_device.has_value())) { TORCH_INTERNAL_ASSERT(*current_device == options.device(), "structured kernels don't support multi-device outputs"); } else { guard_.reset_device(options.device()); } outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; c10::OptionalDeviceGuard guard_; }; ``` However, without current change, the generated code is ```c++ struct structured_exp_out_functional final : public at::native::structured_exp_out { void set_output_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } void set_output_raw_strided( int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names ) override { outputs_[output_idx] = create_out(sizes, strides, options); if (!names.empty()) { namedinference::propagate_names(outputs_[output_idx], names); } // super must happen after, so that downstream can use maybe_get_output // to retrieve the output at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names); } const Tensor& maybe_get_output(int64_t output_idx) override { return outputs_[output_idx]; } std::array<Tensor, 1> outputs_; }; ``` Pull Request resolved: pytorch#138802 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang

Update

88f7901

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Aug 20, 2024

[Intel GPU]device guard codegen, scatter xpu device check

4ec53f4

ghstack-source-id: 8477a21 Pull Request resolved: #133980

ZhiweiYan-96 marked this pull request as draft August 20, 2024 07:15

pytorchbot added the open source label Aug 20, 2024

Update

d7d953c

[ghstack-poisoned]

EikanWang requested changes Aug 23, 2024

View reviewed changes

Update

ade977a

[ghstack-poisoned]

pytorch-bot bot added the topic: not user facing topic category label Aug 23, 2024

ZhiweiYan-96 added a commit that referenced this pull request Aug 23, 2024

[Intel GPU]device guard codegen, scatter xpu device check

c0d0aea

ghstack-source-id: 5c45efe Pull Request resolved: #133980

ZhiweiYan-96 changed the title ~~[Intel GPU]device guard codegen, scatter xpu device check~~ [Intel GPU]device guard codegen for XPU Aug 23, 2024

ZhiweiYan-96 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Aug 24, 2024

Update

2fc97fd

[ghstack-poisoned]

ZhiweiYan-96 mentioned this pull request Aug 26, 2024

[Intel GPU] Customized XPU behaviour in indexing, group norm #134453

Closed

EikanWang approved these changes Aug 26, 2024

View reviewed changes

EikanWang marked this pull request as ready for review August 26, 2024 14:45

EikanWang requested review from atalman and malfet August 26, 2024 14:45

ZhiweiYan-96 added 4 commits August 27, 2024 05:35

Update

9b32a9e

[ghstack-poisoned]

Update

44de03f

[ghstack-poisoned]

Update

1c4b795

[ghstack-poisoned]

Update

e7f98c6

[ghstack-poisoned]

malfet approved these changes Sep 4, 2024

View reviewed changes

Update

64f254d

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Sep 4, 2024

[Intel GPU]device guard codegen, scatter xpu device check

a8af320

ghstack-source-id: 1c89097 Pull Request resolved: #133980

pytorchmergebot added the merging label Sep 5, 2024

pytorchmergebot closed this in a7a53b7 Sep 5, 2024

pytorchmergebot added Merged and removed merging labels Sep 5, 2024

enter-ctrl9 pushed a commit to enter-ctrl9/pytorch11 that referenced this pull request Sep 15, 2024

[Intel GPU]device guard codegen, scatter xpu device check

c8c6dab

ghstack-source-id: d814f1b Pull Request resolved: pytorch/pytorch#133980

github-actions bot deleted the gh/ZhiweiYan-96/25/head branch October 5, 2024 02:06

toyxu mentioned this pull request Oct 30, 2024

[Intel GPU] Add device guard for XPU structured operator in torchgen #138802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU]device guard codegen for XPU #133980

[Intel GPU]device guard codegen for XPU #133980

Uh oh!

ZhiweiYan-96 commented Aug 20, 2024 •

edited by pytorchmergebot

Loading

Uh oh!

pytorch-bot bot commented Aug 20, 2024 •

edited

Loading

Uh oh!

EikanWang commented Aug 23, 2024

Uh oh!

EikanWang left a comment

Uh oh!

ZhiweiYan-96 commented Aug 23, 2024

Uh oh!

EikanWang commented Sep 4, 2024

Uh oh!

malfet commented Sep 4, 2024

Uh oh!

EikanWang commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

EikanWang commented Sep 5, 2024

Uh oh!

pytorchmergebot commented Sep 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Intel GPU]device guard codegen for XPU #133980

[Intel GPU]device guard codegen for XPU #133980

Uh oh!

Conversation

ZhiweiYan-96 commented Aug 20, 2024 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133980

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

EikanWang commented Aug 23, 2024

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 commented Aug 23, 2024

Uh oh!

EikanWang commented Sep 4, 2024

Uh oh!

malfet commented Sep 4, 2024

Uh oh!

EikanWang commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

pytorchmergebot commented Sep 4, 2024

Uh oh!

EikanWang commented Sep 5, 2024

Uh oh!

pytorchmergebot commented Sep 5, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ZhiweiYan-96 commented Aug 20, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Aug 20, 2024 •

edited

Loading