[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store #97626

jgong5 · 2023-03-26T14:15:03Z

Stack from ghstack (oldest at bottom):

Remove CppTile2DTailKernel and CppTile2DKernelChecker and reuse CppVecKernel and CppVecKernelChecker for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by CppTile2DTailKernel.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-03-26T14:15:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97626

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9a75b92:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 1fe2deb Pull Request resolved: #97626

Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

jansel · 2023-04-01T00:38:49Z

test/inductor/test_torchinductor.py

                compiled = compile_fx_inner(traced, [x1, x2])
                assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
-                assert metrics.generated_cpp_vec_kernel_count == 1
+                assert metrics.generated_cpp_vec_kernel_count == 2


Why did this change?

This is because the tail kernel of the tile2d is also vectorized along the outer loop index (i0 below). Originally it is a scalar kernel. The non-contiguous load falls back to the scalar inner loop (tmp11 below) and other code is vectorized as much as possible. Therefore, we have both tile2d main kernel and tile2d tail kernel vectorized, which is counted as 2 now.

Generated tail kernel of tile2d below:

#pragma GCC ivdep for(long i1=static_cast<long>(0); i1<static_cast<long>(10); i1+=static_cast<long>(1)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((16*i0) + (20*i1))); auto tmp11 = ([&]() { __at_align__ float tmpbuf[16]; for (long i0_inner = 0; i0_inner < 16; i0_inner++) tmpbuf[i0_inner] = in_ptr1[static_cast<long>(i1 + (10*i0_inner) + (160*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp1 = tmp0.abs(); auto tmp2 = tmp1.sin(); auto tmp3 = tmp2.neg(); auto tmp4 = tmp3 * tmp3; auto tmp5 = decltype(tmp4)(1)/(decltype(tmp4)(1) + tmp4.neg().exp()); auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6.cos(); auto tmp8 = tmp7.exp(); auto tmp9 = tmp8.sqrt(); auto tmp10 = tmp9 + tmp0; auto tmp12 = tmp10 - tmp11; auto tmp13 = tmp12 * tmp0; auto tmp14 = tmp13 / tmp0; auto tmp15 = tmp14 * tmp14; auto tmp16 = tmp15 * tmp15; auto tmp17 = tmp16 * tmp14; auto tmp18 = tmp17 * tmp17; auto tmp19 = tmp18.log(); auto tmp20 = tmp19.floor(); auto tmp21 = tmp20.ceil(); auto tmp22 = tmp21.trunc(); auto tmp23 = tmp22.lgamma(); auto tmp24 = tmp23.fmod(tmp11); auto tmp25 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), decltype(tmp24)(0) < tmp24); auto tmp26 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), tmp24 < decltype(tmp24)(0)); auto tmp27 = tmp25 - tmp26; auto tmp28 = tmp27 + tmp11; tmp28.store(out_ptr0 + static_cast<long>((16*i0) + (20*i1))); }

jansel · 2023-04-01T00:46:45Z

torch/_inductor/codegen/cpp.py

+                        return []
+                    stride = self.stride_at(var, index)
+                    if stride == 1:
+                        contig_vars.add(int(var.name[1:]))


Variable names can sometimes be stuff like indirect0 so I don't think converting vars to ints will work in general. Suggest adding sort function based on something like:

def sort_key(x): return (*[int(m) for m in re.findall(r"\d+", x)], x)

I intended only to check the loop variables and get the indices from them. So I revised the logic to make sure they are loop variables. Please check.

…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

jgong5 · 2023-04-02T06:35:53Z

@pytorchbot merge

pytorchmergebot · 2023-04-02T06:39:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-02T09:01:21Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 13f4a515d3b16960f14d775ec1139b3133bdc05c returned non-zero exit code 1

Auto-merging test/inductor/test_torchinductor.py
CONFLICT (content): Merge conflict in test/inductor/test_torchinductor.py
error: could not apply 13f4a515d3b... [Inductor] simplify CPP backend Tile2D code
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

jgong5 · 2023-04-02T10:08:23Z

@pytorchbot merge

pytorchmergebot · 2023-04-02T10:11:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-02T11:12:40Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

jgong5 · 2023-04-03T01:07:51Z

@pytorchbot merge

pytorchmergebot · 2023-04-03T01:11:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation ((perhaps copy-n-paste error during the refactor made by #97626 ) ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (if dtype is int64) Discovered while trying to enable arch64 vectorized inductor

Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by #97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64) Discovered while trying to enable arch64 vectorized inductor. Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512): ```python import torch def do_ds(x, y): return torch.diagonal_scatter(x, y) x=torch.ones(10, 10, dtype=torch.int64) y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0]) dsc = torch.compile(do_ds) assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y)) ``` Pull Request resolved: #122580 Approved by: https://github.com/Skylion007, https://github.com/jansel

[Inductor] simplify CPP backend Tile2D code

2f90830

[ghstack-poisoned]

jgong5 mentioned this pull request Mar 26, 2023

[Inductor] Fix error retrieving extra traceback #97625

Closed

github-actions bot added ciflow/inductor module: inductor labels Mar 26, 2023

pytorchbot added the open source label Mar 26, 2023

Update on "[Inductor] simplify CPP backend Tile2D code"

a3cd142

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "[Inductor] simplify CPP backend Tile2D code"

78e208e

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Mar 27, 2023

[Inductor] simplify CPP backend Tile2D code

906a694

ghstack-source-id: 1fe2deb Pull Request resolved: #97626

This was referenced Mar 27, 2023

[Inductor] support vertical reduction in cpp #97644

Closed

[Inductor] support transpose vertical reduction in cpp #97781

Closed

Jiong Gong added 3 commits March 30, 2023 14:09

jgong5 added the topic: not user facing topic category label Mar 31, 2023

jgong5 changed the title ~~[Inductor] simplify CPP backend Tile2D code~~ [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store Mar 31, 2023

jgong5 requested review from EikanWang, XiaobingSuper, desertfire, jansel and zhuhaozhe March 31, 2023 05:01

Jiong Gong added 3 commits March 31, 2023 13:00

jgong5 mentioned this pull request Mar 31, 2023

[Inductor] CPP Backend Tiling Abstraction #95585

Closed

jansel requested changes Apr 1, 2023

View reviewed changes

jgong5 requested a review from jansel April 1, 2023 08:18

This was referenced Apr 1, 2023

[Inductor] cpp further code cleanup #98135

Closed

[Inductor] turn on vectorization with fallback for indirect indexing etc. #98138

Closed

jansel approved these changes Apr 1, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2023

pytorchmergebot added the Merged label Apr 3, 2023

pytorchmergebot closed this in 8e5f491 Apr 3, 2023

facebook-github-bot deleted the gh/jgong5/12/head branch June 8, 2023 17:41

malfet mentioned this pull request Mar 24, 2024

[CPPInductor] Fix another out-of-bounds access #122580

Closed

[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store #97626

[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store #97626

Uh oh!

Conversation

jgong5 commented Mar 26, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97626

✅ No Failures

Uh oh!

jansel Apr 1, 2023

Choose a reason for hiding this comment

Uh oh!

jgong5 Apr 1, 2023

Choose a reason for hiding this comment

Uh oh!

jansel Apr 1, 2023

Choose a reason for hiding this comment

Uh oh!

jgong5 Apr 1, 2023

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Apr 2, 2023

Uh oh!

pytorchmergebot commented Apr 2, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 2, 2023

Merge failed

Uh oh!

jgong5 commented Apr 2, 2023

Uh oh!

pytorchmergebot commented Apr 2, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 2, 2023

Merge failed

Uh oh!

jgong5 commented Apr 3, 2023

Uh oh!

pytorchmergebot commented Apr 3, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jgong5 commented Mar 26, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 26, 2023 •

edited

Loading