Skip to content

Conversation

@jgong5
Copy link
Collaborator

@jgong5 jgong5 commented Mar 26, 2023

Stack from ghstack (oldest at bottom):

Remove CppTile2DTailKernel and CppTile2DKernelChecker and reuse CppVecKernel and CppVecKernelChecker for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by CppTile2DTailKernel.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 26, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97626

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9a75b92:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Mar 27, 2023
ghstack-source-id: 1fe2deb
Pull Request resolved: #97626
Jiong Gong added 3 commits March 30, 2023 14:09
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@jgong5 jgong5 added the topic: not user facing topic category label Mar 31, 2023
@jgong5 jgong5 changed the title [Inductor] simplify CPP backend Tile2D code [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store Mar 31, 2023
Jiong Gong added 3 commits March 31, 2023 13:00
…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
compiled = compile_fx_inner(traced, [x1, x2])
assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
assert metrics.generated_cpp_vec_kernel_count == 1
assert metrics.generated_cpp_vec_kernel_count == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the tail kernel of the tile2d is also vectorized along the outer loop index (i0 below). Originally it is a scalar kernel. The non-contiguous load falls back to the scalar inner loop (tmp11 below) and other code is vectorized as much as possible. Therefore, we have both tile2d main kernel and tile2d tail kernel vectorized, which is counted as 2 now.

Generated tail kernel of tile2d below:

                #pragma GCC ivdep
                for(long i1=static_cast<long>(0); i1<static_cast<long>(10); i1+=static_cast<long>(1))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((16*i0) + (20*i1)));
                    auto tmp11 = ([&]() { __at_align__ float tmpbuf[16]; for (long i0_inner = 0; i0_inner < 16; i0_inner++) tmpbuf[i0_inner] = in_ptr1[static_cast<long>(i1 + (10*i0_inner) + (160*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                    auto tmp1 = tmp0.abs();
                    auto tmp2 = tmp1.sin();
                    auto tmp3 = tmp2.neg();
                    auto tmp4 = tmp3 * tmp3;
                    auto tmp5 = decltype(tmp4)(1)/(decltype(tmp4)(1) + tmp4.neg().exp());
                    auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
                    auto tmp7 = tmp6.cos();
                    auto tmp8 = tmp7.exp();
                    auto tmp9 = tmp8.sqrt();
                    auto tmp10 = tmp9 + tmp0;
                    auto tmp12 = tmp10 - tmp11;
                    auto tmp13 = tmp12 * tmp0;
                    auto tmp14 = tmp13 / tmp0;
                    auto tmp15 = tmp14 * tmp14;
                    auto tmp16 = tmp15 * tmp15;
                    auto tmp17 = tmp16 * tmp14;
                    auto tmp18 = tmp17 * tmp17;
                    auto tmp19 = tmp18.log();
                    auto tmp20 = tmp19.floor();
                    auto tmp21 = tmp20.ceil();
                    auto tmp22 = tmp21.trunc();
                    auto tmp23 = tmp22.lgamma();
                    auto tmp24 = tmp23.fmod(tmp11);
                    auto tmp25 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), decltype(tmp24)(0) < tmp24);
                    auto tmp26 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), tmp24 < decltype(tmp24)(0));
                    auto tmp27 = tmp25 - tmp26;
                    auto tmp28 = tmp27 + tmp11;
                    tmp28.store(out_ptr0 + static_cast<long>((16*i0) + (20*i1)));
                }

return []
stride = self.stride_at(var, index)
if stride == 1:
contig_vars.add(int(var.name[1:]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable names can sometimes be stuff like indirect0 so I don't think converting vars to ints will work in general. Suggest adding sort function based on something like:

def sort_key(x):
   return (*[int(m) for m in re.findall(r"\d+", x)], x)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended only to check the loop variables and get the indices from them. So I revised the logic to make sure they are loop variables. Please check.

…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@jgong5 jgong5 requested a review from jansel April 1, 2023 08:18
…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@jgong5
Copy link
Collaborator Author

jgong5 commented Apr 2, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 13f4a515d3b16960f14d775ec1139b3133bdc05c returned non-zero exit code 1

Auto-merging test/inductor/test_torchinductor.py
CONFLICT (content): Merge conflict in test/inductor/test_torchinductor.py
error: could not apply 13f4a515d3b... [Inductor] simplify CPP backend Tile2D code
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

…n-contiguous load/store"


Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire 

cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@jgong5
Copy link
Collaborator Author

jgong5 commented Apr 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@jgong5
Copy link
Collaborator Author

jgong5 commented Apr 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/jgong5/12/head branch June 8, 2023 17:41
malfet added a commit that referenced this pull request Mar 24, 2024
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation ((perhaps copy-n-paste error during the refactor made by #97626 ) ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (if dtype is int64)

Discovered while trying to enable arch64 vectorized inductor
pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2024
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by #97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: #122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by #97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: #122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants