-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store #97626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97626
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9a75b92: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
test/inductor/test_torchinductor.py
Outdated
| compiled = compile_fx_inner(traced, [x1, x2]) | ||
| assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True) | ||
| assert metrics.generated_cpp_vec_kernel_count == 1 | ||
| assert metrics.generated_cpp_vec_kernel_count == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the tail kernel of the tile2d is also vectorized along the outer loop index (i0 below). Originally it is a scalar kernel. The non-contiguous load falls back to the scalar inner loop (tmp11 below) and other code is vectorized as much as possible. Therefore, we have both tile2d main kernel and tile2d tail kernel vectorized, which is counted as 2 now.
Generated tail kernel of tile2d below:
#pragma GCC ivdep
for(long i1=static_cast<long>(0); i1<static_cast<long>(10); i1+=static_cast<long>(1))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((16*i0) + (20*i1)));
auto tmp11 = ([&]() { __at_align__ float tmpbuf[16]; for (long i0_inner = 0; i0_inner < 16; i0_inner++) tmpbuf[i0_inner] = in_ptr1[static_cast<long>(i1 + (10*i0_inner) + (160*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
auto tmp1 = tmp0.abs();
auto tmp2 = tmp1.sin();
auto tmp3 = tmp2.neg();
auto tmp4 = tmp3 * tmp3;
auto tmp5 = decltype(tmp4)(1)/(decltype(tmp4)(1) + tmp4.neg().exp());
auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
auto tmp7 = tmp6.cos();
auto tmp8 = tmp7.exp();
auto tmp9 = tmp8.sqrt();
auto tmp10 = tmp9 + tmp0;
auto tmp12 = tmp10 - tmp11;
auto tmp13 = tmp12 * tmp0;
auto tmp14 = tmp13 / tmp0;
auto tmp15 = tmp14 * tmp14;
auto tmp16 = tmp15 * tmp15;
auto tmp17 = tmp16 * tmp14;
auto tmp18 = tmp17 * tmp17;
auto tmp19 = tmp18.log();
auto tmp20 = tmp19.floor();
auto tmp21 = tmp20.ceil();
auto tmp22 = tmp21.trunc();
auto tmp23 = tmp22.lgamma();
auto tmp24 = tmp23.fmod(tmp11);
auto tmp25 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), decltype(tmp24)(0) < tmp24);
auto tmp26 = decltype(tmp24)::blendv(decltype(tmp24)(0), decltype(tmp24)(1), tmp24 < decltype(tmp24)(0));
auto tmp27 = tmp25 - tmp26;
auto tmp28 = tmp27 + tmp11;
tmp28.store(out_ptr0 + static_cast<long>((16*i0) + (20*i1)));
}| return [] | ||
| stride = self.stride_at(var, index) | ||
| if stride == 1: | ||
| contig_vars.add(int(var.name[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable names can sometimes be stuff like indirect0 so I don't think converting vars to ints will work in general. Suggest adding sort function based on something like:
def sort_key(x):
return (*[int(m) for m in re.findall(r"\d+", x)], x)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended only to check the loop variables and get the indices from them. So I revised the logic to make sure they are loop variables. Please check.
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc zhuhaozhe cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
…n-contiguous load/store" Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire cc soumith voznesenskym penguinwu anijain2305 EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation ((perhaps copy-n-paste error during the refactor made by #97626 ) ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (if dtype is int64)
Discovered while trying to enable arch64 vectorized inductor
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by #97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
return torch.diagonal_scatter(x, y)
x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```
Pull Request resolved: #122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by #97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
return torch.diagonal_scatter(x, y)
x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```
Pull Request resolved: #122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
Stack from ghstack (oldest at bottom):
Remove
CppTile2DTailKernelandCppTile2DKernelCheckerand reuseCppVecKernelandCppVecKernelCheckerfor them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed byCppTile2DTailKernel.This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire
cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire