-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Inductor] Add config option to force higher-dimensional tiling #132937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…to brister/prefer_tiling
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132937
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8543226 with merge base b01402b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Perf workflow with ND tiling always enabled: https://github.com/pytorch/pytorch/actions/runs/10293993383 |
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
|
@blaine-rister NBD but wondering, numel/index/mask no longer are used, may be nice to rm them? Not to block landing the PR, more of a question. |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
I've wondered about this as well. The way the triton codegen is currently structured makes it a bit awkward to remove dead code. We generate these numels in the beginning, without knowing whether they are actually going to be used. The numels are treated like an artifact of codegen, and are not directly visible as ops in the IR. So we can't use standard dead code elimination to remove them. The triton compiler eliminates dead code under the hood, so there shouldn't be a perf impact to leaving some dead code in the kernel. They do make the kernels a bit harder to read and debug, though. We might be able to post-process the kernels and determine whether numels were actually used, or update some data structure tracking their uses as we're generating code. But there's a question of whether the debugging benefit justifies additional complexity. cc @eellison @shunting314 do you have any thoughts on this? |
|
@blaine no strong opinion, but yea not especially harmful right now. maybe file an issue ? could be a ramp up task |
@nandesuka expressed an interest in this. I'll create an issue for it. |
Fixes #125077
Feature
This PR creates a new Inductor config,
config.triton.prefer_nd_tiling, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like5 * x + 8 * yas opposed to5 * (x // 7) + 8 * (y % 9). This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as:config.triton.max_tiles.Here's an example kernel (elementwise add of views) with ND tiling disabled:
And here's the version with it enabled:
With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions.
Test plan
This PR adds some tests for pointwise ops with discontiguous tensors.
(5,7),(9,3,5), etc.This PR also parametrizes some existing tests to run with and without
triton.prefer_nd_tiling. That way, we ensure this feature doesn't break existing usage.Since this setting isn't enabled on most tests, I also created #132935 to test what happens when
triton.prefer_nd_tiling=Trueby default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge.Limitations and follow-ups
I can see two main improvements which would expand the usefulness of this feature:
This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels.
The usefulness of this feature depends on
triton.config.max_tiles. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang