-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Inductor] Emit strided block pointer from ModularIndexing and FloorDiv #127342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127342
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit 2b5a2f1 with merge base 732b4e9 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D56739375 |
…iv (#127342) Summary: Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, and triton block shapes must be powers of 2, this only works if the iteration range's dims are all powers of 2. However, this feature is still worthwhile since powers of 2 dims are commonly seen in practice. Feature proposal and discussion: #125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan: Added some new CI tests to cover this feature. The tests check that block pointers are generated for strided loads of the appropriate shapes. TODO add some more complex tests, like 2 different strided reads of different sizes. (Maybe read a small matrix, tile it up to size of larger matrix, then add together?) Differential Revision: D56739375
2facbb4 to
cb68245
Compare
|
This pull request was exported from Phabricator. Differential Revision: D56739375 |
…iv (#127342) Summary: Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, and triton block shapes must be powers of 2, this only works if the iteration range's dims are all powers of 2. However, this feature is still worthwhile since powers of 2 dims are commonly seen in practice. Feature proposal and discussion: #125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan: Added some new CI tests to cover this feature. - Check that block pointers are generated for `x + y` where `x` and `y` are views. - Check that `x + y` still works for odd sizes where we don't generate block pointers. - Check some cases with view args of different sizes: - If sizes are all power of two, check that we generate block pointers for both. - If one size is a power of two and the other isn't check that we generate one block pointer. - Check that we can handle `torch.sum(x)` where `x` is a view of shape `(3 * TRITON_MAX_BLOCK["Y"], 2)`. Besides powers of two, we should also be able to handle multiples of the max block size. Differential Revision: D56739375
cb68245 to
2a261cf
Compare
|
This pull request was exported from Phabricator. Differential Revision: D56739375 |
|
@blaine-rister has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
|
|
@jansel @shunting314 I think I've addressed your comments from last time. Could you please take another look at this? |
…PA test. Also, do not use strides to determine data ordering, since we already get this from the range trees.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
it seems that in 2.4.1 this causes a problem because |
Summary
Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.
Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
CielDiv(x, y)rounds up to 1.Note that a slice of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like
(5,8).Feature proposal and discussion: #125077
Example kernel:
Test Plan
This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
Follow-ups
There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
triton.use_block_ptr=False. I'm guessing we can still avoid%and/without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?Differential Revision: D56739375
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang