[inductor] Cooperative reductions #137756

jansel · 2024-10-11T02:45:59Z

Stack from ghstack (oldest at bottom):

-> [inductor] Cooperative reductions #137756

Example generated code for (x+y).sum():

@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
    xnumel = 1
    rnumel = 1048576
    rsplit_id = tl.program_id(0)
    num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
    rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
    rsplit_start = rsplit_chunk * rsplit_id
    rsplit_end = rsplit_chunk * (rsplit_id + 1)
    xoffset = tl.program_id(1) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
    rbase = tl.arange(0, RBLOCK)[None, :]
    _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
    for roffset in range(rsplit_start, rsplit_end, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp2 = tmp0 + tmp1
        tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
        tmp5 = _tmp4 + tmp3
        _tmp4 = tl.where(rmask, tmp5, _tmp4)
    tmp4 = tl.sum(_tmp4, 1)[:, None]
    if RSPLIT > 1:
        tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
        tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
    if RSPLIT > 1:
        triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
    if RSPLIT > 1:
        tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
        tmp4 = tl.sum(tmp4_peers, 1)[:, None]
    if rsplit_id == (0 % RSPLIT):
        tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-10-11T02:46:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137756

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ec25231 with merge base 4cd985a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: b8fc78b Pull Request resolved: #137756

[ghstack-poisoned]

ghstack-source-id: 4fd6ae5 Pull Request resolved: #137756

[ghstack-poisoned]

ghstack-source-id: c983dcd Pull Request resolved: #137756

[ghstack-poisoned]

ghstack-source-id: c735d4d Pull Request resolved: #137756

[ghstack-poisoned]

ghstack-source-id: 6f0888e Pull Request resolved: #137756

jeanschmidt · 2024-10-28T13:22:55Z

@pytorchbot revert -m "ROCM tests are timing out :(" -c nosignal

pytorchmergebot · 2024-10-28T13:24:22Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-10-28T13:24:36Z

@jansel your PR has been successfully reverted.

This reverts commit fed37db. Reverted #137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](#137756 (comment)))

ezyang · 2024-10-28T15:17:48Z

Compile time improvement on MegatronBertForQuestionAnswering training and other huggingface models

[ghstack-poisoned]

ghstack-source-id: a83c28d Pull Request resolved: #137756

jansel · 2024-10-28T20:12:59Z

I am skipping the ROCM tests and making an issue to adding AMD support #139099

jansel · 2024-10-28T20:13:06Z

@pytorchbot merge

pytorchmergebot · 2024-10-28T20:14:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jataylo · 2024-10-29T10:54:48Z

Hi @jansel do we need to do more than skip on ROCm to make sure we don't see issues in other workloads? (i.e. do we need to add a way to disable this functionality until ROCm support can be added?) cc: @jeffdaily @jithunnair-amd

Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: pytorch#137756 Approved by: https://github.com/eellison

ngimel · 2024-10-29T01:20:27Z

torch/_inductor/codegen/triton.py

+        xnumel, rnumel = self.numels
+        # TODO(jansel): base this on num_bytes_read rather than numel
+        xhint = V.graph.sizevars.size_hint(xnumel, fallback=2)
+        if xhint <= 8:


this heuristics probably should be different depending on inner/outer reduction

ngimel · 2024-10-29T16:58:46Z

@jataylo so far this functionality is not turned on by default and is just exercised in tests, you might want to throw an error if it's manually turned on. The fixes should likely be on the triton side - probably a grid that cannot simultaneously run on the device is launch, and is not guarded by cooperative launch attribute.

Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: pytorch#137756 Approved by: https://github.com/eellison

Update

5ad11fa

[ghstack-poisoned]

jansel mentioned this pull request Oct 11, 2024

[inductor] Delete dead code and lints #137753

Closed

jansel mentioned this pull request Oct 11, 2024

[inductor] Fix reduction_hint sum to single element #137754

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 11, 2024

This was referenced Oct 11, 2024

[inductor] Add config.triton.force_persistent_reductions_threshold #137755

Closed

[inductor] Refactor generate_workspace_allocation #137673

Closed

Update

b122951

[ghstack-poisoned]

jansel added a commit that referenced this pull request Oct 11, 2024

[inductor] Cooperative reductions

2abb589

ghstack-source-id: b8fc78b Pull Request resolved: #137756

jansel added the release notes: inductor label Oct 11, 2024

Update

16704fc

[ghstack-poisoned]

jansel added a commit that referenced this pull request Oct 11, 2024

[inductor] Cooperative reductions

ffc554f

ghstack-source-id: 4fd6ae5 Pull Request resolved: #137756

Update

af7fa21

[ghstack-poisoned]

Update

a033eab

[ghstack-poisoned]

jansel added a commit that referenced this pull request Oct 13, 2024

[inductor] Cooperative reductions

9e8a1b8

ghstack-source-id: c983dcd Pull Request resolved: #137756

Update

250937d

[ghstack-poisoned]

jansel added a commit that referenced this pull request Oct 13, 2024

[inductor] Cooperative reductions

ed1f6c2

ghstack-source-id: c735d4d Pull Request resolved: #137756

Update

77a7e53

[ghstack-poisoned]

This was referenced Oct 15, 2024

[inductor] Add LoopBody.op_counts #137945

Closed

[inductor] Refactor triton dtype helpers #137946

Closed

Update

f04bda4

[ghstack-poisoned]

jansel mentioned this pull request Oct 15, 2024

scan wip #137962

Closed

jansel mentioned this pull request Oct 15, 2024

scan wip #137963

Closed

jansel added a commit that referenced this pull request Oct 15, 2024

[inductor] Cooperative reductions

bc66849

ghstack-source-id: 6f0888e Pull Request resolved: #137756

pytorchmergebot added the Merged label Oct 27, 2024

pytorchmergebot closed this in fed37db Oct 27, 2024

pytorchmergebot removed the merging label Oct 27, 2024

pytorchmergebot added the Reverted label Oct 28, 2024

pytorchmergebot reopened this Oct 28, 2024

jansel added the ciflow/rocm Trigger "default" config CI on ROCm label Oct 28, 2024

Update

ec25231

[ghstack-poisoned]

jansel added a commit that referenced this pull request Oct 28, 2024

[inductor] Cooperative reductions

7fbe348

ghstack-source-id: a83c28d Pull Request resolved: #137756

jansel mentioned this pull request Oct 28, 2024

[inductor][rocm] Cooperative reductions on AMD GPUs #139099

Closed

pytorchmergebot added the merging label Oct 28, 2024

pytorchmergebot closed this in 2b937e4 Oct 29, 2024

pytorchmergebot removed the merging label Oct 29, 2024

jansel mentioned this pull request Oct 29, 2024

[Inductor] Support tiling reduction dimensions #137243

Closed

ngimel reviewed Oct 29, 2024

View reviewed changes

exclamaforte mentioned this pull request Oct 31, 2024

Allow inplacing buffer when other users are inconsequential #138383

Closed

github-actions bot deleted the gh/jansel/413/head branch November 29, 2024 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] Cooperative reductions #137756

[inductor] Cooperative reductions #137756

Uh oh!

jansel commented Oct 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

Uh oh!

jeanschmidt commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

ezyang commented Oct 28, 2024

Uh oh!

jansel commented Oct 28, 2024

Uh oh!

jansel commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

jataylo commented Oct 29, 2024

Uh oh!

ngimel Oct 29, 2024

Uh oh!

ngimel commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

[inductor] Cooperative reductions #137756

[inductor] Cooperative reductions #137756

Uh oh!

Conversation

jansel commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137756

✅ No Failures

Uh oh!

jeanschmidt commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

ezyang commented Oct 28, 2024

Uh oh!

jansel commented Oct 28, 2024

Uh oh!

jansel commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Merge started

Uh oh!

jataylo commented Oct 29, 2024

Uh oh!

ngimel Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

ngimel commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jansel commented Oct 11, 2024 •

edited

Loading

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading