[inductor] Reduce block sizes when using Triton CPU backend #136612

int3 · 2024-09-25T06:20:58Z

Stack from ghstack (oldest at bottom):

This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

This greatly reduces compile time. [ghstack-poisoned]

pytorch-bot · 2024-09-25T06:21:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136612

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit c418224 with merge base 6966811 ():

NEW FAILURE - The following job has failed:

inductor / linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu / test (inductor-triton-cpu, 1, 1, lf.linux.12xlarge) (gh)
#75 193.7 urllib.error.HTTPError: HTTP Error 524:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (similar failure)
moco

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This greatly reduces compile time. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

This greatly reduces compile time. ghstack-source-id: cb27c66 Pull Request resolved: #136612

This greatly reduces compile time. ghstack-source-id: 0144cdd Pull Request resolved: pytorch/pytorch#136612

This greatly reduces compile time. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

This greatly reduces compile time. ghstack-source-id: fb22c2f Pull Request resolved: #136612

This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

int3 · 2024-10-03T01:46:45Z

@pytorchbot merge -f "test failure looks unrelated"

pytorchmergebot · 2024-10-03T01:48:23Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Chillee · 2024-10-03T04:03:02Z

Are you primarily running with max-autotune?

int3 · 2024-10-03T05:09:36Z

Yeah, the compile time measurements are using max-autotune.

Chillee · 2024-10-03T06:23:16Z

Perhaps it's better to just turn off max-autotune? The biggest impact it has is in codegening triton matmuls (and autotuning those).

If you just want to search better block sizes perhaps it'd be better to try coordinate descent tuning?

pytorch/torch/_inductor/config.py

Line 346 in 7dc1788

coordinate_descent_tuning = (

int3 · 2024-10-05T03:10:42Z

Oh interesting. Switching from max-autotune to coordinate descent tuning makes 60% of the models faster at runtime. Compile times improve for just above 50% of cases, but there are also major regressions for some models...

[inductor] Reduce block sizes when using Triton CPU backend

d7337c6

This greatly reduces compile time. [ghstack-poisoned]

This was referenced Sep 25, 2024

Make test_skip_data_serialization regex more flexible #136580

Closed

Add Triton CPU as an Inductor backend #133408

Closed

int3 mentioned this pull request Sep 25, 2024

Add CI for Triton CPU backend #135342

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 25, 2024

int3 mentioned this pull request Sep 20, 2024

[not for commit] Benchmark Triton CPU backend #134725

Closed

pytorchbot mentioned this pull request Sep 26, 2024

Make test_skip_data_serialization regex more flexible #136710

Merged

int3 added 2 commits September 30, 2024 20:16

int3 added a commit that referenced this pull request Oct 1, 2024

[inductor] Reduce block sizes when using Triton CPU backend

98b4f22

This greatly reduces compile time. ghstack-source-id: cb27c66 Pull Request resolved: #136612

int3 added the topic: not user facing topic category label Oct 1, 2024

injiiiiil pushed a commit to injiiiiil/654 that referenced this pull request Oct 1, 2024

[inductor] Reduce block sizes when using Triton CPU backend

0f94f41

This greatly reduces compile time. ghstack-source-id: 0144cdd Pull Request resolved: pytorch/pytorch#136612

int3 added a commit that referenced this pull request Oct 1, 2024

[inductor] Reduce block sizes when using Triton CPU backend

13098a7

This greatly reduces compile time. ghstack-source-id: fb22c2f Pull Request resolved: #136612

int3 requested review from desertfire and jansel October 2, 2024 04:41

desertfire approved these changes Oct 2, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 3, 2024

pytorchmergebot closed this in b3953ff Oct 3, 2024

pytorchmergebot added Merged and removed merging labels Oct 3, 2024

int3 mentioned this pull request Oct 3, 2024

Have Triton CPU backend respect max_autotune setting #137276

Closed

github-actions bot deleted the gh/int3/108/head branch November 6, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] Reduce block sizes when using Triton CPU backend #136612

[inductor] Reduce block sizes when using Triton CPU backend #136612

Uh oh!

int3 commented Sep 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading

Uh oh!

int3 commented Oct 3, 2024

Uh oh!

pytorchmergebot commented Oct 3, 2024

Uh oh!

Chillee commented Oct 3, 2024

Uh oh!

int3 commented Oct 3, 2024

Uh oh!

Chillee commented Oct 3, 2024

Uh oh!

int3 commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[inductor] Reduce block sizes when using Triton CPU backend #136612

[inductor] Reduce block sizes when using Triton CPU backend #136612

Uh oh!

Conversation

int3 commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136612

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

int3 commented Oct 3, 2024

Uh oh!

pytorchmergebot commented Oct 3, 2024

Merge started

Uh oh!

Chillee commented Oct 3, 2024

Uh oh!

int3 commented Oct 3, 2024

Uh oh!

Chillee commented Oct 3, 2024

Uh oh!

int3 commented Oct 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

int3 commented Sep 25, 2024 •

edited

Loading

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading