Skip to content

Conversation

@int3
Copy link
Contributor

@int3 int3 commented Sep 25, 2024

Stack from ghstack (oldest at bottom):

This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

This greatly reduces compile time.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136612

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit c418224 with merge base 6966811 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This greatly reduces compile time.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
This greatly reduces compile time.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
This greatly reduces compile time.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
int3 added a commit that referenced this pull request Oct 1, 2024
This greatly reduces compile time.

ghstack-source-id: cb27c66
Pull Request resolved: #136612
@int3 int3 added the topic: not user facing topic category label Oct 1, 2024
injiiiiil pushed a commit to injiiiiil/654 that referenced this pull request Oct 1, 2024
This greatly reduces compile time.

ghstack-source-id: 0144cdd
Pull Request resolved: pytorch/pytorch#136612
This greatly reduces compile time.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
This greatly reduces compile time.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
int3 added a commit that referenced this pull request Oct 1, 2024
This greatly reduces compile time.

ghstack-source-id: fb22c2f
Pull Request resolved: #136612
@int3 int3 requested review from desertfire and jansel October 2, 2024 04:41
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@int3
Copy link
Contributor Author

int3 commented Oct 3, 2024

@pytorchbot merge -f "test failure looks unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Chillee
Copy link
Collaborator

Chillee commented Oct 3, 2024

Are you primarily running with max-autotune?

@int3
Copy link
Contributor Author

int3 commented Oct 3, 2024

Yeah, the compile time measurements are using max-autotune.

@Chillee
Copy link
Collaborator

Chillee commented Oct 3, 2024

Perhaps it's better to just turn off max-autotune? The biggest impact it has is in codegening triton matmuls (and autotuning those).

If you just want to search better block sizes perhaps it'd be better to try coordinate descent tuning?

coordinate_descent_tuning = (

@int3
Copy link
Contributor Author

int3 commented Oct 5, 2024

Oh interesting. Switching from max-autotune to coordinate descent tuning makes 60% of the models faster at runtime. Compile times improve for just above 50% of cases, but there are also major regressions for some models...

@github-actions github-actions bot deleted the gh/int3/108/head branch November 6, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants