-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[AOTI] Fix an autotune block grid computation issue #143098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143098
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8eacd41 with merge base 84f7913 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D67120987 |
Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987
492b7b4 to
8eacd41
Compare
|
This pull request was exported from Phabricator. Differential Revision: D67120987 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: There is a grid computation issue after switching to one-pass codegen in pytorch#141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987 Pull Request resolved: pytorch#143098 Approved by: https://github.com/henrylhtsang
|
@desertfire I don't know if it is connected. I have solved with this PR the compile+aoti export with autotuning but I still have this issue on E1216 site-packages/torch/_inductor/select_algorithm.py:1756] [0/0] Exception out of resource: shared memory, Required: 131072, Hardware limit:101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_root/tc/.....py, ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8)
W1216 site-packages/torch/_inductor/select_algorithm.py:1997] [0/0] out of resource: shared memory, Required: 131072, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. |
Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.
Reviewed By: henrylhtsang
Differential Revision: D67120987
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang @aakhundov