[Dynamo] Support torch.{cuda/cpu}.amp.autocast#95416
[Dynamo] Support torch.{cuda/cpu}.amp.autocast#95416yanboliang wants to merge 10 commits intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95416
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Failures, 2 PendingAs of commit a7360a7: BROKEN TRUNK - The following jobs failed but were present on the merge base 076792a:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Currently blocked by #95837 |
fdfe88a to
93f3088
Compare
|
@pytorchbot merge -f "flaky gcp problem" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m 'Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf' -c ignoredsignal |
|
Here is the error snippet: |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@yanboliang your PR has been successfully reverted. |
This reverts commit c88aa33. Reverted #95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf
|
@pytorchbot merge -f "irrelevant failure" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel
This reverts commit c88aa33. Reverted pytorch/pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf
For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel
This reverts commit c88aa33. Reverted pytorch/pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf
For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel
For Meta internal use cases. Pull Request resolved: pytorch#95416 Approved by: https://github.com/jansel
This reverts commit c88aa33. Reverted pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf
…rch#96289) Fixes issues like the following: https://github.com/pytorch/pytorch/actions/runs/4362155257/jobs/7627059487 has a more serious core dump failure but the log of curl failures (GCP linux trying to get EC2 specific metadata like EC2 AMI-ID, Instance ID, and Instance Type) confused the HUD. <img width="848" alt="image" src="https://user-images.githubusercontent.com/109318740/223670567-330521ba-050a-41c3-9efb-fae6ea3398c0.png"> This PR gets rid of those curl failures. This may have contributed to the impression of "flaky GCP" in pytorch#95416 Pull Request resolved: pytorch#96289 Approved by: https://github.com/huydhn, https://github.com/yanboliang
For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel
|
This caused a number of failures on the dashboard:
This caused a number of dynamo + eager backend accuracy failures on the timm suite (bisected using regnety_002, but possibly others) cc @Chillee to investigate. Also BERT_pytorch in torchbench. |
|
@davidberard98 thanks for your investigation. This PR actually fixed a serious bug in dynamo benchmark: for all AMP benchmarks, actually we fallback to eager mode before this PR. It may have some metric changes since we truly support |
Fixes #97382 #95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found [a list of TIMM models amp + eager + training testing failed](https://docs.google.com/spreadsheets/d/1DEhirVOkj15Lu4UNawIUon9MqkVLaWqyT-DQPif5NHk/edit#gid=0). Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function ```reduce_to_scalar_loss``` which has been used by Torchbench tests. After switching to ```reduce_to_scalar_loss```, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason. Pull Request resolved: #97423 Approved by: https://github.com/Chillee

For Meta internal use cases.
cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire