[Dynamo] Support torch.{cuda/cpu}.amp.autocast by yanboliang · Pull Request #95416 · pytorch/pytorch

yanboliang · 2023-02-23T22:24:01Z

For Meta internal use cases.

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2023-02-23T22:24:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95416

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures, 2 Pending

As of commit a7360a7:

BROKEN TRUNK - The following jobs failed but were present on the merge base 076792a:

👉 Rebase onto the `viable/strict` branch to avoid these failures

manywheel-py3_8-cuda11_7-test / test (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_dynamo/variables/torch.py

yanboliang · 2023-03-01T23:22:33Z

Currently blocked by #95837

yanboliang · 2023-03-08T01:38:30Z

@pytorchbot merge -f "flaky gcp problem"

pytorchmergebot · 2023-03-08T01:40:22Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2023-03-08T06:41:32Z

@pytorchbot revert -m 'Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf' -c ignoredsignal

weiwangmeta · 2023-03-08T06:44:59Z

Real failure is https://github.com/pytorch/pytorch/actions/runs/4360278495/jobs/7623296244#step:11:1070
not the

huydhn · 2023-03-08T06:47:48Z

Here is the error snippet:

+ python benchmarks/dynamo/check_memory_compression_ratio.py --actual /var/lib/jenkins/workspace/test/test-reports/inductor_training_smoketest_hf_Albert.csv --expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv

            hf_Albert                         :
                actual_memory_compression=1.19,
                expected_memory_compression=1.26,
                FAIL
            

Error: 1 models below expected memory compression ratio:
    hf_Albert
If this drop is expected, you can update `benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv`.

pytorchmergebot · 2023-03-08T06:51:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-03-08T06:52:02Z

@yanboliang your PR has been successfully reverted.

This reverts commit c88aa33. Reverted #95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf

yanboliang · 2023-03-10T21:46:17Z

@pytorchbot merge -f "irrelevant failure"

pytorchmergebot · 2023-03-10T21:48:00Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel

This reverts commit c88aa33. Reverted pytorch/pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf

For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel

This reverts commit c88aa33. Reverted pytorch/pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf

For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel

For Meta internal use cases. Pull Request resolved: pytorch#95416 Approved by: https://github.com/jansel

This reverts commit c88aa33. Reverted pytorch#95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf

…rch#96289) Fixes issues like the following: https://github.com/pytorch/pytorch/actions/runs/4362155257/jobs/7627059487 has a more serious core dump failure but the log of curl failures (GCP linux trying to get EC2 specific metadata like EC2 AMI-ID, Instance ID, and Instance Type) confused the HUD. <img width="848" alt="image" src="https://user-images.githubusercontent.com/109318740/223670567-330521ba-050a-41c3-9efb-fae6ea3398c0.png"> This PR gets rid of those curl failures. This may have contributed to the impression of "flaky GCP" in pytorch#95416 Pull Request resolved: pytorch#96289 Approved by: https://github.com/huydhn, https://github.com/yanboliang

For Meta internal use cases. Pull Request resolved: pytorch/pytorch#95416 Approved by: https://github.com/jansel

davidberard98 · 2023-03-15T04:15:31Z

This caused a number of failures on the dashboard:

many dynamo eager-backend accuracy failures on amp timm suite (bisected with regnety_002, but possibly others)
dynamo eager-backend accuracy failure on BERT_pytorch from amp torchbench suite
Slowdown on inductor-no-cudagraphs on lennard_jones (amp torchbench)
Memory compression regression on ElectraForCausalLM

This caused a number of dynamo + eager backend accuracy failures on the timm suite (bisected using regnety_002, but possibly others) cc @Chillee to investigate. Also BERT_pytorch in torchbench.

yanboliang · 2023-03-17T22:43:53Z

@davidberard98 thanks for your investigation. This PR actually fixed a serious bug in dynamo benchmark: for all AMP benchmarks, actually we fallback to eager mode before this PR. It may have some metric changes since we truly support torch.cuda.autocast in dynamo and benchmarks, and definitely we should identify where is the failure from firstly.

Fixes #97382 #95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found [a list of TIMM models amp + eager + training testing failed](https://docs.google.com/spreadsheets/d/1DEhirVOkj15Lu4UNawIUon9MqkVLaWqyT-DQPif5NHk/edit#gid=0). Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function ```reduce_to_scalar_loss``` which has been used by Torchbench tests. After switching to ```reduce_to_scalar_loss```, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason. Pull Request resolved: #97423 Approved by: https://github.com/Chillee

github-actions bot added ciflow/inductor module: dynamo labels Feb 23, 2023

yanboliang added the topic: not user facing topic category label Feb 23, 2023

yanboliang requested review from jansel and mlazos February 23, 2023 22:30

jansel approved these changes Feb 25, 2023

View reviewed changes

torch/_dynamo/variables/torch.py Show resolved Hide resolved

yanboliang force-pushed the amp branch from e9c6cc7 to 4d7c4ce Compare February 27, 2023 02:06

yanboliang added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2023

yanboliang force-pushed the amp branch from ca5673a to 591b6a6 Compare March 1, 2023 02:29

yanboliang mentioned this pull request Mar 1, 2023

Dynamo AutocastModeVariable bug: missing with context on the graph break instruction #95837

Closed

yanboliang force-pushed the amp branch 2 times, most recently from fdfe88a to 93f3088 Compare March 7, 2023 22:14

pytorchmergebot added the Merged label Mar 8, 2023

pytorchmergebot closed this in c88aa33 Mar 8, 2023

yanboliang deleted the amp branch March 8, 2023 01:41

pytorch deleted a comment from pytorch-bot bot Mar 8, 2023

pytorchmergebot added the Reverted label Mar 8, 2023

weiwangmeta mentioned this pull request Mar 8, 2023

Make setup linux action be more friendly with gcp linux runners #96289

Closed

yanboliang added 5 commits March 10, 2023 05:54

Fix unit tests

5365cfd

Fix unit test

3a51157

Skip the unit test that pollute env

b628e34

enable sdpa test

55577c7

Update expected compression ratio for hf_Albert

a7360a7

yanboliang force-pushed the amp branch from 7fdb832 to a7360a7 Compare March 10, 2023 05:55

yanboliang added the ciflow/inductor-perf-compare label Mar 10, 2023

pytorchmergebot closed this in 7fcf8b1 Mar 10, 2023

yanboliang deleted the amp branch March 10, 2023 21:48

Chillee mentioned this pull request Mar 22, 2023

amp + eager backend + training failing for some timm models #97382

Closed

SherlockNoMad mentioned this pull request Mar 22, 2023

Torch Dynamo Error when Compiling/Exporting Module which Uses Amp Utility #97320

Closed

yanboliang mentioned this pull request Mar 23, 2023

[Dynamo] Fix TIMM benchmark compute_loss #97423

Closed

Conversation

yanboliang commented Feb 23, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95416

❌ 1 Failures, 2 Pending

Uh oh!

Uh oh!

yanboliang commented Mar 1, 2023

Uh oh!

yanboliang commented Mar 8, 2023

Uh oh!

pytorchmergebot commented Mar 8, 2023

Merge started

Uh oh!

huydhn commented Mar 8, 2023

Uh oh!

weiwangmeta commented Mar 8, 2023

Uh oh!

huydhn commented Mar 8, 2023

Uh oh!

pytorchmergebot commented Mar 8, 2023

Uh oh!

pytorchmergebot commented Mar 8, 2023

Uh oh!

yanboliang commented Mar 10, 2023

Uh oh!

pytorchmergebot commented Mar 10, 2023

Merge started

Uh oh!

davidberard98 commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanboliang commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yanboliang commented Feb 23, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 23, 2023 •

edited

Loading

davidberard98 commented Mar 15, 2023 •

edited

Loading