Skip to content

[Dynamo] Support torch.{cuda/cpu}.amp.autocast#95416

Closed
yanboliang wants to merge 10 commits intopytorch:masterfrom
yanboliang:amp
Closed

[Dynamo] Support torch.{cuda/cpu}.amp.autocast#95416
yanboliang wants to merge 10 commits intopytorch:masterfrom
yanboliang:amp

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Feb 23, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95416

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures, 2 Pending

As of commit a7360a7:

BROKEN TRUNK - The following jobs failed but were present on the merge base 076792a:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yanboliang yanboliang added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2023
@yanboliang
Copy link
Contributor Author

Currently blocked by #95837

@yanboliang
Copy link
Contributor Author

@pytorchbot merge -f "flaky gcp problem"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yanboliang yanboliang deleted the amp branch March 8, 2023 01:41
@pytorch pytorch deleted a comment from pytorch-bot bot Mar 8, 2023
@pytorch pytorch deleted a comment from pytorch-bot bot Mar 8, 2023
@huydhn
Copy link
Contributor

huydhn commented Mar 8, 2023

@pytorchbot revert -m 'Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf' -c ignoredsignal

@weiwangmeta
Copy link
Contributor

@huydhn
Copy link
Contributor

huydhn commented Mar 8, 2023

Here is the error snippet:

+ python benchmarks/dynamo/check_memory_compression_ratio.py --actual /var/lib/jenkins/workspace/test/test-reports/inductor_training_smoketest_hf_Albert.csv --expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv

            hf_Albert                         :
                actual_memory_compression=1.19,
                expected_memory_compression=1.26,
                FAIL
            

Error: 1 models below expected memory compression ratio:
    hf_Albert
If this drop is expected, you can update `benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv`.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@yanboliang your PR has been successfully reverted.

@yanboliang
Copy link
Contributor Author

@pytorchbot merge -f "irrelevant failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yanboliang yanboliang deleted the amp branch March 10, 2023 21:48
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 14, 2023
@davidberard98
Copy link
Contributor

davidberard98 commented Mar 15, 2023

This caused a number of failures on the dashboard:

  • many dynamo eager-backend accuracy failures on amp timm suite (bisected with regnety_002, but possibly others)
  • dynamo eager-backend accuracy failure on BERT_pytorch from amp torchbench suite
  • Slowdown on inductor-no-cudagraphs on lennard_jones (amp torchbench)
  • Memory compression regression on ElectraForCausalLM

This caused a number of dynamo + eager backend accuracy failures on the timm suite (bisected using regnety_002, but possibly others) cc @Chillee to investigate. Also BERT_pytorch in torchbench.

@yanboliang
Copy link
Contributor Author

@davidberard98 thanks for your investigation. This PR actually fixed a serious bug in dynamo benchmark: for all AMP benchmarks, actually we fallback to eager mode before this PR. It may have some metric changes since we truly support torch.cuda.autocast in dynamo and benchmarks, and definitely we should identify where is the failure from firstly.

pytorchmergebot pushed a commit that referenced this pull request Mar 24, 2023
Fixes #97382

#95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found [a list of TIMM models amp + eager + training testing failed](https://docs.google.com/spreadsheets/d/1DEhirVOkj15Lu4UNawIUon9MqkVLaWqyT-DQPif5NHk/edit#gid=0).
Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function ```reduce_to_scalar_loss``` which has been used by Torchbench tests.
After switching to ```reduce_to_scalar_loss```, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason.

Pull Request resolved: #97423
Approved by: https://github.com/Chillee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants