[Traceable FSDP2][Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager #138113

yf225 · 2024-10-16T20:59:02Z

Dynamo stance is recently added in #137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.

In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".

Stack from ghstack (oldest at bottom):

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec @xmfan

[ghstack-poisoned]

pytorch-bot · 2024-10-16T20:59:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138113

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0481773 with merge base d531bd5 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…back to eager" cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec xmfan [ghstack-poisoned]

…to fallback to eager" Dynamo stance is recently added in #137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec xmfan [ghstack-poisoned]

ghstack-source-id: 707bef0 Pull Request resolved: #138113

yf225 · 2024-10-16T21:24:11Z

@pytorchbot merge

pytorchmergebot · 2024-10-16T21:25:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-16T21:56:42Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-rocm6.2-py3.10 / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

yf225 · 2024-10-17T00:17:50Z

@pytorchbot merge

pytorchmergebot · 2024-10-17T00:19:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yf225 · 2024-10-17T03:42:23Z

@pytorchbot merge -f "stuck in CI, also low-risk to merge since Dynamo stance is a new API"

pytorchmergebot · 2024-10-17T03:42:41Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

yf225 · 2024-10-17T18:38:26Z

@pytorchbot merge

pytorchmergebot · 2024-10-17T18:40:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-17T19:42:04Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, linux.4xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

…to fallback to eager" Dynamo stance is recently added in #137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec xmfan [ghstack-poisoned]

yf225 · 2024-10-17T20:09:34Z

@pytorchbot merge

pytorchmergebot · 2024-10-17T20:11:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…to fallback to eager" Dynamo stance is recently added in #137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec xmfan [ghstack-poisoned]

pytorchmergebot · 2024-10-17T20:37:20Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

…to fallback to eager" Dynamo stance is recently added in #137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec xmfan [ghstack-poisoned]

yf225 · 2024-10-17T20:45:02Z

@pytorchbot merge

pytorchmergebot · 2024-10-17T20:47:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yf225 · 2024-10-18T00:09:33Z

@pytorchbot merge -f "stuck in inductor_torchbench_smoketest_perf CI job, also low-risk to merge since Dynamo stance is a new API"

pytorchmergebot · 2024-10-18T00:09:51Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

yf225 · 2024-10-18T00:11:13Z

@pytorchbot merge -f "stuck in inductor_torchbench_smoketest_perf CI job, also low-risk to merge since Dynamo stance is a new API"

pytorchmergebot · 2024-10-18T00:11:17Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-10-18T00:12:52Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[Compiled Autograd] Check stance to decide whether to fallback to eager

7fb6a87

[ghstack-poisoned]

yf225 mentioned this pull request Oct 16, 2024

[Traceable FSDP2] Add compiled_autograd_enabled helper function #138105

Closed

pytorch-bot bot added ciflow/inductor module: compiled autograd compiled_autograd module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 16, 2024

yf225 requested a review from xmfan October 16, 2024 20:59

yf225 changed the title ~~[Compiled Autograd] Check stance to decide whether to fallback to eager~~ [Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager Oct 16, 2024

xmfan approved these changes Oct 16, 2024

View reviewed changes

yf225 added a commit that referenced this pull request Oct 16, 2024

[Compiled Autograd] Check stance to decide whether to fallback to eager

e295782

ghstack-source-id: 707bef0 Pull Request resolved: #138113

yf225 added keep-going Don't stop on first failure, keep running tests until the end and removed release notes: distributed (fsdp) release notes category labels Oct 16, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2024

yf225 added the topic: not user facing topic category label Oct 16, 2024

pytorchmergebot added the merging label Oct 16, 2024

pytorchmergebot removed the merging label Oct 16, 2024

pytorchmergebot added the merging label Oct 17, 2024

yf225 mentioned this pull request Oct 17, 2024

[compiled autograd] have context manager turn on dynamo config #138241

Closed

pytorchmergebot removed the merging label Oct 17, 2024

pytorchmergebot added the merging label Oct 17, 2024

yf225 mentioned this pull request Oct 17, 2024

[Traceable FSDP2] Add _compiled_autograd_enabled global state variable #138187

Closed

pytorchmergebot removed the merging label Oct 17, 2024

pytorchmergebot added the merging label Oct 17, 2024

pytorchmergebot added the Merged label Oct 18, 2024

pytorchmergebot closed this in 2f91d7c Oct 18, 2024

pytorchmergebot removed the merging label Oct 18, 2024

github-actions bot deleted the gh/yf225/141/head branch November 18, 2024 02:10

yf225 changed the title ~~[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager~~ [Traceable FSDP2][Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager Jan 6, 2025

[Traceable FSDP2][Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager #138113

[Traceable FSDP2][Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager #138113

Uh oh!

Conversation

yf225 commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138113

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

yf225 commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge failed

Uh oh!

yf225 commented Oct 17, 2024

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge started

Uh oh!

yf225 commented Oct 17, 2024

Uh oh!

pytorchmergebot commented Oct 17, 2024

Uh oh!

yf225 commented Oct 17, 2024

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge failed

Uh oh!

yf225 commented Oct 17, 2024

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge failed

Uh oh!

yf225 commented Oct 17, 2024

Uh oh!

pytorchmergebot commented Oct 17, 2024

Merge started

Uh oh!

yf225 commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Uh oh!

yf225 commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yf225 commented Oct 16, 2024 •

edited

Loading

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading