Add facility to run dynamo UTs for non-cuda devices #140929

ankurneog · 2024-11-18T04:54:42Z

This is in line with changes introduced with #130714, additional files are included to support non-cuda devices.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

pytorch-bot · 2024-11-18T04:54:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140929

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit a23bf65 with merge base 0c05832 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ankurneog · 2024-11-22T15:08:36Z

@anijain2305 : Can you please help with the review and approval , this is in lines of #130714
Thanks

ankurneog · 2024-12-03T06:09:35Z

@pytorchbot rebase

pytorchmergebot · 2024-12-03T06:11:12Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-03T06:11:17Z

Successfully rebased dynamo_backend_hpu onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dynamo_backend_hpu && git pull --rebase)

netlify · 2024-12-06T04:41:30Z

✅ Deploy Preview for chimerical-cranachan-793287 ready!

Name	Link
🔨 Latest commit	364e8c19eb73d276ed6ba91ea9d7282a3eb4f923
🔍 Latest deploy log	https://app.netlify.com/sites/chimerical-cranachan-793287/deploys/6752882b5a95220008248e49
😎 Deploy Preview	https://deploy-preview-140929--chimerical-cranachan-793287.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ankurneog · 2024-12-11T05:10:17Z

@anijain2305 : Gentle reminder, can you please help with the approval. Thanks !

ankurneog · 2024-12-16T04:37:50Z

@anijain2305 : Can you please help with the approval ? thanks

ankurneog · 2024-12-18T10:19:14Z

@pytorchbot rebase

pytorchmergebot · 2024-12-18T10:20:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-18T10:20:55Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/140929/head returned non-zero exit code 1

Rebasing (1/4)
Auto-merging test/dynamo/test_activation_checkpointing.py
Auto-merging test/dynamo/test_backends.py
Auto-merging test/dynamo/test_export.py
Auto-merging test/dynamo/test_modules.py
CONFLICT (content): Merge conflict in test/dynamo/test_modules.py
error: could not apply 00ba5897923... Add facility to run dynamo UTs for non-cuda devices
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 00ba5897923... Add facility to run dynamo UTs for non-cuda devices

Raised by https://github.com/pytorch/pytorch/actions/runs/12390920075

ankurneog · 2025-01-17T03:27:38Z

Hello @kwen2501 , @guangyey , @anijain2305 , @albanD, @EikanWang : can anyone of you please help with this approval, its been pending for long. Thank you.

test/dynamo/test_activation_checkpointing.py

kwen2501

LGTM. @anijain2305 @angelayi @ydwu4 @tugsbayasgalan please review to confirm, thanks!

EikanWang

In general, it is fine to add device information to the test cases for flexibility. However, the only_for of instantiate_device_type_tests does not align with the original semantics. It does not sever other devices. For example, it cannot support privateuse1 backend

EikanWang · 2025-01-17T05:50:53Z

test/dynamo/test_activation_checkpointing.py

EikanWang · 2025-01-17T05:52:33Z

test/dynamo/test_activation_checkpointing.py

EikanWang · 2025-01-17T05:53:45Z

test/dynamo/test_activation_checkpointing.py

EikanWang · 2025-01-17T09:32:03Z

test/dynamo/test_backends.py

EikanWang · 2025-01-17T09:39:34Z

In terms of the landed #130714, it introduced the similar issue. We need to track it and refine the test case by removing only_for or extending only_for.

ankurneog · 2025-01-17T09:45:34Z

In terms of the landed #130714, it introduced the similar issue. We need to track it and refine the test case by removing only_for or extending only_for.

These were running for "only cuda" till now, if other accelerators want the support they can verify and add on to the list

ankurneog · 2025-01-17T10:10:24Z

@EikanWang : Note that these tests were written only for cuda , the PR here removes that and generalizes it for other accelerators, if other accelerators support the TC, they can add on to the list. hope the intent is clear

EikanWang · 2025-01-17T14:50:02Z

Per my understanding, it is HPU-bias. Let's focus on the HPU enabling rather than limiting the scope cuda and hpu.

@EikanWang : Note that these tests were written only for cuda , the PR here removes that and generalizes it for other accelerators, if other accelerators support the TC, they can add on to the list. hope the intent is clear

Actually, not. For example, https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_activation_checkpointing.py#L1116 works for CPU as well. why does this PR limit the case to CUDA?

ankurneog · 2025-01-17T15:18:50Z

@EikanWang : The CPU TC related comment is a "valid" one and will be addressed but the general device related comment is unacceptable . This is towards a goal for device abstraction in the UTs. The scheme is already followed in all ops related modules eg : test_ops.py and most recently added for FSDP and D tensor as well. I would request @albanD to moderate here. Thanks

test/dynamo/test_activation_checkpointing.py

EikanWang · 2025-01-17T15:42:34Z

@EikanWang : The CPU TC related comment is a "valid" one and will be addressed but the general device related comment is unacceptable . This is towards a goal for device abstraction in the UTs. The scheme is already followed in all ops related modules eg : test_ops.py and most recently added for FSDP and D tensor as well. I would request @albanD to moderate here. Thanks

@ankurneog , I agreed with that to use instantiate_device_type_tests to support different devices. My point is only_for may not align with the original semantics.

only_for("cuda", "hpu") indicates that all the test cases work on cuda and hpu only. However, some test cases intend to test CPU, like test_compile_selective_checkpoint_parametrization and test_compile_selective_checkpoint_invalid_context
only_for("cuda", "hpu") intends to enable cuda and hpu. However, most of the test cases are decorated as require_cuda. hpu should not work as expected as most test cases will be skipped.

test/dynamo/test_activation_checkpointing.py

AnantGulati · 2025-01-17T16:23:10Z

@EikanWang

only_for("cuda", "hpu") indicates that all the test cases work on cuda and hpu only. However, some test cases intend to test CPU, like test_compile_selective_checkpoint_parametrization and test_compile_selective_checkpoint_invalid_context

With respect to editing the semantics to enable cpu tests, we can simply maintain two lists which are passed to different classes as and when required.

Instantiate device type test will automatically run tests for all devices present in the list passed to it (only_for) and hence gives us the flexibility to test for cases which require both multi device tests or for single ones.

One way this can be approached is as done in #138216, it would be great to discuss better ways to approach this as well.

only_for("cuda", "hpu") intends to enable cuda and hpu. However, most of the test cases are decorated as require_cuda. hpu should not work as expected as most test cases will be skipped.

I agree that we should remove requires_cuda as this makes the test cuda specific

ankurneog · 2025-01-17T17:21:41Z

@EikanWang : The CPU TC related comment is a "valid" one and will be addressed but the general device related comment is unacceptable . This is towards a goal for device abstraction in the UTs. The scheme is already followed in all ops related modules eg : test_ops.py and most recently added for FSDP and D tensor as well. I would request @albanD to moderate here. Thanks

@ankurneog , I agreed with that to use instantiate_device_type_tests to support different devices. My point is only_for may not align with the original semantics.

only_for("cuda", "hpu") skips the test cases like test_compile_selective_checkpoint_parametrization and test_compile_selective_checkpoint_invalid_context

only_for("cuda", "hpu") intends to enable cuda and hpu. However, most of the test cases are decorated as require_cuda. hpu should not work as expected as most test cases will be skipped.

@EikanWang , @guangyey : Please see my comment on the reason for inclusion of @requires_cuda , the decorator ultimately checks for Triton capable NVIDIA HW, ( this needs to be decoupled eventually) . Hence the CI fails.
we can clean this up later after modifying the whole logic for requires_cuda. in the meantime we can leave this as such, since the original functionality is unchanged.

EikanWang · 2025-01-18T01:15:37Z

@ankurneog , I just run the CI. Let's wait for the CI signal.

ankurneog · 2025-01-20T03:46:40Z

@pytorchbot merge

pytorchmergebot · 2025-01-20T03:48:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added module: dynamo topic: not user facing topic category labels Nov 18, 2024

ankurneog force-pushed the dynamo_backend_hpu branch from b2b4103 to c5998de Compare November 18, 2024 05:00

pytorchbot added the open source label Nov 18, 2024

ankurneog force-pushed the dynamo_backend_hpu branch from c5998de to fc10fc7 Compare November 19, 2024 06:23

bdhirsh requested a review from anijain2305 November 26, 2024 00:24

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 26, 2024

pytorchmergebot force-pushed the dynamo_backend_hpu branch from fc10fc7 to 00ba589 Compare December 3, 2024 06:11

ankurneog force-pushed the dynamo_backend_hpu branch from 364e8c1 to 9086f92 Compare December 10, 2024 03:56

ankurneog force-pushed the dynamo_backend_hpu branch from c3fcbcd to 78e9f6a Compare January 9, 2025 04:36

ankurneog added 3 commits January 16, 2025 12:56

Add facility to run dynamo UTs for non-cuda devices

42f666d

Add facility to run dynamo UTs for non-cuda devices

bdf52cb

Add requires_cuda decorator to test_autocast_flash_attention

3051cec

ankurneog force-pushed the dynamo_backend_hpu branch from 2abfef3 to 6fa059a Compare January 16, 2025 10:57

guangyey reviewed Jan 17, 2025

View reviewed changes

test/dynamo/test_activation_checkpointing.py Outdated Show resolved Hide resolved

kwen2501 approved these changes Jan 17, 2025

View reviewed changes

kwen2501 requested review from angelayi and ydwu4 January 17, 2025 07:05

kwen2501 requested a review from tugsbayasgalan January 17, 2025 07:05

EikanWang requested changes Jan 17, 2025

View reviewed changes

EikanWang reviewed Jan 17, 2025

View reviewed changes

test/dynamo/test_activation_checkpointing.py Outdated Show resolved Hide resolved

guangyey reviewed Jan 17, 2025

View reviewed changes

test/dynamo/test_activation_checkpointing.py Outdated Show resolved Hide resolved

remove skipIfHpu for cuda only functions

a23bf65

ankurneog force-pushed the dynamo_backend_hpu branch from 6fa059a to a23bf65 Compare January 17, 2025 17:27

ankurneog requested a review from EikanWang January 17, 2025 17:31

EikanWang approved these changes Jan 18, 2025

View reviewed changes

guangyey approved these changes Jan 20, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 20, 2025

pytorchmergebot added the merging label Jan 20, 2025

pytorchmergebot added the Merged label Jan 20, 2025

pytorchmergebot closed this in 972d4a1 Jan 20, 2025

pytorchmergebot removed the merging label Jan 20, 2025

ankurneog deleted the dynamo_backend_hpu branch January 21, 2025 06:55

Add facility to run dynamo UTs for non-cuda devices #140929

Add facility to run dynamo UTs for non-cuda devices #140929

Uh oh!

Conversation

ankurneog commented Nov 18, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140929

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

ankurneog commented Nov 22, 2024

Uh oh!

ankurneog commented Dec 3, 2024

Uh oh!

pytorchmergebot commented Dec 3, 2024

Uh oh!

pytorchmergebot commented Dec 3, 2024

Uh oh!

netlify bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for chimerical-cranachan-793287 ready!

Uh oh!

ankurneog commented Dec 11, 2024

Uh oh!

ankurneog commented Dec 16, 2024

Uh oh!

ankurneog commented Dec 18, 2024

Uh oh!

pytorchmergebot commented Dec 18, 2024

Uh oh!

pytorchmergebot commented Dec 18, 2024

Uh oh!

ankurneog commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

EikanWang Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EikanWang Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EikanWang Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EikanWang Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang commented Jan 17, 2025

Uh oh!

ankurneog commented Jan 17, 2025

Uh oh!

ankurneog commented Jan 17, 2025

Uh oh!

EikanWang commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankurneog commented Jan 17, 2025

Uh oh!

Uh oh!

EikanWang commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AnantGulati commented Jan 17, 2025

ankurneog commented Nov 18, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading

netlify bot commented Dec 6, 2024 •

edited

Loading

ankurneog commented Jan 17, 2025 •

edited

Loading

EikanWang commented Jan 17, 2025 •

edited

Loading

EikanWang commented Jan 17, 2025 •

edited

Loading