Adapt Dynamo tests to HPUs using instantiate_device_type_tests #144387

amathewc · 2025-01-08T10:40:31Z

MOTIVATION

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at #126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

CHANGES

Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

Previously we had submitted some changes in #140131 . However, deleted that PR due to merge conflicts and other issues.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @ankurneog

pytorch-bot · 2025-01-08T10:40:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144387

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 025617a with merge base d95a6ba ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-01-08T10:40:36Z

The committers listed above are authorized under a signed CLA.

✅ login: amathewc / name: Aby Mathew C (025617a, ee18093, 3a6c659, 26ebd7d, b997f9b, 82ea61d)

amathewc · 2025-01-15T11:00:04Z

@yanboliang : Did you get a chance to review this one ?

yanboliang · 2025-01-16T03:54:35Z

Can you fix these failed unit tests?

amathewc · 2025-01-17T04:15:29Z

@yanboliang : Have fixed the unit test issues.

amathewc · 2025-01-20T03:36:30Z

@yanboliang : The 2 new failures in CI are not related to the changes in this PR.

amathewc · 2025-01-20T03:41:17Z

@pytorchbot rebase

AnantGulati · 2025-01-20T04:50:39Z

"Try to land this since the failure is unrelated."
@pytorchbot merge

pytorch-bot · 2025-01-20T04:50:43Z

This PR needs to be approved by an authorized maintainer before merge.

ankurneog · 2025-01-20T05:31:14Z

@EikanWang : can you please help with the review and approval. Thanks

yanboliang

LGTM, please fix the failed test.

amathewc · 2025-01-21T03:22:37Z

LGTM, please fix the failed test.

@yanboliang : looks like the failing tests are not related to the changes we made.
"2025-01-20T06:51:50.7450315Z FAILED [0.0034s] dynamo/test_repros.py::ReproTestsDeviceCUDA::test_flash_attn_backward_mixed_strides_cuda - RuntimeError: FlashAttention only supports Ampere GPUs or newer. " --> Probably the CI tests are running on an older CUDA device ?

yanboliang · 2025-01-21T04:58:09Z

@amathewc This test should only running on PLATFORM_SUPPORTS_FLASH_ATTENTION, but after your change, it seems running if any GPU is available.

pytorch/test/dynamo/test_repros.py

Lines 4389 to 4393 in 00ffeca

    
               @unittest.skipIf( 
        
                   TEST_WITH_ROCM or not PLATFORM_SUPPORTS_FLASH_ATTENTION, 
        
                   "flash attention not supported", 
        
               ) 
        
               def test_flash_attn_backward_mixed_strides(self):

test/dynamo/test_repros.py

EikanWang · 2025-01-21T08:01:45Z

test/dynamo/test_repros.py

It seems like only the following test cases are applied to hpu. Regarding the other test cases of ReproTestsDevice, they still require CUDA. I question why we must move other test cases from ReproTests to ReproTestsDevice.

test_sub_alpha_scalar_repro

test_guard_default_device

test_megablocks_moe

@EikanWang : The idea was to keep all device tests ( whether the device is cuda, hpu , xpu or any other ) within the ReproTests"Device" class . It felt more logical to do so and also in future if these tests are enabled on any of the devices, they can easily be enabled.

The idea makes sense. However, I cannot capture the rule that which test case should be moved to ReproTestsDevice. What's the logic that moves these particular test cases to ReproTestsDevice from ReproTests, even though some test cases are still decorated by require_cuda?

If this PR intends to generalize the ReproTests, the current status should be intermediate.

If this PR intends to enable HPU through instantiate_device_type_tests, this PR should not move cuda-specific cases to ReproTestsDevice.

The intention is to keep CPU only tests in ReproTests and device based tests which are instantiated using instantiate_device_type_tests in ReproTestsDevice.
Right now, the device tuple has only ("cuda" and "hpu") . And there are some specific test cases which work only on CUDA devices today - hence they have been marked as require_cuda. This approach provides us various flexibilities :

if we have to add some test cases in future , which work only on hpu, they can be added to ReproTestsDevice class later with a similar require_hpu tag.

If any of the test cases which have been marked as requires_cuda is supported on HPU, the decorator can be removed.

if another device is added, the test cases can be enabled easily by adding to the devices tuple.

Thanks for the elaboration. LGTM.

ankurneog

LGTM

amathewc · 2025-01-22T05:02:56Z

@EikanWang , @kwen2501 : Could you review this PR?

amathewc · 2025-01-22T13:10:26Z

@pytorchmergebot : merge

pytorch-bot · 2025-01-22T13:10:30Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: ':' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

amathewc · 2025-01-22T13:10:49Z

@pytorchmergebot merge

pytorchmergebot · 2025-01-22T13:13:02Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

amathewc · 2025-01-23T06:00:44Z

@kwen2501 , @guangyey : The PR has been approved . Could you help in merging this ?

guangyey · 2025-01-23T06:16:23Z

@pytorchbot rebase

pytorchmergebot · 2025-01-23T06:17:49Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Fix unit test failures.

Fix unit test failures. Remove TEST_HPU flag

Fix unit test failures. Remove TEST_HPU flag Fix test_debug_utils.py

Fix unit test failures. Remove TEST_HPU flag Fix test_debug_utils.py Fix issue in test_repros.py which was causing test_flash_attn_backward_mixed_strides to fail

Fix unit test failures. Remove TEST_HPU flag. Fix test_debug_utils.py. Fix issue in test_repros.py which was causing test_flash_attn_backward_mixed_strides to fail. Remove skipIfHpu decorator for tests which already have requires_cuda decorator as per review comments.

pytorchmergebot · 2025-01-23T06:17:53Z

Successfully rebased dynamo_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dynamo_changes && git pull --rebase)

guangyey · 2025-01-23T06:21:10Z

@pytorchbot merge

pytorchmergebot · 2025-01-23T06:23:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR is related to #145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py. This is a continuation of #144387 . ## MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at #126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) ## CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices PS: Most of these changes were initially part of #147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR. Pull Request resolved: #149499 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever

This PR is related to pytorch#145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py. This is a continuation of pytorch#144387 . ## MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at pytorch#126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) ## CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices PS: Most of these changes were initially part of pytorch#147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR. Pull Request resolved: pytorch#149499 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever

pytorch-bot bot added module: dynamo topic: not user facing topic category labels Jan 8, 2025

pytorchbot added the open source label Jan 8, 2025

amathewc force-pushed the dynamo_changes branch from 377ed25 to 91fed86 Compare January 9, 2025 09:39

janeyx99 requested a review from yanboliang January 12, 2025 03:53

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 12, 2025

yanboliang reviewed Jan 21, 2025

View reviewed changes

EikanWang reviewed Jan 21, 2025

View reviewed changes

ankurneog approved these changes Jan 21, 2025

View reviewed changes

EikanWang approved these changes Jan 22, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 22, 2025

pytorchmergebot added the merging label Jan 22, 2025

pytorchmergebot removed the merging label Jan 22, 2025

amathewc added 6 commits January 23, 2025 06:17

Adapt Dynamo tests to HPUs using instantiate_device_type_tests

3a6c659

Adapt Dynamo tests to HPUs using instantiate_device_type_tests

b997f9b

Fix unit test failures.

Adapt Dynamo tests to HPUs using instantiate_device_type_tests

82ea61d

Fix unit test failures. Remove TEST_HPU flag

Adapt Dynamo tests to HPUs using instantiate_device_type_tests

ee18093

Fix unit test failures. Remove TEST_HPU flag Fix test_debug_utils.py

Adapt Dynamo tests to HPUs using instantiate_device_type_tests

26ebd7d

Fix unit test failures. Remove TEST_HPU flag Fix test_debug_utils.py Fix issue in test_repros.py which was causing test_flash_attn_backward_mixed_strides to fail

pytorchmergebot force-pushed the dynamo_changes branch from 2806998 to 025617a Compare January 23, 2025 06:17

guangyey approved these changes Jan 23, 2025

View reviewed changes

pytorchmergebot added the merging label Jan 23, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 06:47 Inactive

pytorchmergebot added the Merged label Jan 23, 2025

pytorchmergebot closed this in 638903a Jan 23, 2025

pytorchmergebot removed the merging label Jan 23, 2025

amathewc deleted the dynamo_changes branch January 23, 2025 09:38

amathewc mentioned this pull request Jan 23, 2025

Adapt Dynamo Tests to HPUs #145476

Closed

amathewc mentioned this pull request Feb 21, 2025

Adapt test_misc.py to HPUs #147609

Closed

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

amathewc mentioned this pull request Mar 19, 2025

Adapt test_misc.py for HPUs #149499

Closed

Adapt Dynamo tests to HPUs using instantiate_device_type_tests #144387

Adapt Dynamo tests to HPUs using instantiate_device_type_tests #144387

Uh oh!

Conversation

amathewc commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144387

⏳ No Failures, 1 Pending

Uh oh!

linux-foundation-easycla bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amathewc commented Jan 15, 2025

Uh oh!

yanboliang commented Jan 16, 2025

Uh oh!

amathewc commented Jan 17, 2025

Uh oh!

amathewc commented Jan 20, 2025

Uh oh!

amathewc commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnantGulati commented Jan 20, 2025

Uh oh!

pytorch-bot bot commented Jan 20, 2025

Uh oh!

ankurneog commented Jan 20, 2025

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

amathewc commented Jan 21, 2025

Uh oh!

yanboliang commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EikanWang Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

amathewc Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

amathewc Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

ankurneog left a comment

Choose a reason for hiding this comment

Uh oh!

amathewc commented Jan 22, 2025

Uh oh!

amathewc commented Jan 22, 2025

Uh oh!

pytorch-bot bot commented Jan 22, 2025

Uh oh!

amathewc commented Jan 22, 2025

Uh oh!

pytorchmergebot commented Jan 22, 2025

Merge failed

Uh oh!

amathewc commented Jan 23, 2025

Uh oh!

guangyey commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Jan 23, 2025

amathewc commented Jan 8, 2025 •

edited

Loading

pytorch-bot bot commented Jan 8, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jan 8, 2025 •

edited

Loading

amathewc commented Jan 20, 2025 •

edited

Loading