Skip to content

Conversation

@amathewc
Copy link
Contributor

@amathewc amathewc commented Jan 8, 2025

MOTIVATION

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at #126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

CHANGES

Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

Previously we had submitted some changes in #140131 . However, deleted that PR due to merge conflicts and other issues.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @ankurneog

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144387

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 025617a with merge base d95a6ba (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 8, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@janeyx99 janeyx99 requested a review from yanboliang January 12, 2025 03:53
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 12, 2025
@amathewc
Copy link
Contributor Author

@yanboliang : Did you get a chance to review this one ?

@yanboliang
Copy link
Contributor

Can you fix these failed unit tests?

@amathewc
Copy link
Contributor Author

@yanboliang : Have fixed the unit test issues.

@amathewc
Copy link
Contributor Author

@yanboliang : The 2 new failures in CI are not related to the changes in this PR.

@amathewc
Copy link
Contributor Author

amathewc commented Jan 20, 2025

@pytorchbot rebase

@AnantGulati
Copy link
Contributor

"Try to land this since the failure is unrelated."
@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 20, 2025

This PR needs to be approved by an authorized maintainer before merge.

@ankurneog
Copy link

@EikanWang : can you please help with the review and approval. Thanks

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the failed test.

@amathewc
Copy link
Contributor Author

LGTM, please fix the failed test.

@yanboliang : looks like the failing tests are not related to the changes we made.
"2025-01-20T06:51:50.7450315Z FAILED [0.0034s] dynamo/test_repros.py::ReproTestsDeviceCUDA::test_flash_attn_backward_mixed_strides_cuda - RuntimeError: FlashAttention only supports Ampere GPUs or newer. " --> Probably the CI tests are running on an older CUDA device ?

@yanboliang
Copy link
Contributor

@amathewc This test should only running on PLATFORM_SUPPORTS_FLASH_ATTENTION, but after your change, it seems running if any GPU is available.

@unittest.skipIf(
TEST_WITH_ROCM or not PLATFORM_SUPPORTS_FLASH_ATTENTION,
"flash attention not supported",
)
def test_flash_attn_backward_mixed_strides(self):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like only the following test cases are applied to hpu. Regarding the other test cases of ReproTestsDevice, they still require CUDA. I question why we must move other test cases from ReproTests to ReproTestsDevice.

  • test_sub_alpha_scalar_repro
  • test_guard_default_device
  • test_megablocks_moe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EikanWang : The idea was to keep all device tests ( whether the device is cuda, hpu , xpu or any other ) within the ReproTests"Device" class . It felt more logical to do so and also in future if these tests are enabled on any of the devices, they can easily be enabled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea makes sense. However, I cannot capture the rule that which test case should be moved to ReproTestsDevice. What's the logic that moves these particular test cases to ReproTestsDevice from ReproTests, even though some test cases are still decorated by require_cuda?

  • If this PR intends to generalize the ReproTests, the current status should be intermediate.
  • If this PR intends to enable HPU through instantiate_device_type_tests, this PR should not move cuda-specific cases to ReproTestsDevice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to keep CPU only tests in ReproTests and device based tests which are instantiated using instantiate_device_type_tests in ReproTestsDevice.
Right now, the device tuple has only ("cuda" and "hpu") . And there are some specific test cases which work only on CUDA devices today - hence they have been marked as require_cuda. This approach provides us various flexibilities :

  • if we have to add some test cases in future , which work only on hpu, they can be added to ReproTestsDevice class later with a similar require_hpu tag.
  • If any of the test cases which have been marked as requires_cuda is supported on HPU, the decorator can be removed.
  • if another device is added, the test cases can be enabled easily by adding to the devices tuple.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the elaboration. LGTM.

Copy link

@ankurneog ankurneog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amathewc
Copy link
Contributor Author

@EikanWang , @kwen2501 : Could you review this PR?

@amathewc
Copy link
Contributor Author

@pytorchmergebot : merge

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 22, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: ':' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@amathewc
Copy link
Contributor Author

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 22, 2025
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@amathewc
Copy link
Contributor Author

@kwen2501 , @guangyey : The PR has been approved . Could you help in merging this ?

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Fix unit test failures.
Remove TEST_HPU flag
Fix unit test failures.
Remove TEST_HPU flag
Fix test_debug_utils.py
Fix unit test failures.
Remove TEST_HPU flag
Fix test_debug_utils.py
Fix issue in test_repros.py which was causing
test_flash_attn_backward_mixed_strides to fail
Fix unit test failures.
Remove TEST_HPU flag.
Fix test_debug_utils.py.
Fix issue in test_repros.py which was causing
test_flash_attn_backward_mixed_strides to fail.
Remove skipIfHpu decorator for tests which already have requires_cuda
decorator as per review comments.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased dynamo_changes onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dynamo_changes && git pull --rebase)

@guangyey
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 06:47 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 06:47 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 06:47 Inactive
@amathewc amathewc deleted the dynamo_changes branch January 23, 2025 09:38
pytorchmergebot pushed a commit that referenced this pull request Apr 4, 2025
This PR is related to #145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py.

This is a continuation of #144387 .

## MOTIVATION
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at #126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

## CHANGES
Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

PS: Most of these changes were initially part of #147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR.

Pull Request resolved: #149499
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
This PR is related to pytorch#145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py.

This is a continuation of pytorch#144387 .

## MOTIVATION
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at pytorch#126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

## CHANGES
Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

PS: Most of these changes were initially part of pytorch#147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR.

Pull Request resolved: pytorch#149499
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever
amathewc added a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This PR is related to pytorch#145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py.

This is a continuation of pytorch#144387 .

## MOTIVATION
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at pytorch#126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

## CHANGES
Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

PS: Most of these changes were initially part of pytorch#147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR.

Pull Request resolved: pytorch#149499
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants