-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Generalization of distributed test cases for non-CUDA devices #138216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalization of distributed test cases for non-CUDA devices #138216
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138216
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit 37cf515 with merge base 43edb94 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
6426826 to
8b2be13
Compare
|
@pytorchbot rebase |
|
You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra. |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the generalization effort!
I put some comments in the code.
@yifuwang do you mind having a look at the changes to test_functional_api.py? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind sharing the reason for changing the base test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not required.
I accounted for this change along with changes mentioned by @yifuwang in #138216 (comment), in the following commit: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability, can we keep the original line (as default), and do conditional overwrite?
BACKEND = dist.Backend.NCCL if torch.cuda.is_available() else dist.Backend.GLOO
if TEST_HPU:
BACKEND = dist.Backend.HCCL
elif ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit : 7e6c741b9efe19c7d3fc0d6326becec6879c98c9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this else branch skip "cpu" tests which might be originally running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can just remove the "else" branch to keep it the same as previous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit: 9383603de38f60ce5943681110ecb806f6736af4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why DDP based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it is meant to be distributed. Fixed in commit: 65e16deb07c5ca9acef323058a99d26e89633acb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can leave this to child class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rational behind adding this here was to allow for cleaner and shorter code in the child class. As it is meant to be derived and used it can be overwritten by the child class wherever required.
I would love to hear your thoughts on this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons:
- Some TestClass may want to init distributed only once for the entire test suite. (to save time).
- Some TestClass may want to pass different init options to
init_process_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding init options for TestClasses:
The current implementation allows child classes to overwrite the method, providing flexibility for different init options. However, if you think it would be beneficial, we could consider passing init options as parameters to the function which could perhaps make this easier on future users.
Concerning single initialization of the distributed environment:
I understand the concern about initializing the distributed environment only once for the entire test suite.
To accommodate this, we could explore creating a separate class, modeled after MultiProcContinuousTest, which would allow for a single process group creation for the entire file. Alternatively, we could integrate this functionality within the existing class structure to allow for a more uniform implementation across test files.
As the current changes align with the existing implementation of MultiProcessTestCase while enabling easier device-agnostic implementation and they do not interfere with classes that initialize the process group only once, I believe it makes sense to limit the scope of this PR as it is.
I'm more than willing to work on a follow-up PR to better accommodate these functionalities. This will allow other developers to benefit from these changes while I work on adding these functionalities. Your input on this would be valuable.
All in all having these functionalities in the base class enables easier usage of these test classes in independent test files without significantly reducing flexibility for the reasons mentioned above and hence I feel they should remain as they are and not be left to the child class.
I'm open to further discussion on these points and any other concerns you may have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, leave this to child class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c10d have changed to support 1 device per process (rank) only. I feel that this logic can be simplified under that assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. as simple as rank % num_visible_devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense
I have applied the changes in: 67b6adbb78be2c2c6cebdf8005305c4ce1798830
Thanks
jgong5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind add some review comments to explain why we want to skip some tests on HPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is meant to run with a single process using fake process group. I don't think it falls in the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this was unnecessarily changed
Reverted back to the old test in: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89
Thanks
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Can you address the two new comments before merge?
(1) Remove set_device.
(2) Remove barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the set_device call before merging. We do want to make sure our library works even without user setting the device beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this causes errors on CUDA which is why I had added it.
NCCL appears to be setting multiple distributed processes to the same device. The error I am getting is this:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1a000
The same error is not seen on HPU. My understanding is this has something to do with how NCCL functions while allocating ranks without an explicit call.
I would appreciate any insight you have on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this is a base test class, can you please remove the barrier() call? We want to make sure our tests works as expected even without a barrier at the end. If a specific test needs a barrier, it should be called explicitly in that test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if there is only one line left, maybe no more need to wrap it in a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit: e97d3a38089a266cf7f37d69309e208d9c58ee0b
|
@pytorchbot merge |
| def wrapper(*args, **kwargs): | ||
| if torch.cuda.is_available() and torch.cuda.device_count() >= x: | ||
| return func(*args, **kwargs) | ||
| if TEST_HPU and torch.hpu.device_count() >= x: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why didn't we use get_device_module here as well as used in function rank_to_device?
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…h#138216) # Motivation This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: pytorch#138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: #145222 Approved by: https://github.com/kwen2501
…#145056) # MOTIVATION To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](#138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs. # CHANGES Replaced MultiProcessTestCase with the DistributedTestBase class. Extended test functionality to include support for HPUs. Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances. Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices. Pull Request resolved: #145056 Approved by: https://github.com/kwen2501, https://github.com/guangyey
Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment))
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o