Skip to content

Conversation

@AnantGulati
Copy link
Contributor

@AnantGulati AnantGulati commented Oct 17, 2024

Motivation

This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment))

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138216

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (2 Unrelated Failures)

As of commit 37cf515 with merge base 43edb94 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 17, 2024
@AnantGulati AnantGulati marked this pull request as draft October 17, 2024 14:29
@AnantGulati
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@ankurneog
Copy link

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased AnantGulati_clean_distributed_test_base onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout AnantGulati_clean_distributed_test_base && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the AnantGulati_clean_distributed_test_base branch from 6426826 to 8b2be13 Compare October 22, 2024 03:36
@AnantGulati
Copy link
Contributor Author

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 23, 2024

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

@AnantGulati AnantGulati changed the title [DRAFT] Generalization of distributed test cases for non-CUDA devices Generalization of distributed test cases for non-CUDA devices Oct 23, 2024
@AnantGulati AnantGulati marked this pull request as ready for review October 23, 2024 03:36
@colesbury colesbury requested review from jgong5 and kwen2501 October 23, 2024 20:12
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 23, 2024
@AnantGulati AnantGulati requested a review from jgong5 October 24, 2024 08:57
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the generalization effort!
I put some comments in the code.
@yifuwang do you mind having a look at the changes to test_functional_api.py? Thanks!

Comment on lines 387 to 405
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind sharing the reason for changing the base test class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not required.

I accounted for this change along with changes mentioned by @yifuwang in #138216 (comment), in the following commit: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89

Thanks

Comment on lines 447 to 451
Copy link
Collaborator

@kwen2501 kwen2501 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability, can we keep the original line (as default), and do conditional overwrite?

BACKEND = dist.Backend.NCCL if torch.cuda.is_available() else dist.Backend.GLOO

if TEST_HPU:
    BACKEND = dist.Backend.HCCL
elif ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit : 7e6c741b9efe19c7d3fc0d6326becec6879c98c9

Comment on lines 462 to 463
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this else branch skip "cpu" tests which might be originally running?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can just remove the "else" branch to keep it the same as previous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit: 9383603de38f60ce5943681110ecb806f6736af4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why DDP based?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is meant to be distributed. Fixed in commit: 65e16deb07c5ca9acef323058a99d26e89633acb

Comment on lines 925 to 929
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can leave this to child class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rational behind adding this here was to allow for cleaner and shorter code in the child class. As it is meant to be derived and used it can be overwritten by the child class wherever required.

I would love to hear your thoughts on this though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons:

  1. Some TestClass may want to init distributed only once for the entire test suite. (to save time).
  2. Some TestClass may want to pass different init options to init_process_group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding init options for TestClasses:

The current implementation allows child classes to overwrite the method, providing flexibility for different init options. However, if you think it would be beneficial, we could consider passing init options as parameters to the function which could perhaps make this easier on future users.

Concerning single initialization of the distributed environment:

I understand the concern about initializing the distributed environment only once for the entire test suite.

To accommodate this, we could explore creating a separate class, modeled after MultiProcContinuousTest, which would allow for a single process group creation for the entire file. Alternatively, we could integrate this functionality within the existing class structure to allow for a more uniform implementation across test files.

As the current changes align with the existing implementation of MultiProcessTestCase while enabling easier device-agnostic implementation and they do not interfere with classes that initialize the process group only once, I believe it makes sense to limit the scope of this PR as it is.

I'm more than willing to work on a follow-up PR to better accommodate these functionalities. This will allow other developers to benefit from these changes while I work on adding these functionalities. Your input on this would be valuable.

All in all having these functionalities in the base class enables easier usage of these test classes in independent test files without significantly reducing flexibility for the reasons mentioned above and hence I feel they should remain as they are and not be left to the child class.

I'm open to further discussion on these points and any other concerns you may have.

Comment on lines 948 to 950
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, leave this to child class?

Comment on lines 938 to 946
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c10d have changed to support 1 device per process (rank) only. I feel that this logic can be simplified under that assumption.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. as simple as rank % num_visible_devices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense

I have applied the changes in: 67b6adbb78be2c2c6cebdf8005305c4ce1798830

Thanks

@kwen2501 kwen2501 requested a review from yifuwang October 25, 2024 01:49
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind add some review comments to explain why we want to skip some tests on HPU?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is meant to run with a single process using fake process group. I don't think it falls in the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this was unnecessarily changed

Reverted back to the old test in: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89

Thanks

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Can you address the two new comments before merge?
(1) Remove set_device.
(2) Remove barrier.

Comment on lines +955 to +927
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the set_device call before merging. We do want to make sure our library works even without user setting the device beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this causes errors on CUDA which is why I had added it.

NCCL appears to be setting multiple distributed processes to the same device. The error I am getting is this:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1a000

The same error is not seen on HPU. My understanding is this has something to do with how NCCL functions while allocating ranks without an explicit call.

I would appreciate any insight you have on this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is a base test class, can you please remove the barrier() call? We want to make sure our tests works as expected even without a barrier at the end. If a specific test needs a barrier, it should be called explicitly in that test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if there is only one line left, maybe no more need to wrap it in a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit: e97d3a38089a266cf7f37d69309e208d9c58ee0b

@ankurneog
Copy link

@pytorchbot merge

def wrapper(*args, **kwargs):
if torch.cuda.is_available() and torch.cuda.device_count() >= x:
return func(*args, **kwargs)
if TEST_HPU and torch.hpu.device_count() >= x:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why didn't we use get_device_module here as well as used in function rank_to_device?

@ankurneog
Copy link

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…h#138216)

# Motivation
This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment))

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: pytorch#138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
pytorchmergebot pushed a commit that referenced this pull request Feb 5, 2025
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies
- Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable
- Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536).

This should result in significant improvement in usability for all devices

Pull Request resolved: #145222
Approved by: https://github.com/kwen2501
pytorchmergebot pushed a commit that referenced this pull request Feb 13, 2025
…#145056)

# MOTIVATION
To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](#138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

# CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

Pull Request resolved: #145056
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants