Generalization of distributed test cases for non-CUDA devices #138216

AnantGulati · 2024-10-17T14:28:56Z

Motivation

This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment))

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-10-17T14:29:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138216

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ You can merge normally! (2 Unrelated Failures)

As of commit 37cf515 with merge base 43edb94 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_non_tensor_input_cpu
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_non_tensor_input_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

AnantGulati · 2024-10-17T14:30:08Z

@pytorchbot label "topic: not user facing"

ankurneog · 2024-10-22T03:34:50Z

@pytorchbot rebase

pytorchmergebot · 2024-10-22T03:36:14Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-10-22T03:36:18Z

Successfully rebased AnantGulati_clean_distributed_test_base onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout AnantGulati_clean_distributed_test_base && git pull --rebase)

AnantGulati · 2024-10-23T02:56:18Z

@pytorchbot rebase

pytorch-bot · 2024-10-23T02:56:22Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

test/distributed/test_functional_api.py

kwen2501

Thanks much for the generalization effort!
I put some comments in the code.
@yifuwang do you mind having a look at the changes to test_functional_api.py? Thanks!

test/distributed/test_functional_api.py

kwen2501 · 2024-10-25T01:17:56Z

test/distributed/test_functional_api.py

Mind sharing the reason for changing the base test class?

This was not required.

I accounted for this change along with changes mentioned by @yifuwang in #138216 (comment), in the following commit: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89

Thanks

kwen2501 · 2024-10-25T01:22:10Z

test/distributed/test_functional_api.py

For better readability, can we keep the original line (as default), and do conditional overwrite?

BACKEND = dist.Backend.NCCL if torch.cuda.is_available() else dist.Backend.GLOO if TEST_HPU: BACKEND = dist.Backend.HCCL elif ...

Done in commit : 7e6c741b9efe19c7d3fc0d6326becec6879c98c9

kwen2501 · 2024-10-25T01:23:52Z

test/distributed/test_functional_api.py

Would this else branch skip "cpu" tests which might be originally running?

It seems we can just remove the "else" branch to keep it the same as previous?

Done in commit: 9383603de38f60ce5943681110ecb806f6736af4

kwen2501 · 2024-10-25T01:33:18Z

torch/testing/_internal/common_distributed.py

Why DDP based?

You are right, it is meant to be distributed. Fixed in commit: 65e16deb07c5ca9acef323058a99d26e89633acb

torch/testing/_internal/common_distributed.py

kwen2501 · 2024-10-25T01:44:40Z

torch/testing/_internal/common_distributed.py

Maybe we can leave this to child class?

The rational behind adding this here was to allow for cleaner and shorter code in the child class. As it is meant to be derived and used it can be overwritten by the child class wherever required.

I would love to hear your thoughts on this though.

There are two reasons:

Some TestClass may want to init distributed only once for the entire test suite. (to save time).

Some TestClass may want to pass different init options to init_process_group

Regarding init options for TestClasses:

The current implementation allows child classes to overwrite the method, providing flexibility for different init options. However, if you think it would be beneficial, we could consider passing init options as parameters to the function which could perhaps make this easier on future users.

Concerning single initialization of the distributed environment:

I understand the concern about initializing the distributed environment only once for the entire test suite.

To accommodate this, we could explore creating a separate class, modeled after MultiProcContinuousTest, which would allow for a single process group creation for the entire file. Alternatively, we could integrate this functionality within the existing class structure to allow for a more uniform implementation across test files.

As the current changes align with the existing implementation of MultiProcessTestCase while enabling easier device-agnostic implementation and they do not interfere with classes that initialize the process group only once, I believe it makes sense to limit the scope of this PR as it is.

I'm more than willing to work on a follow-up PR to better accommodate these functionalities. This will allow other developers to benefit from these changes while I work on adding these functionalities. Your input on this would be valuable.

All in all having these functionalities in the base class enables easier usage of these test classes in independent test files without significantly reducing flexibility for the reasons mentioned above and hence I feel they should remain as they are and not be left to the child class.

I'm open to further discussion on these points and any other concerns you may have.

kwen2501 · 2024-10-25T01:45:41Z

torch/testing/_internal/common_distributed.py

Same, leave this to child class?

kwen2501 · 2024-10-25T01:46:39Z

torch/testing/_internal/common_distributed.py

c10d have changed to support 1 device per process (rank) only. I feel that this logic can be simplified under that assumption.

e.g. as simple as rank % num_visible_devices

This makes a lot of sense

I have applied the changes in: 67b6adbb78be2c2c6cebdf8005305c4ce1798830

Thanks

jgong5

Do you mind add some review comments to explain why we want to skip some tests on HPU?

yifuwang · 2024-10-25T18:34:29Z

test/distributed/test_functional_api.py

This test is meant to run with a single process using fake process group. I don't think it falls in the scope of this PR.

You are right, this was unnecessarily changed

Reverted back to the old test in: 7afa48037bfcdc0f2f8837cebcaefcfbf633de89

Thanks

test/distributed/test_functional_api.py

kwen2501

LGTM.
Can you address the two new comments before merge?
(1) Remove set_device.
(2) Remove barrier.

kwen2501 · 2024-11-10T17:35:36Z

torch/testing/_internal/common_distributed.py

Please remove the set_device call before merging. We do want to make sure our library works even without user setting the device beforehand.

Removing this causes errors on CUDA which is why I had added it.

NCCL appears to be setting multiple distributed processes to the same device. The error I am getting is this:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1a000

The same error is not seen on HPU. My understanding is this has something to do with how NCCL functions while allocating ranks without an explicit call.

I would appreciate any insight you have on this

kwen2501 · 2024-11-10T17:39:22Z

torch/testing/_internal/common_distributed.py

Considering this is a base test class, can you please remove the barrier() call? We want to make sure our tests works as expected even without a barrier at the end. If a specific test needs a barrier, it should be called explicitly in that test.

And if there is only one line left, maybe no more need to wrap it in a method?

Done in commit: e97d3a38089a266cf7f37d69309e208d9c58ee0b

test/distributed/test_functional_api.py

ankurneog · 2024-11-12T03:32:02Z

@pytorchbot merge

ankurneog · 2024-11-14T10:15:35Z

torch/testing/_internal/common_distributed.py

        def wrapper(*args, **kwargs):
            if torch.cuda.is_available() and torch.cuda.device_count() >= x:
                return func(*args, **kwargs)
+            if TEST_HPU and torch.hpu.device_count() >= x:


why didn't we use get_device_module here as well as used in function rank_to_device?

ankurneog · 2024-11-18T09:30:33Z

@pytorchbot merge

pytorchmergebot · 2024-11-18T09:32:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@kwen2501

…h#138216) # Motivation This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: pytorch#138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey

In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: #145222 Approved by: https://github.com/kwen2501

…#145056) # MOTIVATION To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](#138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs. # CHANGES Replaced MultiProcessTestCase with the DistributedTestBase class. Extended test functionality to include support for HPUs. Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances. Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices. Pull Request resolved: #145056 Approved by: https://github.com/kwen2501, https://github.com/guangyey

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 17, 2024

AnantGulati marked this pull request as draft October 17, 2024 14:29

pytorch-bot bot added the topic: not user facing topic category label Oct 17, 2024

pytorchbot added the open source label Oct 17, 2024

pytorchmergebot force-pushed the AnantGulati_clean_distributed_test_base branch from 6426826 to 8b2be13 Compare October 22, 2024 03:36

AnantGulati changed the title ~~[DRAFT] Generalization of distributed test cases for non-CUDA devices~~ Generalization of distributed test cases for non-CUDA devices Oct 23, 2024

AnantGulati marked this pull request as ready for review October 23, 2024 03:36

colesbury requested review from jgong5 and kwen2501 October 23, 2024 20:12

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 23, 2024

ankurneog mentioned this pull request Oct 24, 2024

Generalization of distributed UT content to enable non cuda device execution #131758

Closed

jgong5 reviewed Oct 24, 2024

View reviewed changes

test/distributed/test_functional_api.py Outdated Show resolved Hide resolved

AnantGulati requested a review from jgong5 October 24, 2024 08:57

kwen2501 reviewed Oct 25, 2024

View reviewed changes

kwen2501 requested a review from yifuwang October 25, 2024 01:49

jgong5 reviewed Oct 25, 2024

View reviewed changes

yifuwang reviewed Oct 25, 2024

View reviewed changes

test/distributed/test_functional_api.py Outdated Show resolved Hide resolved

ankurneog reviewed Nov 4, 2024

View reviewed changes

test/distributed/test_functional_api.py Outdated Show resolved Hide resolved

AnantGulati requested review from kwen2501 and yifuwang November 6, 2024 05:44

kwen2501 approved these changes Nov 10, 2024

View reviewed changes

ankurneog reviewed Nov 14, 2024

View reviewed changes

AnantGulati and others added 5 commits November 18, 2024 09:17

Merge branch 'pytorch:main' into AnantGulati_clean_distributed_test_base

624f56c

adding fix for errors raised by CI

f6cdf13

Merge branch 'pytorch:main' into AnantGulati_clean_distributed_test_base

7da1801

fixing CI errors-2

5034d73

adding explicit destroy process group to create fake pg

37cf515

pytorchmergebot added the merging label Nov 18, 2024

pytorchmergebot added the Merged label Nov 18, 2024

pytorchmergebot closed this in b379a28 Nov 18, 2024

pytorchmergebot removed the merging label Nov 18, 2024

AnantGulati mentioned this pull request Jan 16, 2025

Replacing explicit backend search with api call #144944

Closed

This was referenced Jan 17, 2025

Update c10d_object_collectives using DistributedTestBase #145054

Closed

Update test_c10d_object_collectives.py with DistributedTestBase class #145056

Closed

This was referenced Jan 17, 2025

Add facility to run dynamo UTs for non-cuda devices #140929

Closed

Refactoring Distributed test cases to be device agnostic [1/n] #145222

Closed

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

AnantGulati mentioned this pull request Mar 17, 2025

Refactoring Distributed test cases to be device agnostic [2/n] #149317

Closed

Generalization of distributed test cases for non-CUDA devices #138216

Generalization of distributed test cases for non-CUDA devices #138216

Uh oh!

Conversation

AnantGulati commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138216

❗ 1 Active SEVs

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

AnantGulati commented Oct 17, 2024

Uh oh!

ankurneog commented Oct 22, 2024

Uh oh!

pytorchmergebot commented Oct 22, 2024

Uh oh!

pytorchmergebot commented Oct 22, 2024

Uh oh!

AnantGulati commented Oct 23, 2024

Uh oh!

pytorch-bot bot commented Oct 23, 2024

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Regarding init options for TestClasses:

Concerning single initialization of the distributed environment:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AnantGulati commented Oct 17, 2024 •

edited

Loading

pytorch-bot bot commented Oct 17, 2024 •

edited

Loading

kwen2501 Oct 25, 2024 •

edited

Loading