Generalization of distributed UT content to enable non cuda device execution #131758

ankurneog · 2024-07-25T11:16:41Z

Motivation

The distributed UT content is primarily targeted for cuda devices , however other devices like the intel Gaudi do support much of the functionality validated.
Since the code has explicit cuda api calls , it becomes hard to adapt these for non cuda devices.

Here we introduce a new class derived from MultiProcessTestCase, which helps abstracts out the process group creation /deletion and other functionality for a given device.

The tests can be instantiated per device using existing utilities such as instantiate_device_type_tests , which will pass the device as argument and accordingly create the PG with the right backend

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

Note that this is a follow-up of functionality introduced with PR : #126970

Example Use : Here is a snippet from existing content (distributed/tests/test_c10d_nccl.py) to illustrate the use

from torch.testing._internal.common_device_type import instantiate_device_type_tests
class ProcessGroupNCCLTest(DistributedTestBase):
 
    def test_empty_tensors(self,device):
        #store = c10d.FileStore(self.file_name, self.world_size)
        pg = self.create_pg(device)
        local_device_idx = self.rank_to_device(device)[self.rank][0]
 
        xs = [torch.FloatTensor([]).to(device)]
        pg.broadcast(xs).wait()
        self.assertEqual(0, xs[0].numel())
 
        pg.allreduce(xs).wait()
        self.assertEqual(0, xs[0].numel())
 
        pg.reduce(xs).wait()
        self.assertEqual(0, xs[0].numel())
 
        ys = [[torch.FloatTensor([]).to(device) for _ in range(self.world_size)]]
        pg.allgather(ys, xs).wait()
        for y in ys[0]:
            self.assertEqual(0, y.numel())
 
        ys = [torch.FloatTensor([]).to(device)]
        xs = [[torch.FloatTensor([]).to(device) for _ in range(self.world_size)]]
        pg.reduce_scatter(ys, xs).wait()
        self.assertEqual(0, ys[0].numel())
 
 
devices = ["cuda"]
if TEST_HPU:
    devices.append("hpu")
instantiate_device_type_tests(ProcessGroupNCCLTest,globals(),only_for=devices)

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

pytorch-bot · 2024-07-25T11:16:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131758

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 5fb4048 with merge base beb46de ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
# This PR needs a release notes: label
Lint / lintrunner-clang / linux-job (gh)
RuntimeError: Command docker exec -t cdebd9ecdb8208a0b1f9dbdf0e1252d52a18aaee856e5465c0b9c1a4af2f21eb /exec failed with exit code 1
Lint / lintrunner-noclang / linux-job (gh)
RuntimeError: Command docker exec -t cfb4140fe9e4d310377aecb4aafcc072fa2345252403ee18ef96ce5ecd9e633b /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ankurneog · 2024-07-25T11:50:31Z

@pytorchbot rebase

pytorchmergebot · 2024-07-25T11:51:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-07-25T11:52:02Z

Successfully rebased ankurneog_hpu_distributed_common_1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ankurneog_hpu_distributed_common_1 && git pull --rebase)

ankurneog · 2024-07-25T11:54:05Z

@albanD : could you please help with the review. Thanks

torch/testing/_internal/distributed/_tensor/common_dtensor.py

torch/testing/_internal/common_distributed.py

torch/testing/_internal/distributed/_tensor/common_dtensor.py

torch/testing/_internal/common_distributed.py

wconstab · 2024-07-30T16:22:03Z

torch/testing/_internal/common_distributed.py

i wonder if we really need a new class, or we should just update the behavior of MultiProcessTestCase in a safe way?

It seems like all of these changes are designed to be 'safe' but the current PR doesn't test this. Modifying the existing class would force running it on all the existing tests and make sure.

@wconstab : thanks for your comment. Yes, I was also thinking on similar lines initially, however that would need extensive changes to all the already existing files. The idea is to incrementally introduce change to the existing files with the new class and eventually merge the MultiProcessTestCase with the new class.

i think its fine to make all the extensive changes at once, given they are exercised by CI. its actually riskier to land a new thing that isn't well tested in case it doesn't really work as advertised or in case we only half-migrate to it.

also cc @fegin who has been thinking more about test regfactors

i think its fine to make all the extensive changes at once, given they are exercised by CI. its actually riskier to land a new thing that isn't well tested in case it doesn't really work as advertised or in case we only half-migrate to it.

also cc @fegin who has been thinking more about test base refactors. We should probably decide if we want to adopt DistributedTestBase more widely or build more on top of MPTC.

Thanks @wconstab , will check how much is the impact of adapting the change in all files.

But in general there are lot of different ways this is getting implemented

For eg: class MultiProcContinousTest introduced with #125648 implements a mechanism that do not use MPTC but devices related nuances are passed to the individual test file for implementation

Whereas class DTensorTestBase in torch/testing/_internal/distributed/_tensor/common_dtensor.py , inherits MPTC but tries put the devices related nuances in the common code.

and there are a few more

hence, the suggestion of staggered migration to a common base class that takes into account all the challenges module by module. Since Gaudi/XPU wants to use distributed test content for validation, we can take up this migration effort.

Perhaps gradually refactoring things out of MPTC like what this PR proposes is better considering there are multiple features coupled in the MPTC base class right now?

I guess our consideration is the number of TestCases to maintain.
What tests are you interested in today? test_c10d_... ? If so, it seems we can integrate the changes into MPTC first. Then gradually to other TCs.

wconstab · 2024-07-30T16:22:46Z

torch/testing/_internal/common_distributed.py

The device-backend logic should probably live in distributed_c10d.py somewhere. I'm not sure the best way to do it, but it would seem reasonable to have a helper that is usable not only in tests

Sure, I think in general we have lot of scope of harmonizing the APIs for different devices in torch.distributed , can I add that as part of a follow up PR?

wconstab · 2024-07-30T16:25:56Z

torch/testing/_internal/common_distributed.py

I was thinking i've seen this code snippet before. i found at least one case in profiler-

if self.use_device and hasattr(torch, self.use_device): device_module = getattr(torch, self.use_device)

I think it'd be better to provide a utility for this perhaps in torch.device module. Something like this

def get_module(device: Union[str, torch.device]): -> Module

Maybe something you could add in a follow up PR?

@wconstab I think it already exists in utils

pytorch/torch/_utils.py

Line 911 in e6cddc9

def _get_device_module(device_type: str):

@wconstab : there are multiple places where the handle is retrieved from device ,
eg: torch/distributed/fsdp/_common_utils.py
_get_device_handle

Agree that _get_device_module seems to be a more general way than calling:
if TEST_ABC and torch.abc.device_count() > 0 everywhere.

ankurneog · 2024-08-05T10:02:42Z

@wconstab : Could you please let me know if you are ok with my responses?

ankurneog · 2024-09-17T02:53:56Z

@wconstab : could you please provide your response. thank you.

kwen2501

I like the idea of using device_handle() or _get_device_module() to make our tests device agnostic. I recommend picking one TestCase of most interest to refactor first.

kwen2501 · 2024-09-17T06:05:09Z

torch/testing/_internal/common_distributed.py

Can we use the device_handle way (as later in this PR) for every device type?
We can have a global device_handle that gets set based on TEST_HPU, TEST_CUDA, etc.

kwen2501 · 2024-09-17T06:05:30Z

torch/testing/_internal/common_distributed.py

kwen2501 · 2024-09-17T06:05:46Z

torch/testing/_internal/common_distributed.py

kwen2501 · 2024-09-17T06:07:28Z

torch/testing/_internal/common_distributed.py

It seems this mapping is available through the backend_feature map?

kwen2501 · 2024-09-17T06:11:01Z

torch/testing/_internal/common_distributed.py

Agree that _get_device_module seems to be a more general way than calling:
if TEST_ABC and torch.abc.device_count() > 0 everywhere.

kwen2501 · 2024-09-17T06:18:50Z

torch/testing/_internal/common_distributed.py

I guess our consideration is the number of TestCases to maintain.
What tests are you interested in today? test_c10d_... ? If so, it seems we can integrate the changes into MPTC first. Then gradually to other TCs.

ankurneog · 2024-09-27T03:00:23Z

@pytorchbot rebase

pytorchmergebot · 2024-09-27T03:01:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-09-27T03:02:05Z

Successfully rebased ankurneog_hpu_distributed_common_1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ankurneog_hpu_distributed_common_1 && git pull --rebase)

github-actions · 2024-09-27T03:02:53Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ankurneog · 2024-10-24T03:11:17Z

@wconstab , @kwen2501 : all comments are now addressed via this PR : #138216, Could you please have a look and help with the approval. I will close this PR. Thank you.

@kwen2501

# Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: #138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey

@kwen2501

…h#138216) # Motivation This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: pytorch#138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey

pytorch-bot bot added module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 25, 2024

pytorchbot added the open source label Jul 25, 2024

ankurneog force-pushed the ankurneog_hpu_distributed_common_1 branch from 559aadf to 45baafa Compare July 25, 2024 11:49

pytorchmergebot force-pushed the ankurneog_hpu_distributed_common_1 branch from 45baafa to 0fcc085 Compare July 25, 2024 11:52

ankurneog marked this pull request as ready for review July 25, 2024 11:53

bsochack reviewed Jul 25, 2024

View reviewed changes

torch/testing/_internal/distributed/_tensor/common_dtensor.py Outdated Show resolved Hide resolved

torch/testing/_internal/common_distributed.py Outdated Show resolved Hide resolved

bsochack reviewed Jul 25, 2024

View reviewed changes

torch/testing/_internal/common_distributed.py Outdated Show resolved Hide resolved

torch/testing/_internal/common_distributed.py Outdated Show resolved Hide resolved

torch/testing/_internal/distributed/_tensor/common_dtensor.py Outdated Show resolved Hide resolved

ankurneog force-pushed the ankurneog_hpu_distributed_common_1 branch from 0fcc085 to fee499d Compare July 25, 2024 12:08

bsochack reviewed Jul 25, 2024

View reviewed changes

torch/testing/_internal/common_distributed.py Outdated Show resolved Hide resolved

ankurneog force-pushed the ankurneog_hpu_distributed_common_1 branch 3 times, most recently from 7b2e1cd to c6401b0 Compare July 26, 2024 08:17

wconstab self-requested a review July 29, 2024 20:46

jbschlosser added module: testing Issues related to the torch.testing module (not tests) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 30, 2024

wconstab reviewed Jul 30, 2024

View reviewed changes

ankurneog force-pushed the ankurneog_hpu_distributed_common_1 branch from c6401b0 to 677086a Compare August 5, 2024 09:59

VRSinghHabana mentioned this pull request Aug 6, 2024

Added dist utility API to get backend from a device object #132735

Closed

ankurneog force-pushed the ankurneog_hpu_distributed_common_1 branch from 677086a to d93f5d7 Compare August 20, 2024 09:43

kwen2501 reviewed Sep 17, 2024

View reviewed changes

Generalization of distributed UT content for Intel Gaudi -1

5fb4048

pytorchmergebot force-pushed the ankurneog_hpu_distributed_common_1 branch from d93f5d7 to 5fb4048 Compare September 27, 2024 03:02

AnantGulati mentioned this pull request Sep 30, 2024

Generalization of distributed test cases for non CUDA devices #136988

Closed

AnantGulati mentioned this pull request Oct 17, 2024

Generalization of distributed test cases for non-CUDA devices #138216

Closed

ankurneog closed this Oct 25, 2024

ankurneog deleted the ankurneog_hpu_distributed_common_1 branch October 25, 2024 04:13

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

harikodali mentioned this pull request Jul 23, 2025

[RFC] Refactor test_c10d_nccl.py to Support Backend-Agnostic Testing for PyTorch #158911

Open

Generalization of distributed UT content to enable non cuda device execution #131758

Generalization of distributed UT content to enable non cuda device execution #131758

Uh oh!

Conversation

ankurneog commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Example Use : Here is a snippet from existing content (distributed/tests/test_c10d_nccl.py) to illustrate the use

Uh oh!

pytorch-bot bot commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131758

❌ 3 New Failures

Uh oh!

ankurneog commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

ankurneog commented Jul 25, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Aug 5, 2024

Uh oh!

ankurneog commented Sep 17, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Uh oh!

ankurneog commented Jul 25, 2024 •

edited

Loading

pytorch-bot bot commented Jul 25, 2024 •

edited

Loading

kwen2501 Sep 17, 2024 •

edited

Loading

This PR needs a `release notes:` label