Tests Generelization for multiple accelerator devices #139749

rahulsingh-intel · 2024-11-05T11:04:35Z

Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-11-05T11:04:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139749

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit bbd0afc with merge base 73278e6 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) (gh) (#144480)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rahulsingh-intel · 2024-11-05T11:41:52Z

@pytorchbot label "topic: not user facing"

@pytorchbot label "topic: not user facing"

rahulsingh-intel · 2024-11-06T05:07:58Z

@pytorchbot label "topic: not user facing"

@pytorchbot label "topic: not user facing"

hi @kwen2501 , @fegin please review the changes

rahulsingh-intel · 2024-11-07T06:22:21Z

### Check Labels / Check labels (pull_request_target)

@pytorchbot rebase

pytorch-bot · 2024-11-07T06:22:26Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

ankurneog · 2024-11-07T06:33:54Z

@pytorchbot rebase

pytorchmergebot · 2024-11-07T06:35:22Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-07T06:35:26Z

Successfully rebased dtensor_common onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dtensor_common && git pull --rebase)

rahulsingh-intel · 2024-11-07T06:38:20Z

Successfully rebased dtensor_common onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dtensor_common && git pull --rebase)

hi @kwen2501 , can you please review and approve the changes

kwen2501

Thanks for the effort. Overall looks good to me. I left a couple comments. I think the goal is to use device_type as much as possible and reduce the dependency on cuda or hpu strings. You did that well in most places, just need to cover the few ones left. Thanks!

kwen2501 · 2024-11-10T14:46:16Z

test/distributed/_tensor/test_convolution_ops.py

For my education, why is there a numerical difference?

kwen2501 · 2024-11-10T14:49:26Z

test/distributed/_tensor/test_dtensor_compile.py

Can you please restore the original format?

kwen2501 · 2024-11-10T14:52:07Z

test/distributed/_tensor/test_dtensor_compile.py

Is this device_type definition used somewhere given that there is a

@property def device_type(self) -> str:

below?
nit: can you restore the original two-line formatting?

kwen2501 · 2024-11-10T17:12:55Z

test/distributed/_tensor/test_dtensor_compile.py

Can you make device_id a property of the class?

@property def device_id(self) -> torch.device: device_count = torch.get_device_module(self.device_type).device_count() return torch.device(device_type, self.rank % device_count)

Thanks!

kwen2501 · 2024-11-10T17:13:04Z

test/distributed/_tensor/test_random_ops.py

kwen2501 · 2024-11-10T17:13:18Z

test/distributed/_tensor/test_redistribute.py

I wonder if we could truly generalize this line than adding another device string here?

kwen2501 · 2024-11-10T17:13:52Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py

I wonder if we could first define DEVICE_TYPE then DEVICE_COUNT so that we can call _get_device_module(DEVICE_TYPE) then device_count()?

kwen2501 · 2024-11-10T17:14:07Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py

I wonder if we could build a map instead? Thanks!

kwen2501 · 2024-11-10T17:14:27Z

torch/testing/_internal/distributed/_tensor/common_dtensor.py

How about reusing the PG_BACKEND map above?

If we still keep those if...else statement, then a mapping list should be maintained and changed for each new backend added even it's out-of-tree.

Then, is it possible to use undefined backend which will find corresponding device backend by device type in process group init? e.g. nccl for cuda, gloo for cpu, hccl for hpu and xccl for xpu if hccl and xccl is registered out-of-tree.

In code logic,

Init process group with undefined backend.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1569

By backend name to query a backend_config which has a device_backend_map.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1816

undefined backend name will use default device backend map , which is gloo for cpu, nccl for cuda. Out-of-tree backend will also be added to this default map by https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L254

Register available backend in backend_config.get_device_backend_map() to process group
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1818

How about reusing the PG_BACKEND map above?

hi @kwen2501 , modified. can you review please

If we still keep those if...else statement, then a mapping list should be maintained and changed for each new backend added even it's out-of-tree.

Then, is it possible to use undefined backend which will find corresponding device backend by device type in process group init? e.g. nccl for cuda, gloo for cpu, hccl for hpu and xccl for xpu if hccl and xccl is registered out-of-tree.

In code logic,

Init process group with undefined backend.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1569

By backend name to query a backend_config which has a device_backend_map.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1816

undefined backend name will use default device backend map , which is gloo for cpu, nccl for cuda. Out-of-tree backend will also be added to this default map by https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L254

Register available backend in backend_config.get_device_backend_map() to process group
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1818

This can be straight one -to-one mapping for default backends :
#140536

hi @kwen2501 please review.

hi @kwen2501 CI ran fine, please approve after review.

ankurneog · 2024-11-25T05:42:46Z

@pytorchbot rebase

pytorchmergebot · 2024-11-25T05:44:18Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-25T05:44:22Z

Successfully rebased dtensor_common onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dtensor_common && git pull --rebase)

rahulsingh-intel · 2024-11-25T13:10:03Z

Successfully rebased dtensor_common onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout dtensor_common && git pull --rebase)

hi @kwen2501 please review.

ankurneog · 2024-12-02T04:06:52Z

@kwen2501 : Could you please help with the approval , it would be good if we can push this before the code freeze for v2.6.0. Thank you

rahulsingh-intel · 2024-12-03T19:04:52Z

@kwen2501 Gentle reminder !

ankurneog · 2024-12-04T01:53:09Z

@kwen2501 : Gentle reminder, could you please help with the approval. thank you.

kwen2501

LGTM. Sorry about the delay

rahulsingh-intel · 2024-12-11T15:39:59Z

@pytorchmergebot rebase

pytorchmergebot · 2024-12-11T15:41:41Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-11T15:41:43Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/139749/head returned non-zero exit code 1

Rebasing (1/12)
Auto-merging test/distributed/_tensor/test_dtensor_compile.py
Auto-merging test/distributed/_tensor/test_random_ops.py
CONFLICT (content): Merge conflict in test/distributed/_tensor/test_random_ops.py
Auto-merging test/distributed/_tensor/test_redistribute.py
Auto-merging torch/testing/_internal/distributed/_tensor/common_dtensor.py
error: could not apply 57761392580... Tests Generelization for multiple accelerator devices
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 57761392580... Tests Generelization for multiple accelerator devices

Raised by https://github.com/pytorch/pytorch/actions/runs/12279796394

rahulsingh-intel · 2025-01-07T09:48:09Z

@pytorchmergebot rebase

rahulsingh-intel · 2025-01-09T18:26:49Z

@pytorchbot merge

pytorchmergebot · 2025-01-09T18:28:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-09T18:28:36Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

rahulsingh-intel · 2025-01-10T11:07:59Z

@pytorchbot merge

pytorchmergebot · 2025-01-10T11:09:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-10T11:09:53Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

guangyey · 2025-01-14T08:45:34Z

"Try to land this since the failure is unrelated."
@pytorchbot merge

pytorchmergebot · 2025-01-14T08:47:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 5, 2024

pytorchbot added the open source label Nov 5, 2024

pytorch-bot bot added the topic: not user facing topic category label Nov 5, 2024

pytorchmergebot force-pushed the dtensor_common branch from 5a25c57 to 468c0a4 Compare November 7, 2024 06:35

colesbury requested a review from wconstab November 8, 2024 17:04

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 8, 2024

kwen2501 reviewed Nov 10, 2024

View reviewed changes

pytorchmergebot force-pushed the dtensor_common branch from 5b02e11 to 5a18451 Compare November 25, 2024 05:44

kwen2501 approved these changes Dec 6, 2024

View reviewed changes

rahulsingh-intel mentioned this pull request Dec 24, 2024

Tests Generelization for multiple accelerator devices #139184

Closed

rahulsingh-intel added 8 commits January 9, 2025 20:21

Update test_convolution_ops.py

2f985ed

Update test_convolution_ops.py

7d204fe

Update test_dtensor_compile.py

d27ff97

Update test_random_ops.py

e6b1f36

Update common_dtensor.py

d0c7f26

Update common_dtensor.py

d42e150

Update common_dtensor.py

53f24f0

Update common_dtensor.py

a72cbf2

rahulsingh-intel force-pushed the dtensor_common branch from d1acd0c to a72cbf2 Compare January 9, 2025 18:24

pytorchmergebot added the merging label Jan 9, 2025

pytorchmergebot removed the merging label Jan 9, 2025

rahulsingh-intel added 2 commits January 10, 2025 00:05

Update common_dtensor.py

006087d

Update common_dtensor.py

e602e38

pytorchmergebot added the merging label Jan 10, 2025

pytorchmergebot removed the merging label Jan 10, 2025

Update common_dtensor.py

bbd0afc

pytorchmergebot added the merging label Jan 14, 2025

pytorchmergebot added the Merged label Jan 14, 2025

pytorchmergebot closed this in 95b41d2 Jan 14, 2025

pytorchmergebot removed the merging label Jan 14, 2025

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

Tests Generelization for multiple accelerator devices #139749

Tests Generelization for multiple accelerator devices #139749

Uh oh!

Conversation

rahulsingh-intel commented Nov 5, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139749

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

rahulsingh-intel commented Nov 5, 2024

Uh oh!

rahulsingh-intel commented Nov 6, 2024

Uh oh!

rahulsingh-intel commented Nov 7, 2024

Uh oh!

pytorch-bot bot commented Nov 7, 2024

Uh oh!

ankurneog commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

rahulsingh-intel commented Nov 7, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

rahulsingh-intel commented Nov 25, 2024

Uh oh!

ankurneog commented Dec 2, 2024

Uh oh!

rahulsingh-intel commented Dec 3, 2024

Uh oh!

ankurneog commented Dec 4, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

rahulsingh-intel commented Nov 5, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

zhangxiaoli73 Nov 14, 2024 •

edited

Loading