Tests Generelization for multiple accelerator devices #139184

rahulsingh-intel · 2024-10-29T11:45:14Z

Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209 Merged now.
There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501 @zeshengzong Please review the changes.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-10-29T11:45:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139184

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f033ba2 with merge base a97c6a7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ankurneog · 2024-10-29T16:07:03Z

@kwen2501 , @jgong5 , @awgu : Could you please help with the review , thank you.

kwen2501 · 2024-11-01T20:01:01Z

Adding @weifengpy and @awgu to review since most changes are around FSDP tests.

kwen2501 · 2024-11-01T20:06:45Z

Thanks for your big effort!

General comment:
There are 90 addition of TEST_HPU in the change. Can we think of a way to reduce the appearance?
We cannot have an if branch for each possible device in code like this:

src = torch.randn((10, 1, 1024), device=device_type if TEST_HPU else "cuda")

Can we have a DEVICE_TYPE = ... defined at top and just do:

src = torch.randn((10, 1, 1024), device=torch.device(DEVICE_TYPE, index))

in test code?
Thanks for your consideration!

rahulsingh-intel · 2024-11-06T03:39:12Z

Thanks for your big effort!

General comment: There are 90 addition of TEST_HPU in the change. Can we think of a way to reduce the appearance? We cannot have an if branch for each possible device in code like this:
src = torch.randn((10, 1, 1024), device=device_type if TEST_HPU else "cuda")
Can we have a DEVICE_TYPE = ... defined at top and just do:
src = torch.randn((10, 1, 1024), device=torch.device(DEVICE_TYPE, index))
in test code? Thanks for your consideration!

hi @kwen2501 , Thanks for your comments.

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
I have further updated the files please review

jgong5

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).

So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.

I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

ankurneog · 2024-11-07T04:20:12Z

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).

So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.

I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.

Some background :
Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.

What we need is platform independent APIs where we don't have references to device specific APIs.

We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.

Please share your views.

rahulsingh-intel · 2024-11-07T06:07:04Z

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).

So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.

I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.

Some background : Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.

What we need is platform independent APIs where we don't have references to device specific APIs.

We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.

Please share your views.

Hi @jgong5 , agree with @ankurneog . We will adreess this feature enhancement.
And since there is lot of effort for adaptations, can you please aprrove the changes for now
cc: @kwen2501

rahulsingh-intel · 2024-11-07T06:21:50Z

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).

So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.

I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.
Some background : Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.
What we need is platform independent APIs where we don't have references to device specific APIs.
We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.
Please share your views.

Hi @jgong5 , agree with @ankurneog . We will adreess this feature enhancement. And since there is lot of effort for adaptations, can you please aprrove the changes for now cc: @kwen2501

@pytorchbot rebase

pytorch-bot · 2024-11-07T06:21:54Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

ankurneog · 2024-11-07T06:24:13Z

@pytorchbot rebase

pytorchmergebot · 2024-11-07T06:25:48Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-07T06:25:52Z

Successfully rebased ditributed_torch onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ditributed_torch && git pull --rebase)

kwen2501 · 2024-11-08T08:58:17Z

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks

Thanks for the explanation. I am still a bit confused. Does the following rule work for HPU or not?

index = rank % device_count()

kwen2501 · 2024-11-08T09:03:36Z

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

rahulsingh-intel · 2024-11-10T19:55:13Z

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

Hi @kwen2501 torch.get_device_module("hpu") doesn't work for HPU

ankurneog · 2024-11-13T05:10:34Z

test/distributed/fsdp/test_checkpoint_wrapper.py

As @kwen2501 mentioned in one of the comments torch.get_device_module(device.type).reset_peak_memory_stats() should work. We just need to check once if there is any issue with Out-of-tree device like HPU. I just checked it seems to be working :

>>> device=torch.device("hpu:0") >>> torch.get_device_module(device.type).is_available() True

I like this change. It’s a good idea to use torch.get_device_module(...) instead of relying on if...else statements with device-specific namespaces like torch.cuda, torch.hpu, or torch.xpu.

ankurneog · 2024-11-13T05:11:55Z

test/distributed/fsdp/test_fsdp_checkpoint.py

lets keep it as

devices = ["cuda"] if TEST_HPU: devices.append("hpu")

ankurneog · 2024-11-13T05:13:43Z

test/distributed/fsdp/test_fsdp_input.py

same as previous comment

devices = ["cuda"] if TEST_HPU: devices.append("hpu")

That way new devices can just add on to the list

ankurneog · 2024-11-13T05:14:16Z

test/distributed/fsdp/test_fsdp_multiple_forward.py

same as the commented before

ankurneog · 2024-11-13T05:17:42Z

test/distributed/fsdp/test_fsdp_multiple_wrapping.py

device_id is already hpu:0 or cuda:0, it think

Sequential(FSDP(Linear(5,5)).to(device)

should work isn't it ?

ankurneog · 2024-11-13T05:18:17Z

test/distributed/fsdp/test_fsdp_uneven.py

same as previous comment

ankurneog · 2024-11-13T05:24:37Z

test/distributed/fsdp/test_fsdp_unshard_params.py

can we add this directly to the base class FSDPTest

ankurneog · 2024-11-13T05:25:24Z

test/distributed/fsdp/test_fsdp_unshard_params.py

remove the and use self.device

ankurneog · 2024-11-13T05:26:05Z

test/distributed/fsdp/test_fsdp_unshard_params.py

same .to(self.device) should work

ankurneog · 2024-11-13T05:27:26Z

test/distributed/fsdp/test_hsdp_dtensor_state_dict.py

lets update device in FSDPTest base class

zhangxiaoli73 · 2024-11-13T12:31:51Z

test/distributed/fsdp/test_fsdp_checkpoint.py

Suggested change

device_type = "hpu:0" if TEST_HPU else torch.cuda.current_device()

device_type = "hpu:0" if TEST_HPU else torch.get_device_module().current_device()

zhangxiaoli73 · 2024-11-13T12:32:21Z

test/distributed/fsdp/test_fsdp_checkpoint.py

Suggested change

device_type = "hpu:0" if TEST_HPU else torch.cuda.current_device()

device_type = "hpu:0" if TEST_HPU else torch.get_device_module().current_device()

rahulsingh-intel · 2024-11-17T11:49:15Z

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements.
Can you please review now?

rahulsingh-intel · 2024-11-20T04:21:15Z

Does the following code work for HPU or not?
if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)
hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements. Can you please review now?

hi @kwen2501 please review.

rahulsingh-intel · 2024-11-22T03:34:57Z

Does the following code work for HPU or not?
if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)
hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements. Can you please review now?
hi @kwen2501 please review.

hi @kwen2501 CI ran fine, please approve after review.

clee2000 · 2024-12-12T17:51:04Z

ke is on PTO, @awgu could you take a look at this?

clee2000 · 2024-12-12T18:06:10Z

I found test_root_module_is_not_FSDP_cuda logs on the main branch, so I think its something to do with test instantiation with instantiate_device_type_tests and internal test collection collecting the noninstantiated tests

rahulsingh-intel · 2024-12-14T07:29:23Z

@clee2000 can you please re merge now. This class only instantiate tests based on the device type.

rahulsingh-intel · 2024-12-19T11:03:33Z

@pytorchmergebot merge

pytorch-bot · 2024-12-19T11:03:37Z

This PR needs to be approved by an authorized maintainer before merge.

rahulsingh-intel · 2024-12-19T11:12:25Z

This PR needs to be approved by an authorized maintainer before merge.

@clee2000 can you please re merge.

rahulsingh-intel · 2024-12-19T15:15:12Z

hi @kwen2501 , @awgu , this PR was merged but reverted due to issue faced by @clee2000 . Can you please re approve .

clee2000 · 2024-12-19T21:26:21Z

@awgu has suggested @fegin as a POC for internal testing

@fegin I believe a change like D67285322 will help with some of these tests. That being said, I do not know if we have multigpu machines for testing internally so I don't know if there is any value in running them

I will defer to Andrew and Chien Chin on what to do next, as this is the extent of my knowledge on distributed testing

fegin · 2024-12-23T08:11:37Z

test/distributed/fsdp/test_checkpoint_wrapper.py

-            torch.cuda.reset_peak_memory_stats()
+            ).to(device_type.type)
+            x = torch.randn(10000, 256, requires_grad=True).to(device_type.type)
+            torch.get_device_module(device_type.type).reset_peak_memory_stats()


Because you remove @unittest.skipIf(not torch.cuda.is_available(), "Test requires CUDA"), this line should fail when running on only CPU devices. I got the module 'torch.cpu' has no attribute 'reset_peak_memory_stats' error with the internal CPU tests. I'm not sure why CI didn't fail though. cc., @clee2000

It is disabled in CI due to #79510

I see. So we should also disable this internally.

@rahulsingh-intel Can you add an unconditional skip to this test? Thanks!

hi @fegin Added.
Thanks.

fegin · 2024-12-24T00:29:41Z

We can retry this PR after the internal fix is landed. @clee2000

rahulsingh-intel · 2024-12-24T05:33:52Z

We can retry this PR after the internal fix is landed. @clee2000

hi @fegin , can you please approve this now because its taking lot of effort in case of conflicts and we have a depdency ( #139749 ) also on this PR. Once you approve I will make the common changes in remaining fsdp test modules in another PR.

rahulsingh-intel · 2025-01-06T05:10:37Z

hi @clee2000 , @fegin , can you please approve it now ?

ankurneog · 2025-01-07T03:20:20Z

@kwen2501 : could you please help with the approval of this PR.

kwen2501

Re-approving.
Sorry I went on PTO since mid Dec, and I don't know why my previous approval were gone (perhaps a re-request of review nullified it).

rahulsingh-intel · 2025-01-07T08:56:59Z

@pytorchmergebot merge

pytorchmergebot · 2025-01-07T08:58:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ankurneog · 2025-01-07T09:23:24Z

Re-approving. Sorry I went on PTO since mid Dec, and I don't know why my previous approval were gone (perhaps a re-request of review nullified it).

Thank you @kwen2501

…#149848) The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc) The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: #139184 Pull Request resolved: #149848 Approved by: https://github.com/kwen2501, https://github.com/guangyey

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 29, 2024

pytorchbot added the open source label Oct 29, 2024

mikaylagawarecki requested a review from kwen2501 October 30, 2024 16:22

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2024

kwen2501 requested review from awgu and weifengpy November 1, 2024 19:59

jgong5 reviewed Nov 7, 2024

View reviewed changes

pytorchmergebot force-pushed the ditributed_torch branch from 42ede36 to 75bcddd Compare November 7, 2024 06:25

ankurneog reviewed Nov 13, 2024

View reviewed changes

zhangxiaoli73 reviewed Nov 13, 2024

View reviewed changes

Merge branch 'main' into ditributed_torch

8e79c55

rahulsingh-intel requested a review from kwen2501 December 19, 2024 15:15

fegin reviewed Dec 23, 2024

View reviewed changes

Update test_checkpoint_wrapper.py

f033ba2

kwen2501 approved these changes Jan 7, 2025

View reviewed changes

pytorchmergebot added the merging label Jan 7, 2025

pytorchmergebot closed this in bf7747e Jan 7, 2025

pytorchmergebot removed the merging label Jan 7, 2025

mhorowitz mentioned this pull request Feb 4, 2025

DISABLED test_distributed_checkpoint_state_dict_type0_cuda (__main__.TestDistributedCheckpointCUDA) #145807

Open

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

AnantGulati mentioned this pull request Mar 24, 2025

Refactoring FSDP2 (_composable/fsdp) test cases to be device agnostic #149848

Closed

	device_type = "hpu:0" if TEST_HPU else torch.cuda.current_device()
	device_type = "hpu:0" if TEST_HPU else torch.get_device_module().current_device()

Tests Generelization for multiple accelerator devices #139184

Tests Generelization for multiple accelerator devices #139184

Uh oh!

Conversation

rahulsingh-intel commented Oct 29, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139184

✅ No Failures

Uh oh!

ankurneog commented Oct 29, 2024

Uh oh!

kwen2501 commented Nov 1, 2024

Uh oh!

kwen2501 commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulsingh-intel commented Nov 6, 2024

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulsingh-intel commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahulsingh-intel commented Nov 7, 2024

Uh oh!

pytorch-bot bot commented Nov 7, 2024

Uh oh!

ankurneog commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

kwen2501 commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Nov 8, 2024

Uh oh!

rahulsingh-intel commented Nov 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulsingh-intel commented Nov 17, 2024

Uh oh!

rahulsingh-intel commented Nov 20, 2024

Uh oh!

rahulsingh-intel commented Nov 22, 2024

Uh oh!

rahulsingh-intel commented Oct 29, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 29, 2024 •

edited

Loading

kwen2501 commented Nov 1, 2024 •

edited

Loading

ankurneog commented Nov 7, 2024 •

edited

Loading

rahulsingh-intel commented Nov 7, 2024 •

edited

Loading

kwen2501 commented Nov 8, 2024 •

edited

Loading

clee2000 commented Dec 12, 2024 •

edited

Loading