Skip to content

Conversation

@rahulsingh-intel
Copy link
Contributor

@rahulsingh-intel rahulsingh-intel commented Oct 29, 2024

Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209 Merged now.
There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501 @zeshengzong Please review the changes.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139184

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f033ba2 with merge base a97c6a7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 29, 2024
@ankurneog
Copy link

@kwen2501 , @jgong5 , @awgu : Could you please help with the review , thank you.

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2024
@kwen2501 kwen2501 requested review from awgu and weifengpy November 1, 2024 19:59
@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 1, 2024

Adding @weifengpy and @awgu to review since most changes are around FSDP tests.

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 1, 2024

Thanks for your big effort!

General comment:
There are 90 addition of TEST_HPU in the change. Can we think of a way to reduce the appearance?
We cannot have an if branch for each possible device in code like this:

src = torch.randn((10, 1, 1024), device=device_type if TEST_HPU else "cuda")

Can we have a DEVICE_TYPE = ... defined at top and just do:

src = torch.randn((10, 1, 1024), device=torch.device(DEVICE_TYPE, index))

in test code?
Thanks for your consideration!

@rahulsingh-intel
Copy link
Contributor Author

Thanks for your big effort!

General comment: There are 90 addition of TEST_HPU in the change. Can we think of a way to reduce the appearance? We cannot have an if branch for each possible device in code like this:

src = torch.randn((10, 1, 1024), device=device_type if TEST_HPU else "cuda")

Can we have a DEVICE_TYPE = ... defined at top and just do:

src = torch.randn((10, 1, 1024), device=torch.device(DEVICE_TYPE, index))

in test code? Thanks for your consideration!

hi @kwen2501 , Thanks for your comments.

  • device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
  • So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
  • I have further updated the files please review

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
  • So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
  • I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

@ankurneog
Copy link

ankurneog commented Nov 7, 2024

  • device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
  • So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
  • I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.

Some background :
Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.

What we need is platform independent APIs where we don't have references to device specific APIs.

We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.

Please share your views.

@rahulsingh-intel
Copy link
Contributor Author

rahulsingh-intel commented Nov 7, 2024

  • device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
  • So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
  • I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.

Some background : Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.

What we need is platform independent APIs where we don't have references to device specific APIs.

We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.

Please share your views.

Hi @jgong5 , agree with @ankurneog . We will adreess this feature enhancement.
And since there is lot of effort for adaptations, can you please aprrove the changes for now
cc: @kwen2501

@rahulsingh-intel
Copy link
Contributor Author

  • device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks and for cuda it should be rank numbers(self.rank/cuda.current_device()).
  • So for HPU it is fixed and can be defined globally but not for cuda. So "if TEST_HPU" statement is needed but can be reduced checking again and again.
  • I have further updated the files please review

Is it possible to factor out some util functions to hide these TEST_HPU device checks and to be called by the UTs instead of introducing these device checks directly into the UTs?

Thanks @jgong5 for your comment. Can we address this as part of a different PR.
Some background : Even if we remove TEST_HPU, there would be references of other devices such as CUDA (TEST_CUDNN), HIP (TEST_ROCM) etc. already in the files. Also being an out-of-tree device , we need to do additional checks for library availability before accessing the APIs such as torch.hpu.device_count(), this is accommodated in TEST_HPU.
What we need is platform independent APIs where we don't have references to device specific APIs.
We plan to do some abstraction as part of a next PR, where we would have to clean up the files to be platform independent, removing dependencies such as torch.cuda.device_count(), with generic APIs for eg torch.device.count(), for that we would have to make some changes in the pytorch python frontend , that would need some effort and prototyping. But it would make it truely platform independent and all these checks can be under the wrappers.
Please share your views.

Hi @jgong5 , agree with @ankurneog . We will adreess this feature enhancement. And since there is lot of effort for adaptations, can you please aprrove the changes for now cc: @kwen2501

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 7, 2024

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

@ankurneog
Copy link

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ditributed_torch onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ditributed_torch && git pull --rebase)

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 8, 2024

device id accepted by both(cuda & hpu) device is different. for HPU it should be "hpu:0" for all ranks

Thanks for the explanation. I am still a bit confused. Does the following rule work for HPU or not?

index = rank % device_count()

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 8, 2024

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

@rahulsingh-intel
Copy link
Contributor Author

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

Hi @kwen2501 torch.get_device_module("hpu") doesn't work for HPU
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kwen2501 mentioned in one of the comments torch.get_device_module(device.type).reset_peak_memory_stats() should work. We just need to check once if there is any issue with Out-of-tree device like HPU. I just checked it seems to be working :

>>> device=torch.device("hpu:0")
>>> torch.get_device_module(device.type).is_available()
True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. It’s a good idea to use torch.get_device_module(...) instead of relying on if...else statements with device-specific namespaces like torch.cuda, torch.hpu, or torch.xpu.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep it as

devices = ["cuda"]
if TEST_HPU:
    devices.append("hpu") 

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous comment

devices = ["cuda"]
if TEST_HPU:
   devices.append("hpu")

That way new devices can just add on to the list

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the commented before

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_id is already hpu:0 or cuda:0, it think

Sequential(FSDP(Linear(5,5)).to(device) 

should work isn't it ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this directly to the base class FSDPTest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the and use self.device

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same .to(self.device) should work

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets update device in FSDPTest base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_type = "hpu:0" if TEST_HPU else torch.cuda.current_device()
device_type = "hpu:0" if TEST_HPU else torch.get_device_module().current_device()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_type = "hpu:0" if TEST_HPU else torch.cuda.current_device()
device_type = "hpu:0" if TEST_HPU else torch.get_device_module().current_device()

@rahulsingh-intel
Copy link
Contributor Author

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements.
Can you please review now?

@rahulsingh-intel
Copy link
Contributor Author

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements. Can you please review now?

hi @kwen2501 please review.

@rahulsingh-intel
Copy link
Contributor Author

Does the following code work for HPU or not?

if TEST_HPU:
  device_type = "hpu"
elif ...  # others
  ...

device_module = torch.get_device_module(device_type)
device_index = rank % device_module.device_count()
device = torch.device(device_type, device_index)

hi @kwen2501 , removed almost all if else and TEST_HPU & TEST_CUDA statements. Can you please review now?

hi @kwen2501 please review.

hi @kwen2501 CI ran fine, please approve after review.

@clee2000
Copy link
Contributor

ke is on PTO, @awgu could you take a look at this?

@clee2000
Copy link
Contributor

clee2000 commented Dec 12, 2024

I found test_root_module_is_not_FSDP_cuda logs on the main branch, so I think its something to do with test instantiation with instantiate_device_type_tests and internal test collection collecting the noninstantiated tests

@rahulsingh-intel
Copy link
Contributor Author

@clee2000 can you please re merge now. This class only instantiate tests based on the device type.

@rahulsingh-intel
Copy link
Contributor Author

@pytorchmergebot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 19, 2024

This PR needs to be approved by an authorized maintainer before merge.

@rahulsingh-intel
Copy link
Contributor Author

This PR needs to be approved by an authorized maintainer before merge.

@clee2000 can you please re merge.

@rahulsingh-intel
Copy link
Contributor Author

hi @kwen2501 , @awgu , this PR was merged but reverted due to issue faced by @clee2000 . Can you please re approve .

@clee2000
Copy link
Contributor

@awgu has suggested @fegin as a POC for internal testing

@fegin I believe a change like D67285322 will help with some of these tests. That being said, I do not know if we have multigpu machines for testing internally so I don't know if there is any value in running them

I will defer to Andrew and Chien Chin on what to do next, as this is the extent of my knowledge on distributed testing

torch.cuda.reset_peak_memory_stats()
).to(device_type.type)
x = torch.randn(10000, 256, requires_grad=True).to(device_type.type)
torch.get_device_module(device_type.type).reset_peak_memory_stats()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you remove @unittest.skipIf(not torch.cuda.is_available(), "Test requires CUDA"), this line should fail when running on only CPU devices. I got the module 'torch.cpu' has no attribute 'reset_peak_memory_stats' error with the internal CPU tests. I'm not sure why CI didn't fail though. cc., @clee2000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is disabled in CI due to #79510

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So we should also disable this internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahulsingh-intel Can you add an unconditional skip to this test? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @fegin Added.
Thanks.

@fegin
Copy link
Contributor

fegin commented Dec 24, 2024

We can retry this PR after the internal fix is landed. @clee2000

@rahulsingh-intel
Copy link
Contributor Author

We can retry this PR after the internal fix is landed. @clee2000

hi @fegin , can you please approve this now because its taking lot of effort in case of conflicts and we have a depdency ( #139749 ) also on this PR. Once you approve I will make the common changes in remaining fsdp test modules in another PR.

@rahulsingh-intel
Copy link
Contributor Author

hi @clee2000 , @fegin , can you please approve it now ?

@ankurneog
Copy link

@kwen2501 : could you please help with the approval of this PR.

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving.
Sorry I went on PTO since mid Dec, and I don't know why my previous approval were gone (perhaps a re-request of review nullified it).

@rahulsingh-intel
Copy link
Contributor Author

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ankurneog
Copy link

Re-approving. Sorry I went on PTO since mid Dec, and I don't know why my previous approval were gone (perhaps a re-request of review nullified it).

Thank you @kwen2501

pytorchmergebot pushed a commit that referenced this pull request May 19, 2025
…#149848)

The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc)

The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: #139184

Pull Request resolved: #149848
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.