Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024

medivh-xp · 2023-04-13T08:56:00Z

Custom backend implementation based on privateuse1 with semantics identical to CUDA (CUDA is so popular), named for example 'my_device', and registered as the same module name torch.my_device.

This PR aims to satisfy the constraints of such a backend, which can be directly integrated into the current FSDP implementation.

The main issues addressed are:

1. Device decision for FSDP wrapping of Modules without Parameters

Users typically organize FSDP code as follows:

m = Module().to('my_device:0')
fsdp_m = FSDP(m)

or like this:

m = Module()
fsdp_m = FSDP(m, device_id=torch.device('my_device', 0))

If the model has Parameters, everything works fine because FSDP will prioritize the device where the Parameters are located. However, for Modules without Parameters, the to() call has no side effects, and FSDP will assume the current CUDA device, which prevents the use of devices other than the current CUDA device for Modules without Parameters. Therefore, when FSDP is called with a device_id argument, this configuration takes top priority.

2. Abstraction of a cuda-like device

Now, in addition to compute_device, _FSDPState includes a device_handler member. In fact, this device_handler is now just a reference to either torch.cuda or torch.my_device. From now on, code that works based on _FSDPState should use state.device_handler to operate streams create, wait or sync, just like using torch.cuda previously.

pytorch-bot · 2023-04-13T08:56:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99024

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a5152f8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu

The approach looks good to me! I left some nits and can approve next review.

torch/distributed/fsdp/_common_utils.py

torch/distributed/fsdp/_init_utils.py

torch/distributed/fsdp/_common_utils.py

…emantics

…el as _FSDPState

zhaojuanmao · 2023-04-26T06:13:14Z

wondering whether CI has some setup to test non-cuda devices? if so, could we add some unit tests for non-cuda devices?

awgu · 2023-04-26T16:17:17Z

wondering whether CI has some setup to test non-cuda devices? if so, could we add some unit tests for non-cuda devices?

This might be challenging since I believe @medivh-xp uses a custom hardware (correct me if I am wrong), and otherwise, I am not sure if there are easily accessible CUDA-like devices that we can use in CI.

Personally, as long as this does not regress the CUDA code path, then I am okay with landing. This is similar to adding the _param_extensions for 2D support, where we added some generalizations to key points in the execution.

awgu

This looks good to me! I just left one more nit.

torch/distributed/fsdp/_common_utils.py

medivh-xp · 2023-04-27T01:52:32Z

wondering whether CI has some setup to test non-cuda devices? if so, could we add some unit tests for non-cuda devices?

This might be challenging since I believe @medivh-xp uses a custom hardware (correct me if I am wrong), and otherwise, I am not sure if there are easily accessible CUDA-like devices that we can use in CI.

Personally, as long as this does not regress the CUDA code path, then I am okay with landing. This is similar to adding the _param_extensions for 2D support, where we added some generalizations to key points in the execution.

Yes! We use custom hardware and will ensure that it supports the semantics of CUDA, so that we can directly benefit from the excellent features of the community.

awgu · 2023-04-27T02:04:49Z

@pytorchbot merge

pytorchmergebot · 2023-04-27T02:07:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 13, 2023

pytorchbot added the open source label Apr 13, 2023

medivh-xp force-pushed the fsdp_head branch 4 times, most recently from 09b5229 to 7f2b840 Compare April 15, 2023 07:40

medivh-xp changed the title ~~Allow the FSDP constructor to run even without CUDA~~ Making fsdp device-agnostic Apr 15, 2023

medivh-xp changed the title ~~Making fsdp device-agnostic~~ Making fsdp device-agnostic for device implement cuda-semantics Apr 15, 2023

medivh-xp force-pushed the fsdp_head branch 19 times, most recently from 1d7a2ac to f8599de Compare April 19, 2023 09:17

medivh-xp marked this pull request as ready for review April 19, 2023 09:18

medivh-xp requested a review from mrshenli as a code owner April 19, 2023 09:18

medivh-xp force-pushed the fsdp_head branch 5 times, most recently from 7eb7bd3 to 9fb91a9 Compare April 20, 2023 11:46

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2023

awgu reviewed Apr 25, 2023

View reviewed changes

medivh-xp force-pushed the fsdp_head branch from 4553585 to 9fb91a9 Compare April 26, 2023 02:34

medivh-xp added 2 commits April 26, 2023 10:43

Making fsdp device-agnostic for custom backends that implement cuda-s…

8a39f2e

…emantics

move device handler to _FSDPState as fsdp sometimes reguard ffsdp mod…

b694647

…el as _FSDPState

medivh-xp force-pushed the fsdp_head branch 2 times, most recently from 9cd5566 to 4cc28f6 Compare April 26, 2023 02:51

nit fix

60fc457

medivh-xp force-pushed the fsdp_head branch from 4cc28f6 to 60fc457 Compare April 26, 2023 02:53

awgu approved these changes Apr 26, 2023

View reviewed changes

torch/distributed/fsdp/_common_utils.py Outdated Show resolved Hide resolved

nit fix

a5152f8

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2023

pytorchmergebot added the merging label Apr 27, 2023

pytorchmergebot added the Merged label Apr 27, 2023

pytorchmergebot closed this in 859e82a Apr 27, 2023

This was referenced May 27, 2023

[FSDP]Add device_mesh to FSDPstate #102317

Closed

[DTensor] Allow DTensor support third-party device #102442

Closed

guangyey mentioned this pull request Jun 11, 2024

[RFC] A device-agnostic Python runtime API design for stream-based accelerators #128403

Closed

shiyan1121 mentioned this pull request Aug 7, 2024

Adding Intel XPU backend support using device-agnostic way meta-pytorch/torchtune#1280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024

Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024

Uh oh!

medivh-xp commented Apr 13, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 13, 2023 •

edited

Loading

Uh oh!

awgu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaojuanmao commented Apr 26, 2023

Uh oh!

awgu commented Apr 26, 2023

Uh oh!

awgu left a comment

Uh oh!

Uh oh!

medivh-xp commented Apr 27, 2023

Uh oh!

awgu commented Apr 27, 2023

Uh oh!

pytorchmergebot commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024

Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024

Uh oh!

Conversation

medivh-xp commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Device decision for FSDP wrapping of Modules without Parameters

2. Abstraction of a cuda-like device

Uh oh!

pytorch-bot bot commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99024

✅ No Failures

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaojuanmao commented Apr 26, 2023

Uh oh!

awgu commented Apr 26, 2023

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

medivh-xp commented Apr 27, 2023

Uh oh!

awgu commented Apr 27, 2023

Uh oh!

pytorchmergebot commented Apr 27, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

medivh-xp commented Apr 13, 2023 •

edited

Loading

pytorch-bot bot commented Apr 13, 2023 •

edited

Loading