-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Making fsdp device-agnostic for custom-backend which implement cuda-semantics #99024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99024
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a5152f8: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
09b5229 to
7f2b840
Compare
1d7a2ac to
f8599de
Compare
7eb7bd3 to
9fb91a9
Compare
awgu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me! I left some nits and can approve next review.
9cd5566 to
4cc28f6
Compare
|
wondering whether CI has some setup to test non-cuda devices? if so, could we add some unit tests for non-cuda devices? |
This might be challenging since I believe @medivh-xp uses a custom hardware (correct me if I am wrong), and otherwise, I am not sure if there are easily accessible CUDA-like devices that we can use in CI. Personally, as long as this does not regress the CUDA code path, then I am okay with landing. This is similar to adding the |
awgu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I just left one more nit.
Yes! We use custom hardware and will ensure that it supports the semantics of CUDA, so that we can directly benefit from the excellent features of the community. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Custom backend implementation based on privateuse1 with semantics identical to CUDA (CUDA is so popular), named for example 'my_device', and registered as the same module name torch.my_device.
This PR aims to satisfy the constraints of such a backend, which can be directly integrated into the current FSDP implementation.
The main issues addressed are:
1. Device decision for FSDP wrapping of Modules without Parameters
Users typically organize FSDP code as follows:
or like this:
If the model has Parameters, everything works fine because FSDP will prioritize the device where the Parameters are located. However, for Modules without Parameters, the to() call has no side effects, and FSDP will assume the current CUDA device, which prevents the use of devices other than the current CUDA device for Modules without Parameters. Therefore, when FSDP is called with a device_id argument, this configuration takes top priority.
2. Abstraction of a cuda-like device
Now, in addition to compute_device, _FSDPState includes a device_handler member. In fact, this device_handler is now just a reference to either torch.cuda or torch.my_device. From now on, code that works based on _FSDPState should use state.device_handler to operate streams create, wait or sync, just like using torch.cuda previously.