-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add wrappers for synchronous GPUDirect Storage APIs #130633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wrappers for synchronous GPUDirect Storage APIs #130633
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130633
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0c18902 with merge base c047bdd ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| _CUDAToolkit_find_and_add_import_lib(cublas_static DEPS culibos) | ||
| endif() | ||
|
|
||
| if(CUDAToolkit_VERSION VERSION_GREATER_EQUAL 11.4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file vendored from somewhere? Did you take that update from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep from here https://gitlab.kitware.com/cmake/cmake/-/blob/master/Modules/FindCUDAToolkit.cmake#L1245-1251, the cuFile changes were only in 3.25, but seems like we are on version 3.17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
Did you update the full thing to 3.25? Or just picked the subset of the changes you needed?
@malfet what do you think is the best way to do this? A full update to a given version to make sure we're not in a weird in-between. Or just what we need to reduce churn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just copy pasted the subset I needed, fwiw I am not sure that the initial file was copy pasted directly as I see this at the top
| # This module is back-ported from CMake 3.17 and above to work with CMake 3.10 |
but around the date where the PR that added this file was merged I don't see a similar comment https://gitlab.kitware.com/cmake/cmake/-/blob/21b102c77d85897c2500488180e58de077447b4c/Modules/FindCUDAToolkit.cmake
not sure whether there were any changes made/which commit to CMake it was taken from
|
@mikaylagawarecki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Based in part on NVIDIA/apex#1774 Pull Request resolved: pytorch#130633 Approved by: https://github.com/albanD
…130633)" This reverts commit 5b5e069. Reverted pytorch#130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](pytorch#130633 (comment)))
|
Build failures present on D60085885 do not exist on the imported D60155434 (also verified by running some of the builds locally on the diff and they succeeded), the service_lab signal that is failing previously succeeded so is flaky. Going to rebase and merge |
|
@pytorchbot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "still failing internally D60265673" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@mikaylagawarecki your PR has been successfully reverted. |
This reverts commit 709ddf7. Reverted #130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](#130633 (comment)))
Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: #133489 Approved by: https://github.com/albanD
…#133489) Reland pytorch#130633 USE_CUFILE turned off by default in this version Pull Request resolved: pytorch#133489 Approved by: https://github.com/albanD
…#133489) Reland pytorch#130633 USE_CUFILE turned off by default in this version Pull Request resolved: pytorch#133489 Approved by: https://github.com/albanD
Based in part on NVIDIA/apex#1774
Stack from ghstack (oldest at bottom):
Differential Revision: D60155434