-
Notifications
You must be signed in to change notification settings - Fork 26.3k
pin_memory malloc now uses existing context if available. #22229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Adding a review from @ngimel in case our nvidia expert has something to say :) |
test/test_cuda_primary_ctx.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
you cannot link |
|
@soumith I may have described this change wrongly. What I did was adding in Is this also going to be a problem? |
|
yeah this does not seem the correct way to link things. the cuda build CI (which is on a CPU-only machine I think) fails with @soumith What is the correct way to do this? If I don't link, local build fails with |
|
also cc @ezyang , do you know the answer? |
|
cuda builds in cpu-only building environments can link to libcuda stub that is part of the toolkit, and I think something like this is already done in pytorch, but I don't exactly know how. |
|
@ngimel yep, cpu builds are working on cpu machines. It seems that GPU-enabled builds on cpu machines should work up till trying to initialize cuda, but this makes it fail at importing. The error happens when the building machine tries to import pytorch to build doc, an operation that should not initialize cuda. There must be something I am missing... |
|
we link to libnvrtc.so (which depends on libcuda.so ) via our stub You can refer the CUDA Driver API functions there, and use via the thnvrtc mechanism we have. |
|
@soumith I see. It seems that there is no easy way to access that in ATen, right? |
|
@ssnl unlikely |
|
@pytorchbot rebase this please |
|
This PR will fail CI. Here is the plan. I will in a separate PR port the stub to ATen, and update this afterwards. |
Summary: Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking pytorch/pytorch#22229. `DynamicLibrary` is also moved as it is used in the stub code, and seems general enough. Pull Request resolved: pytorch/pytorch#22362 Differential Revision: D16131787 Pulled By: ezyang fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73
|
@VitalyFedyunin The build PR has been merged. I also moved the device guard to the host allocator. I think this is ready for another review :). CI should pass since I only rebased. |
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything good except two tiny things
|
@pytorchbot merge this please pytorch_linux_trusty_py3_6_gcc5_4_test failure is already reported at #22052 |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
EDIT: Sorry, I posted this on the wrong PR |
|
@ezyang When you have time, could you land this again? Ty :) |
|
please fix merge conflicts |
|
@pytorchbot rebase this please |
|
@pytorchbot merge this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746
```
Commits:
1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
Add test for this in that file.
Fixes pytorch/pytorch#21081.
Pull Request resolved: pytorch/pytorch#22229
Differential Revision: D16170194
Pulled By: zou3519
fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
This is achieved by using
cuDevicePrimaryCtxGetStateas a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:Commits:
CUDAHooks::getDeviceWithPrimaryContextwhich returns a device index with primary context (if exists).Link
c10/cudaagainstlibcudafor device API calls.getDeviceWithPrimaryContextto check primary context inpin_memory.Fix
OptionalDeviceGuarddoc.test_cuda_primary_ctx.pyto support multiple tests.Add test for this in that file.
Fixes #21081.