Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jun 25, 2019

This is achieved by using cuDevicePrimaryCtxGetState as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState        301 ns        301 ns    2319746

Commits:

  1. Add CUDAHooks::getDeviceWithPrimaryContext which returns a device index with primary context (if exists).
    Link c10/cuda against libcuda for device API calls.
  2. Use getDeviceWithPrimaryContext to check primary context in pin_memory.
    Fix OptionalDeviceGuard doc.
  3. Refactor test_cuda_primary_ctx.py to support multiple tests.
    Add test for this in that file.

Fixes #21081.

@ssnl ssnl requested review from colesbury and gchanan June 25, 2019 20:44
@pytorchbot pytorchbot added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators module: tests Issues related to tests (not the torch.testing module) labels Jun 25, 2019
@ssnl ssnl requested a review from ngimel June 25, 2019 21:04
@ssnl
Copy link
Collaborator Author

ssnl commented Jun 25, 2019

Adding a review from @ngimel in case our nvidia expert has something to say :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pytorchbot pytorchbot added the module: rocm AMD GPU support for Pytorch label Jun 25, 2019
@ssnl ssnl force-pushed the pin_memory_device branch from beceff6 to 078c35b Compare June 25, 2019 21:37
@soumith
Copy link
Contributor

soumith commented Jun 25, 2019

you cannot link libtorch with libcuda, because on CPU-only machines you wont have libcuda.so as it comes with nvidia driver (and we cannot ship libcuda.so with pytorch wheel, because it comes from the nvidia driver)

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 25, 2019

@soumith I may have described this change wrongly. What I did was adding in c10/cuda/CMakeList.txt

target_link_libraries(c10_cuda INTERFACE caffe2::cuda)

Is this also going to be a problem?

ngimel
ngimel previously approved these changes Jun 25, 2019
@ssnl
Copy link
Collaborator Author

ssnl commented Jun 25, 2019

yeah this does not seem the correct way to link things. the cuda build CI (which is on a CPU-only machine I think) fails with


Jun 25 23:11:02 + [[ -n '' ]]
Jun 25 23:11:02 + [[ pytorch-linux-xenial-cuda9-cudnn7-py3-build == *xenial-cuda9-cudnn7-py3* ]]
Jun 25 23:11:02 + pushd docs
Jun 25 23:11:02 + pip install -q -r requirements.txt
Jun 25 23:11:02 ~/workspace/docs ~/workspace
Jun 25 23:11:10 + LC_ALL=C
Jun 25 23:11:10 + make html
Jun 25 23:11:10 Traceback (most recent call last):
Jun 25 23:11:10   File "source/scripts/build_activation_images.py", line 8, in <module>
Jun 25 23:11:10     import torch.nn.modules.activation
Jun 25 23:11:10   File "/opt/conda/lib/python3.6/site-packages/torch/__init__.py", line 79, in <module>
Jun 25 23:11:10     from torch._C import *
Jun 25 23:11:10 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Jun 25 23:11:10 Makefile:17: recipe for target 'figures' failed

@soumith What is the correct way to do this? If I don't link, local build fails with

FAILED: bin/generate_proposals_op_util_nms_gpu_test: && /home/ssnl/ccache/lib/c++  
-Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -O2 -fPIC 
-Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits 
-Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter 
-Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow 
-Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow 
-Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast 
-fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable 
-Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow 
-DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -rdynamic    -rdynamic 
caffe2/CMakeFiles/generate_proposals_op_util_nms_gpu_test.dir/operators/generate_proposals_op_util_nms_gpu_test.cc.o  
-o bin/generate_proposals_op_util_nms_gpu_test 
-L/home/ssnl/anaconda3/lib -Wl,rpath,/usr/local/cuda/lib64:/home/ssnl/anaconda3/lib:/data/packages/pytorch/build/lib: 
/usr/local/cuda/lib64/libcudart.so  lib/libgtest_main.a 
-Wl,--no-as-needed,/data/packages/pytorch/build/lib/libtorch.so 
-Wl,--as-needed lib/libprotobuf.a 
-lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -fopenmp -lpthread -lm -ldl 
/home/ssnl/anaconda3/lib/libirc.so lib/libmkldnn.a lib/libc10_cuda.so lib/libc10.so 
/usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so 
/usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so 
/usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcublas.so lib/libgtest.a -pthread && :

/data/packages/pytorch/build/lib/libtorch.so: undefined reference to `cuGetErrorString'
/data/packages/pytorch/build/lib/libtorch.so: undefined reference to `cuDevicePrimaryCtxGetState'

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 25, 2019

also cc @ezyang , do you know the answer?

@ngimel
Copy link
Collaborator

ngimel commented Jun 26, 2019

cuda builds in cpu-only building environments can link to libcuda stub that is part of the toolkit, and I think something like this is already done in pytorch, but I don't exactly know how.
But cpu-only builds should not attempt to link to libcuda at all.

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 26, 2019

@ngimel yep, cpu builds are working on cpu machines. It seems that GPU-enabled builds on cpu machines should work up till trying to initialize cuda, but this makes it fail at importing. The error happens when the building machine tries to import pytorch to build doc, an operation that should not initialize cuda. There must be something I am missing...

@soumith
Copy link
Contributor

soumith commented Jun 26, 2019

we link to libnvrtc.so (which depends on libcuda.so ) via our stub libthnvrtc.so which is only loaded in the context of JIT fuser, dynamically. See: https://github.com/pytorch/pytorch/blob/b2197ef2b0b384f863d5d830d61cfc08c994b2ec/torch/csrc/jit/fuser/cuda/thnvrtc.h

You can refer the CUDA Driver API functions there, and use via the thnvrtc mechanism we have.

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 26, 2019

@soumith I see. It seems that there is no easy way to access that in ATen, right?

@soumith
Copy link
Contributor

soumith commented Jun 26, 2019

@ssnl unlikely

@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2019
@li-roy
Copy link
Contributor

li-roy commented Jun 28, 2019

@pytorchbot rebase this please

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 28, 2019

This PR will fail CI. Here is the plan. I will in a separate PR port the stub to ATen, and update this afterwards.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 9, 2019
Summary:
Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking pytorch/pytorch#22229.

`DynamicLibrary` is also moved as it is used in the stub code, and seems general enough.
Pull Request resolved: pytorch/pytorch#22362

Differential Revision: D16131787

Pulled By: ezyang

fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73
@ssnl ssnl removed module: internals Related to internal abstractions in c10 and ATen module: operators labels Jul 9, 2019
@ssnl
Copy link
Collaborator Author

ssnl commented Jul 9, 2019

@VitalyFedyunin The build PR has been merged. I also moved the device guard to the host allocator. I think this is ready for another review :). CI should pass since I only rebased.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything good except two tiny things

@pytorchbot pytorchbot added module: internals Related to internal abstractions in c10 and ATen module: operators labels Jul 9, 2019
@ssnl
Copy link
Collaborator Author

ssnl commented Jul 9, 2019

@pytorchbot merge this please

pytorch_linux_trusty_py3_6_gcc5_4_test failure is already reported at #22052

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 9, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jul 10, 2019

So, when we landed this internally, some people reported they have this problem:

Error in dlopen or dlsym: libcaffe2_nvrtc.so: cannot open shared object file: No such file or directory

EDIT: Sorry, I posted this on the wrong PR

@ssnl
Copy link
Collaborator Author

ssnl commented Jul 10, 2019

@ezyang When you have time, could you land this again? Ty :)

@jerryzh168
Copy link
Contributor

please fix merge conflicts

@ssnl
Copy link
Collaborator Author

ssnl commented Jul 15, 2019

@pytorchbot rebase this please

@ssnl
Copy link
Collaborator Author

ssnl commented Jul 15, 2019

@pytorchbot merge this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl ssnl deleted the pin_memory_device branch July 16, 2019 17:22
@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 8482efb.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 16, 2019
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState        301 ns        301 ns    2319746
```

Commits:

1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
    Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
    Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
    Add test for this in that file.

Fixes pytorch/pytorch#21081.
Pull Request resolved: pytorch/pytorch#22229

Differential Revision: D16170194

Pulled By: zou3519

fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: tests Issues related to tests (not the torch.testing module) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[discussion][cuda] pin_memory() asks for ctx on current device