pin_memory malloc now uses existing context if available. #22229

ssnl · 2019-06-25T20:44:59Z

This is achieved by using cuDevicePrimaryCtxGetState as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState        301 ns        301 ns    2319746

Commits:

Add CUDAHooks::getDeviceWithPrimaryContext which returns a device index with primary context (if exists).
Link c10/cuda against libcuda for device API calls.
Use getDeviceWithPrimaryContext to check primary context in pin_memory.
Fix OptionalDeviceGuard doc.
Refactor test_cuda_primary_ctx.py to support multiple tests.
Add test for this in that file.

Fixes #21081.

ssnl · 2019-06-25T21:04:36Z

Adding a review from @ngimel in case our nvidia expert has something to say :)

colesbury · 2019-06-25T20:57:43Z

test/test_cuda_primary_ctx.py

c10/cuda/CUDAException.h

aten/src/ATen/cuda/detail/CUDAHooks.cpp

soumith · 2019-06-25T21:37:46Z

you cannot link libtorch with libcuda, because on CPU-only machines you wont have libcuda.so as it comes with nvidia driver (and we cannot ship libcuda.so with pytorch wheel, because it comes from the nvidia driver)

ssnl · 2019-06-25T21:44:02Z

@soumith I may have described this change wrongly. What I did was adding in c10/cuda/CMakeList.txt

target_link_libraries(c10_cuda INTERFACE caffe2::cuda)

Is this also going to be a problem?

test/common_utils.py

ssnl · 2019-06-25T23:41:39Z

yeah this does not seem the correct way to link things. the cuda build CI (which is on a CPU-only machine I think) fails with


Jun 25 23:11:02 + [[ -n '' ]]
Jun 25 23:11:02 + [[ pytorch-linux-xenial-cuda9-cudnn7-py3-build == *xenial-cuda9-cudnn7-py3* ]]
Jun 25 23:11:02 + pushd docs
Jun 25 23:11:02 + pip install -q -r requirements.txt
Jun 25 23:11:02 ~/workspace/docs ~/workspace
Jun 25 23:11:10 + LC_ALL=C
Jun 25 23:11:10 + make html
Jun 25 23:11:10 Traceback (most recent call last):
Jun 25 23:11:10   File "source/scripts/build_activation_images.py", line 8, in <module>
Jun 25 23:11:10     import torch.nn.modules.activation
Jun 25 23:11:10   File "/opt/conda/lib/python3.6/site-packages/torch/__init__.py", line 79, in <module>
Jun 25 23:11:10     from torch._C import *
Jun 25 23:11:10 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Jun 25 23:11:10 Makefile:17: recipe for target 'figures' failed

@soumith What is the correct way to do this? If I don't link, local build fails with

FAILED: bin/generate_proposals_op_util_nms_gpu_test: && /home/ssnl/ccache/lib/c++  
-Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -O2 -fPIC 
-Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits 
-Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter 
-Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow 
-Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow 
-Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast 
-fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable 
-Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow 
-DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -rdynamic    -rdynamic 
caffe2/CMakeFiles/generate_proposals_op_util_nms_gpu_test.dir/operators/generate_proposals_op_util_nms_gpu_test.cc.o  
-o bin/generate_proposals_op_util_nms_gpu_test 
-L/home/ssnl/anaconda3/lib -Wl,rpath,/usr/local/cuda/lib64:/home/ssnl/anaconda3/lib:/data/packages/pytorch/build/lib: 
/usr/local/cuda/lib64/libcudart.so  lib/libgtest_main.a 
-Wl,--no-as-needed,/data/packages/pytorch/build/lib/libtorch.so 
-Wl,--as-needed lib/libprotobuf.a 
-lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -fopenmp -lpthread -lm -ldl 
/home/ssnl/anaconda3/lib/libirc.so lib/libmkldnn.a lib/libc10_cuda.so lib/libc10.so 
/usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so 
/usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so 
/usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcublas.so lib/libgtest.a -pthread && :

/data/packages/pytorch/build/lib/libtorch.so: undefined reference to `cuGetErrorString'
/data/packages/pytorch/build/lib/libtorch.so: undefined reference to `cuDevicePrimaryCtxGetState'

ssnl · 2019-06-25T23:44:36Z

also cc @ezyang , do you know the answer?

ngimel · 2019-06-26T00:21:07Z

cuda builds in cpu-only building environments can link to libcuda stub that is part of the toolkit, and I think something like this is already done in pytorch, but I don't exactly know how.
But cpu-only builds should not attempt to link to libcuda at all.

ssnl · 2019-06-26T00:31:28Z

@ngimel yep, cpu builds are working on cpu machines. It seems that GPU-enabled builds on cpu machines should work up till trying to initialize cuda, but this makes it fail at importing. The error happens when the building machine tries to import pytorch to build doc, an operation that should not initialize cuda. There must be something I am missing...

soumith · 2019-06-26T02:55:36Z

we link to libnvrtc.so (which depends on libcuda.so ) via our stub libthnvrtc.so which is only loaded in the context of JIT fuser, dynamically. See: https://github.com/pytorch/pytorch/blob/b2197ef2b0b384f863d5d830d61cfc08c994b2ec/torch/csrc/jit/fuser/cuda/thnvrtc.h

You can refer the CUDA Driver API functions there, and use via the thnvrtc mechanism we have.

ssnl · 2019-06-26T03:05:08Z

@soumith I see. It seems that there is no easy way to access that in ATen, right?

soumith · 2019-06-26T03:26:45Z

@ssnl unlikely

li-roy · 2019-06-28T03:38:55Z

@pytorchbot rebase this please

ssnl · 2019-06-28T03:43:07Z

This PR will fail CI. Here is the plan. I will in a separate PR port the stub to ATen, and update this afterwards.

currently broken

Summary: Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking pytorch/pytorch#22229. `DynamicLibrary` is also moved as it is used in the stub code, and seems general enough. Pull Request resolved: pytorch/pytorch#22362 Differential Revision: D16131787 Pulled By: ezyang fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73

ssnl · 2019-07-09T15:37:05Z

@VitalyFedyunin The build PR has been merged. I also moved the device guard to the host allocator. I think this is ready for another review :). CI should pass since I only rebased.

VitalyFedyunin

Everything good except two tiny things

aten/src/ATen/cuda/detail/CUDAHooks.cpp

aten/src/ATen/detail/CUDAHooksInterface.h

ssnl · 2019-07-09T19:27:45Z

@pytorchbot merge this please

pytorch_linux_trusty_py3_6_gcc5_4_test failure is already reported at #22052

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-07-10T14:34:54Z

~~So, when we landed this internally, some people reported they have this problem:~~

Error in dlopen or dlsym: libcaffe2_nvrtc.so: cannot open shared object file: No such file or directory

EDIT: Sorry, I posted this on the wrong PR

ssnl · 2019-07-10T23:08:45Z

@ezyang When you have time, could you land this again? Ty :)

jerryzh168 · 2019-07-13T01:37:01Z

please fix merge conflicts

ssnl · 2019-07-15T14:25:11Z

@pytorchbot rebase this please

ssnl · 2019-07-15T18:24:30Z

@pytorchbot merge this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-07-16T18:19:34Z

@zou3519 merged this pull request in 8482efb.

Summary: This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27: ``` --------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746 ``` Commits: 1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists). Link `c10/cuda` against `libcuda` for device API calls. 2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`. Fix `OptionalDeviceGuard` doc. 3. Refactor `test_cuda_primary_ctx.py` to support multiple tests. Add test for this in that file. Fixes pytorch/pytorch#21081. Pull Request resolved: pytorch/pytorch#22229 Differential Revision: D16170194 Pulled By: zou3519 fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d

ssnl requested review from colesbury and gchanan June 25, 2019 20:44

pytorchbot added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators module: tests Issues related to tests (not the torch.testing module) labels Jun 25, 2019

ssnl added the open source label Jun 25, 2019

ssnl requested a review from ngimel June 25, 2019 21:04

colesbury reviewed Jun 25, 2019

View reviewed changes

pytorchbot added the module: rocm AMD GPU support for Pytorch label Jun 25, 2019

ssnl force-pushed the pin_memory_device branch from beceff6 to 078c35b Compare June 25, 2019 21:37

ngimel previously approved these changes Jun 25, 2019

View reviewed changes

test/common_utils.py Outdated Show resolved Hide resolved

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 27, 2019

ssnl mentioned this pull request Jun 29, 2019

Move thnvrtc and DynamicLibrary to ATen #22361

Closed

ssnl force-pushed the pin_memory_device branch from bdd8875 to dd1b1e4 Compare June 29, 2019 02:55

pytorchbot added the caffe2 label Jun 29, 2019

formatting

b3a73c0

pytorchbot added the module: operators label Jul 9, 2019

ssnl removed module: internals Related to internal abstractions in c10 and ATen module: operators labels Jul 9, 2019

VitalyFedyunin approved these changes Jul 9, 2019

View reviewed changes

aten/src/ATen/cuda/detail/CUDAHooks.cpp Outdated Show resolved Hide resolved

aten/src/ATen/detail/CUDAHooksInterface.h Outdated Show resolved Hide resolved

TORCH_CHECK everywhere

e2d880f

pytorchbot added module: internals Related to internal abstractions in c10 and ATen module: operators labels Jul 9, 2019

VitalyFedyunin approved these changes Jul 9, 2019

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 9, 2019

facebook-github-bot reviewed Jul 9, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

7563a1d

facebook-github-bot reviewed Jul 15, 2019

View reviewed changes

facebook-github-bot closed this in 8482efb Jul 16, 2019

ssnl deleted the pin_memory_device branch July 16, 2019 17:22

facebook-github-bot added the merged label Jul 16, 2019

ssnl mentioned this pull request Jun 12, 2020

Multiple CUDA contexts created on first GPU when using pin_memory=True, multiprocessing and torch.cat() #32424

Closed

mruberry added the Merged label Oct 28, 2020

msbaines mentioned this pull request May 20, 2021

GPU 0 context created on GPU 1 worker when using pin_memory=True #58626

Open

pin_memory malloc now uses existing context if available. #22229

pin_memory malloc now uses existing context if available. #22229

Uh oh!

Conversation

ssnl commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jun 25, 2019

Uh oh!

colesbury Jun 25, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

soumith commented Jun 25, 2019

Uh oh!

ssnl commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ssnl commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jun 25, 2019

Uh oh!

ngimel commented Jun 26, 2019

Uh oh!

ssnl commented Jun 26, 2019

Uh oh!

soumith commented Jun 26, 2019

Uh oh!

ssnl commented Jun 26, 2019

Uh oh!

soumith commented Jun 26, 2019

Uh oh!

li-roy commented Jun 28, 2019

Uh oh!

ssnl commented Jun 28, 2019

Uh oh!

ssnl commented Jul 9, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ssnl commented Jul 9, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jul 10, 2019

Uh oh!

jerryzh168 commented Jul 13, 2019

Uh oh!

ssnl commented Jul 15, 2019

Uh oh!

ssnl commented Jul 15, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

ssnl commented Jun 25, 2019 •

edited

Loading

ssnl commented Jun 25, 2019 •

edited

Loading

ssnl commented Jun 25, 2019 •

edited

Loading

ezyang commented Jul 10, 2019 •

edited

Loading