Use a pool of per-thread cudnn handles for each device #14861

mcarilli · 2018-12-06T23:37:32Z

Real PR at #15080

This PR fixes a potential race condition when multiple threads issue cudnn calls to the same device. (@slayton58 and I believe a similar danger exists for cublas as well, which this PR does NOT address, and which we will investigate and fix in a later PR if the danger is real.)

Currently, aten/src/ATen/cudnn/Handle.cpp maintains a single cudnn handle per device, lazily created whenever a thread requests it. This can lead to race conditions if different threads using different streams share the same device. For example:

thread 0, using stream s0, calls setCuDNNStreamToCurrent().
thread 1, using stream s1, calls setCuDNNStreamToCurrent().
thread 0 launches its raw convolution, which it thinks will run in s0, but actually runs in s1.
thread 0 enqueues another op, which DOES run in s0, but now races with its convolution.

Cudnn documentation contains a general warning against multiple threads using the same handle, so there may be other dangers I'm not aware of.

This PR avoids race conditions by maintaining a pool of cudnn handles. Handles are lazily created as different threads request them, released back into the pool when threads exit, but not destroyed until the end of the process. This is the desired behavior IMO: We want unused handles to be reusable by other threads later. If multiple threads attempt to use the same device, each thread will issue cudnn calls through a distinct handle, so race conditions are avoided.

In addition to the preexisting RAII helper Handle, the pool is implemented using three new data structures:

std::unordered_map<int, std::stack<Handle>> created_handles;

created_handles is a global variable, whose only purpose in life is to contain the set of all Handle wrappers that have been created so far for each device, and ensure that their destructors are called at the end of the process. Accesses to created_handles are mutexed.

std::unordered_map<int, std::stack<cudnnHandle_t>> available_handles;

available_handles is also a global object, storing the pool of currently-unused cudnnHandle_ts for each device. Threads pop handles from this pool, and release handles back into this pool when they exit. Accesses toavailable_handles are also mutexed.

thread_local PoolWindow myPoolWindow;

myPoolWindow is an object that controls access to the pool for each thread. myPoolWindow contains std::unordered_map<int, cudnnHandle_t> my_handles;, which stores any cudnnHandle_ts that this thread may currently own for each device. Threads request cudnn handles by calling myPoolWindow.reserve(int device), which implements the following sequence:

If myPoolWindow::my_handles already contains a handle for that device, return it.
If not, grab the mutex that protects available_handles and created_handles. Check if available_handles has a free handle for this device. If so, stash that in my_handles, pop it from available_handles, and return it.
If no free handle is available for this device, emplace a new Handle into created_handles, stash the new cudnn handle in my_handles, and return it.

myPoolWindow's destructor releases this thread's handles back into the global available_handles pool.

For the usual one-thread-per-device usage, I don't believe this pattern incurs any more overhead than the current pattern of lazy Handle creation. For multiple-threads-per-device, it only ever adds overhead each time a new high-water mark of simultaneous threads using a single device is reached, otherwise, handles will be reused.

I've included a test that exposes the possible shared-handle race condition. It reliably fails on current master and succeeds on a build with this PR.

Summary: Pull Request resolved: pytorch#14746 Reviewed By: ezyang Differential Revision: D13318644 fbshipit-source-id: b703d7dc67e75d9e9571c80d62a100c5fc4e84df

Summary: Otherwise, these tests will fail, even though there are never meant to run on single GPU machines. Pull Request resolved: pytorch#14860 Differential Revision: D13369060 Pulled By: teng-li fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0

Summary: - allow gradcheck to take sparse tensor as input - sparse output is not allowed yet at gradcheck - add backward for `to_dense()` to get around sparse output - calling gradcheck at test_sparse, so that we can use `_gen_sparse()` and also easily cover coalesced / uncoalesced test cases Pull Request resolved: pytorch#14596 Differential Revision: D13271904 Pulled By: weiyangfb fbshipit-source-id: 5317484104404fd38058884c86e987546011dd86

Summary: This PR removes some expect files that aren't really testing anything Pull Request resolved: pytorch#14871 Differential Revision: D13373762 Pulled By: driazati fbshipit-source-id: e3537ee83df23b3b3b854f9b1253fd0cc8e9dd33

Reviewed By: yns88 fbshipit-source-id: 7da015701f18f8a0b5a8092aae02a42ede7bfd44

Summary: Fixes pytorch#14099 I attempted to be as consistent as possible with the formatting, hence why my equation reads d*(k - 1) instead of (k - 1)*d. Also there is an unused variable on line 46: `n = self.in_channels`. I could fix that here too if that's not too out of scope. Pull Request resolved: pytorch#14876 Differential Revision: D13374317 Pulled By: soumith fbshipit-source-id: a9f110acafa58cdb4206956dbe3ab4738d48292d

Summary: Pull Request resolved: pytorch#14873 Differential Revision: D13375053 Pulled By: bddppq fbshipit-source-id: f3051640386667bbf0566856ed433eb83276c39e

Summary: Fixes pytorch#14859 . Differential Revision: D13376915 Pulled By: zou3519 fbshipit-source-id: f1fc0e8492a159431a3fc0a19a41aa10429ecc80

Differential Revision: D13205604 Original commit changeset: 54166492d318 fbshipit-source-id: 89b6833518c0b554668c88ae38d97fbc47e2de17

…pe issue. Summary: Pull Request resolved: pytorch#14407 Reviewed By: yinghai Differential Revision: D13364364 Pulled By: wesolwsk fbshipit-source-id: e69bcd1bc52e35b2f0e45e5dc40184f1bd66605d

Summary: Pull Request resolved: pytorch#14515 Differential Revision: D13247966 Pulled By: goldsborough fbshipit-source-id: 7a127c508fc576a7a92626dd6b729f660162d628

Summary: _th_tensor is moving off Type, so these calls need to be replaced. Unfortunately, replacing these with a full-fledged solution [e.g. from_storage(..., TensorOptions)] is a bit complicated because the storage itself fully defines the Type (modulo variable). It's simpler to just wait for the Variable/Tensor merge rather than to solve this now, so instead I changed the call sites to: at::empty({0}, type.options()).set_(storage...). This isn't great because we are also trying to get rid of Type::options, but this seems to be the lesser-of-two-evils. Pull Request resolved: pytorch#14877 Differential Revision: D13374310 Pulled By: gchanan fbshipit-source-id: eb953ed041507e6190d6f32e383912e5a08311cd

ezyang · 2018-12-07T21:32:05Z

cc @iotamudelta @petrex I don't know what invariants miopen gives, but they almost certainly are broken in the same way.

Summary: This will let us install tests and other Caffe2 python code as a part of running Caffe2 tests in PyTorch. Broken out of pytorch#13733 cc pjh5 yf225 Pull Request resolved: pytorch#14898 Reviewed By: pjh5 Differential Revision: D13381123 Pulled By: orionr fbshipit-source-id: 0ec96629b0570f6cc2abb1d1d6fce084e7464dbe

ezyang · 2018-12-07T21:44:28Z