Skip to content

[CUDNN] PoolWindow::reserve crash, vector out of range. Race condition #19394

@xsacha

Description

@xsacha

🐛 Bug

When using a JIT model twice at the same time, for the first time, I get a crash here.

PoolWindow::reserve seems to try to access a vector out of range.

The responsible code was added in #14861

Backtrace:

 	msvcp140d.dll!std::_Debug_message(const wchar_t * message, const wchar_t * file, unsigned int line) Line 9	C++
>	caffe2_gpu.dll!std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > >,std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,cudnnContext * __ptr64> > > > > >::operator[](const unsigned __int64 _Pos) Line 1796	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Vec_lo(unsigned __int64 _Bucket) Line 822	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::_Begin(unsigned __int64 _Bucket) Line 841	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::lower_bound(const int & _Keyval) Line 647	C++
 	caffe2_gpu.dll!std::_Hash<std::_Umap_traits<int,cudnnContext * __ptr64,std::_Uhash_compare<int,std::hash<int>,std::equal_to<int> >,std::allocator<std::pair<int const ,cudnnContext * __ptr64> >,0> >::find(const int & _Keyval) Line 630	C++
 	caffe2_gpu.dll!at::native::`anonymous namespace'::PoolWindow::reserve(int device) Line 88	C++
 	caffe2_gpu.dll!at::native::getCudnnHandle() Line 149	C++
 	caffe2_gpu.dll!at::native::setCuDNNStreamToCurrent() Line 13	C++
 	caffe2_gpu.dll!at::native::cudnn_convolution(const at::Tensor & input_t, const at::Tensor & weight_t, const at::Tensor & bias_t, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 930	C++
 	caffe2_gpu.dll!at::CUDAFloatType::cudnn_convolution(const at::Tensor & self, const at::Tensor & weight, const at::Tensor & bias, c10::ArrayRef<__int64> padding, c10::ArrayRef<__int64> stride, c10::ArrayRef<__int64> dilation, __int64 groups, bool benchmark, bool deterministic) Line 5315	C++

To Reproduce

Steps to reproduce the behavior:

  1. Load a Jit model (once, so on one thread)
  2. After the model is loaded, forward it twice simultaneously (separate threads)
  3. Experience crash in header

Called from two threads at same time:

    static std::once_flag model_flag;
    std::call_once(model_flag, [&modelFile]() {
        model = torch::jit::load(modelFile, torch::kCUDA); }
    );
    // All works fine up until here
    model->forward({torch::randn({1, 3, 1024, 1024}, torch::kCUDA)});
    // Crashes when called from a different thread than model was created on.

Expected behavior

Works when loaded from one thread without having to wait for some random time period to prevent race condition.

Environment

  • PyTorch Version (e.g., 1.0): 1.1.0-pre
  • OS (e.g., Linux): Windows 64-bit
  • How you installed PyTorch (conda, pip, source): nightly
  • Build command you used (if compiling from source): N/A
  • Python version: 3.7
  • CUDA/cuDNN version: 10.0/7.5
  • GPU models and configuration: GTX1060

Workarounds

Other models that I have in a special model thread that batches the inputs will work fine as it ensures they don't get run simultaneously.
This specific model has tensors of different sizes which I am unable to batch together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: cudnnRelated to torch.backends.cudnn, and CuDNN supportmodule: multithreadingRelated to issues that occur when running on multiple CPU threadsmodule: windowsWindows support for PyTorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions