-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Lock the cudaFree mutex. #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Prevents NCCL calls from overlapping with cudaFree() which can lead to deadlocks.
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug in NCCL that is going to be fixed sometime, or is it an expected interaction? Is it only a problem with cudaFree or with cudaMalloc too?
Also, it is still possible to deadlock right? We might have some other calls e.g. the allocator compacting the memory.
|
It's expected interaction at this point. I don't expect it to be fixed. This covers all the cudaFree calls in PyTorch including when the allocator compacts memory. cudaMalloc isn't an issue, but there might be other calls that could trigger a deadlock if they overlap with NCCL launches. I haven't tested all the likely suspects (cudaDeviceSynchronize, cudaStreamDestroy), but they are much less likely to be accidentally overlapped with NCCL calls. |
…r Random generator ops (#880) onnx/onnx@541512b
…r Random generator ops (pytorch#880) onnx/onnx@541512b
…r Random generator ops (pytorch#880) onnx/onnx@541512b
…NcclGroup. (#22173) Summary: There are two mutexes within CUDACachingAllocator that cause a deadlock. One of the mutexes was added in order to work around the issue of NCCL interacting poorly with cudaFree. See - ROCm@68ff58d - #880 As of NCCL version 2 and its new group start/end APIs, the protection surrounding cudaFree() is no longer needed. The PyTorch code was updated to use the NCCL2 group start/end API, but the corresponding cuda_free_mutex and its getter getFreeMutex() were not revised. This PR removes the use of the getFreeMutex() when NCCL2 is used by moving calls to getFreeMutex() into the AutoNcclGroup. That way, depending on the NCCL version used, we either use the mutex or we use the new group APIs. The race condition is as follows, thanks to skeelyamd: The deadlock occurs between hip_free_mutex (aka cuda_free_mutex in github) (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L165) and mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L162). hip_free_mutex is exported from THCCachingAllocator in getFreeMutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L660) and is acquired in ProcessGroupNCCL::collective (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L397), which then calls back into THCCachingAllocator via c10::cuda::CUDACachingAllocator::recordStream (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L416 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L655 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L379). At this point it acquires mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L384). This requires hip_free_mutex to be locked before mutex. However, in free_blocks (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L505) THCCachingAllocator locks hip_free_mutex. Free_blocks is called from emptyCache (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L328) which locks mutex. That requires mutex to be locked before hip_free_mutex. emptyCache and ProcessGroupNCCL::collective may not be executed concurrently but this is occurring and deadlocking the CPU. free_blocks is also called by malloc (via cuda_malloc_retry -> free_cached_blocks -> free_blocks) which also locks mutex first and so malloc must not execute concurrent with ProcessGroupNCCL::collective. Pull Request resolved: #22173 Differential Revision: D16357177 Pulled By: pietern fbshipit-source-id: f4ca9cd46cc6d5e15290d99577d88be3f4fa8972
Add section about C++ classes as op args
add streaming support for softmax kernels
Prevents NCCL calls from overlapping with cudaFree() which can lead to
deadlocks.