Skip to content

Conversation

@colesbury
Copy link
Member

Prevents NCCL calls from overlapping with cudaFree() which can lead to
deadlocks.

Prevents NCCL calls from overlapping with cudaFree() which can lead to
deadlocks.
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug in NCCL that is going to be fixed sometime, or is it an expected interaction? Is it only a problem with cudaFree or with cudaMalloc too?

Also, it is still possible to deadlock right? We might have some other calls e.g. the allocator compacting the memory.

@colesbury
Copy link
Member Author

It's expected interaction at this point. I don't expect it to be fixed.

This covers all the cudaFree calls in PyTorch including when the allocator compacts memory.

cudaMalloc isn't an issue, but there might be other calls that could trigger a deadlock if they overlap with NCCL launches. I haven't tested all the likely suspects (cudaDeviceSynchronize, cudaStreamDestroy), but they are much less likely to be accidentally overlapped with NCCL calls.

@colesbury colesbury merged commit fc6fcf2 into pytorch:master Mar 1, 2017
@colesbury colesbury deleted the mutex branch March 1, 2017 22:54
onnxbot added a commit that referenced this pull request May 4, 2018
Jorghi12 pushed a commit to wsttiger/pytorch that referenced this pull request May 10, 2018
weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018
facebook-github-bot pushed a commit that referenced this pull request Jul 18, 2019
…NcclGroup. (#22173)

Summary:
There are two mutexes within CUDACachingAllocator that cause a deadlock.  One of the mutexes was added in order to work around the issue of NCCL interacting poorly with cudaFree.  See

- ROCm@68ff58d
- #880

As of NCCL version 2 and its new group start/end APIs, the protection surrounding cudaFree() is no longer needed.  The PyTorch code was updated to use the NCCL2 group start/end API, but the corresponding cuda_free_mutex and its getter getFreeMutex() were not revised.  This PR removes the use of the getFreeMutex() when NCCL2 is used by moving calls to getFreeMutex() into the AutoNcclGroup.  That way, depending on the NCCL version used, we either use the mutex or we use the new group APIs.

The race condition is as follows, thanks to skeelyamd:

The deadlock occurs between hip_free_mutex (aka cuda_free_mutex in github) (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L165) and mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L162).

hip_free_mutex is exported from THCCachingAllocator in getFreeMutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L660) and is acquired in ProcessGroupNCCL::collective (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L397), which then calls back into THCCachingAllocator via c10::cuda::CUDACachingAllocator::recordStream (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L416 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L655 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L379).  At this point it acquires mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L384).

This requires hip_free_mutex to be locked before mutex.

However, in free_blocks (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L505) THCCachingAllocator locks hip_free_mutex.  Free_blocks is called from emptyCache (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L328) which locks mutex.

That requires mutex to be locked before hip_free_mutex.

emptyCache and ProcessGroupNCCL::collective may not be executed concurrently but this is occurring and deadlocking the CPU.

free_blocks is also called by malloc (via cuda_malloc_retry -> free_cached_blocks -> free_blocks) which also locks mutex first and so malloc must not execute concurrent with ProcessGroupNCCL::collective.
Pull Request resolved: #22173

Differential Revision: D16357177

Pulled By: pietern

fbshipit-source-id: f4ca9cd46cc6d5e15290d99577d88be3f4fa8972
mrshenli pushed a commit to mrshenli/pytorch that referenced this pull request Apr 11, 2020
Add section about C++ classes as op args
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this pull request Nov 1, 2022
add streaming support for softmax kernels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants