Lock the cudaFree mutex. #880

colesbury · 2017-02-28T23:24:31Z

Prevents NCCL calls from overlapping with cudaFree() which can lead to
deadlocks.

Prevents NCCL calls from overlapping with cudaFree() which can lead to deadlocks.

apaszke

Is this a bug in NCCL that is going to be fixed sometime, or is it an expected interaction? Is it only a problem with cudaFree or with cudaMalloc too?

Also, it is still possible to deadlock right? We might have some other calls e.g. the allocator compacting the memory.

colesbury · 2017-03-01T16:28:20Z

It's expected interaction at this point. I don't expect it to be fixed.

This covers all the cudaFree calls in PyTorch including when the allocator compacts memory.

cudaMalloc isn't an issue, but there might be other calls that could trigger a deadlock if they overlap with NCCL launches. I haven't tested all the likely suspects (cudaDeviceSynchronize, cudaStreamDestroy), but they are much less likely to be accidentally overlapped with NCCL calls.

…r Random generator ops (#880) onnx/onnx@541512b

…r Random generator ops (pytorch#880) onnx/onnx@541512b

…NcclGroup. (#22173) Summary: There are two mutexes within CUDACachingAllocator that cause a deadlock. One of the mutexes was added in order to work around the issue of NCCL interacting poorly with cudaFree. See - ROCm@68ff58d - #880 As of NCCL version 2 and its new group start/end APIs, the protection surrounding cudaFree() is no longer needed. The PyTorch code was updated to use the NCCL2 group start/end API, but the corresponding cuda_free_mutex and its getter getFreeMutex() were not revised. This PR removes the use of the getFreeMutex() when NCCL2 is used by moving calls to getFreeMutex() into the AutoNcclGroup. That way, depending on the NCCL version used, we either use the mutex or we use the new group APIs. The race condition is as follows, thanks to skeelyamd: The deadlock occurs between hip_free_mutex (aka cuda_free_mutex in github) (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L165) and mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L162). hip_free_mutex is exported from THCCachingAllocator in getFreeMutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L660) and is acquired in ProcessGroupNCCL::collective (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L397), which then calls back into THCCachingAllocator via c10::cuda::CUDACachingAllocator::recordStream (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L416 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L655 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L379). At this point it acquires mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L384). This requires hip_free_mutex to be locked before mutex. However, in free_blocks (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L505) THCCachingAllocator locks hip_free_mutex. Free_blocks is called from emptyCache (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L328) which locks mutex. That requires mutex to be locked before hip_free_mutex. emptyCache and ProcessGroupNCCL::collective may not be executed concurrently but this is occurring and deadlocking the CPU. free_blocks is also called by malloc (via cuda_malloc_retry -> free_cached_blocks -> free_blocks) which also locks mutex first and so malloc must not execute concurrent with ProcessGroupNCCL::collective. Pull Request resolved: #22173 Differential Revision: D16357177 Pulled By: pietern fbshipit-source-id: f4ca9cd46cc6d5e15290d99577d88be3f4fa8972

Add section about C++ classes as op args

add streaming support for softmax kernels

Lock the cudaFree mutex.

4ca941e

Prevents NCCL calls from overlapping with cudaFree() which can lead to deadlocks.

colesbury force-pushed the mutex branch from 53bf1bb to 4ca941e Compare March 1, 2017 02:52

apaszke approved these changes Mar 1, 2017

View reviewed changes

colesbury merged commit fc6fcf2 into pytorch:master Mar 1, 2017

colesbury deleted the mutex branch March 1, 2017 22:54

onnxbot added a commit that referenced this pull request May 4, 2018

[auto] Update onnx to 541512b - tests for type and shape inference fo…

46d0140

…r Random generator ops (#880) onnx/onnx@541512b

Jorghi12 pushed a commit to wsttiger/pytorch that referenced this pull request May 10, 2018

[auto] Update onnx to 541512b - tests for type and shape inference fo…

1bd86f4

…r Random generator ops (pytorch#880) onnx/onnx@541512b

weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018

[auto] Update onnx to 541512b - tests for type and shape inference fo…

f748c9e

…r Random generator ops (pytorch#880) onnx/onnx@541512b

jeffdaily mentioned this pull request Jun 25, 2019

Fix race condition, bad lock hierarchy. Move getFreeMutex() into AutoNcclGroup. #22173

Closed

mrshenli pushed a commit to mrshenli/pytorch that referenced this pull request Apr 11, 2020

Merge pull request pytorch#880 from jamesr66a/cpp_class_op

2ea3972

Add section about C++ classes as op args

hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this pull request Nov 1, 2022

Merge pull request pytorch#880 from seryilmaz/seryilmaz/stream

058addb

add streaming support for softmax kernels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lock the cudaFree mutex. #880

Lock the cudaFree mutex. #880

Uh oh!

colesbury commented Feb 28, 2017

Uh oh!

apaszke left a comment

Uh oh!

colesbury commented Mar 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lock the cudaFree mutex. #880

Lock the cudaFree mutex. #880

Uh oh!

Conversation

colesbury commented Feb 28, 2017

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

colesbury commented Mar 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants