Skip to content

Conversation

@mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Nov 1, 2019

Thanks to @AddyLaddy @ptrblck for tracking this fix down.

In torch/csrc/cuda/nccl.cpp and torch/csrc/cuda/python_nccl.cpp, construction of the AutoNcclGroup guard (which calls ncclGroupStart()) precedes a possible call to get_communicators, which may call ncclCommInitAll(). Calling ncclCommInitAll() within a ncclGroupStart()/End() is incorrect according to our Nccl people.

It seemed ok (relevant tests were silently passing) as long as Pytorch was compiled/linked against Nccl 2.4.x (which is currently what's locked into your third_party/nccl subrepo). However, when we tried to compile and link against Nccl 2.5.x in internal builds, we began to see test hangs (TestAutogradDeviceTypeCUDA.test_unused_output_device_cuda was what initially brought it to our attention).

The present PR fixes those hangs, as far as we know, and will prevent a nasty future surprise when you start building against nccl 2.5.

The backend affected by this PR is exposed via https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py. I'm not sure if the exposure is actually used anywhere (I think the distributed frontend is now backed by ProcessGroupNCCL in torch/lib/c10d). So this PR may affect code that is already dead or dying, but still tested, it seems.

I skimmed ProcessGroupNCCL.cpp for potential similar vulnerabilities and didn't spot anything obvious.

@ezyang ezyang requested a review from pietern November 1, 2019 18:13
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 314066b.

@pietern
Copy link
Contributor

pietern commented Nov 4, 2019

@mcarilli Thanks for the fix! The code here is used by single-process multi-GPU code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants